4.1. Intent of PeTaL
An example is shown below of a thermal management problem in aerospace. Given a control volume and boundary conditions, what is the minimum material that needs to be used to allow Q watts to be dissipated while withstanding σ
x Pa of loading? There are natural systems and adaptations that deal with the distribution of fluids (rivers, trees, sponges), thermal management (pores on leaves, vascular systems), and managing structural loads (hierarchical structures, lattices, honeycombs). There are also human solutions to these challenges such as heat pipes or photovoltaics and materials with properties vastly different from those found in nature such as silicon carbide composites that can withstand 3000 F. The key to solving the problem is linking the patterns and mechanisms (or principles) in nature to the materials and processes available to humans. In some cases, we may discover new process and add these to our ‘human toolbox’. This may be done for a specific component or a group of components (system). The challenge with large systems is to identify appropriate analogies in nature (living or otherwise) by identifying entities, connections and the flow of resources between the entities. Abstractions of specific mechanisms allows for the learning of principles from natural history. There are numerous approaches that may be used so we show one here as an example. One way to design a system would be to find inspiration from nature and then apply human materials and technology to scale the natural system model. This would entail knowing the design requirements, having access to sources of knowledge (journals, the internet or databases), being able to abstract the requirements and connecting them to the contents of the sources of knowledge, extracting insights and being able to implement these insights. Such a process is shown in
Figure 4.
The problem solving methodology, shown in
Figure 4 is the following:
- (a)
Problem definition
Identify a set of words or phrases (topics) relevant to the field (not exhaustive): distribution, generation, management of heat, energy, stress, convection, conduction, cooling, heating, evaporation, condensation. These terms form the basis for searching articles in the literature, patents, blogs or books. Ideally, these words themselves are connected in a ‘bag of words’ to related terminology to broaden the search and reduce the burden on the user.
Define the problem as a set of goals (minimum heat flux, maximum stress and minimum weight), boundary conditions (temperature, pressure, humidity, volume) and constraints (gravity, material must be continuous and supported, properties must exist in the natural world).
- (b)
Identify possible models based on historical work
Human generated literature
- i.
Patents, literature: Use NLP (natural language processing) to identify articles related to the problem at hand through search terms related to the problem. The objective is to uncover solutions that have direct implementation, data that may be used to train models, or literature that identifies human technology (heat pipes, pulsating heat pipes, pins, heat exchangers, fluidic devices, cooling schemes for turbines, anti-icing technology) that may be further investigated thought the generation of synthetic data.
Natural databases: The objective is to identify strategies in nature that enable solutions to the problem at hand. In the case of thermal management, one would search for articles that discuss distribution, generation, management of heat, energy, stress, convection, conduction, cooling, heating, evaporation, condensation. These terms form a word cloud. If articles from the literature are already classified or clustered into word clouds, then the search is greatly simplified. This may be done through the use of a search tool that translates user search terms to biology or through algorithms that find occurrences of noun, verb, noun combinations that represent function, object, and environment. Sources may include:
- ii.
natural data sets such as Encyclopedia of Life, iDigBio, Wikipedia provide general information on organisms.
- iii.
research data and information that may be gleaned from publications such as the Journal of Experimental Biology.
- (c)
Based on the models identified, identify the structure, pattern or mechanism associated with the specific function(s) being performed by the model (whether natural or human-made). For example, given tuna, penguins, toucan bills, falcon wings as models, we would identify that a vascular system or loop-like system should be considered with a hierarchy of diameters, branches and lengths for a set of structural, weight and heat exchange requirements.
Unsupervised learning (clustering or topic modeling for example) may be used to determine the parameter set associated with the environment we are designing for. Due to the sparse nature of most data available today, we may only be able to glean qualitative trends such as ‘more branches on the heat sink’ or more layers leads to increased heat flux. This may lead us to design a counter current heat exchanger system with parameters derived from a model that is most suitable. We may also see that certain types of thermal management solutions co-occur with a certain type of structure.
Synthetic data sets may be generated based on the patterns found in nature to train machine learning algorithms that can provide insights into the relationship between form and multiple functions.
Optimization may be conducted to customize the structure to meet a specific distribution of objectives.
- (d)
Down-select by comparing metrics (weight, heat flux, structural loads) to reference architecture or goals.
Scaling a solution from nature is difficult without an understanding of physics. This is where machine learning or artificial intelligence can play a role (step 6 above). By training machine learning algorithms to understand the relationship between form and function we can assess the performance of biological form using synthetic materials. We can also use mapping techniques to transform arbitrary 3D volumes to normalized domains (
Figure 5). These normalized domains may then be populated with natural forms such as branches or lattice structures. First, topologies must be parameterized. For example, a branching structure may be represented as angles, a primary thickness, ratio of lengths, ratios of thickness, number of fins per unit area, and roughness. These parameters would exist across length scales from the macro (diameters), the micro (roughness) and the nano (porosity). Another key parameter that should be noted is the functional degrees of freedom of the structure or process. Functional degrees of freedom (FDOF) are the mechanisms by which the structure or process may be flexible to accomplish objectives beyond that of the original structure. For example, the lid of a coffee cup from Starbucks
® does not seal the cup merely by seating it over the cup. It requires the user to flex the lid and snap it into place. The FDOF here would be flexibility or sealing effectiveness and is driven by the modulus of elasticity of the material. This is the kind of data that researchers should store in databases that are intended for use in bio-inspired design. The data should be normalized across the entire dataset for scalability. For example, when creating a dataset of dragonfly wings, the largest and smallest chords, spans, Reynolds numbers etc., should be used to scale the data rather than any individual parameter.
A hybrid approach may be used that couples machine learning and topology optimization. Machine learning can be used to determine grid densities locally in a domain that is to be optimized using a high fidelity method such as topology optimization. The domain might be meshed with individual cells being marked for grid refinement. Once intelligent adaptive gridding is accomplished, topology optimization may proceed at significantly reduced cost. It is also possible to use a purely machine learning approach. We may train Bayesian classifiers or neural networks on selected biological models to understand relationships between the distribution of geometric coordinates, (x, y, z) and performance (specific power, pressure drop) as a function of material properties and environment (boundary conditions). A combination of natural system data, industry data including healthcare, automotive, aerospace and synthetic data may be used to ‘teach’ physics to the algorithms. The intent of PeTaL is, thus, not to serve as a purely bio-inspirational tool, but rather to link human and biological ontologies through machine learning and advanced computing to directly provide engineering solutions.
The association of pattern with function may be achieved by first classifying a biological feature by pattern composition (e.g., 50% lattice, 20% branching, 10% wavy, 20% explosion, and bilateral symmetry). Each pattern composition is associated with a use case from engineering that is multifunctional (e.g., primary factor: strength, secondary: heat transfer, tertiary: mobility). An example of this is shown in
Figure 6. Once the association is known, parametric variations on the particular pattern composition may be undertaken to optimize for engineering. A similar procedure may be followed for challenges related to other fields of study. A manual example of this approach used for a turbine blade is described in [
38]. Discovery of patterns in nature was first conducted, followed by an extraction of relevant parameters. Parametric variations were then simulated (such as varying the amplitudes and pitch of the waveforms along the span of the blade) and the results compared to a baseline. Work is in progress to extend this quasi-3D (designed for a cascade rather than rotating rig) design to 3D for general applicability in a turbine engine.
Comprehensive overviews of the current biomimetic toolkit are presented by Goel et al. [
39] and Wanieck et al. [
40]. Wanieck et al. show that there are a large number of databases to aid in biomimetics, there are few ontologies and fewer taxa and thesauri. Furthermore, there are gaps in the biomimetic process especially in translation of biological solutions to technology. It is stated that 77% of classified tools have shown their potential for usage. A landscape survey was conducted using the discovery tool QUID
® to identify whether any significant works have been missed. Titles and abstracts for the period between 2003 and 2017 from SCOPUS [
41] were scraped and used as input data for the analysis. The results are shown in
Figure 7 and
Figure 8. The results indicate that while there is a large interest in biomimetics and bio-inspired design methods, there is not a sustained effort by any one group. This is clear from the lack of connections between publications. A healthy (or healthier) network should look like
Figure 9 that shows research in the field of heat exchangers inspired by biology. There are several references to previous publications with many nodes being a part of multiple sub-networks. The heat exchange and thermal management network is however one of the weaker areas of bio-inspired design from an application standpoint. This could be due to difficulties in understanding how to scale the operating environment of biological models to the high temperature world that we as humans require our heat exchangers and thermal management systems to survive in. This contrast is shown in
Figure 10. Mobility is a field that is more accessible and has been studied for a much longer time than heat exchange. This may explain why the number of articles focusing on thermal management is considerably lower than that for mobility. Also, the number of articles of land and air mobility are far greater than those for aquatic applications. This is again to be expected given the lack of resources available for and challenges involved in marine exploration. The entire landscape of biomimetics in the academic literature is visualized in
Figure 11. Mobility (robotics in particular) is a prominent cluster. Other dominant clusters are adhesives, synthesis, superhydrophobics, aerodynamics and synthesis.
There is a need to provide a way for tools to communicate with each other and to share an underlying framework or ontology that would enable this. PeTaL is intended to be an open source platform to allow new and existing tools to be developed and to interact with each other. With this in mind, some requirements arise:
To enable broad access to the platform and development of code by all interested entities, the platform must be open and use an open source language. Python was chosen for this purpose although early development and proof of concept was carried out in R and RStudio. The platform must be accessible and a way of doing this is through a virtual environment or website. V.I.N.E. (Virtual Interchange for Nature-inspired Exploration) [
42] is intended to be that virtual location and is described in subsequent sections.
The platform must enable new apps and tools to communicate through some common language and have some common reference of the universe. To this end an ontology is proposed that uses existing ontologies for morphology (phylogenetic tree), physics, ecosystems, and states of matter and properties of matter. The ontology must relate biological organisms to their ecosystem or habitat that includes descriptions of environment. The environment must include descriptions of states of matter such as solid, liquid or gas and these must in turn connect to properties such as temperature, pressure, density, specific heat capacity etc. The phylogenetic tree must include evolutionary history of organisms and their inner composition, down to the cellular level. This gives the ability to zoom in on any part of the organism and to consider it as a model system for analysis. Thus, the entire ontology has the ability to define ecosystems. Given a network diagram and having performed a clustering operation on this network, such ecosystems may emerge visually. Alternatively, clustering algorithms may produce similar groups in evolutionary time or in function.
Figure 12 shows an example of such an ontology. The entities shown must be connected by the laws of physics (force is mass times acceleration, pressure is force per unit area etc.) A distinguishing feature from existing tools is the presence of a ‘human’ ontology. Given biological solutions such as ‘a snake, shark jaw and bat ultrasound’ inspired sensor, one must be able to identify materials that are flexible and poses the strength, finish and other qualities dictated by the biological model to realize a technological analog. This requires classification of human technology and a method of translation across domains.
The platform must provide a common database where data related to the elements of the ontology may reside. Data may be in the form of videos, images, text, or numerical values. An unstructured database would be a suitable choice, for example, MongoDB.
Tools must be able to communicate with the base entities of the platform (ontology, database) regardless of domain specificity or level of user expertise. Thus a natural language processing method is essential. One solution may be a thesaurus that takes queries or variables from users or user-defined tools and translates them to a form amenable to the PeTaL platform.
Given a platform and a set of tools such as classification algorithms, image analysis, design or modeling tools, there should be a machine learning wrapper that navigates these tools in a manner that is likely to produce better outcomes given a particular set of user objectives, design philosophy and constraints. An example of such an algorithm is random forest, where several decision trees may be navigated depending on the set of input parameters, quality and quantity of data, time available to solve the problem, computational resources. The idea here is to mimic the evolutionary process where several mutations are permitted and the ones that are not flawed, propagate. Thus, the platform is the earth consisting of an ontology, database and language. These are themselves subject to change over time. The organisms are the tools that rely on the platform. The user problems and constraints are the forcing functions that ensure that problem solving pathways are constantly updated. Some deviations from evolution may be beneficial such as lack of extinction (storing old tools that may be shown to have a place in the toolset at a later time).
Figure 13 shows the architecture of PeTaL. The primary components include:
User Interface
Ontology and database
Translation Tools such as thesauri, mapping tools, abstraction tools
Data science/machine learning/computer vision tools
Physical models such as relationship between geolocation (including under water bodies) and temperature, pressure, density, humidity.
Data management (validation, entry, modification)
Technology synthesis tools (given solutions and known constraints and resources, how to construct a functioning system?)
4.2. Taxonomy
In order to integrate tools and process information from variegated sources and domains, it is necessary to establish a taxonomy or language that an algorithm might use to perform tasks such as classification, clustering, data scraping and translation. Several taxa are available in the biomimetic landscape, but arguably, the most widely used is the Biomimicry Taxonomy, used by AskNature [
43]. The taxonomy was constructed by domain experts but it is unclear if such a taxonomy is generally applicable or reproducible. These two qualities are necessary if used within a data science context. We therefore ask ourselves whether it is possible to construct a taxonomy that is objective and uses data science to discern topics base on the published literature. Due to the bias in the literature toward research topics that have been accessible to researchers thus far (as seen in
Figure 11), unsupervised clustering of articles is unlikely to yield a taxonomy of terms that is mutually independent and exhaustive. Hypothetically, one should be able to take a large body of work from, say, the Journal of Experimental Biology, and perform topic modeling on the corpus to reveal underlying themes. One might expect that we would discern clusters that correspond to environment, organisms and structures but also to functions that could then inform our taxonomic approach. Each cluster would then be the highest taxonomic level. Sub-clusters would then be created within each cluster for refined searches. This taxonomy would be updated periodically to reflect the literature that is being generated. The entire PeTaL architecture is therefore dynamic. In the short term however, it may be advantageous to leverage the existing taxonomy from AskNature [
43] and develop a classification tool to label articles of interest to a user by the AskNature functions.
Text classification was conducted through a collaborative construction from a team of NASA researchers of a manually labeled training data set. Articles used for training the algorithm were provided by a collaborator, Dr. Julian Vincent [
19], and labels were created by assigning 1 of the 8 primary tiers from the Biomimicry Taxonomy from AskNature [
43] to 100 different articles. The training data set was crafted to represent a balanced distribution of labels from each of these 8 primary tiers. A total of 90 of these articles were randomly selected as the training data set, and 10 were used in a “hold out” validation/testing scheme. The process required converting the text of interest into a Corpus document and then the removal of punctuation, numbers, common English terms (referred to as stop words), and any terms found in more than 50% of the documents. The remaining words were then stemmed to reduce them to their common prefixes. The documents were then transformed into Document Term Matrices consisting of word frequencies and applied to an XGBoost [
44] algorithm. In a process of hyper-tuning, the max_depth, eta, gamma, subsample, colsample_bytree, min_child_weight, max_delta_step parameters of the XGBoost [
44] algorithm were each concurrently randomized over 10,000 iterations of 5-fold cross validation. The parameters which achieved the lowest testing error rate using the mlogloss evaluation criteria were then applied alongside the saved corresponding seed value (to reproduce the random-value environment responsible for generation of the model) towards construction of the final text classification model.
Figure 14 shows word clouds displaying the highest terms found in each of the classes/labels for the 90 articles used to train the XGBoost classification algorithm.
After creating the XGBoost model, the hold-out testing set was then classified to determine the level of harmonization as compared to the human label curation methodology. Once the classification tool was developed, it was also necessary to determine whether an unsupervised approach would yield a breakdown of topics similar to that of the AskNature taxonomy [
43] keeping in mind that there may be overlap in terms or an alternative way to view topics. Exploratory analysis of the text composition was carried out through an unsupervised application of a random forest machine learning algorithm, which was used to generate a proximity matrix to apply towards a partition around medoids (PAM) algorithm [
45]. The optimal k or number of clusters was analyzed, and a
k = 8 was selected to determine the degree of possible concordance between the manually assigned labels and the PAM clusters. Latent Dirichlet Allocation (LDA) [
46] was then also applied to the data by specifying 8 topics to compare concordance between the manually assigned labels and the PAM clusters. The classification model was then also applied to a set of un-labeled articles focusing on extremophile and microbiology literature as a demonstrative use case for the algorithm’s capabilities, as well as towards the aggregated and unstructured text data gathered through webscraping to attempt to identify functional biomimetic engineering correlations with different species of interest. The hyper-tuned XGBoost algorithm’s 5-Fold Cross Validation obtained a minimum logloss value of 1.846 after 35 rounds, as shown in
Table 2.
The results of this model, as applied to the hold-out testing data, set are displayed in
Table 3. Orange cells indicate the articles that were not classified in concordance with the manually assigned labels. The darkest blue cells indicate the highest probability scores from the XGBoost model’s results, while the lighter shade of blue indicates the second highest probability score, and the lightest shade of blue represents the third highest probability score. The dark green cells indicate the article was classified by the algorithm as the highest probability in concordance with the manually assigned label, the lighter shade of green indicates that the article’s second highest probability score from the algorithm matches the manually assigned label, and the lightest shade of green indicates that the 3rd highest probability score matches the manually assigned label. In an effort to examine the underlying statistical trends existing within the training data set, unsupervised analysis was carried out via PAM through a random forest proximity matrix and LDA (Latent Dirichlet Allocation). PAM analysis was performed with a
k = 8 clusters and LDA with
k = 8 topics. In
Figure 15, a silhouette profile was generated for the purposes of PAM clustering and depicted optimal clustering with
k = 2 as well as a slight peak around
k = 7. Examining the heat map in
Figure 16 depicts the co-localization between PAM clusters and the manually assigned Biomimicry Taxonomy labels, where it can be observed that PAM cluster 3 appears to contain a significant proportion of the “Maintain Community” labels, as well as a rough distribution of different labels across each cluster.
Analysis of the peak in the silhouette profile and the level of concordance between clusters and manual labels within the heatmap could thus possibly indicate the existence of a limited statistical basis underlying the results of the manual labeling strategy, despite ambivalent cluster distributions for several labels. The heat map in
Figure 17 demonstrates the co-localization between LDA topics and the manually assigned Biomimicry Taxonomy labels, where we observe a potentially significant correlation between the “Modify” label and LDA Topic 4, as well as a somewhat discernable scattered concentration of the manually assigned labels across LDA topics.
An analysis of the LDA topic highest word frequencies is shown in
Figure 18. Finally, from the heat map in
Figure 19 that compares articles found in the LDA topics vs. the PAM clusters, we can observe a potential relationship between PAM cluster 1 and LDA topic 1 with less overall concordance across topics and clusters reflecting the different statistical methodologies inherent to these analytic techniques.
Another approach taken to develop a taxonomy is shown in
Figure 20. Here, a system is defined as a collection of entities that perform one or more common or collective function(s). The entities may be called subsystems as they in turn are a collection of entities that perform one or more functions. A collection of systems may then be termed a super system. Every system except perhaps the universe has an external boundary (interface to super systems) and at least one internal boundary (interface to subsystems). Identification of system boundaries is a complex subject that has been explored by many authors [
57]. We assume that a boundary is dependent on the action or function of interest as identified by the flow of resources, energy or information across it. This flow or travel may occur completely within the system control volume, between the system and the environment or between the system and another system directly. The approach involves identifying a genus (
Aster) and breaking it down into organ systems (root system, reproductive system etc.). The premise is that within a genus, the morphology exhibits sufficient similarities to ensure that the same hierarchy of taxonomy exists across species. This is not to say that there is no inter-species variation but that in general all species within a genus exhibit similar morphology. The organ systems were then further broken down into subsystems. At each level, the action, object of the action, interface of the object and the entity performing the action, the result of the action (change or lack of change), state of the entity and object were noted. This was done in a manner specific to a genus. The exercise was repeated with human organ systems as the systems of interest [
58]. Once specific functions for each system were identified, then the functions, resources and environment were abstracted to attempt to create a building block set of attributes. For example, the statement “Lung is an organ to exchange carbon dioxide and oxygen” is abstracted to “exchange of chemicals from liquid to gas”. We may break this down further to “a change in chemical composition of the lung system (increase of oxygen, decrease of carbon dioxide) through a physical mechanism”. Conversely, the process also results in a change in chemical composition of the air (decrease of oxygen, increase of carbon dioxide) through a physical mechanism. It emerged that, beyond a certain level, the functions and morphology for a species share much in common with other species in the genus and we do not need to research at a species level except for species-specific features or functions.
Some of the patterns that emerged are shown in
Table 4 (with a focus on changing states rather than maintaining states) with the final column of
Table 4 showing a mapping to the AskNature taxonomy [
43] and the third column showing a mapping to the tertiary terms of the engineering to biology thesaurus [
59]. The non-abstracted functions show 10 high level topics similar to that of AskNature’s taxonomy [
43]. The functions that emerged were Generate, Destroy, Attach, Detach, Transport, Transfer, Exchange, Regulate, Promote, Inhibit. These may be reduced to 7 if we combine Generate and Destroy, Attach and Detach, and Promote and Inhibit. However, upon abstraction (column 1), it appears that it may be possible to create a standardized way of identifying functions as shown in
Table 5. By combining 1 element from each column, we can create functional statements that correspond to each action in
Table 4. As an example, the statement that relates the attachment of a burr to a pair of trousers may be specifically written as “A solid is hooked (attached) to another solid” and abstracted to “A solid maintains its mass interface with another solid through physical mechanism”. This disambiguates mass interface from energy interface which may be applicable in the case of two magnets that are attached or connected by their fields.
Due to the preliminary nature of the building block taxonomy shown in
Table 5, we use the AskNature taxonomy [
43] as a starting point to classify journal articles, and the engineering to biology thesaurus to perform search queries in PeTaL. However, the taxonomy and ontology are easily modified through editable input files.
Figure 21 shows a possible workflow for PeTaL, while
Figure 22 shows multiple possible pathways for a problem solving process. Inspiration for this process could come from quantum biology. The intent of looking to quantum biology to determine workflow is to enable multiple approaches to be weighed simultaneously before choosing one that is appropriate for the task. Initially multiple pathways should be followed to build a robust training set upon which to train the algorithm. The particular details are left to future work. The decision on which workflow to traverse would depend on such factors as:
4.3. Status of PeTaL
PeTaL is currently being prepared for open-source deployment through NASA’s Software Portal. A prototype was initially created using R and Rshiny, but was later transitioned to Python for scalability and customizability.
Figure 23 shows the original PeTaL user interface.
Refer to the components of PeTaL in
Section 4.1 (
Figure 13) to understand the relationship between the tools and capabilities that follow and the PeTaL architecture.
- a.
User Interface: HTML, CSS, and JavaScript were used to create a barebones PeTaL user interface with a home, profile, map, vision, and graph page, implemented using the Bootstrap framework. The python-based home page for PeTaL is shown in
Figure 24. SCSS, a superset of CSS, was introduced that allowed the writing of maintainable CSS in an object-oriented manner and enabled speeding up the process of styling elements and introducing new features and sections for the site. CSS was moved to more structured SCSS files. Doing so facilitated creating consistency and enhancing the platform with interactions, such as subtle SCSS driven scroll-triggered animations. Proper use of version control Git was key to develop new features and debugging. To reduce loading time of the application the number of HTTP requests was minimized.
- b.
Ontology and database: The diversity and size associated with “Big Data” endeavors have enforced the adoption of non-traditional data storage systems. Thus, we have witnessed the dawn of NoSQL systems that allows for drastic increases in size over time, a concept known as scalability, as well as flexibility in formatting [
60]. From this order of database systems there exists the genus of “Document-Oriented Databases” that utilize key-value systems to store data encoded in standardized language encoded documents. One particular species of note from this genus is known as MongoDB and has been widely lauded by industry and academia for its combined strengths of speed and schema-less formatting. PeTaL uses this database to store data. This database uses the Binary Javascript Object Notation (BSON) format, which is extremely similar to the popular Javascript Object Notation (JSON) format [
61].
Figure 25 depicts the PeTaL database being represented in the NoSQL Booster [
62]. The database has given each entry its own ObjectId. The intent is to store data gathered from researchers, citizen scientists and other databases in this format.
A NetworkX graph—a compilation of ontology elements, literature, and the connections between them—was built from PeTaL’s OWL (Web Ontology Language) file, and then had information from the abstract dataset appended to it as additional nodes. This graph is primarily accessed via PeTaL’s profile pages, as shown in
Figure 26. The profile pages display the ontology element, as well as associated data such as a description, image, linked elements, and relevant literature. The D3 force layout enables a full view of the PeTaL ontology that can be zoomed and manipulated. Upon mouse-over, the visualization highlights the relevant node and all its nearest neighbors, and upon a double click event, it leads to the associated profile page. The visualization allows for a more comprehensive understanding of the PeTaL ontology and the significant relationships between its elements that facilitates the bio-inspiration process. To populate the ontology with data autonomously, several sources and tools were considered including Animal Diversity Web, Encyclopedia of Life, ITIS, and Wikipedia.
Figure 27 shows a view of the ontology within PeTaL. The nodes can be clustered according to function, morphology or environment and can be clicked for more information. The ontology graph is rendered in 3D using various javascript libraries and WebGL. Rendering the graph in 3D facilitates exploring the ontology dataset by allowing users to distinguish relationships between the nodes as well as allowing rotation of the graph in 3D space. This graph implements a force-directed algorithm to minimize or completely avoid edges crossing with one another, so as to achieve a cleaner and easier to interpret graph.
- c.
Translation Tools such as thesauri, mapping tools, abstraction tools:
Figure 28 shows the bio-inspired research and design (BIRD) tool for the PeTaL interface. This tool provides easy to use biology data engineering services for engineers. Because the translation and discover phases are often the most difficult to overcome when beginning a biomimicry project, BIRD provides specific tools to overcome these challenges. At the heart of BIRD is a JSON file that translates common engineering functions to their biology counterparts in the appropriate language. The program takes biology primary research articles and scrapes them against this list of relevant biology terms, ranks the top 50 results, and displays them back to the user. The output of this literature search can be either in the format to be human readable as shown in
Figure 28, or in a format compatible for use with PeTaL’s clustering tools.
Figure 29 shows the software architecture of BIRD’s back end. One of the largest benefits of using BIRD is the high degree to which a user can customize their search.
Figure 29 shows the overall structure of BIRD as it is presented to future users upon download. The working directory has a folder labeled “highwire” that contains all of the tar.gz files taken from the HighWire [
63] repository. Additionally, there are 3 other empty folders labeled cluster, data, and article checkpoint. The text file “journal_inputs.txt” contains a list of journal abbreviations that the user can update to add or remove journals from their search. The “e2b.json” file contains the modified engineering to biology thesaurus. Both the e2b and the journal_inputs files were designed to be highly manipulated by the user to enable a custom experience. Throughout the process, at designated checkpoints the dataclass instances are exported as pickle files into the folder “article_checkpoint” which is originally presented empty. In combine.py, the user is able to specify whether they want the output of the ranking and sorting tool to be in human and PeTaL compatible format, or in a format compatible with a clustering tool also housed in PeTaL. If the user specifies human readable output, the final text files representing the top 50 results for each engineering term are written to the “data” folder. If the user specifies the cluster compatible output, the text files are written to the “cluster” folder. Overall, combine.py will run highwire_scrapter.py and hand off the result, a list of Article dataclass instances, to article_rank.py which produces the final output.
Figure 30 shows a contributed tool that uses AskNature [
43] by Dr. Marjan Eggermont [
64]. The intent is to continue adding tools from contributing developers. Several independent projects have already begun to combine tools so as to enable easier integration into PeTaL.
- d.
Data science/machine learning/computer vision tools:
Figure 31 shows a topic modeling tool that helps researchers identify models for inspiration in the short term and provides a method to generate automated taxonomies in the long term based on the methods that are presented in
Section 4.2. Given a set of input documents (a corpus), Latent Dirichlet Allocation (LDA) and nonnegative matrix factorization (NMF) with Kullback–Leibler (KL) divergence are used to provide topic clusters to the user. Topic clusters are the latent underlying themes of a body of text. For the text modeling problem, NMF-KL is the equivalent of probabilistic latent semantic analysis. Both LDA and NMF-KL are preferred modeling tools within the topic modeling community [
65,
66,
67,
68,
69]. These tools are used to identify biological specimens relevant to the user. The user can search for a cluster with relevant topics, find articles and identify the biological models and mechanisms associated with the function of interest. It is also possible to sub-cluster a given topic to learn of underlying topics. Another tool developed for PeTaL is an image classifier that determines the organism(s) and patterns present in an image. Convolutional neural networks (CNN) have demonstrated an impressive ability to learn abstract features from raw pixel information in order to classify images [
70,
71]. ResNet [
72], pretrained on 1000 ImageNet classes, was used for organism classification. This was implemented with the Keras Applications package and no training or validation was required. The relevant output classes that pertain to organisms in PeTaL are mapped to their respective phylum page for relevant information and literature within the PeTaL platform. For geometric pattern classification, a standard convolutional neural network was trained with four convolution layers, each followed by batch normalization, ReLU non-linear activation, and max pooling. We will simply refer to this model as the CNN. A more complex CNN architecture, called MobileNet [
73], was also trained. MobileNet is an image classification network designed to perform well with limited computing resources. MobileNet was chosen because it is a good compromise between classification performance and required time to train. The classifier was trained on naturally occurring patterns [
74]. Examples include trees and roots for branching, dandelions for explosion, spiral galaxies and staircases for spiral, and tile roofing and giraffe spots for tile. Images were sourced from ImageNet. Images were automatically gray-scaled to avoid biasing patterns to certain colors. All images were down-sampled to 224 × 224 resolution.
Figure 32 shows the image classifier workflow in PeTaL. Ongoing work includes identifying pattern locations on the images using heatmaps (
Figure 33) to allow for verification of the classification tool and to also enable extraction of multiple patterns and organisms from an image to reveal connections of systems to their environment. Eventually, computed tomography (CT) scans might be available for large databases of specimen and an autonomous approach to identifying the relationship between function, morphology and pattern might be useful.
In addition to the tools and online databases that have been utilized, an effort was made to determine best practices for raw data collections for quantitative insights. Data was collected at the Cleveland Museum of Natural History (CMNH) in Cleveland OH, the Cincinnati Museum Center (CMC) in Cincinnati OH, and the Smithsonian National Museum of Natural History in Washington D.C. to use as a test set for PeTaL’s user interface. The initial data collection focused on Odonata at CMNH because all the specimens in that collection have been digitized [
75]. Hymenoptera were also characterized. These two orders were selected because of the easy access to their wings. Another large group in the database are trilobites. Trilobites are easy to characterize and show large inter-species diversity. A sample of three to five specimen per species, two species per family, were characterized. Not all families were represented at the aforementioned museums.
Three hundred specimens have been characterized so far. Measurements were taken with a Leica microscope with the Leica Application Suite (LAS) for the CMNH entomology specimens, vernier calipers for the fossil specimens at the CMNH and at the CMC. For the Smithsonian specimens, an Olympus microscope was used to collect the data.
Figure 34 shows the data captured through direct measurement in the PeTaL user interface.
Figure 35 shows a 3D scatter plot with randomly generated data. The intent is to enable the user to change axes and view multiple datasets simultaneously to gain insights into patterns or trends over various scales thus implementing the concept in
Figure 3 in addition to allowing the user to perform comparative studies and find gaps in the literature or phylogenetic tree.
As an open source platform, PeTaL seeks data contributions from researchers of varying expertise. Ongoing efforts are exploring opportunities for collaboration with universities and other organizations to develop a standardized way for non-proprietary data to be curated for PeTaL (e.g., a web interface) that is accessible to all.