Over the last decade new species of Protozoa were sequenced and deposited in GenBank [1
]. The availability of the primary genome sequence is a good starting point for the community to contribute further analyses (e.g., identification and functional annotation of coding sequences as well as comparative genomics analysis) in order to infer new information on the biology of these organisms. Analyzing large amounts of data generated by genomics experiments, especially using Next Generation Sequencing (NGS), is not a trivial task. The ongoing NGS technology makes the sequencing of more and more eukaryote genomes a reality, giving rise to new paradigms (either for the development and improvement of semi-automatic analysis/annotation systems for this huge amount of data, or for an object-view concept where raw reads are the main, fixed object, and assemblies with their annotations take a role of dynamically changing and modifying views of the object [5
The processes involved in the sequencing and preparation of genomic information can be represented in a similar way as the life cycle of software (Figure 1
). The first step is data acquisition that can be performed by: (i) downloading from public databases; and (ii) sequencing across multiple platforms, like Sanger and/or NGS (Illumina, Ion Torrent, Nanopore and/or Pacific BioScience). The second step, called pre-processing, formats and stores genomic data for subsequent use. The third step refers to the use of a number of computational tools to transform raw data into knowledge. The fourth and last step is distributing and making this information available to the community for further analysis and inferences.
Therefore, in order to facilitate information extraction [6
], we developed the ProtozoaDB [7
] database system, which in its first version included five protozoan genomes (Entamoeba histolytica
, Leishmania major
, Plasmodium falciparum
, Trypanosoma cruzi
, and T. brucei
) and a set of tools for searching and analyzing data, including phylogeny inference. In the present study we present a new version of ProtozoaDB called ProtozoaDB 2.0 (http://protozoadb.biowebdb.org
) that, according to the above description, fits into the third and final steps of the bioinformatics cycle: transforming raw data into information followed by distribution and availability. The development of new generation databases as ProtozoaDB is being encouraged by the community, especially in the context of the BioCreative initiative [8
] and reviewed by Krallinger et al. (2008) [9
The system has been fully remodeled to allow for new tools and a more expanded view of data, using advanced computational techniques and providing a wider range of information for users. Now with the genomes of 22 pathogenic Protozoa, this new version includes analyses such as: (i) similarities with other databases (Homo sapiens
, model organisms, Conserved Domains Database—CDD and Protein Data Bank—PDB); (ii) visualization of the metabolic pathways of Kyoto Encyclopedia of Genes and Genomes—KEGG [10
]; (iii) protein structures by PDB [11
]; (iv) homology studies, using results from OrthoMCL [12
], KEGG Orthology (KO), and OrthoSearch [13
]; (v) the search for related publications at PubMed; (vi) superfamily classification [14
]; and (vii) phenotype inferences based on comparisons with model organisms, particularly with Saccharomyces cerevisae
ProtozoaDB source code was completely rewritten in another programming language and with more elaborated techniques. It now uses a framework for developing Web applications known as Rails (http://rubyonrails.org/
). It was developed in layers, allowing for the separation of the business object code of the pages displayed to users, making maintenance easier and consequently access to its pages lighter and faster. Furthermore, there is a specific layer to deal with data to be fetched from other sources. The Ruby language, suitable for the use of Rails, was adopted for this version together with BioRuby library [15
] it is also possible to transfer functional information based on similar phenotypes and a specialized database called PhenomicDB was developed using this concept [17
The previous version of ProtozoaDB contained only five pathogenic protozoa and some basic analyses. ProtozoaDB 2.0 increased over 17 protozoan species, totalizing 22 genomes and proteomes. New analyses were added in this new version, such as: homology analysis among the 22 organisms, using two different approaches; and phenotype inferences through orthology with the model organism. Furthermore, to allow for more comprehensive information about these organisms, several queries were performed in real time in third party (remote) sites, retrieving information about the proteome of organisms.
There are some other databases containing Protozoa species [25
]; however, ProtozoaDB is the first database and web server that provides “all-in-one” information about comparative genomics of 22 species.
The use of web services allows for a flexible system that: (i) integrates a range of related information; (ii) has direct access to information in their original (remote) sources; and (iii) does not use local storage data from third parties (remote databases) that could imply their periodic update. These advantages allow our system to be always updated, since most of the information is queried directly in source databases through web services. The use of web services is already a practice in bioinformatics, since a number of research groups are using this technology, e.g., BioSWR [27
] and BOWS [28
Using an AJAX-based framework enables ProtozoaDB 2.0 to perform all queries through web services while simultaneously making the response time queries quite suitable for online analysis. AJAX framework is used for modern web sites, including those related to health [29
The new search engines, particularly through BLAST, allow researchers to query the ProtozoaDB 2.0 data directly by the protein or gene of interest, viewing several pieces of information. Thus, it is possible to find a potential drug target by just browsing through the system and using all the information provided.
3.1. T. Brucei Case Study
Farnesyltransferase is one enzyme of the prenyltransferase group, which attaches a 15-carbon isoprenoid farnesyl group to proteins with CAAX motif: a four-amino acid sequence at the carboxyl terminus of a protein [31
]. Farnesylation is a type of prenylation, a post-translational modification of proteins [32
], which binds a isoprenyl group (15-carbon isoprenoid) to a cysteine residue. In other words, protein farnesylation involves protein farnesyltransferase (PFT) that catalyzes the attachment of the farnesyl group from farnesyl pyrophosphate (FPP) to cysteine SH of the C-terminal sequence motif CAAX, where C is cysteine and usually, but not always, an aliphatic residue. The terminal amino acid is determinant of farnesylation because FTase is preferentially active on protein substrates with CAAX [19
]. This is an important process to mediate protein-protein interactions and protein-membrane interactions [31
Prenylation (farnesylation) and subsequent modifications are essential for correct membrane targeting and cellular functioning of a number of proteins in eukaryotic cells such as Ras superfamily GTPases [34
]. The farnesyltransferase enzyme is heterodimeric and has two subunits: alpha (α) and beta (β). The α subunit consists of a double layer paired alpha helices piled up in parallel, whichpartly enfold the beta subunit like a mantle.
As shown in Figure 9
, prenyltransferase alpha subunit is present in various eukaryote species and several studies show that this protein is potentially a good drug target for trypanosomatids [33
], especially because inhibitors have potent activity against cultured forms and are less toxic to mammalian cells than parasite cells. Besides that, PFT inhibitors have been developed as antimalarial agents [35
3.2. Comparison between ProtozoaDB 2.0 and EupathDB
Both information extraction tools evaluated have several features that allow a wider analysis on the organism studied. EuPathDB allows a more comprehensive view of the characteristics of the protein investigated, whereas ProtozoaDB 2.0 focuses its analysis to infer and/or confirm the functional annotation of a given protein, based on its primary annotation deposited in Genbank. Furthermore, ProtozoaDB 2.0 also allows a view of the biological role played by the protein in biological systems, including information on related literature. Through ProtozoaDB 2.0 it is possible to re-annotate some of the proteins identified as “hypothetical” through similarity-based programs as well as SuperFamily-based classification. Finally, using the tools provided by ProtozoaDB 2.0, it is also possible to infer potential drug targets, as described in our case study.