In Silico Tools and Phosphoproteomic Software Exclusives

: Proteomics and phosphoproteomics have been emerging as new dimensions of omics. Phosphorylation has a profound impact on the biological functions and applications of proteins. It influences everything from intrinsic activity and extrinsic executions to cellular localization. This post-translational modification has been subjected to detailed study and has been an object of analytical curiosity with the advent of faster instrumentation. The major strength of phosphoproteomic research lies in the fact that it gives an overall picture of the workforce of the cell. Phosphoproteomics gives deeper insights into understanding the mechanism behind development and progression of a disease. This review for the first time consolidates the list of existing bioinformatics tools developed for phosphoproteomics. The gap between development of bioinformatics tools and their implementation in clinical research is highlighted. The challenge facing progress is ideally believed to be the interdisciplinary arena this field of research is associated with. For meaningful solutions and deliverables, these tools need to be implemented in clinical studies for obtaining answers to pharmacodynamic questions, saving time, costs and energy. This review hopes to invoke some thought in this direction.


Introduction
The recent few decades have seen an escalation in applying computer-based knowledge into life science applications, especially medicine. Currently, proteomics and bioinformatics are deeply rooted in biological sciences, so much so that it is hard to progress without this integration. Both these interdisciplinary approaches draw motivation from cross disciplines such as physics, chemistry, biology, computer science and engineering. Proteomics and bioinformatics have realized their full potential in various areas of biological sciences, especially when it comes to medicine. This interdisciplinary research has had an unequivocal impact on both fields, and through both fields has impacted the fundamental understanding and unravelling of core biological process affecting human health and welfare.
Bioinformatics uses computational approaches to answer theoretical and experimental queries in life sciences. The growth of the biotechnology industry has enormously impacted disease characterization, pharmaceutical discovery, clinical healthcare, forensics, molecular understanding and agriculture. These are core issues that fundamentally impact economic and social issues worldwide [1]. Incorporation of computer knowledge into biotechnological research has been responsible for taking things forward rapidly and authoritatively in this field. Scientific research has been on the transition in recent years owing to the collective information obtained from numerous genome projects and application of high-throughput technologies and mass spectrometry. The development of computational tools has not only roused hope, but has also provided increasing opportunities on biological systems [2]. Bioinformatics is now more so an empowering technology. A fundamental understanding on protein-protein interactions as well as protein identification and characterization and post translational modification has been achieved through bioinformatics approaches. The prediction of primary, secondary, tertiary and quaternary structures, and molecular modeling and visualization, has been realized through inputs from bioinformatics. Insights into genomics, epigenomics, lipidomics, glycomics, foodomics and transcriptomics has been the working hub of ongoing bioinformatics.
Phosphorylation is the chemical addition of a phosphoryl group (PO3 − ) to an organic molecule. The removal of a phosphoryl group is called dephosphorylation. Phosphorylation and dephosphorylation are carried out by kinases and phosphotransferases. Protein phosphorylation is the addition of a phosphoryl group to an amino acid. The amino acid is ideally serine, however, threonine and tyrosine in eukaryotes and histidine in prokaryotes are also on the list. The most predominant types of phosphorylation are post-translational modifications (PTM). The identification and characterization of proteins possessing phosphorylation as a post-translational modification (PTM) is phosphoproteomics. This branch of omics provides insights into proteins that regulate essential signaling pathways. It also aids in the understanding of cellular processes enabling the location of potential drug targets. Developments in sample preparation, enrichment, quantification and data analysis strategies have led to targeted phosphoproteome profiling. Using shotgun phosphoproteomics, enzymatic digestion of protein samples into peptides and phosphopeptides has been achieved. Phosphoproteomics has enabled the identification of site-specific phosphorylation in plants [3]. Technological advancements in analytical instrumentation, sample preparation and data analysis [4][5][6][7][8][9][10][11][12][13][14] have enabled obtaining high-quality, reproducible and comprehensive data sets. Researchers [15] were able to detect 50,000 phosphopeptides in a single human cancer cell line and quantify thousands of peptides within short time frames. Published reviews have elaborately discussed proteomics and phosphoproteomics in the context of precision medicine [16][17][18]. The relevance of phosphoproteomics data in providing mechanistic information towards the understanding of disease mechanism has been a crucial breakthrough [19][20][21]. The fundamental knowledge on the resistance of melanoma cells to BRAF inhibitors [19] as well as glioblastoma cells to mTOR inhibitors has been understood via phosphoproteomic studies. This profoundly insightful information has led to the discovery of novel combinational therapies [20]. Other authors [22] used phosphoproteomics data to assign tumor types for designing treatment routines. These same authors have studied acute myeloid leukemia primary cells to identify the differences in activation of kinases in cells and their drug resistance profiles [23]. Phosphoproteomics has led to unravelling the bidirectional signaling between endothelial cells and tumor cells for a better understanding of metastatic mechanisms of tumor cells [21]. Phosphoproteomic data has been employed to create mechanistic models of colorectal cancer cell lines for the understanding of specific drug resistance [24]. It is well-known that technological advancements and community efforts to standardize protocols and achieve reproducible results are vital for disease and patient stratification. Other than the data reproducibility issue that the mass spectrometry community is confronting, data typespecific methods to extract valuable information is another issue. The role of bioinformatics in proteomics/phosphoproteomics is thus evident: storage of huge volumes of information, cross examination and cross verification of patient sample information, simulation studies, simplification of in vivo/in vitro processes through theoretical approach and understanding underlying fundamental interactions and networking within diseased cells. Figure 1 gives the overall workflow of phosphoproteomics, indicating the junctures (data acquisition and data analysis) where bioinformatic tools play a pivotal role. The present review focusses on highlighting the importance of phosphoproteomic research and the importance of bioinformatics approaches and inputs into this area of research. The milestones achieved thus far via such an integration are presented and the challenges facing this integration discussed. This review discloses the fact that in spite of the valuable deliverables from phosphoproteomics, the interest from the research community in this area of omics is limited. Less than few tens of publications are placed on record; the need for implementation and the reason for this reduced popularity are also discussed in this review.

Biocomputational Tools for Proteomics-A Snapshot
With the increasingly large variety of proteomics workflows and data outcomes, Human Proteome Organisation (HUPO) [25] is facing a major challenge. It is here that there is room for a new generation of the ProteinScape™ bioinformatics platform, supported by LOOPP and PROCHECK software, to chip in. This could prove helpful in furthering functional characterization of specified proteins [26]. Other basic databases for proteins and genes include: UniProt Knowledgebase, Entrez tools such as PANTHER, DAVID, KEGG, and IPA, have been improved for data mapping. These tools are useful in understanding the functions of proteins in cells and its intricate interactions. Coon OMSSA Proteomic Analysis Software Suite (COMPASS) is a software that is freely available for highthroughput analysis of proteomics data, based on the Open Mass Spectrometry Search Algorithm [27]. SPIRE (Systematic Protein Investigative Research Environment) has a web-interface that is easy to use, generating interactive and simple data formats [28] for mass spectrometric (MS) data. ScanRanker identifies unassigned high-quality spectra (that evaded identification) and picks spectra for de novo sequencing and cross-linking of proteins [29]. Also available are computer-based tools for biological pathways such as: iPath, Protein Lounge, BioCarta, KEGG and MetsCyc. Other software available for network analysis includes Ingenuity Pathway Analysis, MetaCore Integrated software suite based on MetaBase, PathwayStudio, GenMAPP (Gene Map Annotator and Pathway Profiler) and Cytoscape [30].
To transform large-scale biologically relevant proteomic data into valuable information [31], novel and improved computational tools are required. Various bioinformatics tools tailored to address the pressing needs of proteomics are available: Proteo Connections, Pathway Browser, and interaction databases like IntAct, ChEMBL, BioGRID [32] and ProteoRed MIAPE [33]. Proteomic storekeeper repositories like PRIDE, Global Proteome Machine, PeptideAtlas are also available to cater to huge volumes of mass spectral data and their respective protein identifications [34]. ANTILOPE is used for mathematical programming [35] and Peptidomimetics Based Inhibitor Design is a drug designing tool [36]. Genome Medicine Database of Japan Proteomics (GeMDBJ proteomics) is a free database [37] and HUPO [38] is another proteomic database that exchanges and imports data to and from databases such as Primer3 software [39], ClustalW, [40] SWISS 2DPAGE and others [41,42]. The MAPU (Max-Planck UnifiedProteome Database) 2.0 database contains a huge collection of proteomes of organelles, tissues and cell types [43]. It aids in the retrieval of organism-specific proteomic data obtained from high accuracy MS-based proteomics and provides insight into general features ranging from gene ontology classification to SwissProt annotation. MODELLER 9v2 software is used to predict the 3-dimensional structure of proteins and PROCHECK and VERIFY 3D for generating output models [44,45].
Emerging tools used in various R&D sectors are summarized [46] as follows: (i) FindMod: Predicts post-translational modifications and single amino acid substitutions in peptides; (ii) FindPept: Identifies peptides; (iii) Mascot: Useful in protein identification by peptide mass fingerprinting; (iv) PepMAPPER: A web-based mapping tool developed for the purpose of epitope prediction and for sequence-structure alignment of proteins; (v) ProFound: Searches known protein sequences; (vi) ProteinProspector: Tools for peptide masses data (MS-Fit, MS-Pattern, MS-Digest); (vii) AACompIdent: Identifies a protein by its amino acid composition; (viii) AACompSim: Compares amino acid composition; (ix) TagIdent: Identifies proteins based on isoelectric point (pI), molecular weight (Mw) and sequence tag; (x) MultiIdent: Identifies proteins; (xi) InterPro Scan: Searches associated proteins with PROSITE, Pfam, PRINTS and other family and domain databases; (xii) MyHits: Establishes connectivity between protein sequences and motifs; (xiii) ScanProsite and HamapScan: Scans a sequence against PROSITE/HAMAP families; and finally, (xiv) MotifScan: Scans a sequence against protein profile databases [47].
The list further extends to tools [47] [48] have extensively reviewed these tools in their review on the topic integration of bioinformatic tools for proteomics research.

Biocomputational Tools for Phosphoproteomics
Phosphoprotemic tools are being developed in such a way that involves searching a sequence database and performing analysis using designated tools [49][50]. Another approach employs searching a spectral localization library [51][52][53]. We move forward to sweep through the existing phosphoproteomic software options available.

Tools for Analysis of Phosphopeptide Data/Spectra
SimPhospho, simulates phosphopeptide spectra searching through spectral libraries leading to highly accurate phosphosite validation. SimPhospho, accurately simulates phosphopeptide tandem mass spectra. The SimPhospho software uses Proteowizard project [54] and an XML library (Thomson) and includes a Qt framework-based user interface. Two XML files [55] serve as an input to SimPhospho: (i) a pep.xml file that contains search results [56] and (ii) an mzXML file that holds mass spectra. The software can be retrieved at https://sourceforge.net/projects/simphospho/.
The typical outcome from MS based proteomics, is the identification of peptides assigned to proteins. As a result of detection of extensive sub-proteomes and sub-phosphoproteomes of living cells, description, storage, management and recovery of the obtained data becomes challenging. For this purpose, PHOSIDA, the Phosphorylation Site Database [57] (http://www.phosida.com) was created [57]. The aim of PHOSIDA is to evolve high quality phosphoproteomic data for quantitative information, for mapping cell regulation after treatment with a stimulus. PHOSIDA is multifunctional in that it predicts putative phosphorylation sites, acetylation and other posttranslational modification sites and analyzes phosphorylation events of proteins of interest. Computer based extraction of knowledge from comprehensive datasets is the agenda of 'knowledge discovery in databases' (KDD).
A large number of phosphopeptides and proteins are detected through mass spectrometrybased phosphoproteomics. The critical challenge is the manual analysis of downstream data. Towards this automation, a software called PhosFox [58] has been launched, which enables peptidelevel processing of phosphoproteomic data supported by Mascot, Sequest, and Paragon. The PhosFox software aids in qualitative and quantitative phosphoproteomics studies and detects phosphorylated peptides and proteins. It also distinguishes differences within phosphorylation sites.
Normalization is a crucial step when analyzing phosphoproteomics data. A median normalization global centric method has been widely employed when it comes to label-free MS-based proteomics [59]. This works on the assumption that peptide abundances do not change between samples [60,61]. Researchers have reported that applying global-centering normalization introduces bias in distribution of fold changes of phosphopeptides across samples. It is in this direction that an R package called phosphonormalizer that fulfils pairwise normalization has been launched [62].
While thousands of phosphopeptides are identified in complex biological specimens, tools to evaluate and detect large amounts of phosphopeptides and related data are needed. Skyline is a freely-available and open source Windows client application for building Selected Reaction Monitoring (SRM)/Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM), Data Independent Acquisition (DIA/SWATH) and Data Dependent Acquisition (DDA) with MS1 quantitative methods and analyzing the resulting mass spectrometer data. MaxQuant [63] and Skyline [64] have been used in a few occasions for phosphopeptide identification and quantification. A couple of excellent reviews describe these software programs in more detail [65,66].

Tools for Phosphorylation Site Assignment
Correct phosphorylation site assignment is a critical aspect for any phosphoproteomic analysis. PhosphoScore [67] is such a site assignment program. It relates the match quality and intensity of observed spectral peaks compared to a theoretical spectrum. The claim [67] is that PhosphoScore produces >95% MS2 assignments. Ascore [68] is another statistical algorithm that measures the probability of correct phosphorylation sites. It is reported that phosphorylation sites with an Ascore ≥ 19 are usually considered unambiguously assigned.

Tools for Prediction of Phosphorylation Sites
As is known, protein phosphorylation is catalyzed by a group of enzymes called kinases, which add phosphate (PO4) to serine (S), threonine (T), tyrosine (Y) and histidine (H) residues. On the other hand, phosphate moieties existing on substrates can also be eradicated by phosphatases. Since many members of the human protein kinase family are implicated in cancer, it is reported that their alteration or dysregulation provides clinically-validated targets for personalized treatment of cancer [69,70]. Given this fact, identification and characterization of kinases and their unique phosphorylation sites becomes a prerequisite for understanding protein kinase-regulated signaling pathways and their impacts on health and disease. While most or all protein kinases have been identified, the sites that they phosphorylate are not well understood. Many computational techniques for phosphorylation site prediction have been proposed. These differ in several ways, including the machine learning technique; the sequence information used; the number of residues surrounding the phosphorylation site; use of structural information/sequence information; and dependence on predictions made for specific/general kinases. Few review articles have previously been published that elaborately discuss computational phosphorylation site prediction. Kobe et al. [71] provided a brief review of this field [72], while Miller and Blom [73] briefly summarized the literature on phosphorylation site prediction and discussed their NetPhos [74,75] family of tools. Xue et al. [76] reviewed, and Trost and Kausalik [77] have extensively reviewed, the tools available for prediction of phosphorylation sites. The list includes tools such as: NetPhosK, PHOSITE, Predikin 1. MusiteDeep [78], is an advanced deep-learning framework that predicts general and kinasespecific phosphorylation sites. DeepPhos [79], is another novel deep learning architecture for prediction of protein phosphorylation, applied for kinase-specific prediction. DeepPhos is reported to outperform competitive predictors in general and kinase-specific phosphorylation site prediction. PhosphoPredict [80] is yet another novel bioinformatics tool, which combines protein sequence and functional features to predict kinase-specific substrates and their associated sites.

Tools for Detection of Phosphosites and Kinase Activity from Phosphopeptide Data
Another approach to phosphoproteomics is through biochemical methods whereby kinase activities are assessed in vitro [81,82]. The major limitation is that these methods are limited in throughput and time-consuming. In vitro methods are not effective in reflecting in vivo activities of kinases, which is why MS-based methods are needed for evaluating kinase activity [83,84]. An approach to link phosphoproteomics data with the activity of kinases was presented by Qi et al. [85], which is known as kinase activity analysis (KAA). CLUE (CLUster Evaluation) is a method designed specifically for phosphoproteomics data [86], based on the hypothesis that phosphosites targeted by the same kinase will show similar temporal profiles. This principle has been utilized to guide the clustering algorithm and group kinases associated to these clusters. The abundances of the target phosphosites are studied using MS followed by in vitro enzymatic reactions. Since every phosphorylation event results from the activity of a kinase, the data thus involved is able to infer the activity of many kinases without the need of actual experiments. This task requires computational analysis of the detected phosphorylation sites (phosphosites), since thousands of phosphosites can routinely be measured in a single experiment. GSEA (Gene Set Enrichment Analysis), is generally applied to an entire set of gene expression data in order to obtain extensive information. It has also been reported to be useful for inference of kinase activity from phosphoproteomics data. This is related to the inference of transcription factor activity, based on the gene expression data.
There are many freely available databases that collect experimentally verified phosphosites, such as PhosphoSitePlus [87], Phospho.ELM [88], Signor [89], or PHOSIDA (explained above) [90]. Each of these databases differ in size and aim. For example, Phospho.ELM computes a score for the conservation of a phosphosite and Signor focuses on interactions with proteins involved in signal transduction. PhosphoNetworks [91] is dedicated to kinase-substrate interactions. One other prominent database for interactions between kinases and individual phosphosites is PhosphoSitePlus. The unique database PhosphoGRID is exceptional in that it provides analogous information [92] for Saccharomyces cerevisiae. Specific information about phosphatase targets can be found in DEPOD [93]. As estimated, there are between 100,000 [94] and 500,000 possible phosphosites in the human proteome, and this has been the motivation for the development of computational tools to predict in vivo kinase-substrate relationships [95]. Scansite [96] uses position-specific scoring matrices (PSSMs) obtained by positional scanning of peptide libraries [97] or phage display methods [98]. Netphorest [99] classifies phosphorylation sites instead of predicting individual kinasesubstrate links [75,100]. The software packages NetworKIN [101] (extended asKinomeXplorer [102]) and iGPS [103] combine information about kinase recognition motifs, in vivo phosphorylation sites and contextual information (STRING database [104][105][106]).
Currently available applications that offer kinase related analyses include inference of kinase activities from phosphoproteomics (IKAP) [107], kinase perturbation analysis (KinasePA) [86], CLUE [108] and Kinase Enrichment Analysis (KEA) [109], now updated as KEA2. IKAP is platform-specific, KinasePA and CLUE are limited to multi-condition studies and KEA is based on substrate overrepresentation. Kinase-Substrate Enrichment Analysis (KSEA) [110] scores each kinase based on the relative hyperphosphorylation or dephosphorylation of its substrates. To make KSEA available to the greater scientific community, a web-based implementation called the KSEA App has been developed. This KSEA App version 1.0 is hosted on the shinyapps.io server as a free online tool: https://casecpb.shinyapps.io/ksea/. Alternatively, this tool is also available as the R package 'KSEAapp' in CRAN: https://CRAN.R-project.org/package¼KSEAapp/.

Future Direction-Implementation of Biocomputation Integrated Phosphoproteomics
As summarized above and in Table 1  , it is clearly evident that biocomputation had indeed played a vital role in the establishment of phosphoproteomics as a well accomplished offshoot of omics. However, it was observed that not much implementation of these tools towards applications has been reported. Most of the publications present the potential of the bioinformatics tools, or the development of these tools, and very few target the implementation of these for relevant applications. Few publications on applying phosphoproteomics for precisions medicine have been reported. Additionally, not much progress has been made in applying any of these bioinformatics tools for phosphoproteomics in clinical or plant/animal biotechnological research.
We assume that the challenge could be owing to the fact that compared to proteomics, phosphoproteomics is a more specialized field requiring more expertise. This could be a reason for the inhibition of extensive research interest in this direction. Moreover, this field being a highly interdisciplinary field (with acquaintance in cross disciplinary fields such as molecular biology, computation, protein chemistry and informatics) this very aspect could be a limiting factor. However, with good progress in the development of such valuable bioinformatics tools being achieved, it is now high time that these resources are put to productive and real time applications and ultimate utility realized. This review points out to this lacuna that in spite of so many tools being developed, nothing much has been accomplished in terms of fundamental understanding of human diseases or animal/plant pathogenicity. Except for a few reports on cancer related studies where bioinformatics phosphoproteomic approaches came handy, there appears to be no implementation. An interdisciplinary approach with cross disciplinary researchers collaborating will lead to positive progress and practical implication for harnessing wholesome benefits.

Conclusions
This review aimed at consolidating the bioinformatic tools available, giving a snapshot of the ones useful for proteomics and touching on the tools available for phosphoproteomics. Despite such valuable tools having been developed, in terms of real time application into clinical/pathological research and investigations we are not even close to accomplished. It is about time bioinformatics tool developers loop in with biologists and implement their tools.