An In Silico Identification of Common Putative Vaccine Candidates against Treponema pallidum: A Reverse Vaccinology and Subtractive Genomics Based Approach

Sexually transmitted infections (STIs) are caused by a wide variety of bacteria, viruses, and parasites that are transmitted from one person to another primarily by vaginal, anal, or oral sexual contact. Syphilis is a serious disease caused by a sexually transmitted infection. Syphilis is caused by the bacterium Treponema pallidum subspecies pallidum. Treponema pallidum (T. pallidum) is a motile, gram-negative spirochete, which can be transmitted both sexually and from mother to child, and can invade virtually any organ or structure in the human body. The current worldwide prevalence of syphilis emphasizes the need for continued preventive measures and strategies. Unfortunately, effective measures are limited. In this study, we focus on the identification of vaccine targets and putative drugs against syphilis disease using reverse vaccinology and subtractive genomics. We compared 13 strains of T. pallidum using T. pallidum Nichols as the reference genome. Using an in silicoapproach, four pathogenic islands were detected in the genome of T. pallidum Nichols. We identified 15 putative antigenic proteins and sixdrug targets through reverse vaccinology and subtractive genomics, respectively, which can be used as candidate therapeutic targets in the future.


Introduction
Sexually transmitted infections (STIs) are triggered by a number of bacteria, viruses, and parasites that are transferred mainly by vaginal, anal, or oral sexual contact between people. Different STIs can be existent or transmitted instantaneously, and such infections can trigger other STIs [1]. The World Health Organization (WHO) has reported more than 30 different bacteria, viruses, and parasites that are responsible for disease transmission through sexual contact.
Syphilis is among the most severe sexually transmitted infections (STIs) caused by the Treponema pallidum subspecies pallidum, a motile, gram-negative spirochete bacterium [2]. The annual estimated frequency of infectious syphilis is 36 million cases and over 11 million new infections; thus, it is an important public health burden globally [3]. Furthermore, the number of cases increased 10-fold

Identification of Intra-Species Conserved Non-Host Homologous Proteinsand Pathogenicity Islands
We compared 13 Treponema pallidum strains (Table 1) using Treponema pallidum Nichols as the reference using the orthoMCL software [26]. Coding DNA sequences (CDSs) shared by all species were considered a part of the core genome. Considering the human genome as the host genome, a set of 565 conserved non-host homologous proteins were identified. The prediction of genomic islands (GIs) was subsequently performed. GIs are gene clusters, usually >8 kb in size, likely acquired via horizontal gene transfers (HGT), and often playing a role in the environmental or host adaptation of bacteria. GIs significantly influence bacterial evolution and provide further insight in

Identification of Intra-Species Conserved Non-Host Homologous Proteinsand Pathogenicity Islands
We compared 13 Treponema pallidum strains (Table 1) using Treponema pallidum Nichols as the reference using the orthoMCL software [26]. Coding DNA sequences (CDSs) shared by all species were considered a part of the core genome. Considering the human genome as the host genome, a set of 565 conserved non-host homologous proteins were identified. The prediction of genomic islands (GIs) was subsequently performed. GIs are gene clusters, usually >8 kb in size, likely acquired via horizontal gene transfers (HGT), and often playing a role in the environmental or host adaptation of bacteria. GIs significantly influence bacterial evolution and provide further insight in differentiating bacterial species and strains. For T. pallidum Nichols strains, 10 putative GIs were identified through the Genomic Island Prediction Software (GIPSy) [27], using Treponema denticola as a closely related, non-pathogenic organism. Of the 10 GIs, four are classified as pathogenicity islands (PAIs), i.e., they present high concentrations of virulence factors and are absent in the aforementioned closely related non-pathogenic organism ( Figure 2). differentiating bacterial species and strains. For T. pallidum Nichols strains, 10 putative GIs were identified through the Genomic Island Prediction Software (GIPSy) [27], using Treponema denticola as a closely related, non-pathogenic organism. Of the 10 GIs, four are classified as pathogenicity islands (PAIs), i.e., they present high concentrations of virulence factors and are absent in the aforementioned closely related non-pathogenic organism ( Figure 2). . Genomic islands (GIs) of T. pallidum Nichols strains as predicted by the genomic island prediction software (GIPSy) using Treponema denticola as a closely related non-pathogenic organism. The outermost circle highlighted in red shows the four pathogenicity islands from 10 GIs. Guanine-Cytosine (GC) content is shown in black.

Figure 2.
Genomic islands (GIs) of T. pallidum Nichols strains as predicted by the genomic island prediction software (GIPSy) using Treponema denticola as a closely related non-pathogenic organism. The outermost circle highlighted in red shows the four pathogenicity islands from 10 GIs. Guanine-Cytosine (GC) content is shown in black.

Assessment of Essential Genes
Essentiality analysis identifies significant genes required for pathogen survival such as adhesion, entry into the host, infection, and persistence in the host [13]. The conserved 565 non-hosts homologous proteins were subjected to the Database of Essential Genes (DEG) for the identification of essential proteins, through which a final set of 268 proteins was obtained (Table S1). Essential proteins are necessary for the survival of pathogen within the host. When these essential proteins are declared to be virulent, they can be of vital significance to unveil novel therapeutic targets. There is a probability of essential proteins to be conserved among various populations and species because of their vital roles in various pathways for pathogen survival [13,28]. Virulence is the characteristic of a pathogen responsible for causing severe human diseases. In the present study, these properties have been given high priority to identify potential vaccine candidates computationally. Although only 268 proteins were identified as essential by DEG, we considered all 565 proteins for our analyses.

Prediction of Candidate Vaccine Target for T. pallidum
The subcellular localization of conserved non-hosts homologous proteins of T. pallidum strains were predicted with the SurfG+ software [29]. We classified 207 gene products as putative surface-exposed (PSE) proteins, secreted proteins, or membrane proteins ( Table 2). The proteins predicted by SurfG+ were further analyzed with the software Vaxign [30] for antigenic properties with adhesion probabilities greater than 0.51, resulting in the detection of three proteins in the T. pallidum strains Nichols (Table 3). We found that out of these three proteins, Tp_Nichols141 and Tp_Nichols797 were hypothetical proteins. Tp_Nichols141 belongs to the pathogenicity island 1 ( Figure 2). When the adhesion probability threshold was >0.4, we also identified 12 more proteins that can also be considered potential vaccine candidates against T. pallidum. Previous studies have shown the importance of targeting proteins involved in the capability of T. pallidum to invade host tissues and to evade the functional immune response, contributing to its persistence during the "latency" stage. Most of the described gene targets code for proteins responsible for the attachment to extracellular matrix bridges (Tp0136, TP0155, Tp0483, and Tp0751), such as the low density integral Outer Membrane Proteins (OMPs) [6]. Briefly, in our predictions of good vaccine targets, we have identified Tp_Nichols350 and TpNichols852 with similarities to two previously described OMPs (TP0453 and Tp_0326), along with two additional OMP domain containing proteins: Tp_Nichols797 and Tp_Nichols141. Interestingly, both Tp_Nichols797 and Tp_Nichols141 presented adhesion probabilities higher than 0.5 and should be given priority in in vitro assays.

High Throughput Structural Modeling
The main focus of this study was to find candidate vaccine targets. However, according to Caroline et al., 2014 [6], the difficulty in curing syphilis is due to the vilification of many antibiotics for treatment or prophylaxis. Our contributionsincludetheprediction of some novel drug targets against Treponema pallidum. For this, the identified 565 conserved non-host homologous Treponema pallidum proteins were submitted to MHOLline [31] an online web tool, to predict the modelome. MHOLline utilizes multi-fasta files of amino acids as an input data and then uses HMMTOP, BLAST, BATS, MODELLER, and PROCHECK programs for the detailed analyses. The program HMMTOP detects transmembrane regions. The BLAST algorithm is used to identify the template structure by performing a random search against the Protein Data Bank. BATS (Blast Automatic Targeting for Structures) carries out the refinement in the template search and it is a key step for the model construction. BATS refinement identifies sequences that make the modeling possible by selecting a template from a BLAST output file using their BATS scores, expectation values, identity, and sequence similarity as criteria, as well as considering the number of gaps and the alignment coverage. BATS selects the best template for 3D model generation and performs automated alignment using the MODELLER program. Furthermore, it gathers all the BLAST output files into four distinctive groups (i.e., G0, G1, G2, and G3) according to the following criteria: G0 = unaligned sequence; G1 = E-value > 10 × 10 −5 or identity <15%; G2 = E-value ≤ 10 × 10 −5 and identity ≥25% AND LVI ≤ 0.7; G3 = E-value ≤ 10 × 10 −5 and identity ≤15% and <25% OR LVI (Length Variation Index) >0.7. Only the first three distinct quality G2 model groups were taken into consideration in this study; these were: 1-very high quality model sequences (identity ≥75%) (LVI ≤ 0.1), 2-high quality model sequences (identity ≥50%) and <75%) (LVI ≤ 0.1), and 3-good quality model sequences (identity ≥50%) (LVI > 0.1 and ≤0.3) [31]). Therefore, all the considered protein 3D models were constructed from sequences for which their template is available with identity ≥50%. We found 26 proteins (8 very high, 12 high, and 6 good) in the first 3 distinct quality G2 model groups.
The membrane and cell wall associated proteins are, theoretically, more exposed as targets than the cytoplasmic drug targets. However, membrane proteins are difficulty to purify and assay [32]. Cytoplasmic membrane proteins are also very important for the physiology of bacteria, as they are involved in many important metabolic functions. Therefore, the membrane, putative surface exposed, and secreted proteins are better applicable as targets for reverse vaccinology, whereas the pivotal role of cytoplasmic proteins in maintenance of cell viability makes them more favorable as drug targets [33]. Out of the 26 proteins, only cytoplasmic proteins that were present in any GIs were selected as candidate drug targets. Six proteins that were also present in the 268 proteins were identified as essential in the DEG analyses and were considered for the target prioritization and docking studies ( Table 4).
The outer membrane may pose a barrier for drugs to gain access to cytoplasmic targets. However, small molecules are able to gain access to the periplasm through porins and reach the cytoplasm. In previous studies, it was shown that one of the pore forming OMPs, OmpF, has an exclusion limit of 600 Daltons, for example, which is used by ions, amino acids, and small sugars as a means to reach the periplasm [34]. The molecular weight of the compounds used here varies from~275.1 g/mol (liriodenine) to~488.7 g/mol (jacarandic acid) and they may also be able to use porins to gain access to the periplasm. Alternatively, the use of nanoparticles as delivery systems or a combined treatment, such as with polymyxins and derivatives that increase the permeability of the outer membrane, may also help in overcoming the outer membrane barrier [35].

Analyses of Non-Host Homologous Targets and Molecular Docking
In molecular docking, lower energy scores represent better protein-ligand bindings compared to higher energy values [37]. We considered the lower MolDock score and the interaction with the residues that were involved in the active site of the target for the prediction of therapeutic candidates. For each target protein (uvrB, pfp, asnA, recA, ndh, and dxs), a library of 28 natural compounds were docked to examine each molecule one-by-one for the selection of the final set of promising molecules that showed favorable interactions with the active site residues of targets. The biological importance for each target is described here (Table 4) along with an analysis of the predicted protein-ligand interaction(s). The name of the molecules, MolDock scores for the selected ligands, and the number of predicted hydrogen bonds with the active residues involved in these interactions are shown below for each target protein ( Table 5). The predicted configurations of one of the best-docked molecules are also shown for each pathogen target in Figure 3A

Analyses of Non-Host Homologous Targets and Molecular Docking
In molecular docking, lower energy scores represent better protein-ligand bindings compared to higher energy values [37]. We considered the lower MolDock score and the interaction with the residues that were involved in the active site of the target for the prediction of therapeutic candidates. For each target protein (uvrB, pfp, asnA, recA, ndh, and dxs), a library of 28 natural compounds were docked to examine each molecule one-by-one for the selection of the final set of promising molecules that showed favorable interactions with the active site residues of targets. The biological importance for each target is described here (Table 4) along with an analysis of the predicted protein-ligand interaction(s). The name of the molecules, MolDock scores for the selected ligands, and the number of predicted hydrogen bonds with the active residues involved in these interactions are shown below for each target protein ( Table 5). The predicted configurations of one of the best-docked molecules are also shown for each pathogen target in Figure 3A-F. Based on a structural comparison with a crystallographic structure of the uvrB template (2d7d, uvrB from Bacillus subtilis), the active site residues involved in H-bond interactions with the crystallographic ligand adenosine-5′-diphosphate are Phe10, Gln11, Gln16, Gly41, Gly43, and Arg541. One of these residues, Gly41, was predicted to make hydrogen bonds to the ligand potamogetonin (CID 5742898) with a MolDock score of −97.81. Similarly, for the target pfp template (2F48, Borrelia burgdorferi), the active site residues involving in H-bond interactions are Lys211, Pro210, Asp214, Gly90, Tyr434, Arg154, Met259, Arg261, and Glu320. The residue Lys211 interacts Based on a structural comparison with a crystallographic structure of the uvrB template (2d7d, uvrB from Bacillus subtilis), the active site residues involved in H-bond interactions with the crystallographic ligand adenosine-5 -diphosphate are Phe10, Gln11, Gln16, Gly41, Gly43, and Arg541. One of these residues, Gly41, was predicted to make hydrogen bonds to the ligand potamogetonin (CID 5742898) with a MolDock score of −97.81. Similarly, for the target pfp template (2F48, Borrelia burgdorferi), the active site residues involving in H-bond interactions are Lys211, Pro210, Asp214, Gly90, Tyr434, Arg154, Met259, Arg261, and Glu320. The residue Lys211 interacts with jacarandic acid (CID 73645) and pinoresinol (CID 234817) with MolDock scores of −62.15 and −112.67, respectively. The compound leptophyllin B (CID 10447482) interacts with the identified active site residues Ser111, Cys113, Asp115, Tyr218, and Ser251of asnA (PDB ID: 12AS from Escherichia coli) and Leu298, Asp32, and Asn36 of ndh (PDB Template ID: 2BC0 from Streptococcus pyogenes).
Interestingly, the drug molecule pinoresinol (CID234817) was predicted to show good results against four of our targets uvrB, pfp, asnA, and dxs. Pinoresinol is a lignan, biphenolic compound found in Araucaria araucana and Sambucus williamsii. It possesses bactericidal and fungicidal activities and therapeutic potential as an antifungal agent for the treatment of fungal infectious diseases in humans [38,39]. Thus, the identification of pinoresinol in our in silico study strengthens our protocol and can be potentially used as a new drug for the treatment of syphilis.

Selection of Data
The genome sequences of all 13 strains of T. pallidum were retrieved from the NCBI (National Center for Biotechnology Information) server [40]. For homogeneity in the functional annotation, all genomes were annotated using the RAST server (RapidAnnotationsusing SubsystemsTechnology) [41]. Furthermore, these annotated genome sequences were used for analysis.

Identification of Intra-Species Conserved Non-Host Homologous Proteins
In comparative genomics, the orthologous genes are clustered to obtain a framework to integrate information from multiple genomes, highlighting the conservation and divergence of gene families and biological processes. For pathogens, clustering orthologs can facilitate drug and/or vaccine targets identification. We compared 13 strains of Treponema pallidum using Treponema pallidum Nichols as the reference genome, using orthoMCL software [26] with an E-value of 1 × 10 −50 . CDSs shared by all strains were considered a part of the core genome. The possible candidates for drugs and/or vaccines should be non-homologues to human proteins; thus, autoimmunity is avoided, and an accurate immune response is elicited against the targeted pathogen. Accordingly, these core genes were subjected to orthoMCL software (E-value = 1 × 10 −50 ) against the human genome for the identification of non-host homolog targets.

Identification of Pathogenicity Islands
Knowledge about pathogenicity islands, the virulence factors they encode, their mobility, and their structure is not only helpful in understanding the bacterial evolution and their interactions with eukaryotic host cells, but may also facilitate in providing delivery systems for vaccination and tools for the development of new approaches for treating bacterial infections [28]. The identification of pathogenicity islands in the genome of T. pallidum Nichols was performed with GIPSy (Genomic Island Prediction Software) [27] through the detection of regions presenting: deviations in genomic signature (i.e., anomalous G+C and/or codon usage deviation); presence of transposase, virulence or flanking tRNA genes; and absence in the non-pathogenic organism Treponema denticola.

Assessment of Essential Genes
A subtractive genomics approach was followed to identify conserved targets that were essential to the bacteria [13]. The set of core conserved proteins of T. pallidum Nichols was subjected to the Database of Essential Genes (DEG) [42] for homology analyses. The DEG contains experimentally validated data from bacteria, archaea, and eukaryotes that are comprised of currently reported essential genomic elements including protein-coding genes that are indispensable to support cellular life. The cut-off values used for BLASTp were: E-value = 0.0001, bit score =100, and identity = 25% [15,18,30].

Reverse Vaccinology Approach for Prediction of Putative T. pallidum Vaccine Targets
For potential vaccine targets, subcellular localization and the secretion of pathogenic proteins are important factors for consideration, where secreted and membrane proteins are the first to be in contact with the host, eliciting an immune response. Therefore, the prediction of the exoproteome or secretome, composed of the proteins localized in the extracellular matrix or outer membrane of the organism, is highly valuable for reverse vaccinology strategies. In combination with subtractive proteomics, reverse vaccinology can provide a more reliable output compared to screening of the whole data set without considering prioritizing parameters [13]. The non-host homologous conserved proteome of T. pallidum Nichols was screened using SurfG+ software [29] to identify secreted proteins, membrane proteins, and putative surface exposed proteins. We searched for cleavage sites and transmembrane helices in all 15 proteins using SignalP [43] and TMHMM (Transmembrane Helix prediction server, based on a hidden Markov model) [44], respectively, and we also predicted the presence of functional domains for all the 15 proteins with InterProScan, which uses several databases for domain prediction [45]. The dataset was screened by Vaxign [30] by searching for proteins with the following features: major histocompatibility complex (MHC I) and (MHC II) binding properties, an adhesion probability greater than 0.51, and no similarity to host proteins.

High Throughput Structural Modeling
MHOLline [31] was used to predict the modelome (complete set of protein 3D models for the whole conserved core non-host homologous proteome). MHOLline utilizes multi-fasta files of amino acids as input data and then uses HMMTOP, BLAST, BATS, MODELLER, and PROCHECK programs for the detailed analyses. The program HMMTOP detects transmembrane regions [46]. The BLAST algorithm is used to identify template structure by performing random searches against the Protein Data Bank [47]. BATS (Blast Automatic Targeting for Structures) performs the refinement in the template search; its use represents a key step for the model construction. BATS refinement identifies sequences that make the modeling possible by selecting templates from the BLAST output file using their BATS scores, expectation values, identity, and sequence similarity as criteria as well as considering the number of gaps and the alignment coverage. BATS selects the best template for 3D model generation and performs automated alignment used by the MODELLER program. The adopted methodology was revised accordingly from the original work by Hassan et al. [46].

Ligand Libraries and Docking Analyses
The ligand libraries of 28 natural compounds presented by Tiwari et al., 2014 [48] were used for the docking analysis. The 3D structures of all target proteins were carefully examined for structural errors (wrong bonds, missing atoms, and protonation states) in the MVD (Molegro Virtual Docker) [37]. The active side residues of the target proteins were identified by comparing its 3D structure to the respective templates. Furthermore, taking identified cavities from a template used in a grid for molecular docking. The program includes three search algorithms for molecular docking analyses, namely MolDock Optimizer [37], MolDock Simplex Evolution (SE), and Iterated Simplex (IS). We employed the MolDock Optimizer search algorithm, which is based on a differential evolutionary algorithm, using the default parameters, that are (a) population size = 50; (b) scaling factor = 0.5; and (c) crossover rate = 0.9. The 3D poses of docked molecules were analyzed in Chimera [49]. Molecular function (MF) and biological process (BP) for each target protein were determined using UniProt [41]. The biochemical pathway of these proteins were checked using KEGG (Kyoto Encyclopedia of Genes and Genomes) [50], SurfG+ software [29], and virulence using GIPSy [31]. The final list of targets was based on 12 criteria, as described earlier in [13,46].

Conclusions
Here, the genomic information was used with the aim of determining the conserved proteome of 13 strains of Treponema pallidum in a search for regions of genome plasticity. Moreover, we used reverse vaccinology and subtractive genomics to predict new antigenic/drug targets, which can be used in the development of new vaccines and drugs for Treponema pallidum. After a detailed in silico analysis between host and pathogen proteins, we suggest that the identified non-host homologous proteins could be considered for prophylaxis of syphilis due to further experimental validations.