Genome Subtraction and Comparison for the Identification of Novel Drug Targets against Mycobacterium avium subsp. hominissuis

Mycobacterium avium complex (MAC) is a major cause of non-tuberculous pulmonary and disseminated diseases worldwide, inducing bronchiectasis, and affects HIV and immunocompromised patients. In MAC, Mycobacterium avium subsp. hominissuis is a pathogen that infects humans and mammals, and that is why it is a focus of this study. It is crucial to find essential drug targets to eradicate the infections caused by these virulent microorganisms. The application of bioinformatics and proteomics has made a significant impact on discovering unique drug targets against the deadly pathogens. One successful bioinformatics methodology is the use of in silico subtractive genomics. In this study, the aim was to identify the unique, non-host and essential protein-based drug targets of Mycobacterium avium subsp. hominissuis via in silico a subtractive genomics approach. Therefore, an in silico subtractive genomics approach was applied in which complete proteome is subtracted systematically to shortlist potential drug targets. For this, the complete dataset of proteins of Mycobacterium avium subsp. hominissuis was retrieved. The applied subtractive genomics method, which involves the homology search between the host and the pathogen to subtract the non-druggable proteins, resulted in the identification of a few prioritized potential drug targets against the three strains of M. avium subsp. Hominissuis, i.e., MAH-TH135, OCU466 and A5. In conclusion, the current study resulted in the prioritization of vital drug targets, which opens future avenues to perform structural as well as biochemical studies on predicted drug targets against M. avium subsp. hominissuis.


Introduction
Mycobacterium species that do not cause tuberculosis are referred to as non-tuberculous mycobacteria (NTM) and are ubiquitous in nature. NTM cause pulmonary diseases in which organisms of Mycobacterium avium complex (MAC) are widely distributed [1]. The incidence rate of infection caused by M. avium is found to be higher than that of the other Mycobacterium species. For example, a literature survey showed that the pulmonary infection rate in Japan is sevenfold greater by M. avium than any other Mycobacterium species [2]. MAC consists of two closely linked species, M. intracellulare and M. avium [3]. Furthermore, M. avium is comprised of four subspecies: M. avium subsp. paratuberculosis (MAP), M. avium subsp. avium (MAA), M. avium subsp. silvaticum (MAS) and with 80% identity, 20 sequences were identified as paralogous out of 4614 proteins in MAH-TH135, 54 out of 5165 in MAH-OCU466 and 14 out of 4502 proteins of A5 strain. The CD-HIT clustered the paralogous sequences and, hence, reduced the total number of sequences of each strain. The sequence dataset was comprised of 4596, 5111 and 4488 protein sequences for the MAH-TH135, OCU466 and A5 strains, respectively.

Searching of Essential, Non-Homologous and Druggable Proteins
In this step, protein sequences that were only present in the pathogens were segregated. Thus, by applying a subtractive approach, sequences were excluded that showed similarity to the human host. The remaining orthologous sequences, retrieved from the previous step, were subjected to BLASTp against the complete human proteome, and the resultant file was parsed. The only sequences that were retained were those that showed "no hits found", and a total of 3151, 3619 and 3072 non-homologous sequences were found in the MAH-TH135, OCU466 and A5 strains, respectively.
The Database of Essential Genes (DEG) provides information on essential genes of Gram-positive and Gram-negative bacteria determined from experimental methods (http://www.essentialgene.org/). Homology with the sequences found in the DEG database is the basis of essentiality of non-homologous proteins. To do this, the parsed results of each strain from the last step were subjected to BLASTp against the DEG with a 10 −5 threshold. The BLASTp results depict 1360, 1451 and 1352 essential protein sequences in MAH-TH135, OCU466 and A5, respectively. These identified sequences were considered viable for the pathogen's life cycle. These sequences include functional, non-functional or uncharacterized proteins, and they were dealt with using different bioinformatics tools for further characterization.

Subcellular Localization
The tracing of the location of essential proteins is an important facet to understand the functions of proteins in their suitable cell compartments. It is important to know the localization of a drug target in order to optimize the mode of action of the drug for its specific target. The prediction of sub-cellular localization of the essential non-homologous protein sequences was achieved by a computational tool called PSORTb. The results depict that approximately 48% of proteins resided in the cytoplasm of each strain. A proportion of 23% was distributed in the cytoplasmic membrane. The rest of the proteins were present in different regions, including~1% of proteins in the extracellular region, >1.5% proteins in the periplasm and very few proteins in the outer membrane of each of the strains. Despite these results, some fractions were considered "unknown" due to the tool's prediction of proteins in multiple sites simultaneously. The distribution of proteins by PSORTb is graphically shown for each strain in Figure 1.

Functional Family Classification
The functional families of protein sequences were also determined using the Support Vector Machine of Proteins (SVM-Prot) tool. Only the sequences whose functions were not known earlier were submitted to this tool. Hence, only uncharacterized sequences were retrieved from the nonhomologous essential proteins' sequences. About 193, 119 and 187 uncharacterized sequences of TH135, OCU466 and A5 strains, respectively, were predicted by the SVM-Prot method. The results of the SVM-Prot tool are depicted in Figure 2. The proteins were broadly classified based on their molecular and biological functions and were further sub-divided into several protein classes, i.e., enzymes, transporters, trans-membranes, zinc or magnesium binding or other elements, DNA condensation, repair, etc. Complete information on classes with their strains is summarized in Supplementary Table S1.

Functional Family Classification
The functional families of protein sequences were also determined using the Support Vector Machine of Proteins (SVM-Prot) tool. Only the sequences whose functions were not known earlier were submitted to this tool. Hence, only uncharacterized sequences were retrieved from the non-homologous essential proteins' sequences. About 193, 119 and 187 uncharacterized sequences of TH135, OCU466 and A5 strains, respectively, were predicted by the SVM-Prot method. The results of the SVM-Prot tool are depicted in Figure 2. The proteins were broadly classified based on their molecular and biological functions and were further sub-divided into several protein classes, i.e., enzymes, transporters, trans-membranes, zinc or magnesium binding or other elements, DNA condensation, repair, etc. Complete information on classes with their strains is summarized in Supplementary Table S1.

Metabolic Pathway Analysis via KEGG
The KEGG database provides a network of metabolic pathways with their complete annotation. It helps to predict which protein sequences are essential in playing a unique role in metabolism. This step predicts the potential drug target based on the pathogen's unique metabolism. Metabolic pathways analysis was carried out for the essential protein sequences using the KEGG database. The DEG's results were subjected to the KEGG database via the KEGG Automated Annotation Server (KAAS). Briefly, out of 675 protein sequences of the MAH-135 strain, 72, 70, 29, 16 and 103 proteins were found to take part in carbohydrate metabolism, energy metabolism, lipid metabolism, nucleotide metabolism and amino acid metabolism, respectively. For OCU-466, 76 were involved in carbohydrate metabolism, while 69, 30 and 15 took part in energy metabolism, lipid metabolism and nucleotide metabolism, respectively, whereas the A5 strain possessed 93 proteins that majorly contributed to amino acid metabolism. The distribution of proteins in different metabolisms is presented in Figure 3a

Metabolic Pathway Analysis via KEGG
The KEGG database provides a network of metabolic pathways with their complete annotation. It helps to predict which protein sequences are essential in playing a unique role in metabolism. This step predicts the potential drug target based on the pathogen's unique metabolism. Metabolic pathways analysis was carried out for the essential protein sequences using the KEGG database. The DEG's results were subjected to the KEGG database via the KEGG Automated Annotation Server (KAAS). Briefly, out of 675 protein sequences of the MAH-135 strain, 72, 70, 29, 16 and 103 proteins were found to take part in carbohydrate metabolism, energy metabolism, lipid metabolism, nucleotide metabolism and amino acid metabolism, respectively. For OCU-466, 76 were involved in carbohydrate metabolism, while 69, 30 and 15 took part in energy metabolism, lipid metabolism and nucleotide metabolism, respectively, whereas the A5 strain possessed 93 proteins that majorly contributed to amino acid metabolism. The distribution of proteins in different metabolisms is presented in Figure 3a-c. Details are provided in Supplementary Tables S2-S4.

Discussion of Significant Unique Metabolic Pathways (UMPs) of the Pathogens
Bacterial metabolism refers to the collection of the biochemical reactions required for bacterial survival and growth, which mainly includes respiration (aerobic and anaerobic) and fermentation. Bacteria, as a pathogen to humans, conduct all the same types of basic biochemical reactions a human cell performs. However, bacteria may have several types of energy generating metabolisms that do not exist in human or eukaryotic cells. This diversity of energy generation and metabolism allows bacteria to survive in a variety of habitats and flourish in otherwise not-suitable conditions. On the other hand, these differential metabolic pathways make bacteria susceptible by serving as an ideal target for antibiotics. Metabolic pathways that exist only in pathogens are called unique metabolic pathways (UMP). These UMPs are listed in Supplementary Table S5. We provide brief information on some bacterial UMPs and their significance as an antibiotic target.

Energy Metabolism
Energy is a potential, needed to perform work and maintain life, usually acquired by breaking a chemical bond and stored by making another chemical bond, very often in the form of ATP. Methane metabolism is one of the UMPs by which bacteria can obtain energy by oxidizing one-carbon compounds (e.g., methanol, methane). Methanotrophic bacteria are generally considered environmentally friendly organisms, as they contribute to oxidizing environmental methane, thereby mitigating the effects of global warming [32]. Methane monooxygenases are the main enzymes to catalyze methane oxidation [33]. There are several UMPs in bacteria, which are related to photosynthesis and carbon fixation and can be exploited for the purpose of drug target identification.

Biosynthesis of Secondary Metabolites
Secondary metabolites are molecules not essentially required for the survival of an organism. A large portion of bacterial metabolism deals with the biosynthesis of secondary metabolites. However, these pathways have a minimal role in bacterial growth and viability and are not considered a suitable target for antibiotics. Even though secondary metabolites are not considered to be ideal as drug targets, many of these pathways are manipulated by researchers for valuable purposes such as penicillin and cephalosporin biosynthesis, carbapenem biosynthesis and streptomycin biosynthesis.

Discussion of Significant Unique Metabolic Pathways (UMPs) of the Pathogens
Bacterial metabolism refers to the collection of the biochemical reactions required for bacterial survival and growth, which mainly includes respiration (aerobic and anaerobic) and fermentation. Bacteria, as a pathogen to humans, conduct all the same types of basic biochemical reactions a human cell performs. However, bacteria may have several types of energy generating metabolisms that do not exist in human or eukaryotic cells. This diversity of energy generation and metabolism allows bacteria to survive in a variety of habitats and flourish in otherwise not-suitable conditions. On the other hand, these differential metabolic pathways make bacteria susceptible by serving as an ideal target for antibiotics. Metabolic pathways that exist only in pathogens are called unique metabolic pathways (UMP). These UMPs are listed in Supplementary Table S5. We provide brief information on some bacterial UMPs and their significance as an antibiotic target.

Energy Metabolism
Energy is a potential, needed to perform work and maintain life, usually acquired by breaking a chemical bond and stored by making another chemical bond, very often in the form of ATP. Methane metabolism is one of the UMPs by which bacteria can obtain energy by oxidizing one-carbon compounds (e.g., methanol, methane). Methanotrophic bacteria are generally considered environmentally friendly organisms, as they contribute to oxidizing environmental methane, thereby mitigating the effects of global warming [32]. Methane monooxygenases are the main enzymes to catalyze methane oxidation [33]. There are several UMPs in bacteria, which are related to photosynthesis and carbon fixation and can be exploited for the purpose of drug target identification.

Biosynthesis of Secondary Metabolites
Secondary metabolites are molecules not essentially required for the survival of an organism. A large portion of bacterial metabolism deals with the biosynthesis of secondary metabolites. However, these pathways have a minimal role in bacterial growth and viability and are not considered a suitable target for antibiotics. Even though secondary metabolites are not considered to be ideal as drug targets, many of these pathways are manipulated by researchers for valuable purposes such as penicillin and cephalosporin biosynthesis, carbapenem biosynthesis and streptomycin biosynthesis.

Amino Acid Metabolism
Amino acid metabolism in bacteria is diverse in nature and performs a pivotal role in maintaining bacterial growth. Amino acid metabolism has emerged as a potential target for new antibiotics, and a number of new drug targets have been proposed in recent years [34][35][36][37]. Some of these drug targets have shown promising results. Lysine biosynthesis, an essential pathway in bacteria for survival and growth, is reported to be a potential target for antibiotics [38,39]. Similarly, D-alanine metabolism is a significant target; an antibiotic D-cycloserine targeting D-alanine metabolism is already in clinical use against Mycobacterium tuberculosis [40,41]. The heterogeneity of amino acid metabolism implies an enormous scope for discovering new antibiotic targets using modern computational tools.
Other types of metabolic activities in bacteria, such as terpenoids and polyketides, glycan biosynthesis and drug resistance, also perform supportive functions for bacterial growth and survival; however, these metabolic routes are not prioritized targets for anti-bacterial drugs. Rather, these metabolic routes are often manipulated for advantageous purposes [42].

Shortlisting of Proteins Sequences as Druggable
The potential drug targets were shortlisted based on obtained information from earlier successful literature reports. The druggability of non-host uncharacterized protein sequences was determined by performing BLASTp against the druggable protein sequences present in the DrugBank Database. For this purpose, the earlier shortlisted, non-host, uncharacterized proteins, which are essential in metabolic pathways, were analyzed for druggability by comparing their sequences with the DrugBank Database. In this search, only one protein was prioritized in TAH-135, whereas four and seven potential drug targets emerged with the OCU-466 and A5 strains, respectively (Table 1). All these potential drug targets were similar to the FDA-approved drug target sequences in the DrugBank Database, including the DNA polymerase III subunit ε of the TH-135 strain, Inter-α-trypsin inhibitor heavy chain H4, exopolyphosphatase, DNA polymerase III subunit ε, mannoside ABC transport system and sugar-binding protein of the OCU-466 strain. In addition to all the proteins from the OCU-466 strain, diacylglycerol acyltransferase/mycolyltransferase, Ag85C and nickel-binding periplasmic protein were found for the A5 strain. It is noteworthy that all the proposed drug targets could be analyzed for 3D structural information to prioritize novel drug targets against pathogens. Therefore, BLASTp was performed for the target proteins against the Protein Data Bank (PDB) database, which revealed that 12 protein sequences had no 3D structure available yet in the PDB. Therefore, this study offers those 12 proteins' sequences to not only consider as a potential druggable genome, but also for future studies of 3D structure determination either by homology modeling (template-based) or by ab initio (template-free) methods [43].

Materials and Methods
An overview of the subtractive genomics approach is illustrated in Figure 4.

Extraction of the Host-Pathogen Proteome
The whole proteome of the host, i.e., Homo sapiens, and pathogen, i.e., Mycobacterium avium subsp. Hominissuis, were downloaded from the UniProt KB database [44] to retrieve protein sequences. The drug target identification approach was carried out on the pathogenic MAH-TH135, MAH-OCU466 and A5 strains.

Grouping of Common Proteins in All Strains
The CD-HIT tool [45] clusters the protein or nucleotide sequences and reduces redundancy and manual efforts in sequence analysis. This tool was used as a standalone command line tool to remove paralogous or duplicated sequences of all strains with a threshold value of 80%. The remaining set of proteins was grouped as orthologous sequences.

Identification of Non-Homologous Proteins
Standalone BLAST version 2.8.1 was downloaded from the NCBI FTP server [46]. The orthologous sequences were subjected to BLASTp against the H. sapiens database with an expectation value (e-value) of 10 −3 [47]. The output was obtained with keywords of "no hits found" for unique proteins and "significant alignments" for the sequences having similarity with the human (host) proteome. The results were analyzed, and only protein sequences "with no homology with the human host" were retained, while the rest were removed. Those proteins were further labelled as non-homologous proteins, and finally, they were extracted using our in-house scripts.

Finding of Essential Genes
The genes required to sustain the life cycle of bacteria are called essential genes. The Database of Essential Genes (DEG) contains lists of genes with their corresponding sequences, which are

Extraction of the Host-Pathogen Proteome
The whole proteome of the host, i.e., Homo sapiens, and pathogen, i.e., Mycobacterium avium subsp. Hominissuis, were downloaded from the UniProt KB database [44] to retrieve protein sequences. The drug target identification approach was carried out on the pathogenic MAH-TH135, MAH-OCU466 and A5 strains.

Grouping of Common Proteins in All Strains
The CD-HIT tool [45] clusters the protein or nucleotide sequences and reduces redundancy and manual efforts in sequence analysis. This tool was used as a standalone command line tool to remove paralogous or duplicated sequences of all strains with a threshold value of 80%. The remaining set of proteins was grouped as orthologous sequences.

Identification of Non-Homologous Proteins
Standalone BLAST version 2.8.1 was downloaded from the NCBI FTP server [46]. The orthologous sequences were subjected to BLASTp against the H. sapiens database with an expectation value (e-value) of 10 −3 [47]. The output was obtained with keywords of "no hits found" for unique proteins and "significant alignments" for the sequences having similarity with the human (host) proteome. The results were analyzed, and only protein sequences "with no homology with the human host" were retained, while the rest were removed. Those proteins were further labelled as non-homologous proteins, and finally, they were extracted using our in-house scripts.

Finding of Essential Genes
The genes required to sustain the life cycle of bacteria are called essential genes. The Database of Essential Genes (DEG) contains lists of genes with their corresponding sequences, which are essential for the survival of bacterial life. [48]. Therefore, the DEG was used to find the sequences that are essential to the bacterial pathogen studied here (i.e., M. avium subsp. hominissuis). The non-homologous proteins were aligned with the DEG database using BLASTp, and the expectation value was set to 10 −5 . As a result, the non-homologous essential genes, which may have hypothetical or uncharacterized proteins, were obtained.

Information about Metabolic Pathways
The metabolic pathways of the identified non-homologous essential proteins were searched in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [49] through the KAAS server. KAAS [50] uses BLASTp for the comparison of query proteins against the KEGG database and annotates functions. KAAS provides the KEGG Orthology (KO) identifiers and information on the metabolic pathways of the proteins.

Annotation of the Curated Proteins
Annotation of proteins includes information about the location of proteins in various regions of the cell and the family to which it belongs. PSORTb version 3.0 [51] is well known to predict the subcellular localization (SCL) of proteins. The SCL includes different compartments, such as cytoplasmic membrane, cytoplasm, cell wall and extracellular and unknown regions of the cell where the proteins reside. All the non-homologous essential, as well as hypothetical, proteins were subjected to the protein databases with known functions using SCL BLAST by the web-based server. SVM-Prot [52] is an online tool for the classification of protein functional families. It applies the machine-learning method and predicts a diverse set of molecular and biological functions covering all major classes of enzymes, channels, transporters, receptors, DNA/RNA binding proteins, etc. and covering 192 functional families of proteins. Those proteins whose functions are still unknown were labeled as non-homologous, hypothetical/uncharacterized proteins and passed through the server of SVM-Prot to classify them into functional families.

Druggability of the Shortlisted Sequences
In order to detemine the novel drug targets, standalone BLASTp was run between hypothetical non-homologous essential proteins, and drug target sequences were taken from the DrugBank Database [53] with an e-value cutoff 10 −3 . The DrugBank Database provides detailed information on drugs and drug targets. A large database shows up to 8261 drugs, including FDA-approved drugs; experimental and nutraceutical drugs are available in the DrugBank Database.

Conclusions
Different bioinformatics tools were applied in this study to identify vital drug targets of Mycobacterium avium subsp. hominissuis. Protein sequences of M. avium subsp. hominissuis were parsed using multiple steps of the subtractive genomics approach, and a few of them were shortlisted as possible drug targets because they fulfilled the druggability criteria. The shortlisted sequences were non-homologous to the human host; thus, these can be proposed as ideal drug targets. All the identified drug targets of different strains of MAH have never been characterized before as drug targets, and we proposed them here as potential drug targets against which new drug compounds can be designed. Therefore, the study is significant to the scientific community, as it provides a prioritized list of possible drug targets sorted by the computational subtractive genomics method, and it has the potential to lead to the discovery of new and novel drug targets against M. avium subsp. hominissuis.