Machine Learning Approaches for Discriminating Bacterial and Viral Targeted Human Proteins

: Infectious diseases are one of the core biological complications for public health. It is important to recognize the pathogen-speciﬁc mechanisms to improve our understanding of infectious diseases. Differentiations between bacterial- and viral-targeted human proteins are important for improving both prognosis and treatment for the patient. Here, we introduce machine learning-based classiﬁers to discriminate between the two groups of human proteins. We used the sequence, network, and gene ontology features of human proteins. Among different classiﬁers and features, the deep neural network (DNN) classiﬁer with amino acid composition (AAC), dipeptide composition (DC), and pseudo-amino acid composition (PAAC) (445 features) achieved the best area under the curve (AUC) value (0.939), F1-score (94.9%), and Matthews correlation coefﬁcient (MCC) value (0.81). We found that each of the selected top 100 of the bacteria- and virus-targeted human proteins from a candidate pool of 1618 and 3916 proteins, respectively, were part of distinct enriched biological processes and pathways. Our proposed method will help to differentiate between the bacterial and viral infections based on the targeted human proteins on a global scale. Furthermore, identiﬁcation of the crucial pathogen targets in the human proteome would help us to better understand the pathogen-speciﬁc infection strategies and develop novel therapeutics.


Introduction
Despite the current improvements in antimicrobial therapy and vaccination, infectious diseases remain a major threat to public health worldwide. They cause significant morbidity across the nations, posing a major burden on the economy, and causing a substantial number of deaths in the less developed countries [1]. The majority of infectious diseases are caused by pathogenic bacteria and viruses. Pathogens interact with the host system right from the point of its entry into the host, primarily to evade the host immune response and create their own niche for survival and growth [2]. The identification of host proteins targeted by pathogens and pathogen-host protein-protein interactions (PPIs) is crucial to understand the mechanisms underlying the infectious diseases [3]. To differentiate between the bacterial-and viral-targeted host proteins is critical to delineate the specific infection strategies for these two groups of pathogens. While this may help in the diagnosis of the etiology, it is particularly important from the treatment perspective, which is distinct for bacterial and viral infections. Antibiotics kill bacterial pathogens but are ineffective against are ineffective against viruses. Finally, identification of the specific biological processes for the bacterial-and viral-targeted human proteins could improve disease prognosis and treatment.
Several studies attempted to explore the mechanisms underlying infectious diseases from the study of pathogen-host PPIs [4][5][6][7][8][9][10][11][12][13]. The availability of experimentally verified pathogen-host PPIs in the public domain significantly helped these efforts [14][15][16][17][18][19][20]. However, only one study compared pathogen-host PPIs for bacterial and viral infections [21]. This study addressed common as well as distinct infection strategies for bacterial and viral infections. To distinguish between bacterial-and viral-targeted human proteins, they only used the degree centrality, betweenness centrality, and gene ontology (GO) features of different proteins. They drew a general conclusion that viruses tend to interact with human proteins having much higher connectivity and centrality values than those for bacteria. They proposed that viral-targeted human proteins function in the cellular process to manipulate it, while bacteria-targeted human proteins interact with the immune system. Here, we used more rigorous techniques, such as machine learning algorithms, to differentiate the bacteria-targeted human proteins from the virus-targeted proteins. To this end, we used the sequence, network, and gene ontology features of human proteins extensively. We identified the best features set for the purpose of discriminating between bacterial-and viral-targeted proteins and listed the top predicted targets. Finally, the differences between the bacterial-and viral-targeted human proteins were validated by GO and pathway enrichment analysis.

Data Collection
All the experimentally validated bacteria-human and virus-human protein-protein interaction (PPI) datasets were collected from PHISTO: a pathogen-host interaction search tool [22]. We found 8993 and 35,120 bacteria-human and virus-human PPIs, respectively, and detected 3673 bacterial-and 5887 viral-targeted human proteins. Out of these, 1780 proteins were common targets of both bacteria and viruses (shown in Figure 1) and were excluded from our analysis. We searched the remaining 1893 and 4107 respective bacterial-and viral-targeted human proteins, in UniProt, a worldwide hub of protein knowledge database [23]. We found 1618 and 3916 bacterial-and viral-targeted and reviewed human proteins, respectively, in UniProt (Supplementary Tables S1 and S2), which were considered for further analysis.

Sequence Features
All the above human protein sequences were downloaded from the UniProt database. For the prediction of proteins and PPIs, the sequence features, such as the amino acid composition (AAC), dipeptide composition (DC), pseudo-amino acid composition (PAAC), and composition-transition-distribution (CTD) were reported as important features [24][25][26]. We computed AAC, DC, PAAC, and CTD using PyDPI, a freely available

Sequence Features
All the above human protein sequences were downloaded from the UniProt database. For the prediction of proteins and PPIs, the sequence features, such as the amino acid composition (AAC), dipeptide composition (DC), pseudo-amino acid composition (PAAC), and composition-transition-distribution (CTD) were reported as important features [24][25][26]. We computed AAC, DC, PAAC, and CTD using PyDPI, a freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies [27]. We used these sequence features to discriminate between the bacterial-from the viral-targeted human proteins.

Network Features
To compute network features for human proteins, we retrieved expert-curated human PPIs from the Human Protein Reference Database (HPRD) (Release 9) [28] and constructed a network using these PPIs. Network analyzer (cytoscape plugin) was used to compute the network properties, such as degree, closeness centrality, neighborhood connectivity, average shortest path length, betweenness centrality, clustering coefficient, topological coefficient, eccentricity, and radiality [29].

Gene Ontology (GO) Features
All the GO identifiers (IDs) for the respective 1618 and 3916 bacterial-and viraltargeted human proteins were downloaded from UniProt. We found a total of 23,737 GO IDs for 1618 bacteria-targeted human proteins, while the number of GO IDs for the viraltargeted human proteins was 67,035. The occurrence of each GO ID was counted separately for the above two groups, followed by sorting based on the occurrence value. The top 100 and 280 GO IDs for the bacterial-and viral-targeted human proteins were extracted for GO features. However, only 282 were unique among the top 380 GO IDs (Supplementary Table S3). Therefore, we considered the unique IDs for GO features (Supplementary Figure S1). For each human protein, the presence or absence of the top GO ID was considered as 1 or 0, respectively.

Classification
The distinction between the bacterial-and viral-targeted human proteins may be viewed as a binary (two-class) classification problem. To differentiate between the proteins, we used well-known classifiers, such as SVM, RF, and DNN.

Support Vector Machines (SVM)
The SVM classifier explicitly maps the data over a vector space to find a decision surface that maximizes the margin between data points of two classes. For the SVM classifier, we used the scikit-learn python package [30]. To find the best performance of the SVM classifier, we tested different combinations of cost and gamma parameters of radial basis function (RBF).

Random Forest (RF)
Several decision trees (DTs) grow simultaneously using a random subset of features in RF. In the RF classifier, each tree is a new object and "votes" for that class. Based on a majority vote, the forest elects the classification. We also used the scikit-learn python package for the RF classifier. Optimal parameters were utilized to find the best performance.

Deep Neural Networks (DNN)
The DNN method was shown to perform well with diverse problems. DNN is more robust and useful than other methods for complex classification problems and is becoming a popular algorithm in the field of modern computational biology. We used TensorFlow DNN, which is a widely-used deep learning package for classification, to discriminate between the bacterial-and viral-targeted human proteins [31].

10-Fold Cross-Validation
To avoid the performance bias of the prediction methods, we used the 10-fold crossvalidation technique. In 10-fold cross-validation, the whole dataset is divided into 10 sets (folds) of equal or nearly equal sizes. Training and testing are repeated 10 times so that each time, a different set (fold) goes out for testing, while the remaining 9 sets (folds) are used for training. The average performance measures over the 10 folds are considered for the overall performance of the model.

Feature Selection
We used several feature selection methods, such as univariate feature selection (UFS), recursive feature elimination (RFE), feature selection using SelectFromModel (SFM), and tree-based feature selection (TBFS). In UFS, the K best features were selected based on the univariate statistical tests. We used all the univariate statistical test methods available in scikit-learn for the purpose of classification. In RFE, the least important features are excluded in each recursive step, until the desired number of features is reached. The important features are selected from the model in SFM. In TBFS, a tree-based estimator computes the importance of the features and irrelevant features are discarded.

Performance Measures
The performance measures of the classification problem, such as sensitivity, specificity, accuracy, positive predictive value (PPV or precision), Mathews correlation coefficient (MCC), and F1-score were calculated using the following equations: where True Positive (TP): Bacterial-targeted human proteins are correctly identified as bacterialtargeted human proteins.
False Positive (FP): Viral-targeted human proteins are incorrectly identified as bacterialtargeted human proteins.
True Negative (TN): Viral-targeted human proteins are correctly identified as viraltargeted human proteins.
False Negative (FN): Bacterial-targeted human proteins are incorrectly identified as viral-targeted human proteins.
The area under the receiver operating characteristic curve (AUC), for all the cases, was also computed.

GO Enrichment Analysis
The top 100 bacterial-targeted and the same number of viral-targeted human proteins predicted by our method were considered for GO enrichment analysis. To this end, we used Enrichr, a comprehensive gene set enrichment analysis web server, 2016 update [32]. We considered only the biological process terms with p-values < 0.05 for the GO enrichment analysis.

Pathway Enrichment Analysis
The above mentioned 200 human proteins (100 each of the bacterial-and viral-targeted proteins) were also considered for pathway enrichment analysis. We used the Reactome Pathway Knowledgebase for this purpose [33]. Pathways with p-value < 0.05 were treated as enriched pathways.

Selection of Features
Important features of human proteins, such as the sequence, GO, and networks were considered to discriminate between the bacteria-and virus-targeted human proteins. For individual sequence features, dipeptide composition (DC) achieved the highest AUC of 0.931, with an F1-score of 90.3%, and MCC of 0.67 (Table 1 and Supplementary Table S4). However, the sequence features AAC, PAAC, and CTD showed poor performances with CTD being the poorest. We tested different combinations of the above features to achieve a high performance. We observed that a combination of AAC, DC, and PAAC achieved the best AUC of 0.939, F1-score of 94.9% and MCC of 0.81.
Of the other features, the GO feature attained the maximum AUC of 0.886, F1-score of 86.4% and MCC of 0.51. On the other hand, the network feature was unable to distinguish between the bacteria-and virus-targeted human proteins. We also tested mixed features set to measure the performance. We found that the combination of AAC, DC, PAAC, and GO features achieved the highest AUC of 0.914, F1-score of 88.3% and MCC of 0.60. Together, the above results suggested that the combination of the AAC, DC, and PAAC features attained the highest level of performance. We applied multiple feature selection methods, such as UFS, RFE, SFM, and TBFS for the combination of AAC, DC, and PAAC features. We observed that TBFS achieved the highest AUC of 0.805, F1-score of 84% and MCC of 0.44 (Table 2 and Supplementary  Table S5). However, features selected by these methods were unable to attain a similar performance as the original features set. This result suggested that several features selection methods were unable to perform better than the primary features. As a result, we selected a combination of AAC, DC, and PAAC (445 features) as the best features set.

Performance Comparison of Different Classifiers
To find the best classifier for our dataset, we compared the performance of SVM, RF, and DNN classifiers. Different parameter-based performances were calculated for these classifiers and only the best result was reported here. In the majority of cases, we observed that the DNN classifier achieved the best performance (Tables 1 and 2). As shown in Figures 2 and 3, the performance of the DNN classifier is far superior to SVM and RF. Together, the results suggested that DNN performed better than other conventional MLT.

Performance Comparison of Different Classifiers
To find the best classifier for our dataset, we compared the performance of SVM, RF, and DNN classifiers. Different parameter-based performances were calculated for these classifiers and only the best result was reported here. In the majority of cases, we observed that the DNN classifier achieved the best performance (Tables 1 and 2). As shown in Figures 2 and 3, the performance of the DNN classifier is far superior to SVM and RF. Together, the results suggested that DNN performed better than other conventional MLT.

Gene Ontology Enrichment Analysis
Prediction probability scores of all the bacteria-and virus-targeted human proteins were sorted (Supplementary Tables S6 and S7). Prediction scores for the top 100 bacteriatargeted and the same number of virus-targeted human proteins were investigated further to understand the specific infection strategies. GO enrichment analysis of the predicted bacteria-targeted proteins displayed negative regulation for catalytic activity, cellular response to hypoxia, cellular catabolic process, nitric oxide biosynthetic process, nitric oxide metabolic process, calcium ion import, RIG-I signaling pathway, cell adhesion mediated by integrin, and heart rate, etc. (Table 3). In contrast, virus-targeted human pro-

Gene Ontology Enrichment Analysis
Prediction probability scores of all the bacteria-and virus-targeted human proteins were sorted (Supplementary Tables S6 and S7). Prediction scores for the top 100 bacteriatargeted and the same number of virus-targeted human proteins were investigated further to understand the specific infection strategies. GO enrichment analysis of the predicted bacteria-targeted proteins displayed negative regulation for catalytic activity, cellular response to hypoxia, cellular catabolic process, nitric oxide biosynthetic process, nitric oxide metabolic process, calcium ion import, RIG-I signaling pathway, cell adhesion mediated by integrin, and heart rate, etc. (Table 3). In contrast, virus-targeted human proteins showed biological processes, such as the peptide biosynthetic process, translation, mitochondrial ATP synthesis-coupled electron transport, mitochondrial translation elongation, cellular macromolecule biosynthetic process, mitochondrial translational termination, respiratory electron transport chain, and translational termination upon GO enrichment analysis (Table 4). Overall, the top bacteria-and virus-targeted human proteins were related to 48 and 96 enriched biological processes, respectively. We found that most of the enriched biological processes were distinct for bacteria-and virus-targeted human proteins (Figure 4). Table 3. Top 20 GO biological processes for bacterial-targeted human proteins.       Pathway enrichment analysis showed the uptake and function of anthrax toxins, defective NEU1 causing sialidosis, and Vitamin B1 (thiamin) metabolism pathways for the top 100 bacteria-targeted human proteins (Table 5). Likewise, the top predicted virus-targeted human proteins showed the enrichment of pathways, including the formation of the cornified envelope, keratinization, translation, and mitochondrial translation termination, etc. (Table 6). We found that the enriched pathways for bacteria-and virus-targeted human proteins were different ( Figure 5). The above results suggested that bacterial-targeted human proteins enriched gene ontology (GO) and pathways distinct from viral-targeted human protein. RUNX1 interacts with cofactors whose precise effect on RUNX1 targets is not known 0.044545497 Table 6. Top 5 pathways for viral-targeted human proteins.

Discussion
Rapid, safe, cost-effective, and accurate tools for etiological diagnosis of suspected infections are of paramount importance for individual and public health. Particularly important is to discriminate between the bacterial and viral causes of infectious diseases given the alarming rise of antibiotic resistance, due to their indiscriminate and unnecessary use. An estimated 30-50% of antibiotics are prescribed in hospitalized patients of the United States for wrong indications, most commonly viral infections (https://www.cdc.gov/antibiotic-use/stewardship-report/outpatient.html, accessed on 21 October 2021) [34]. Traditional culture methods for bacterial infections are low

Discussion
Rapid, safe, cost-effective, and accurate tools for etiological diagnosis of suspected infections are of paramount importance for individual and public health. Particularly important is to discriminate between the bacterial and viral causes of infectious diseases given the alarming rise of antibiotic resistance, due to their indiscriminate and unnecessary use. An estimated 30-50% of antibiotics are prescribed in hospitalized patients of the United States for wrong indications, most commonly viral infections (https://www.cdc.gov/antibiotic-use/stewardship-report/outpatient.html, accessed on 21 October 2021) [34]. Traditional culture methods for bacterial infections are low throughput, time consuming, and labor intensive, in addition to the challenges of sample collection from some of the infected tissues, and the lack of wide availability of culture techniques for many pathogen species. On the other hand, the diagnosis of viral infections by serology may lack specificity, while nucleic acid detection methods require sophisticated equipment and technical expertise. However, no reliable methods or markers are currently available for the rapid diagnosis of bacterial and viral etiologies of infectious diseases.
Attempts have been made to develop complementary diagnostics for infectious diseases by focusing on specific host responses. In addition to being capable of discriminating between colonization and infection, this approach is not limited by the availability of infected tissue samples. Moreover, host response-based categorization of infections provides additional insights into the disease pathogenesis and immune response and may help to identify new targets for therapeutic intervention.
Multiple attempts have been made to diagnose infectious diseases based on hostspecific biomarkers. Widely used parameters such as WBC counts and C-reactive protein (CRP), may aid to differentiate between bacterial and viral infections, but lack sensitivity and specificity, leading to frequent misdiagnosis. Newer bacterial infection markers, such as presepsin, procalcitonin, and CD64, are used for severe sepsis, while proADM may predict prognosis of the disease [35,36]. In contrast, cytokines, such as IL-2, IL-8, and IL-10 were suggested as early biomarkers for viral infection [37]. Several research groups reported that the antiviral host protein MxA is a clinically useful marker for acute viral infection and, combined with CRP and/or procalcitonin, may distinguish between bacterial and viral infections [38]. A double-blind, multicenter study found that a strategy to integrate CRP, tumor necrosis factor-related apoptosis-inducing ligand (TRAIL) and interferon γ-induced protein-10 (IP-10) performed significantly better than the individual markers to identify acute viral infection in pediatric patients [39]. However, they did not validate their tools against reference diagnostic methods, limiting its utility. Other studies also suggested that a combination of markers may perform better than a single biomarker [40]. However, combining CRP with other markers did not improve the former's ability to differentiate between bacterial and viral lower respiratory tract infections in a different study [41].
High throughput genomic and proteomic studies have been employed to identify infection-specific host gene sets. Although they were useful for novel biomarker discovery, the gene sets often contained a large number of candidates, making them difficult to apply clinically [42][43][44]. Through multi-cohort analysis of these large datasets, smaller gene sets optimized for the diagnosis of bacterial and viral infections were identified later on [45].
Machine learning techniques have been extensively used for disease biomarker discovery, including infectious diseases. However, they were mostly used for individual microbial species or groups of pathogens. The increasing availability of bacteria-human and virushuman PPIs now permits researchers to compare bacterial-and viral-specific infection strategies and identify host proteins that are differentially targeted by these two classes of pathogens. We employed well-known machine learning methods, such as SVM, RF, and DNN to the available PPI datasets to distinguish between bacteria-and virus-targeted human proteins.
We considered all the updated and comprehensive sets of experimentally validated bacteria-human and virus-human PPIs from PHISTO. We found 1780 human proteins that are common targets for bacteria and viruses. During the bacterial and viral infection, these common proteins might help to execute several commonalities, such as immune response patterns, acute onset, and response to antimicrobial agents in humans. The primary goal of the current study was to differentiate between bacterial-and viral-targeted human proteins. Therefore, we excluded these 1780 human proteins from our analysis. The proposed method used 1618 and 3917 bacterial-and viral-targeted human proteins. To ensure utilization of a larger dataset of two classes, we considered the complete dataset for building the model. For imbalance datasets, we found that performance measures, such as the AUC, MCC, and F1-score, were more important as opposed to sensitivity, specificity, and accuracy. Therefore, we compared the AUC, MCC, and F1-score for all the cases. We found that sequence and gene ontology features performed far better than network features. We witnessed that the network properties of human proteins was unable to distinguish between bacterial-and viral-targeted human proteins (Table 1), suggesting indistinguishable network feature patterns for bacterial and viral targeted human proteins. The majority of frequent GO IDs for bacterial-and viral-targeted human proteins are common (Supplementary Figure S1). Therefore, gene ontology features were unable to perform better than the sequence features. Among the sequence features, we found that DC achieved better performance than the others. A combination of AAC, DC, and PAAC features (445 features) achieved the best performance (Table 1). In addition to these, the feature set selected by different feature selection techniques also showed a poorer performance than the above features set. Therefore, we reported that the combination of AAC, DC, and PAAC (445 features) is the best feature set for discriminating between bacterial-and viral-targeted human proteins. If the two classes are distinct due to true biological reasons, then we can also get good performance results for conventional MLTs like SVM and RF (shown in Table 1, and Figures 2 and 3). The DNN performed well due to a large number of data and features. Furthermore, we identified the top 100 human proteins targeted by bacteria and the top 100 human proteins targeted by viruses. The gene ontology enrichment analysis of these 200 proteins showed a greater number of enriched biological processes for viral-targeted human proteins rather than bacterial-targeted human proteins (Figure 4). Similarly, we observed a greater number of enriched pathways for viral-targeted human proteins than bacterial targeted human proteins. These results imply that viruses are influencing more biological processes and pathways than bacteria. As is known, viruses are totally dependent on the host. Therefore, they exploit more host machinery than bacteria. The above results indicate the same. In addition to this, we observed that the majority of the enriched biological processes and pathways were different for bacterial-and viral-targeted human proteins. These functional annotations also validated our method for discriminating between bacterial-and viral-targeted human proteins.

Conclusions
We proposed a computational method to distinguish between the bacteria-and virustargeted human proteins. We employed widely used and state-of-the-art machine learning techniques, such as SVM, RF, and DNN and integrated important biological information on human proteins, including the sequences, networks, and GO to achieve this goal. We found the best performance was with the sequence features and the DNN classifier.
We developed a prediction model to maximize the performance measures and identify the best features to do the same. Therefore, we did not use the prediction for future data. However, the proposed model may be utilized for predicting and discriminating between the possible interactions of human proteins with bacterial and viral proteins. We identified distinct targets for bacterial and viral infections upon GO and pathway enrichment analysis of highly predicted human proteins. Bacterial targets predominantly included immune response-related genes and transcriptional machinery, while viruses targeted protein translation and mitochondrial energy metabolism. The distinction between bacteria-and virus-targeted human proteins might help to improve infection-specific diagnosis and treatment. In the future, we will look for the difference between RNA and DNA viruses, and Gram-positive and Gram-negative bacteria to understand the specific infection strategy.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/pr10020291/s1, Figure S1: Venn diagram of Gene Ontology (GO) IDs of bacteria-and virus-targeted human proteins., Table S1: Bacterial targeted reviewed human proteins, Table S2: Viral targeted reviewed human proteins, Table S3: Top GO IDs for bacterial and viral targeted human proteins, Table S4: Full table of features wise performance measures on bacterial  and viral targeted human proteins, Table S5: Full table of selected feature-wise performance measures of bacterial and viral targeted human proteins, Table S6: Probability score of top 100 bacteria targeted human proteins, Table S7: Probability score of top 100 virus targeted human proteins.