Investigation of Virulence Genes Detected in Antimicrobial-Resistance Pathogens Isolates for Five Countries across the World

A large portion of annual deaths worldwide are due to infections caused by disease-causing pathogens. These pathogens contain virulence genes, which encode mechanisms that facilitate infection and microbial survival in hosts. More recently, antimicrobial resistance (AMR) genes, also found in these pathogens, have become an increasingly large issue. While the National Center for Biotechnology Information (NCBI) Pathogen Detection Isolates Browser (NPDIB) database has been compiling genes involved in microbial virulence and antimicrobial resistance through isolate samples, few studies have identified the genes primarily responsible for virulence and compared them to those responsible for AMR. This study performed the first multivariate statistical analysis of the multidimensional NPDIB data to identify the major virulence genes from historical pathogen isolates for Australia, China, South Africa, UK, and US—the largely populated countries from five of the six major continents. The important virulence genes were then compared with the AMR genes to study whether there is correlation between their occurrences. Among the significant genes and pathogens associated with virulence, it was found that the genes fdeC, iha, iss, iutA, lpfA, sslE, ybtP, and ybtQ are shared amongst all five countries. The pathogens E. coli and Shigella, Salmonella enterica, and Klebsiella pneumoniae mostly contained these genes and were common among four of the five studied countries. Additionally, the trend of virulence was investigated by plotting historical occurrences of gene and pathogen frequency in the annual samples. These plots showed that the trends of E. coli and Shigella and Salmonella enterica were similar to the trends of certain virulence genes, confirming the two pathogens do indeed carry important virulence genes. While the virulence genes in the five countries are not significantly different, the US and the UK share the largest amount of important virulence genes. The plots from principal component analysis and hierarchical clustering show that the important virulence and AMR genes were not significantly correlated, with only few genes from both types of genes clustered into the same groups.


Introduction
Every year, 25% of total deaths worldwide are attributed to microbial pathogens [1]. The emergence of a relatively new threat called antimicrobial resistance, or "AMR", is largely responsible for this high percentage in recent decades. Despite the initial success of antibiotics, overuse and misuse of the drugs have led to targeted bacteria developing resistance and mutations, and the emergence of superbugs. For example, the rate of global antibiotic consumption increased by 39% from 2000 to

Materials and Methods
This study's data originate from the NPDIB database. The hundreds of thousands of data samples were exported into individual spreadsheets organized by country (Australia, China, South Africa, the UK, and the US, specifically). To virtualize the multidimensional data into two dimensions, PCA was used. Combined with hierarchical clustering, the virulence genes and pathogens that differed the most from the rest (i.e., outliers) were identified as "important", as they represented the genes/pathogens that show different pattens from the bulk genes/pathogens. They are thus significantly associated with virulence. Furthermore, profiles of the historical occurrences of these important genes and pathogens in each country were used to study the virulence trends in each country. Finally, more clustering statistical analyses, as well as time profiles and frequency scatterplots, were used to study the correlation in the occurrence frequencies of important virulence genes and AMR genes.

The Historical Data on Virulence Genes from the NCBI Pathogen Detection Isolates Browser
Data from the NPDIB database contain the following information for each sample: the location of collection, the date of collection, the name of pathogen found, the detected virulence genes, the detected AMR genes, and the type of isolate (i.e., clinical or environmental). To translate the data into a usable format, each individual country's dataset of samples was saved and preprocessed as a comma-separated value (.csv) file in Excel. In each file, the row refers to an individual sample, and a specific column refers to an information label mentioned above (e.g., the location of collection, the name of the pathogen, the collection date). In particular, one column is designated for each gene to indicate whether the gene is detected in each sample (i.e., each row), with 1 s indicating a gene was found in a sample, and 0 s indicating it was not found. These data matrices typically had several thousand rows and a few hundred columns.

Principal Component Analysis (PCA) and Hierarchical Clustering
PCA is extremely useful in analyzing high-dimensional data. As the data extracted from the NPDIB database contain hundreds of dimensions (i.e., variables) to plot normally, PCA reduces the dimensions and projects the data into a two-dimensional space. The new data points are based on new coordinate dimensions, called principal components. A principal component is a linear combination of the initial coordinate variables. The first principal component (i.e., PC1) is always in the direction where projections have the largest variance. The second principal component (i.e., PC2) is the one with the greatest variance among all directions orthogonal (perpendicular) to PC1. The maximum amount of initial information is compressed into the first component, with the subsequent components containing less and less information. Once the data points are projected into a plot based on these principal components, the data are much easier to analyze since the data can be projected into only few dimensions. Identification of any correlations or outliers is now much simpler; in this study, the outliers were especially important. For example, the outlier genes show significantly different distribution patterns in how they are carried by pathogens from the bulk genes that are lumped together in the PC1~PC2 space. They typically show significantly larger occurrences than the bulk genes. Since there are more than one hundred virulence genes, these outlier genes can serve as better indicators of pathogenic virulence and thus better targets for virulence inhibition. Similarly, the outlier pathogens are regarded important, as they typically carry more virulence genes and have higher occurrence frequencies in the data.
In order to identify these important virulence genes in PCA, a new data matrix was created in R so that: (1) each row represents one gene; (2) each column stands for one pathogen; and (3) each element in the matrix represents the number of historical samples in which the gene in the row was detected in the pathogen in the column ( Figure 1A). The PCA was then implemented on the matrix to plot the genes in a two-dimensional space characterized by PC1 and PC2 ( Figure 1B). The outlier genes typically show more involvement in virulence than those bulk genes that typically lump together in the PCA plot. The R function prcomp was performed for PCA. A second PCA analysis for the transposed version of the same data matrix (now pathogens in the rows and virulence genes in the columns) allowed us to also identify important pathogens carrying virulence genes. Figure 1. An illustrative example of using principal component analysis (PCA) and hierarchical clustering to identify important virulence genes: (A) a matrix was constructed with elements representing the number of the samples in which the gene in each row was detected in the pathogens in the columns; (B) genes from the high-dimensional matrix were projected onto the PC1~PC2 two-dimensional space; (C) the hierarchical clustering was used to group the data points on the basis of their projection onto the reduced dimensional space shown in (B); (D) the total within-cluster sum of square was calculated when the genes were grouped into one to six clusters (indicated by the dashed line in (C) in the dendrogram). A bend was observed when the genes were separated into three clusters, with F, ED, and ABC forming individual clusters, respectively. The clustering result was further confirmed by the genes' projection on the PC1~PC2 space. While only six genes were used here for the purpose of illustration, a large number of genes with low occurrence patterns were typically lumped together such as genes A, B, and C in this work. The outlier genes such as genes F, E, and D were thus considered important genes. While PCA plots visualize the genes in a two-dimensional space, these plots were not able to quantify the relationship between individual genes. In particular, most genes lump together so that they are not distinguishable in the PCA plots. In order to quantify the relationship between individual genes, hierarchical clustering was used in conjunction with PCA. The pairwise distance between the genes is defined as the Euclidean distance from the genes' projections on the PC1~PC2 space. The distance between different branches in the hierarchical clustering tree is calculated by the complete linkage approach. After calculating the Euclidean distance between points on a PCA plot, the closest or most similar points are then graphed as a cluster on a dendrogram, or a tree diagram ( Figure 1C). Points are identified to be in the same cluster because they are connected by a line; points with the shortest distance, and thus shortest line between them, are the most similar. Each cluster is on a different level of the tree, with similarities in height reflecting similarities between clusters. The Elbow method [18] was used to determine the optimal number of clusters from the dendrogram. In particular, the number of clusters from the hierarchical clustering tree was gradually increased, with all data points regarded as one cluster from the beginning ( Figure 1D). After increasing the cluster number one by one, the total within-cluster sum of square (i.e., TWSS_i, i for individual clusters) was calculated and then summarized together over all clusters (i.e., TWSS_all). TWSS_all was then plotted over the number of clusters. The number of clusters was determined by the location of the bend in the plot. While Figure 1 shows an illustrative example with three clusters, the genes in this work were generally separated into six clusters to identify the outliers. The genes showing few occurrences or patterns were generally lumped together in a single cluster, while the outliers were distributed in other clusters. The outliers are thus regarded important genes/pathogens.

Time Profiles and Frequency Scatterplots
To study the trend of virulence over time, the historical profiles of the important virulence genes and pathogens were plotted. To reduce the bias of some years having more collected samples, the profiles were normalized by dividing the annual occurrences of genes or pathogens with the total number of samples collected in that year.
Additionally, plotting the number of AMR genes versus the number of virulence genes for each sample in each country, scatterplots of the different combinations of gene numbers were produced. For example, a combination of zero AMR genes and three virulence genes would be represented as a point; a combination of two AMR genes and one virulence gene would be another point. Additionally, with different colors, the frequencies of each combination were represented. Points with warmer colors, such as red or orange, had a high frequency of samples of that specific combination. Conversely, points represented as cooler colors, such as green or blue, had a lower frequency. In this way, combinations that were common in individual countries as well as across the five different countries were identified from the scatterplots. With this information, the relationship between the two types of genes could be qualitatively determined, if the correlation between the number of AMR genes and number of virulence genes was positive and clustered across a line, then it would suggest the two types of genes are more closely related. If the combinations of the number of AMR genes and number of virulence genes were more scattered or followed a negative trend, then there would be a much weaker general connection between virulence genes and AMR genes.

Important Virulence Genes in Samples from Australia, China, South Africa, the UK, and the US
Visualized by principal component analysis and hierarchical clustering, as described in the Materials and Methods section, important virulence genes were identified for each of the five countries. In summary, the genes clustered away from the other ones, represented as outliers in the PC1-PC2 space and as more separated in the results from hierarchical clustering, are the important genes of a country. As an example, the results from hierarchical clustering of the US are shown in Figure 2. The PCA representation is not shown here, instead it can be found in the Appendix A. Since the US has over 150 different virulence genes, it is rather difficult to discern important genes in the PCA graph (with lumping genes). The two red rectangles in Figure 2 show the genes in the US that are important. Similar methods were used to identify the significant virulence genes in Australia, China, South Africa, and the UK as well. Table 1 summarizes the important virulence genes across each country. There is quite a bit of variation in the number of virulence genes identified as important in each country. For example, South Africa has only 15 genes while the US has 36. Despite this disparity, eight genes were shared among all of the five countries (fdeC, iha, iss, iutA, lpfA, sslE, ybtP, and ybtQ), and several more were common between three or more countries. These important shared genes are highlighted in red and blue in Table 1. Table 1. The important virulence genes identified in the five countries. In red are the genes shared across each country, and in blue are the important genes identified in at least three different countries (note: underlined genes show generally increasing trend in their occurrence from 2010-2020).  ipaH1  iroE  mchF  iutA  espK  iss  iroN  papA  lpfA  espP  iucA  iss  senB  lpfA1  espX1  iutA  iucA  sslE  lpfA2  etpD  lpfA  iutA  vactox  mchF  fdeC   mchB  lpfA  ybtP  nleB2  iha  papE  mchF  ybtQ  pic  iss  papF  pic  sepA  iucA  papH  sslE  sinH  iutA  pic  ybtP  sslE  katP  senB  ybtQ  stxA1a  lpfA  sepA  stxA2c  lpfA1  sigA  stxB1a  lpfA2  sinH  stxB2a  nleA  sslE  stxB2c  nleB  stxB2c  virF  nleB2  toxB  ybtP  nleC  tsh  ybtQ  sinH  virF  sslE  ybtP  stxA1a  ybtQ  stxA2c  stxB1a  stxB2a tccP tir toxB ybtP ybtQ More detailed descriptions of these shared genes are shown in Table 2, with red again representing the eight genes common among all countries, and blue representing the genes shared between at least three countries. Expectedly, many of these genes have common functions such as pathogenic infectivity through adhesin-like proteins (fdeC [19], iha [20], lpfA [21], and sinH [22]), increased iron reception and uptake (iutA [23], ybtP [24], ybtQ [24], and iucA [25]), and also toxin synthesis for host inhibition (auto-sat [26], espX1 [27], and mchF [28]). The iss and sslE genes are most unlike the others in terms of function as they work to promote extraintestinal infectivity [29] and biofilm production [30], respectively. Additionally, the common genes between all countries were found in E. coli; however, a trend was noticed in genes shared among at least three countries as well. Generally, it seems that virulence genes with a role in infectivity that are found in E. coli are important. Table 2. A summary of the important virulence genes shared among the five countries with information on gene function, and species of which the gene is found. Highlighted in red are the virulence genes identified in all countries, and in blue are the genes belonging to at least three different countries. To further study the important virulence genes, normalized historical occurrences of individual genes in annual samples from 2010-2020 were plotted separately for each country. The genes with generally increasing trends after 2011 were identified and they are underlined in Table 1. For each country, these important genes were then plotted together in a summarizing historical time profile to identify any trends. Figure 3 shows the important increasing genes found in the US. There is a large increase in gene portion in 2011, and there are smaller subsequent spikes in 2014 and 2016; generally, the behavior seems to stabilize after then. Similar graphs were produced for other countries to compare their trends as well.

Important Pathogens Carrying Virulence Genes in Samples from Five Countries
In addition to identifying the important genes in each country, principal component analysis and hierarchical clustering were used to identify the key pathogens that carried these genes. In Table 3, the important pathogens found in each country are shown. Highlighted in red are the three pathogens that were common among at least four countries (no pathogen was shared among all countries). These pathogens were E. coli and Shigella, Klebsiella pneumoniae, and Salmonella enterica. As previously stated, many of the important virulence genes identified in Table 1 were found in E. coli and Shigella samples. Furthermore, as shown in Table 2, a few common virulence genes were also found in Klebsiella pneumoniae or Salmonella enterica, so it makes sense that these three pathogens were found to be common among all countries except Australia. Normalized time profiles of the pathogens were also produced to compare the historic trends in each country. Figure 4 shows such a graph of the important pathogens in the US. There does not seem to be any distinct pattern before 2015, but after that year, the percentage of the pathogens found in the total annual sample seemed to stabilize. The line representing E. coli and Shigella also seems to have a similar trend to that of important increasing virulence genes as seen in Figure 3, with a spike in 2011 as well as small increases in 2014 and 2016. These results align with the fact that many of the important virulence genes were found in E. coli and Shigella.

Comparisons between AMR Genes and Virulence Genes
In addition to identifying the important virulence genes across the five countries, principal component analysis and hierarchical clustering were performed to compare the similarity between virulence genes and previously studied AMR genes [7,8]. Previously identified important AMR genes were combined with the genes found in Table 1, and through hierarchical clustering, the virulence genes and AMR genes in each country that connected in the clustering branches were found. For example, the hierarchical clusters of important genes in the United States are shown in Figure 5. As before, it is hard to identify the overlap or connecting genes with the PCA representation, so only the hierarchical clustering is shown here, and the principal component analysis graph can be found in the Appendix A. From the clustering, it can be seen that sinH is the only virulence gene clustered with AMR genes, and it is closely related to other antimicrobial resistance genes such as floR and fosA7. Table 4 shows virulence genes and AMR genes that are connected in the clustering tree. As shown in the table, there were far fewer "overlap" genes than either important virulence genes or AMR genes. In South Africa, there was also minimal overlap between virulence genes and AMR genes, but in Australia, China, and the UK, there were several important genes of both types clustered near each other. The specific virulence genes and AMR genes clustered together were different in each country, but some genes were consistently related to each other across the five geographies. For example, the yersiniabactin virulence genes ybtP and ybtQ were correlated with beta-lactamase AMR genes in several countries. Connections between these genes likely relate to a common pathogen of E. coli. The general delineation between virulence genes and AMR genes, however, suggests that there is little similarity in the number of pathogens carrying the two types of genes.  To further compare virulence genes and AMR genes, normalized time profiles of the historical occurrences of these connecting genes were plotted to identify any trends or patterns. These profiles looked different for each country. As an example, Figure 6 shows the time profile of important virulence genes and AMR genes in the US. The behaviors of the virulence genes and AMR genes have some similarities as they all share a gene portion spike in 2015 and relatively flat behavior after that, but there are also differences, as the sinH gene has a large increase in 2011 while the two AMR genes do not. In Australia, China, South Africa, and the UK, fluctuations in gene portions cause spikes in uniquely different years, but generally the overlap genes that link virulence and AMR share similar behaviors in each country.
While Figure 5 and Table 4 show the connected important AMR genes and virulence genes in hierarchical clustering, we further studied the total number of AMR genes and virulence genes in each sample for each country. In other words, the number of genes, instead of the gene ID, was studied to quantify the correlation between the number of the two types of genes in individual pathogen isolates. We plotted the number of virulence genes against the number of AMR genes found in samples from each individual country. Figure 7 shows such a scatterplot of the gene amounts in the US, and similar graphs were produced for the other countries as well. In the US, there were no instances in which a sample had less than three AMR genes (as those pathogens are antimicrobial resistant). Additionally, almost no samples had a high amount of virulence genes as well as a high amount of antimicrobial resistance genes; most samples had less than 25 virulence genes and 15 AMR genes. In other countries, it was also unlikely that samples had a lot of virulence genes and AMR genes, and it was also rare to have many virulence genes without at least one AMR gene. Generally, all countries had scattered points and no apparently positive correlation between virulence genes and AMR genes were found. These results support the idea that while a few AMR genes may be linked with virulence genes, the two types of genes are mostly not correlated.     Table 1 lists the important virulence genes for the five selected countries. The number of important virulence genes that are shared by each pair of countries is shown in Figure 8. The diagonal elements in the table are for individual countries, while the non-diagonal elements illustrate the similarities of virulence genes between countries. While South Africa shares less than 10 important virulence genes with either Australia or the US, most countries share 10 or more important virulence genes with others. In particular, the US and the UK share the largest amount of important virulence genes (i.e., 19 genes), followed by the UK-Austria pair, which contains 14 shared virulence genes. Pathogens and their genes can spread to other countries through trade and travel through contamination of goods or an unknowingly infected person, for example. Once the pathogen reaches an unfamiliar country, it would spread from host to host using its virulence mechanisms, allowing these new virulence genes to integrate with existing ones so that the countries share similar virulence genes. The greater amount of similarities in certain countries could be explained by trade or travel, both of which provide pathogens, and therefore genes, access to other countries. For example, the US-UK and UK-Australia pairs have the largest amounts of shared important genes due to their close trading and traveling relationships. In addition to these three countries, China shares more than 10 important virulence genes with any one of the other countries. This may be explained by the trading relationship between China and the other countries. For example, China is Australia's largest trading partner for both imports and exports; Australia is China's sixth largest trading partner overall [31]. In addition, China has provided more travelers to Australia than any other country, with over a million Chinese tourists visiting in 2019 [32]. This number accounted for about 15% of Australia's short-term visitors that year [32]. Immigrants from China also make up the second highest foreign-born population in Australia, at about 2.7% of Australia's population [33]. While South Africa shares relatively less important virulence genes with other countries, its pairs with China and the UK may be implied by its trading relationship with these two countries. For example, the UK is South Africa's fourth largest trading partner, receiving 5.1% of total South African exports [34]. Although South Africa is not a main trading partner of the UK, South Africa is the UK's largest export and import market in Africa [35]. The most common destination for South African emigrants is the UK [36], where South Africans make up the seventh largest group of immigrants [37].

The Trend of Virulence Genes Indicated from Time Profiles
The time profiles of the pathogens are associated with the corresponding virulence genes. As shown in Figure 3, many genes in the US spike in 2011, fall from 2011-2013, then increase and stay relatively constant onwards. One example of this correlation is in the espX1 and sslE gene spike in 2011, as seen in Figure 3. This corresponded with an increase in the E. coli pathogen, shown in Figure 4, as both of the above genes are associated with virulence in E. coli. sslE is a zinc-metalloprotease involved in the degradation of mucin substrates, and an important colonization factor favoring E. coli access to both metabolic substrates and target cells [38]. espx1 is an effector protein injected into host cells, contributing to virulence of E. Coli [39].

Correlation between Virulence Genes and AMR Genes
Generally, virulence and AMR genes are distinctly separated in PCA and clustering for each country. The few overlap genes (both virulence and resistance) are mostly found in E. coli and Shigella. Firstly, these results suggest that these important pathogens (as identified in Table 3) are the main hosts for genes. It also implies that virulence genes and AMR genes evolved on their own time scales and therefore do not share many links with each other. Indeed, virulence genes have been around in pathogenic bacteria for millions of years, but antimicrobial resistance has only recently evolved (i.e., last 50 years) after the first usage of antibiotics [40]. On the other hand, certain interplays between virulence genes and AMR genes have also been found from the clustering branches containing both types of genes (listed in Table 4). Some of these interplays have been reported [40,41]. For example, AMR genes tet(A), sul1, sul2, mph(A), blaTEMp-1 were found in plasmid pRSB107 that also contains a virulence-associated system (e.g., an aerobactin iron acquisition siderophore system iucA) [40]. While certain interplays between virulence genes and AMR genes have been validated, the other overlapping genes (such as iss, sslE, and ybtP/Q) may have led, either as a direct precursor or potentially as a mutation point, to modern AMR genes. It is also possible that this correlation is coincidental, but because of the commonality of select shared genes in the five countries, it is more likely that there is an association between the virulence genes and resistance genes that were clustered into the same group.

Limitations and Future Work
In this study, a couple of limitations are related to uncontrollable factors in the NCBI Pathogen Detection Isolates Browser database. First, there is a large difference in samples isolated from each country. In particular, countries such as South Africa with a smaller number of samples from NPDIB may be less representative of their country's actual trends of virulence genes or AMR genes. In addition, some countries do not have data from before 2012. Therefore, general conclusions cannot be extrapolated to years before this decade. Another limitation that cannot be overstated is the impact of SARS-CoV-2. Some data collection was likely cancelled or not made possible because of COVID-19 restrictions, so even with a normalized time profile, pathogen and gene prevalence from the year 2020 likely does not follow the historic ten-year trend from 2010.
There are several opportunities for further study. Since this work aims to identify the important virulence genes carried by pathogens, the matrix for PCA analysis was built by setting the genes in rows and the pathogens in columns. Thus, the hierarchical clustering mainly considers the genes' presence. In addition to studying how genes were similar from how pathogens carried the genes, the impact of time was evaluated by plotting the time profiles of the important virulence genes. The genes were further compared across different countries to investigate the impact of the regions. All the genes studied in this work have been identified from genomic sequences in the NCBI Pathogen Detection Isolates Browser database. Analysis of the sequence data may reveal the correlation from other genes with these important virulence genes. This is an interesting research direction for further investigation. Moreover, analyzing other countries covering all six major land masses may provide a more complete picture of genes and pathogens globally. All these potential ideas for future work would benefit scientists in finding novel and particularized methods of treating pathogens.

Conclusions
Research on virulence genes is important as it is a potential avenue to develop new treatments for pathogens. In this work, statistical analysis methods such as PCA (principal component analysis) and hierarchical clustering were used to study the extensive data on virulence genes from the NCBI Pathogen Detection Isolates Browser. These methods, implemented through R, were able to identify important genes and pathogens mainly involved in virulence for Australia, China, South Africa, UK, and US. Significant genes that were common for all five countries included fdeC, iha, iss, IpfA, sslE, ybtP, ybtQ, and iutA; important pathogens that were common for every country except Australia included E. coli and Shigella, Salmonella enterica, and Klebsiella pneumoniae. These results were further studied through normalized time profiles, which plotted the portion of the above genes and pathogens in annual samples. Both the gene and pathogen profiles showed a generally increasing yet highly fluctuating trend for each studied country. Additionally, in each country's pathogen profile, E. coli and Shigella and Salmonella enterica were consistently at high percentages, indicating they have a strong presence in virulence. The similarity analysis of the important virulence genes indicated that the US and the UK share the largest amount of important virulence genes and South Africa shares the least number of genes with the other countries. The correlation between virulence genes and AMR genes was studied by the PCA and clustering approach, along with the comparison of their time profiles. It turned out that the two gene types are mostly distinctly separated in the plot returned by PCA and hierarchical clustering. This may be due to vastly different evolution timescales. Occasionally, there are virulence genes that are strongly related to AMR genes possibly because both types are associated with survival in harmful conditions, so it is possible that certain virulence genes and AMR genes paired with each other in certain pathogens for the purpose of survival.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Important virulence genes (red) and AMR genes (blue) identified in each country. Both types of genes were determined using independent hierarchical clustering analyses.