Genome-Wide Analysis of Four Pathotypes of Wheat Rust Pathogen (Puccinia graminis) Reveals Structural Variations and Diversifying Selection

Diseases caused by Puccinia graminis are some of the most devastating diseases of wheat. Extensive genomic understanding of the pathogen has proven helpful not only in understanding host- pathogen interaction but also in finding appropriate control measures. In the present study, whole-genome sequencing of four diverse P. graminis pathotypes was performed to understand the genetic variation and evolution. An average of 63.5 Gb of data per pathotype with about 100× average genomic coverage was achieved with 100-base paired-end sequencing performed with Illumina Hiseq 1000. Genome structural annotations collectively predicted 9273 functional proteins including ~583 extracellular secreted proteins. Approximately 7.4% of the genes showed similarity with the PHI database which is suggestive of their significance in pathogenesis. Genome-wide analysis demonstrated pathotype 117-6 as likely distinct and descended through a different lineage. The 3–6% more SNPs in the regulatory regions and 154 genes under positive selection with their orthologs and under negative selection in the other three pathotypes further supported pathotype 117-6 to be highly diverse in nature. The genomic information generated in the present study could serve as an important source for comparative genomic studies across the genus Puccinia and lead to better rust management in wheat.


Introduction
Rusts are among the most devastating fungal pathogens of wheat worldwide. Stem (Puccinia graminis f. sp. tritici), stripe (P. striiformis f. sp. tritici), and leaf (P. triticina) rust diseases of wheat have globally affected wheat production and food security [1,2]. All the rusts of wheat are known to occur in India, of which, stem (black) rust of wheat is restricted to about seven million hectares of peninsular and central India only. Because of its destructiveness and the economic significance of its cereal hosts, stem rust is one of the most widely studied among all plant rust pathogens [3]. Despite continuous genetic interventions to improve wheat yields, its production has been nearly stagnant during the last few years because of biotic and abiotic constraints. Since the biotrophic pathogens cannot be cultured on artificial medium, the molecular characterization of functional genes has been extremely difficult in the rust fungi [4]. Nevertheless, in the recent years, enormous amounts of genomic information of rust pathogens, generated through next generation sequencing (NGS) these pathotypes were multiplied on a susceptible wheat genotype 'Agra local' by a single spore infection. (A) Description of the four pathotypes collected from various regions of India used in this study with their year of detection and host isolation. The map of India was downloaded from site (http://www.d-maps.com/carte.php?num_car=24868& lang=en, accessed on 12 July 2018) using "free-to-use images" which is under the license (Creative Commons-attribution 4.0 international CC BY 4.0) to be used as free for educational as well as commercial purposes. (B) CIRCOS plot of four individual P. graminis pathotypes. The outside of the outer-most circle in red, green, blue, and gray depicts the genome size of pathotypes 117-6, 14, 40-3 and 40A, respectively, with 1 Mb breakpoints increasing in the clockwise direction. Further inwards, the second circle is a density tile plot of all the annotated exons (orange color). The third inner circle in purple is the density histogram plot of total SSR content in the four pathotypes. The fourth circle is the density histogram plot depicting total genome coverage (green color). The inner-most circle in blue is the density scatter plot of total heterozygous SNPs identified in the four genomes. A clear distinction of SNP distribution in pathotype 117-6 can be observed as compared to the other three pathotypes. (C) Genes > 450bp excluding the hypothetical genes were annotated against NCBI nr database and categorized into 22 different functional categories. Percentage distribution of the genes in the individual class showed genes within energy metabolism to be the highest (~19%) in all the pathotypes, followed by mobile and extra chromosomal elements (~8%), and transport and binding proteins (~9%). About 7% genes were observed to fall under the categories of conserved and predicted proteins.

Genome Sequence and Assembly
Individual separate paired-end (PE) libraries (100 bp) were prepared from the genomic DNA of all the pathotypes by using TruSeq DNA Library Preparation Kits as per the manufacture's protocol and were sequenced using Hiseq 1000 (Illumina) automated sequencer (Illumina, Inc., San Diego, CA, USA). Reference-based assembly was performed for the processed data by GS Reference Mapper (Roche) version 2.5.3 with default parameters (minimum read length = 20 bp, minimum overlap length = 40 bp, minimum overlap identity = 90%, alignment identity score = 2, and all contig threshold = 100) using the genome sequence of P. graminis pathotype CDL 75-36-700-3 (Puccinia Group Sequencing Project, Broad Institute of Harvard and MIT (http://www.broadinstitute.org, accessed on 5 March 2017) as a reference. Unmapped reads were assembled and analyzed by using CLC Genomics Workbench 7.5.1. The quality of assembly was assessed by QUAST 3.2 software tool (Supplementary Materials Figure S1 and Table S1). Genome completeness of the assembled pathotypes was checked by CEGMA using a set of very highly conserved 248 core eukaryotic genes (CEGs), single-copy genes [33,34]. Raw reads of all four pathotypes were also mapped against the assembled data of self as well as against the other three pathotypes. The quality of the assembly was carried out by QUAST 3.2 software tool (Supplementary Materials Figure S1 and Table S1). Intra-pathotype SNPs were obtained by aligning the sequence reads of each pathotype to its assembled contigs. Further heterokaryotic and homokaryotic SNPs were predicted by aligning the reads of each isolate to the assembled contigs of the other pathotypes (inter-pathotype SNPs). Sequence data from this article have been deposited with the NCBI GenBank and their BioProject IDs (https://www.ncbi.nlm.nih.gov/bioproject/, accessed on 15 January 2017) under Accession No. LAQV00000000, LAQW00000000, LAQX00000000, LAQY00000000. If requested, the database will withhold release of data until publication.

Gene Prediction and Annotations
Genes were predicted from large contigs (≥2 kb) by FGENESH 3.1.2 (MolQuest2.2) trained against Puccinia spp. possessing no less than 80% homology with default parameters. For expression analysis, the predicted genes were searched against the National Center for Biotechnology Information (NCBI) EST dataset using the BLAST function. Functional annotation was done for the genes (≥450 bp) by searching against the NCBI-nr database, again via the BLAST function. Genes with significant hits (E ≤ 1 × 10 −10 ) were categorized into different functional categories based on a literature search.

SNP and InDel Analysis
Whole-genome SNPs and InDels were detected using Sequence Alignment/Map tools (SAM tools) software package at 10× coverage with the quality value of Phred score ≥20. Sequence of P. graminis Race SCCL (CDL 36-700-3, Broad Institute, Cambridge, MA, USA) was used as a reference for the prediction of SNPs. The annotation of SNPs was performed using SnpEff software [37]. For the analysis of haplotype variations within the genomes, SNPs were detected by "Basic Variant Detection" module of CLC Genomics Workbench 7.5.1 with default parameters (Ploidy = 2, minimum coverage 10×, Variant Frequency ≥35%) by aligning raw reads of each pathotype against their respective contigs (assembled genome) for intra-pathotype SNPs and against assembled genomes of other three pathotypes for inter-pathotype SNPs. For depicting the heterozygosity level, interpathotype SNPs were classified as either 'Homozygous' or 'Heterozygous' having only one or more than one variant called at that position, respectively. Heterozygous SNPs, if referring to a variant position that is homozygous in other pathotypes, were classified as heterokaryotic SNPs, and Homozygous SNPs, if found polymorphic between two independent pathotypes, were classified as homokaryotic SNPs [6].

Secretome Analysis
Identification and analysis of the secretory proteins within the four P. graminids pathotypes was performed by using different software tools. Initially, proteins (≥50 amino acids) with a SignalP D-score = Y (SignalP version 4.1; www.cbs.dtu.dk/services/SignalP, accessed on 13 December 2017) and TargetPLoc = S (TargetP version1.1; www.cbs.dtu. dk/services/TargetP, accessed on 13 December 2017) were merged. These were then scanned for transmembrane spanning regions using TMHMM (TMHMM version2.0; http://www.cbs.dtu.dk/services/TMHMM, accessed on 15 December 2017). Peptides with 0 or 1 transmembrane regions were retained and transmembrane regions located in fewer than 10 amino acids in a mature peptide from a predicted cleavage site were considered for further analysis. The eventual locations of these proteins were predicted by the integral prediction of protein location score obtained by ProtComp version 10 (http://linux1.softberry. com/berry.phtml/berry.phtml?topic=protcompan&group=programssubgroup=proloc, accessed on 14 December 2017). Proteins showing the integral prediction of protein location and extracellular secreted (ES) were kept in the final secretome data set. BLASTP was used for the annotation of the predicted secretome. Conserved domains in the secretome were predicted through the Pfam domain database with profile gathering cut-off threshold [38]. Relative gene conservation of the total predicted ES proteins among the four pathotypes was determined using OrthoVenn web server (http://www.bioinfogenome. net/OrthoVenn/, accessed on 17 December 2017). Cysteine content in the extracellular secreted proteins was calculated and divided by the total number of residues in the peptide and converted to percentage. Potential pathogenicity genes in these genomes were identified by a BLAST search of predicted genes (≥450 bp) against 2647 protein sequences of PHI-base (Pathogen-Host Interactions database version 3.6) and genes with significant hits (with E ≤ 0.05 and bit score ≥100) were considered the pathogenicity-related genes.

Identification of Homologous Genes within and across the P. graminis Pathotypes
A combined approach of the 'best bidirectional hit' (BBH) method and the In-Paranoid program (INP) algorithm [39] with higher and more stringent threshold values was used for the identification of orthologs and paralogs [40,41]. Genes (≥450 bp) from each of the four genomes were searched against each other via a BLAST search following an all-against-all BLAST search. BLAST hits with bit score ≥100, E-value ≤ 0.05 and at least 40% identity between amino acid sequences over at least 70% of the protein length were filtered out as the significant hits [42,43]. Any two significant BLAST hits fulfilling the afore-mentioned 3-scale threshold (bit-score, E-value, and identity percentage) and those with bidirectional hits with each other were considered as paralogs or orthologs to each other based on if they belonged to the same or different pathotypes, respectively. The predicted ortholog pairs were clustered by using Excel sheets parsing into three groups with two, three, and four genes from two, three, and four different genomes, respectively. A circular diagram was constructed with Circos Table Viewer V0. 63-9 [44], to show the syntenic view for one-to-one comparison of ortholog pairs having genes from any two pathotypes. To interactively show the number of ortholog pairs within two or three or all four isolates, a Venn diagram was constructed using VENNY 2.1 [45].

Phylogenetic Analysis
Comparative evolutionary analysis of the four pathotypes of P. graminis, along with the reference genome, CRL 75-36-700-3 (race SCCL) was performed through conservation distance matrix-based guide tree obtained by genome alignment using Mauve version 20150226 build 10 [46] with default parameters. For the SNP-based analysis, a phylogenetic tree of the pathotypes was constructed using ClustalX by applying the neighbor-joining method with 100 bootstrap replicates.

Diversifying Selection Analysis and Estimation of Substitution Rates
Among the four isolates, all genes (≥450 bp) with orthologs were chosen for the detection and analysis of diversifying selection. Gene sequences in each cluster were aligned by ClustalX 2.0 [47] and converted to PAML format by PAL2NAL version 14.0 [48]. For the estimation of synonymous and non-synonymous substitution rates, PAML format was used to calculate pairwise dN/dS ratios by YN00 of pamlX 1.3.1 [49]. Clusters with three or four genes and having atleast one gene with dN/dS ratios > 1 were further subjected to CODEML of pamlX 1.3.1 with two likelihood ratio tests (LRTs), i.e., model M1 (neutral) to model M2 (selection), and model M7 (beta) to M8 (beta and omega) to assess the site-specific diversifying selection. A gene was considered to be undergoing site-specific diversifying selection if both the M1/M2 and M7/M8 LRTs for that gene were found significant with the chi-square tests threshold, p < 0.05.

Genome Sequencing and Assembly of the P. graminis Pathotypes
To investigate genetic variation between different pathotypes which are virulent/avirulent on standard differential genotypes ( Figure 1A), four P. graminis f. sp. tritici (P. graminis) pathotypes from India (14, 40A, 40-3, and 117-6) were selected for sequencing using the NGS method. Approximately 6.2 Gb sequence data were generated for these pathotypes on the Illumina HiSeq 1000 platform. Reference-guided assembly using GS Reference Mapper Software (version 2.0, Roche) against the reference genome of P. graminis CDL 75-36-700-3, race SCCL [8] (Puccinia Group Sequencing Project, Broad Institute of Harvard and MIT (http://www.broadinstitute.org, accessed on 15 March 2017) resulted in a total of 58,140 to 70,264 sequence contigs with an N50 that ranged from 4.1 to 5.2 kb. We obtained 96× to 103× depth coverage of the assembled genome with the size varying from 59 to 66Mb (Table 1). The final assembled data were subjected to quality assessment by using QUAST 3.2 software tool (Supplementary Materials Figure S1 and Table S1) resulting in satisfactory scores of N50 and L50 for the pathotypes. The genome completeness of the four assembled pathotypes along with the reference genome was assessed with the Core Eukaryotic Gene Mapping Approach (CEGMA) with defined 248 single-copy conserved eukaryotic genes (CEGs). Results obtained in this study showed the presence of 75.81% to 79.84% of complete CEGs and 82.66% to 85.48% of partial CEGs in the genome assembly of all the pathotypes. Normalized values for genome completeness with respect to the reference genome ranged from 84.30% to 88.79% for complete CEGs and from 89.13% to 92.17% for partial CEGs (Supplementary Materials Table S2A,B).
Genetic variations were expected between the two independent nuclei of asexual dikaryotic urediniospores. Inter-and intra-individual (within and across pathotypes) SNPs were predicted to investigate such variations in these four pathotypes. We identified an average SNP frequency of 23.75 ± 2.90 SNPs/kb between the two nuclei within a single individual (intra-pathotype SNPs) with highest frequency of 25.93 SNPs/kb for pathotype 40-3 followed by 25.71and 23.66 for pathotypes 40A and 14, respectively. A frequency of 19.68 SNPs/kb was obtained for the pathotype 117-6, being the lowest among the four pathotypes (Supplementary Materials Table S3). On average, heterokaryotic SNPs across the pathotypes were more frequent (11.78 ± 1.77 SNPs/kb) than homokaryotic SNPs (3.57 ± 2.35 SNPs/kb). The highest levels of diversity of over 13 SNPs/kb for heterokaryotic sites were found when reads of pathotypes 40A and 40-3 were individually mapped onto the other three pathotypes with an average of 13.18 ± 0.35 and 13.26 ± 0.34 SNPs/kb, respectively. Pathotype 117-6 was found to have the lowest diversity of 9.13 ± 0.23 SNPs/kb for heterokaryotic sites with the other three pathotypes whereas the diversity level for homokaryotic SNPs was the maximum, i.e., 5.32 ± 0.58 SNPs/kb with the other three pathotypes (Supplementary Materials Table S4).

Gene Prediction and Annotations
We predicted 13,854, 12,636, 12,670, and 15,401 protein-coding genes in the pathotypes 14, 40A, 40-3, and 117-6, respectively, by using FGENESH 3.1.2 (MolQuest2.2). Among the predicted proteins, an average of 90% of genes (≥150 bp) produced significant hits against NCBI database. Annotations of the genes with ≥450 bp (an average of 9273 genes in the four pathotypes) were performed by a BLASTP search against NCBI nr-database and 67% of these genes from the total predicted genes showed significant sequence similarity to the genes of all the four pathotypes in the database (Supplementary Materials Table S5). Among the twenty-three different functional classes assigned to the proteins, 46.4 to 49.3% of genes were hypothetical. The remaining 53.6 to 50.7% of the annotated genes in all the four pathotypes were categorized into twenty-two different functional classes ( Figure 1C and Supplementary Materials Table S6). Four major classes with the genes "transport and binding" (8.9%), "predicted" (7.6%), "mobile and extra chromosomal elements" (7.3%), and "cellular processes" (6.7%) were obtained in all the four pathotypes. Genes (~19%) under the class "energy metabolism" showed illustrative differences in the pathotypes.

Identification and Analysis of Secretory Proteins
For successful infection, pathogenic fungi largely depend on a range of secreted proteins, particularly effectors. An in-house-designed pipeline was used to carry out the prediction of secreted proteins. All the four pathotype genomes encoded an average of 11.7% of the total predicted proteins (≥50 aa) as the potentially secreted proteins. The transmembrane proteins predicted through TMHMM were eliminated from the protein data set except for the proteins identified with value 1Tm. A total of 588 (14), 529 (40A), 535 , and 681 (117-6) proteins were predicted as extracellular secreted (ES) proteins using only mature peptide sequences (>20 aa). These ES proteins accounted for 4.75% of the predicted proteins in all the four pathotypes ( Figure 2A and Supplementary Materials Table S7). Further, a BLAST search of these ES proteins against the NCBI-nr database produced significant hits for 468 (14), 435 (40A), 430 , and 532 (117-6) predicted secretory proteins. Among these hits, 79.6 to 82.2% of the proteins were annotated as hypothetical proteins, with only ten genes in pathotype 14, nine genes in pathotype 117-6, and seven genes each in pathotypes 40A and 40-3 with precise annotations and assigned functional classes. These were all single-copy genes within the genomes ( Figure 2B). The remaining proteins (with no significant hits against NCBI nr-database) were further included in a BLAST search against the Australian isolate of P. graminis, pathotype 21-0 database (http://webapollo.bioinformatics.csiro.au/puccinia_graminis_tritici_PGTAuspan/index.html, accessed on 18 December 2017). Analyses of the BLAST search against both the nr-database as well as the Australian isolate resulted in 13.8 to 16.8% of total secretory proteins with no significant hits. These putative proteins could thus be considered as novel proteins either specific or common among the four pathotypes. Conserved domains with precise function were searched with Pfam, and 7.6 to 10.4% of the ES proteins could be identified with a conserved functional domain in the respective genomes (Supplementary Materials Figure S2).
In order to determine the relative gene conservation among four P. graminis pathotypes, the putative ES effectors 588 (14), 529 (40A), 535 (40-3) and 681 (117-6) were searched for the presence of orthologs using the OrthoVenn webserver. Out of 612 clusters formed, 610 were orthologous clusters (gene clusters from any of the two pathotypes) and 270 were singlecopy gene clusters which were shared single-copy genes among the four pathotypes. There were only two gene clusters unique to pathotype 117-6 with no homolog in the other three pathotypes. Individually, 502, 479, 475 and 471 gene clusters in the pathotypes 14, 40A, 40-3 and 117-6, respectively, shared orthologs with at least one of the pathotypes. While, 84, 50, 59 and 202 single-copy genes were unique to their respective genomes ( Figure 2C). These gene clusters were further subjected to annotation analysis using GO and Swiss-prot databases and the majority of them did not show any hit with known genes.  Table S7. (B) ABLAST search against the NCBI-nr database was performed in order to annotate the extracellular secreted proteins (ProtComp). Ten single-copy genes which were either present or absent in any of the four genomes were observed (Supplementary Materials Table S8). (C) Venn diagram represents the comparative analyses of extracellular secreted proteins performed for four pathotypes. None of the proteins showed specificity in genomes 14, 40A, or 40-3 with two proteins being specific to pathotype117-6. Color bars show orthology of genes within a genome with rest of the three genomes. Pink horizontal bar represents the number of genes orthologous between any two or three, or all the four genomes.

Genome-Wide Analysis for Cysteine-Rich Genes
To investigate the cysteine-rich proteins within 588 (pathotype14), 529 (pathotype 40A), 535 (pathotype 40-3), and 681 (pathotype117-6) ES proteins identified in the pathotypes, further analysis was performed. In accordance to the growing evidence in the literature about effectors having unconventional characteristics, such as no predicted signal peptide, a low number of cysteine residues, or a large size [50][51][52][53], we performed our study by separating the proteins into sets of 50-200 aa and >200 aa categories. Considering both the sets, few proteins were observed to contain a very high percentage of cysteine residues (>8 to >18), while the majority of the proteins of four pathotypes had cysteine residues either ≥2 or ≤8. Comparatively the number of small ES proteins within pathotype 117-6 possessed a highest percentage of cysteine residues ( Figure 3A,B). Interestingly, 23 to 25% of small ES proteins were found to contain more than 5% cysteine while only a negligible percentage contained more than 5% cysteine within the large ES proteins in the four pathotypes ( Figure 3C). Overall our results revealed small ES proteins (50-200aa) to be prominently more rich in cysteine residues compared to the larger ES proteins (>200 aa) among all four pathotypes ( Figure 3C). All predicted genes (>150 aa) identified in the four pathotypes were subjected to PHI db genes to further categorize them based on their role in pathogenicity. Major group of genes were related to reduced virulence followed by other groups including unaffected pathogenicity, loss of pathogenicity, and lethal genes. (E) Functional annotation of the genes under reduced virulence into 23 classes showed energy metabolic genes as the most abundant class. Other major classes such as transport and binding, mobile and extra chromosomal elements, and cellular processes contained potential pathogenicity genes. A large number of genes were classified as hypothetical, conserved, predicted which could be novel or specific to individual pathotypes.

Identification of Pathogenicity-Related Genes
Pathogenicity genes were determined for the genes >150 aa identified in the four pathotype genomes by using pathogen-host interaction (PHI) gene database version 3.6 [43]. The protein-coding genes (7.4 to 7.8%) identified in four P. graminis pathotypes showed homology with the genes present in the PHI db. Details of the number of genes sharing homology with the PHI db in the four P. graminis genomes analyzed in this study are given in Supplementary Materials Table S7. We obtained homologs to enhanced antagonism, loss of pathogenicity genes, lethal genes, increased virulence genes, mixed pathogenesis genes, and unaffected pathogenicity genes in the P. graminis genomes ( Figure 3D).
Based on the functional groups identified earlier for all the predicted genes in this study, PHI gene homologs of "reduced virulence genes" were classified into 23 functional groups. Major groups included genes related to energy metabolism, mobile and extrachromosomal elements, cellular processes, transport and binding, transcription, and genes under hypothetical, unpredicted, and conserved domains ( Figure 3E). Interestingly, of the total average 12,271 protein-coding genes (>150 aa) identified in all the four pathotypes, 7.4% genes showed similarity with the PHI database (Supplementary Materials Table S8), which is suggestive of their significance in pathogenesis.

Genome-Wide Identification of SNPs
In order to investigate polymorphism in the four P. graminis pathotypes, DNA variants (SNPs, InDels) were predicted against the reference genome [7] at a 10× depth coverage. On average, a density of one SNP per 64 bp was observed in the four pathotypes. Overall, we identified 93% substitutions, 43% insertions, and 25% deletions in pathotype 117-6. Deletions within pathotypes 40A and 40-3 were 30% while in pathotype 14, 28% deletions were obtained (Table 2). The type of effects (start codon gained or lost, stop codon gained or lost, exon deletion, or gene ablation due to deletion of a gene, etc.) caused by the variants (SNPs and InDels) across different genomic regions showed that most of these variants were found in the downstream and upstream regions (5kb upstream of the most distal transcription start site and 5 kb downstream of the most distal polyA addition site, respectively), with 28% and 24% of the total variations in the pathotypes 40A and 40-3, respectively. For variants in pathotypes 14 and 117-6, 31% and 33% were found in the downstream region and 26% and 30% in the upstream regions, respectively. The percentage of variants in the intergenic regions was more (26%) in pathotypes 40A and 40-3 than in pathotype 14 (23%) and pathotype 117-6 (19%). Intronic regions showed a similar percentage of variants in all the pathotypes ( Figure 4A). Further, functional annotation of the SNPs within the exonic regions resulted in 43%, 48%, 45%, and 43%non-synonymous SNPs in the pathotypes 14, 40A, 40-3, and 117-6, respectively. Silent substitutions were most abundant with 57% in pathotypes 14 and 117-6, 55% in pathotype 40-3, and 52% in pathotype 40A ( Figure 4B). On an average, the ratio of non-synonymous to synonymous SNPs was 0.82 in these four pathotypes.

Identification of Repetitive Elements
Total repeat contents were analyzed in the genome of the four pathotypes and were divided into two categories-like transposable elements (TEs) and tandem repeats. TEs were found to be 40.0% (~26 Mb) in the genomes (Table 1 and Figure 4C). The most dominant class was represented by the LTR retrotransposons with 27.3%, corresponding tõ 17Mb of the genome in all the pathotypes. LTR retrotransposons were followed by DNA transposons and non-LTR retrotransposons with approximately 6.6 Mb (10.6%) and 2.6 Mb (4.2%) of their genomes, respectively ( Figure 4D). These repeats were further annotated into different super families within the genomes of the pathotypes. The Gypsy (10.4 Mb) and Copia (5.6 Mb) elements were found to be the most abundant, followed by the presence of super families such as Tad 1, Tc1/Mariner, MuDR and Harbinger elements (~5 Mb) in their respective genomes ( Figure 4E). Due to the lack of specific coding regions, certain elements (~4.0 Mb) could not be annotated into a specific super family and thus were placed into the class "Others". Only a negligible portion (~0.8Mb) of the assembled genomes (~1.3%) of all the four pathotypes were composed of tandem repeats ( Figure 4F). Further analysis showed that the microsatellite repeat (SSR) distribution within the predicted protein coding region of pathotypes revealed tri-nucleotide repeats as the most abundant class, comprising 41.8%, followed by tetra-(24.5%) and hexa-(15.5%) repeats. Mono-and penta-repeats were both similar with~4.0% occurrence within the genomes and the least represented class (1.3%) was the di-nucleotide repeats ( Figure 4G). These results demonstrated a clear difference between the TEs and tandem repeat contents within the genomes of four pathotypes of P. graminis, which is a common feature among the genomes [54].

Conservation of Orthology and Paralogy Genes in the Pathotypes
In order to investigate the protein evolution, we identified homologous genes among the four pathotypes ( Figure 5A). A combined approach of the 'best bidirectional hit' (BBH) method and the In-Paranoid program (INP) algorithm [39] with higher and more stringent threshold values was used for the identification of orthologs and paralogs [40,41]. Using a BLASTp search, any two significant hits with bit score ≥ 100, at least 40% identity over 70% query coverage, and E-value 1 × 10 −20 cut offs along with bidirectional hits with each other were considered as paralog pair being in the same pathotype or ortholog pair being in any of the other three pathotypes.
Pairwise comparison of the numbers of orthologs between any two pathotypes in combination, demonstrated that while pathotypes 40A and 40-3 showed a good homology between each other with 353 conserved protein pairs, they shared the least number of pairs with pathotypes 117-6 and 14. Pathotype 14, in contrast to this, shared the highest number of homologous protein pairs with pathotype 117-6 (Supplementary Materials  Table S9). These results indicate a distinction of pathotype 117-6 from pathotypes 40A and 40-3. Considering ortholog pairs among any three pathotypes at a time additionally demonstrated that pathotypes 117-6 and 14 are very distinct from pathotypes 40A and 40-3 (Supplementary Materials Table S9). All the pathotypes were observed with a higher percentage of orthologous gene pairs than unique and/or specific genes ( Figure 5A). Comparative analysis of the homologous genes in all the four pathotypes suggested that 7106 homologs were present in all the pathotypes. These genes are referred to as core genes ( Figure 5B). Interestingly, pathotype 117-6 had a fairly large number of unique and specific genes (1882), more than 3-fold higher than that of pathotypes 40A (593) and 40-3 (567) and more than 2-fold higher than pathotype 14 (830), further supporting its probable distinction from the other three pathotypes and indicative of its probable adaptive nature (Supplementary Materials Tables S3 and S9). Moreover, pathotype-specific genes having paralog pairs but no orthologous genes across pathotypes were highest in pathotype 117-6 (93) followed by only nine genes found in pathotype 14. There was a single gene specific to pathotype 40-3 while no such specific gene was identified in pathotype 40A (Supplementary  Materials Table S9). Synteny analysis revealed 65% to 70% gene conservation between pathotypes 117-6 and 14 and 55% synteny among the genes of pathotypes 40-3 and 40A ( Figure 5C). Altogether these results indicate the distant nature of pathotype 117-6 and pathotype 14 compared to pathotypes 40A and 40-3.

Evolutionary Relationship among the P. graminis Pathotypes
Phylogenetic analysis was performed to investigate the evolutionary relationship among the four pathotypes (40A, 40-3, 14, and 117-6) and the reference genome (CDL 75-36-700-3), based on whole-genome sequence alignment. The conservation distance calculated showed pathotypes 40A and 40-3 to be very closely related. Pathotype 14 was more closely related to pathotype 117-6 than 40A and 40-3 ( Figure 5D). Similar results were observed in the analysis based on SNPs sharing a close relationship with pathotypes 40A and 40-3, and pathotype 117-6 was distantly related to three pathotypes ( Figure 5E). These results showed the intraspecies discrimination of closely related pathotypes 40A and 40-3 with pathotype 117-6.

Genome-Wide Analysis of Diversifying Selection
Apart from the effectors and pathogenicity-related genes in pathogens, numerous other genes have been known to undergo diversifying selection due to the potential strong selection pressure [55]. In order to investigate the evolutionary divergence, diversifying selection analysis was performed for all the predicted genes (≥450 bp (150 aa)) of the four pathotypes sharing orthologs among themselves. Signatures of diversifying selections were analyzed using two methods from PALM software [56]. Genes with at least one ortholog were analyzed by the counting method [57] utilizing YN00 to estimate the pairwise dN/dS ratios. Additionally, site-specific diversifying selection analysis for genes possessing at least two orthologs was performed by using two LRTs of CODEML. Maximum dN/dS ratios were obtained for pathotype 40A, followed by 40-3, and minimum in the pathotype 117-6. The site-specific diversifying selection was obtained for 5.1% of the genes in pathotype 14, 5.3% genes of pathotypes 40A and 40-3, and 4.9% genes of pathotype 117-6 (Supplementary Materials Table S10). Pathotype 117-6 had a greater number of lineage-specific genes that lacked the predicted orthologs among the pathotypes. Therefore, the percentage of genes analyzed for this pathotype was relatively low. The mean dN/dS ratio was 0.3 in pathotype 117-6, while the other three pathotypes (14, 40A, 40-3) were similar with the ratio of 0.28.

Analysis of Substitution Rates of Sequence Divergence
Despite the dN/dS ratio varying among the gene pair combinations in all the pathotypes, the mean ratio was 0.30 ± 0.02, which was suggestive of a probable strong functional constrain for most of the genes. Among the genes analyzed for diversifying selection (YN00), we found that in all the four pathotypes around 97% of the genes had a dN/dS ratio <1. Genes with adN/dS ratio = 1 were only found as single-copy genes in pathotypes 117-6 and 40A (Supplementary Materials Table S11). We further investigated the existence and conservation of the genes with adN/dS ratio >1 by considering genes having orthology in the genomes of all the pathotypes. A set of these genes (154) only with dN/dS > 1 in pathotype 117-6 was used as a source to sort out the orthologs in the other three pathotypes inclusive of the genes with dN/dS < or > 1 ( Figure 6A). This enabled the confirmation of the presence of orthologous genes along with their functional relationship, and if they are under some selection pressure with respect to pathotype 117-6. A comparative analysis was also performed to investigate the functional conservation among these genes (dN/dS > 1) within the genome of all the pathotypes. Clusters were generated on the basis of ortholog pairs shared between combinations of two pathotypes (two way), three pathotypes (three way), and all four pathotypes (four way). This analysis showed that pathotypes 117-6 and 14, and pathotypes 40A and 40-3 shared a higher number of genes common to each other as compared to any other two pathotypes observed together in the two-way gene cluster ( Figure 6B). Functional annotation of all the genes (dN/dS ≥ 1) showed that approximately 75% of the genes were hypothetical in all the genomes analyzed in this study (Supplementary Materials Table S12). (B) Clustered heat map generated for the genes having dN/dS values > 1 in all the four pathotypes. Clusters were sorted on the basis of orthologous pairs shared between any two (two way) pathotypes, any three (three way) pathotypes, and all four pathotypes (four way). Vacant (no color) space shows the absence of any shared genes in the respective pathotype.

Identification of Genes under Site-Specific Diversifying Selection and Their Functional Categorization
Amino acid sequence analysis of genes (p < 0.05) underlying the selection in the present study suggested differences in the nature of the pressures exerted on them. Therefore, the site-specific diversifying genes (p < 0.05) with dN/dS ≥ 1 were further analyzed. Our results showed that some of the genes were under strong positive selection. Overall, the percentage of site-specific diversifying genes wascomparatively similar in pathotypes 14, 40-3, and 40A, while it waslowest in the pathotype 117-6 (Supplementary Materials Table S11). Distribution of these genes in different functional classes demonstrated that "energy metabolism", "protein fate", "mobile elements", and "cellular process" were among the major categories that classified these genes (Supplementary Materials Table S13).

Discussion
It has been reported that about 85% of the global population requires wheat as one of their only calorie sources [58]. Wheat is cultivated in about 215 million hectares in the world and provides 20% of the calorie and protein requirements for 4.5 billion people in 94 countries [59]. Although, among the three existing rust diseases of wheat (leaf, stripe, and stem), stem rust is less common and prevalent compared to leaf rust but is considered the most destructive of the three wheat rust diseases [60]. Further, with the emergence of the strain TTKSK (Ug99) and the still-new race [61], the stem rust pathogen has recently acquired much attention due to the danger it poses to the global wheat productions in the near future. The mega wheat variety PBW343, covering nearly 8 million hectares in northern India has also become susceptible to Ug99 [62]. Therefore, a definite program to manage wheat rusts in India should be in place. The incidence and virulence pattern of stem rust pathogens are monitored on wheat crops in India for early detection of possible new virulence, evolution of pathotypes, and changes in pathotype composition and their distribution patterns in summer and regular wheat crops [23,24]. The pathotypes are designated as per the binomial system [63] with modifications [4]. The information thus generated can be used to select stem-rust-resistance genes for incorporation in the development and deployment of new wheat varieties to diversify resistance and avoid yield losses.
The years and places of detection of the four pathotypes selected and investigated in this study are quite different, but are being maintained at the national repository for studying their evolutionary update and disease prevalence each year. The phenotypic and phylogenetic features of the four pathotypes include pathotype 14, the least virulent in terms of not causing disease on wheat lines containing specific resistance genes, while pathotype 40-3 is the most recent and virulent, being able to overcome Sr7a, Sr13 and Sr30 (IT3+) genes. Pathotype 40A is avirulent to these three Sr genes but the virulence pattern for more than~24 Sr genes is shared with pathotype 40-3. Pathotype 117-6 is virulent on durum wheat unlike the other three pathotypes which are virulent on bread wheat.Additionally, 117-6 is also virulent on Sr21(IT3+). Pathotype 117-6 is virulent on Sr13 but avirulent on Sr7a and Sr30(IT2), while pathotype 14 is avirulent on Sr7a, Sr13, and Sr30(IT 1-2) and virulent on Sr21(IT3+) (Supplementary Materials Figure S3 and Table S14). Thus, pathotype 14 could show an intermediary relationship with pathotypes 40A and 117-6in the evolutionary analysis. The phylogenetic analysis performed in this study based on the whole-genome comparisons as well as SNPs identified further pointed towards the diverse nature of the pathotypes.
The assembled genome data for the previously published P. graminis strain, CDL 75-36-700-3 (race SCCL) [8] from the U.S. confirms the high-quality of the assembly and larger (88.6 Mb) genome. Nevertheless, various quality assessments of our data indicate a relatively smaller size of the assembled genomes from India. The sequence depth coverage of 96× to 103× mapped reads of all the pathotypes added credence to our data. There was no evidence of whole-genome segmental duplications in the genomes (Supplementary  Materials and Table S16) as similar to the strain CDL 75-36-700-3 from U.S. [8]. Since the dikaryotic spore stage was sequenced in the present study, the haplotype natures of the genomes were addressed by aligning the assembly against itself individually for each pathotype using CLC genomics workbench. Also, there is minimal possibility of any haplotype sequence being internally aligned elsewhere since the GS reference mapper allows a mapped read as a whole or a portion to be utilized only once during the wholegenome assembly while the possibility of repeats being assembled on top of each other could be considered. The total number of predicted CDS was found to be highest in pathotype 117-6 (15401) followed by pathotypes 14, 40A, and 40-3. The mean coding sequence length (1092 to 1152 bp) in all the four pathotypes was similar to the U.S. strain, i.e., 1075 bp [8]. With the whole-genome sequences of rust fungi now available [4][5][6][7][8][9][10], there are tremendous opportunities of deciphering the predicted genes and understanding their roles. In the present study, functionally annotated and predicted genes (~51.57%) containing significant (~19%) genes related to "energy metabolism", and~7-9% genes related to "cellular processes" and "transport and binding", respectively, in the four individual Puccinia genomes, are a huge resource to understand the metabolic basis of the self-regulation within dormant/resting spores.
It has been demonstrated that diversifying selection has a strong impact on pathogens, especially in case of biotrophs, which have intimate connections to their hosts via effectors [64][65][66][67][68]. Species-specific effectors like Melampsora lini Avr gene homologs which are only found in M. larici-populina, as well as RTP1 effector homologs that are conserved across the rust fungi might have undergone natural selection [69,70]. This illustrates the existence of a triggered regulatory mechanism which encompasses a precise set of genes within the pathogen to survive and infect the host [71]. Therefore, a strong selection pressure is exerted on the genes. In the present study, comparative analysis of a core set of genes with dN/dS > 1 in pathotype 117-6 demonstrated that a majority of the orthologs present in the other three pathotypes (14, 40A, and 40-3) possess dN/dS < 1, possibly indicating a greater adaptability of pathotype 117-6 compared to other three pathotypes. Furthermore, orthologs of the genes under positive selection were predominantly shared between pathotypes 117-6 and 14 rather than with any other pathotype when compared individually. Therefore, the results of the present study indicate that pathotype 117-6 possesses lineagespecific genes which could be under strong positive selection. Also, a co-relation of an old avirulent pathotype (14) with a recent, virulent, and adaptable pathotype (117-6) could be seen, which suggests conservation of certain essential groups of genes. Thus, this study was focused on finding the impact of diversifying selection on the whole repertoire of predicted genes so that a common set of rust-specific genes could be identified within the different functional classes of these four pathotypes. Identification of such specific genes with unique features within individual pathotype would enable a better understanding of the evolution of recent virulent rust pathotypes which follow a different lineage.
Despite increasing evidence that not all the effectors are small in size and have cysteinerich peptides, secreted pathogen proteins are still referred to as potential effectors [72]. Of the total ES proteins identified and analyzed in the present study, pathotype 117-6 showed 2 to 6% higher proteins with cysteine content >5 compared to the other three pathotypes. Two of these hypothetical genes were specific to pathotype 117-6. Similarly, the potential pathogenicity genes of pathotype 117-6 showed greater homology with the reduced virulence genes in PHI db (www.phibase.org, accessed on 13 December 2017 [43]) than the pathotypes 14, 40A, and 40-3. These results seem to differentiate pathotype 117-6 from rest of the three pathotypes in terms of the number of effectors and pathogenicityrelated genes that may have important roles in contributing to it being the most virulent and specific to durum wheat. Prediction of ES proteins and identification of small proteins enriched in cysteine amino acids with the secretion signal form strong evidence for the presence of potential effectors in the P. graminis genome.
Pathotype 117-6 seemed to belong to a different group in this study, which is consistent with the published reports [25,73]. Mutational events are considered to be one of the main reasons for genetic variations in rust fungi [74][75][76] and might be responsible for the emergence of new virulent races [77,78]. At the same time, fewer molecular variations within the species have been observed [65,79]. Consistent with this, our results on wholegenome DNA polymorphism analysis within the four individuals showed quite a similarity in terms of percent distribution of SNPs in the genomic regions. Our results suggested most of the genes to be under purifying selection and 3% of genes potentially under positive selection which possibly could be responsible for the existing diversity among the P. graminis pathotypes. However, diversity among populations is also observed due to genetic drift, migration, and demographic events [80,81]. Furthermore, comparative analysis demonstrated the conservation of functional genes and a close relationship among themselves despite the diversity observed. Point mutations in non-protein-coding DNA sequences can also have functional consequences, particularly if they affect a regulatory element [82]. Consistent with this, an increased percentage of SNPs identified in the downstream, upstream, and intergenic regions compared to the exonic regions in all the four individuals could be of significant importance.

Conclusions
In this study we sequenced the genomes of four distinct P. graminis pathotypes (14, 40A, 40-3, and 117-6) and performed genome-wide comparative analyses. This is the first report using NGS for reference-based genome sequencing and genome-wide comparative analysis of four Indian P. graminis pathotypes. Pathotype 117-6 showed significant difference in possessing small ES proteins with high cysteine content, SNP distribution in the regulatory regions, genes of adaptability, and fairly large numbers of unique and specific genes. Whole-genome phylogenetic analysis revealed that there may be evolutionary relationships among the pathotypes in terms of host specificity, insertion and deletion events, and virulence profiles. Pathotype 14, being an older avirulent strain, could probably have an intermediary relationship between the bread wheat pathotypes, 40A and 40-3 and the durum wheat pathotype, 117-6. Considering any possible artifacts during sequencing and downstream analysis, this study nevertheless could serve as an important genomic resource for studying population structure and diversity analysis. It could also be helpful in monitoring the evolution of new variants of the rust pathogen and mapping genes for various traits to further enable a better management of the stem rust pathogen.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/jof7090701/s1, Figure S1: Assembly statistics of Puccinia graminis pathotypes. Figure S2: Identification of conserved domains in the Puccinia graminis pathotypes. Figure S3: Depiction of IT (infection types) for the code of pathotypes on wheat leaves. Table S1: Genome assembly of the Puccinia graminispathotypes. Table S2: Statistics of the completeness of the genome based on analysis by CEGMA. Table S3: Identification of intra-pathotype genetic variation within the Puccinia graminispathotypes. Table S4: Identification of inter-pathotype genetic variations among the Puccinia graminis pathotypes. TableS5: Gene predictions of the Puccinia graminis pathotypes. Table S6: Categorization of annotated genes (>450) into various classes based on their functional roles. Table S7: Secretome analysis of the Puccinia graminis pathotypes. Table S8: Functional annotation of proteomes in Puccinia graminis pathotypes (against PHI database). Table S9: Analysis of pairwise orthologous genes across the Puccinia graminis pathotypes. Table S10: Diversifying selection analysis of genes (≥450bp) in four Puccinia graminis pathotypes. Table S11: Genes under specific selection pressure in the Puccinia graminis pathotypes. Table S12: Functional classification of genes with omega (dN/dS) > 1. Table S13: Functional classification of site-specific diversifying genes. Table S14: Virulence/avirulence formula for the Puccinia graminis pathotypes. Table S15: Wheat lines/varieties carrying Sr genes. Table S16: Segmental duplication in Puccinia graminis genomes.
Author Contributions: T.R.S. conceived and designed the experiments, K.K. generated data, H.C.R. and H.D. performed the computational biology experiments, K.K. and H.C.R. analyzed the data, S.C.B. and R.J. contributed in providing reagents/materials, R.D. and H.D. revised the manuscript and provided input, and K.K. and T.R.S. wrote the manuscript. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Sequence data from this article have been deposited with the NCBI GenBank and their BioProject IDs (https://www.ncbi.nlm.nih.gov/bioproject/) under Accession No. LAQV00000000, LAQW00000000, LAQX00000000, LAQY00000000". If requested, the database will withhold release of data until publication. support and input during initial stages of this study. We thank William Wesley Crump, (Washington State University Irrigated Agriculture Research and Extension Center) for the English language correction in the revised manuscript.

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.