Comparative Pan-Genomic Analysis Revealed an Improved Multi-Locus Sequence Typing Scheme for Staphylococcus aureus

The growing prevalence of antibiotic-resistant Staphylococcus aureus strains mandates selective susceptibility testing and epidemiological investigations. It also draws attention to an efficient typing strategy. Whole genome sequencing helps in genetic comparison, strain differentiation, and typing; however, it is not that cost-effective. In comparison, Multi-Locus Sequence Typing (MLST) is an efficient typing method employed for bacterial strain typing and characterizations. In this paper, a comprehensive pangenome and phylogenetic analysis of 502/1279 S. aureus genomes is carried out to understand the species divergence. Additionally, the current Multi-Locus Sequence Typing (MLST) scheme was evaluated, and genes were excluded or substituted by alternative genes based on reported shortcomings, genomic data, and statistical scores calculated. The data generated were helpful in devising a new Multi-Locus Sequence Typing (MLST) scheme for the efficient typing of S. aureus strains. The revised scheme is now a blend of previously used genes and new candidate genes. The genes yQil, aroE, and gmk are replaced with better gene candidates, opuCC, aspS, and rpiB, based on their genome localization, representation, and statistical scores. Therefore, the proposed Multi-Locus Sequence Typing (MLST) method offers a greater resolution with 58 sequence types (STs) in comparison to the prior scheme’s 42 STs.


Introduction
S. aureus, both commensal and pathogenic, inhabits humans' noses, mucosal passages, and skin, which can result in mild to severe infections, sometimes resulting in death [1]. The infections can spread through inhalation, direct contact, and contamination [2]. The statistics of its infections occurring worldwide and its isolation in healthcare settings increase the need for effective typing schemes for close monitoring of various strains and lineages.
According to the Centers for Disease Control's (CDC) (https://www.cdc.gov/ (accessed on 15 September 2022)) 2019 report "Antibiotic Resistance Threats in the United States, 2019", antibiotic resistance of methicillin-resistant S. aureus is ranked ninth in the list of serious threats [3]. In 2011, according to CDC, the invasive MRSA infection cases were 80,451. Of these cases, approximately 11,285 deaths occurred in the United States alone [4]. S. aureus was the second most isolated microbe from bacteremia in Europe in 2008 with the genome phylogeny. This calls for our attention to the improvement and revision of the current scheme to understand the current emerging patterns of S. aureus.
The rapid evolution of the genome of S. aureus as indicated by the comparative pangenome analysis and the increasing evidence of the limitation of the genes of the MLST scheme of S. aureus from the literature was the major reason for the assessment of the current scheme [17].

Materials and Methods
This study was comprised of two main components both of which were linked to each other: comparative pan-genomic analysis, and the development of an improved MLST scheme. The results of the pangenome analysis were used extensively in identifying the pan and core genes for the S. aureus genome, possible gene candidates for the MLST scheme, studying the evolution of S. aureus on a molecular and functional basis, configuring an improved MLST scheme, analyzing the pathogenicity, estimating the gene acquisition and loss, investigating the recombination in S. aureus, and estimating how open the genome of S. aureus is.
The details of each of the components are discussed in Figure 1 below.
scheme represents a portion of the whole genome where approximately 43% of the g nomic area remains unrepresented in the present scheme. This shortcoming can result failure to represent the genome phylogeny. This calls for our attention to the improveme and revision of the current scheme to understand the current emerging patterns of S. reus. The rapid evolution of the genome of S. aureus as indicated by the comparative pa genome analysis and the increasing evidence of the limitation of the genes of the ML scheme of S. aureus from the literature was the major reason for the assessment of t current scheme [17].

Materials and Methods
This study was comprised of two main components both of which were linked each other: comparative pan-genomic analysis, and the development of an improv MLST scheme. The results of the pangenome analysis were used extensively in identifyi the pan and core genes for the S. aureus genome, possible gene candidates for the ML scheme, studying the evolution of S. aureus on a molecular and functional basis, config ing an improved MLST scheme, analyzing the pathogenicity, estimating the gene acq sition and loss, investigating the recombination in S. aureus, and estimating how open t genome of S. aureus is.
The details of each of the components are discussed in Figure 1 below. Figure 1. Overall methodology. The flowchart demonstrates the overall methodology followed this study. It shows the two main components: phylogenetic analysis and the development of a n MLST scheme.

Sequence Retrieval and S. aureus Genome Data and Gene Annotation
The total genomic sequences available for S. aureus to date are 1279, and 502 out 1279 were downloaded from National Center for Biotechnology Information (NC (https://www.ncbi.nlm.nih.gov/ (accessed on 21 October 2022)) and further subjected different genomic analysis. DNA sequences in FASTA format were used to make a 1 rRNA phylogenetic tree and an SNP-based phylogenetic tree. DNA sequences in FAST format were annotated by Prokka at default parameters for estimation of pan-genom [18], a tool for the complete annotation of the bacterial draft genome in approximately min. The steps are further elaborated in Figure 2.

Sequence Retrieval and S. aureus Genome Data and Gene Annotation
The total genomic sequences available for S. aureus to date are 1279, and 502 out of 1279 were downloaded from National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/ (accessed on 21 October 2022)) and further subjected to different genomic analysis. DNA sequences in FASTA format were used to make a 16s rRNA phylogenetic tree and an SNP-based phylogenetic tree. DNA sequences in FASTA format were annotated by Prokka at default parameters for estimation of pan-genome [18], a tool for the complete annotation of the bacterial draft genome in approximately 10 min. The steps are further elaborated in Figure 2.

Pan-Genome Profile Analysis
We first performed Pangenome analysis using PanRV in-house pipeline to analyze the 502/1279 complete genome sequences. PanRV is a pangenome reverse vaccinology tool that assesses the genome against each category, such as pan, core, accessory, and unique, using a pangenome estimate module built on roary. The output file from prokka in GFF3 format was needed as input in running the pangenome estimation module (PGM) of a PanRV [19]. The steps followed and the software used are elaborated in Figure 2 above.

Development of an Improved MLST Scheme
For the current scheme, the literature was reviewed to identify the problems of the current MLST scheme. Meanwhile, the data of the current scheme were subjected to further analysis using different software. The allelic profile for 502 genomes was determined using PubMed. The MLST tree of the 502 genomes was constructed, and clones were identified.
The prediction of new gene candidates for the new MLST scheme was the second part of the methodology depicted in Figure 3 for which an inclusion-exclusion criterion based on the size and unique alleles was established based on the literature review and drawbacks of the current scheme.
Allelic profiles of all the genes were retrieved using PubMed. These proposed gene candidates then underwent statistical analysis that led to further shortlisting of the genes. A new scheme constituting proposed gene candidates and current scheme candidates was constructed. The new scheme eliminated all the genes of the current MLST scheme having lower shortcomings in satisfying the new criteria. Finally, STs were assigned for the proposed MLST scheme. Subsequently, an MLST tree was constructed. This tree was compared with the current MLST tree, and insights were drawn.

Pan-Genome Profile Analysis
We first performed Pangenome analysis using PanRV in-house pipeline to analyze the 502/1279 complete genome sequences. PanRV is a pangenome reverse vaccinology tool that assesses the genome against each category, such as pan, core, accessory, and unique, using a pangenome estimate module built on roary. The output file from prokka in GFF3 format was needed as input in running the pangenome estimation module (PGM) of a PanRV [19]. The steps followed and the software used are elaborated in Figure 2 above.

Development of an Improved MLST Scheme
For the current scheme, the literature was reviewed to identify the problems of the current MLST scheme. Meanwhile, the data of the current scheme were subjected to further analysis using different software. The allelic profile for 502 genomes was determined using PubMed. The MLST tree of the 502 genomes was constructed, and clones were identified.
The prediction of new gene candidates for the new MLST scheme was the second part of the methodology depicted in Figure 3 for which an inclusion-exclusion criterion based on the size and unique alleles was established based on the literature review and drawbacks of the current scheme.
Allelic profiles of all the genes were retrieved using PubMed. These proposed gene candidates then underwent statistical analysis that led to further shortlisting of the genes. A new scheme constituting proposed gene candidates and current scheme candidates was constructed. The new scheme eliminated all the genes of the current MLST scheme having lower shortcomings in satisfying the new criteria. Finally, STs were assigned for the proposed MLST scheme. Subsequently, an MLST tree was constructed. This tree was compared with the current MLST tree, and insights were drawn.

Population Structure
The population structure of S. aureus developed using the current MLST scheme was compared with that of the newly proposed scheme. For this purpose, the seven loci of housekeeping genes of 502 genomes (available at the time of analysis) were subjected to MLST scheme analysis and were assigned allele numbers based on the already reported MLST scheme [20].

New Gene Candidates for S. aureus
The purpose of this scheme is to represent the whole genome and to select genes with a higher discriminatory index than the previously reported genes. The higher the index, the greater the ability to differentiate between strains. Genes that play crucial roles in cellular pathways were selected with the help of BacMap, UniProt, and COG. The following are the criteria for the selection of candidate genes for the MLST scheme.

Core genes
In the selection process, the two parameters of size and core genome presence were kept side by side. Core genome analysis means that a particular gene is present throughout the species [13]. Each gene was searched on PubMed blast Sequence query for its presence in the core genome.

Role in Cellular Pathways
These housekeeping genes are constitutively active and play a crucial role in cellular mechanisms. Genes playing roles in transcription, translation, recombination, repair, defense mechanism, cell cycle regulation, cell division, and chromosome partitioning were selected. Moreover, the core genomes in FASTA selected from the PanRV output were subjected to COG analysis for functional annotation. The sequences of the genes were also retrieved using the online database "BacMap Genome Atlas" (http://bacmap.wishartlab.com/ (accessed on 23 October 2022)), Uniprot (www.uniprot.org/ (accessed on 23 October)) [21][22][23][24].

Population Structure
The population structure of S. aureus developed using the current MLST scheme was compared with that of the newly proposed scheme. For this purpose, the seven loci of housekeeping genes of 502 genomes (available at the time of analysis) were subjected to MLST scheme analysis and were assigned allele numbers based on the already reported MLST scheme [20].

New Gene Candidates for S. aureus
The purpose of this scheme is to represent the whole genome and to select genes with a higher discriminatory index than the previously reported genes. The higher the index, the greater the ability to differentiate between strains. Genes that play crucial roles in cellular pathways were selected with the help of BacMap, UniProt, and COG. The following are the criteria for the selection of candidate genes for the MLST scheme.

Core genes
In the selection process, the two parameters of size and core genome presence were kept side by side. Core genome analysis means that a particular gene is present throughout the species [13]. Each gene was searched on PubMed blast Sequence query for its presence in the core genome.

Role in Cellular Pathways
These housekeeping genes are constitutively active and play a crucial role in cellular mechanisms. Genes playing roles in transcription, translation, recombination, repair, defense mechanism, cell cycle regulation, cell division, and chromosome partitioning were selected. Moreover, the core genomes in FASTA selected from the PanRV output were subjected to COG analysis for functional annotation. The sequences of the genes were also retrieved using the online database "BacMap Genome Atlas" (http://bacmap. wishartlab.com/ (accessed on 23 October 2022)), Uniprot (www.uniprot.org/ (accessed on 23 October)) [21-24].

Mobile Elements
S. aureus is commonly known for recombination events within the core genome, which could be attributed to the presence of mobile elements [25]. Therefore, each candidate gene was looked for the presence of mobile elements in or around the proximity of candidate genes through an online tool, CENSOR by "Giri Repbase" (http://www.girinst.org/censor/ index.php (accessed on 22 October 2022)) [26]. Then 100 bp upstream and downstream of each gene was looked up for the mobile elements using the same software.

Size of the Gene
The size of the gene in an effective MLST scheme for PCR is very important. In the '90s, 400-600 bp was considered the right length for single-run gel-based sequencing [27]. A gene that is too short or too long fails to produce the result [28]. Hence, genes with appropriate sizes were selected that led to the shortlisting of 15 gene candidates.

Unique Alleles of Each Gene
The pragmatic choice of genes for MLST scheme development requires discrimination power. Allele's profile and sequence types are used to distinguish different strains. A stable gene having more unique alleles can characterize more strains [13]. Alleles for all the selected genes for 502 genomes were labeled using PubMed (https://pubmlst.org/bigsdb? db=pubmlst_S. aureus_seqdef (accessed on 21 October 2022)).

Data Analysis
Sequence alignment was performed using Mega 7 through MUSCLE. The aligned sequences of MLST genes and newly predicted genes were then subjected to different statistical analysis. Statistical analysis was performed using "dnasp5" and "SplitsTree" software. SplitsTree was used to construct a Split decomposition tree, Phi test for recombination, and to calculate the invariant sites (http://www.splitstree.org (accessed on 25 October 2022)) [29]. "Dnasp5" was used to perform the phi test, Tajima's D test, and Tajima's dN/dS ratio calculation (http://www.ub.edu/dnasp/index_v5.html (accessed on 25 October 2022)).

Significance of Statistical Parameters
The statistical score was used for the selection of candidate genes. The scores are significant in determining the stability of the genes over the course of evolution. The following tests were used to evaluate the genes.
Phi test for Recombination: Phi test is performed to estimate recombination events. Phi test was performed using "Splitstree" in order to estimate the recombination events [29]. The higher rate of recombination is directly related to increased virulence and more pathogenicity [30].

•
Tajima's D value: Tajima's D value was calculated for already reported MLST scheme and newly predicted genes using "dnasp5". If the value is zero or closer to zero, it indicates that the observed variation of the population is equal to or closer to the expected. A negative Tajima's D value shows population expansion. Similarly, a positive value indicates population shrinking or balancing selection. A value between 2 and −2 shows that the genes were driven by neutral selection [31]; • Tajima's dN/dS ratio: It is a ratio of non-synonymous and synonymous mutations. The ratio is used to distinguish between variants of a population [32]; • dN/dS = (Non-synonymous mutations)/(Synonymous mutations) The dN/dS ratio was calculated using "dnasp5" software. If the value is greater than 1, it indicates more non-synonymous mutations. However, if the value is smaller than 1, it shows that there are more synonymous mutations.
Primers for new candidate genes were constructed. The following primers were constructed for the suitable candidates shown in Table 1.

Sequence Type Assigning
After the statistical tests, the new gene candidates were finalized. The STs for both the current and new schemes were assigned. For the current scheme, the STs were assigned based on data available on PubMed. For the new scheme, the allelic numbers were assigned manually after the data retrieval of each allele for every gene. The validity of data was confirmed using the online "eburst" tool (http://eburst.mlst.net/ (accessed on 27 October 2022)) to ensure that no two allelic profiles have the same STs [11].

MLST Data Analysis
For analyzing the clonal complexes and studying the phylogenetics of the species the MLST scheme must be presented in a graphical form for analysis. The data analysis can be performed in two ways:

•
Based on ST profiles of the genes; • Based on FASTA sequences of each allele.
The choice of selection of the method depends on the congruency of the alleles. If alleles are congruent then the second method is used, if not, then the other method is used. S. aureus is highly clonal, which ultimately refers to the choice of the first method for the analysis. For the very purpose of analysis, "Phyloviz", which is both downloadable and online software, was used (https://online.phyloviz.net/index (accessed on 28 October 2022)). Two files, the Profile data file and the Accessory data file, were given to the software and the clones of data were analyzed. The Profile data file included the allelic profiles and STs. The Accessory data file included the names of the organisms in the same order as that of the Profile data file. In the Accessory data file, we can add as many details of an organism as we want. This can include the name, origin, source, etc.

Staphylococcus aureus Genomic Features and Statistics
As multiple drug and antibiotic resistance has emerged in S. aureus, making it a member of the list of major threats to public health by the CDC, and it has become a serious concern for both researchers and the community alike [33]. There has been an exponential increase in the data of S. aureus on international databases owing to the advancements and ease in sequencing whole genomes. Currently, 1279 complete genomes and~31283 incomplete genomes are publicly available. A total of 502/1279 complete genome sequences were downloaded for further analysis as indicated in Table 1 below [34]. The size ranged from 2.66466 Mbp to 3.13248 Mbp with the reference genome, NCTC 8325, having an average size of 2.8 Mbp. The particular bacteria have been evolving and acquiring antibiotic-resistant and virulent genes since the 1940s, with outbreaks occurring mainly in, but not limited to, hospitals and healthcare settings [35].

Pan-Genome Profile Analysis
The pan-genome analysis is a comprehensive study to understand the evolution of bacteria on a molecular and functional basis. The pan-genome of 502 S. aureus genomes of a total of 1279 available genomes, is comprised of 12,477 gene families of which 2320 genes (19%) are part of the core genome. According to the COG analysis, which is an appropriate way to identify the functions of individual proteins or protein sets, the majority of the 2320 gene families that make up the core genome are involved in the metabolism transforming nutrients into cellular energy. This represents approximately 19% of the entire genome, the majority of which falls into the category of metabolism by COG analysis. Soft core genes are 10 (0.08%) which represents~95% incidence of the genome, shell genes are 848 (70%) which cover almost 10-95% of the genome, and 9299 (74%) are numbered as cloud genes which ensure <10% occurrence. The pan and core genome shows fluctuation in the number of pan and core genomes. Overall, the pangenome sizes are not stable and suggest an open pangenome as shown in Figure 4. Pangenome graphs demonstrated a significant expansion in pangenome size upon the addition of new genomes whereas the core genome size drops initially and becomes almost constant and rather stable. Thus, the core genome attains a level of relative stability toward the end of the genomes, making it an ideal component for the prediction of genes to be included in the MLST scheme as well as candidates for vaccine design.
As multiple drug and antibiotic resistance has emerged in S. aureus, making it a mem-ber of the list of major threats to public health by the CDC, and it has become a serious concern for both researchers and the community alike [33]. There has been an exponential increase in the data of S. aureus on international databases owing to the advancements and ease in sequencing whole genomes. Currently, 1279 complete genomes and ~31283 incomplete genomes are publicly available. A total of 502/1279 complete genome sequences were downloaded for further analysis as indicated in Table 1 below [34]. The size ranged from 2.66466 Mbp to 3.13248 Mbp with the reference genome, NCTC 8325, having an average size of 2.8 Mbp. The particular bacteria have been evolving and acquiring antibiotic-resistant and virulent genes since the 1940s, with outbreaks occurring mainly in, but not limited to, hospitals and healthcare settings [35].

Pan-Genome Profile Analysis
The pan-genome analysis is a comprehensive study to understand the evolution of bacteria on a molecular and functional basis. The pan-genome of 502 S. aureus genomes of a total of 1279 available genomes, is comprised of 12,477 gene families of which 2320 genes (19%) are part of the core genome. According to the COG analysis, which is an appropriate way to identify the functions of individual proteins or protein sets, the majority of the 2320 gene families that make up the core genome are involved in the metabolism transforming nutrients into cellular energy. This represents approximately 19% of the entire genome, the majority of which falls into the category of metabolism by COG analysis. Soft core genes are 10 (0.08%) which represents ~95% incidence of the genome, shell genes are 848 (07%) which cover almost 10-95% of the genome, and 9299 (74%) are numbered as cloud genes which ensure <10% occurrence. The pan and core genome shows fluctuation in the number of pan and core genomes. Overall, the pangenome sizes are not stable and suggest an open pangenome as shown in Figure 4. Pangenome graphs demonstrated a significant expansion in pangenome size upon the addition of new genomes whereas the core genome size drops initially and becomes almost constant and rather stable. Thus, the core genome attains a level of relative stability toward the end of the genomes, making it an ideal component for the prediction of genes to be included in the MLST scheme as well as candidates for vaccine design.

Population Analysis Using Present MLST Scheme
In the population analysis of the current MLST scheme for our 502 genomes, two new arcC alleles, which encode for Carbamate kinase, involved in catabolic processes and intermediate metabolism were found for strains FDAARGOS_159 and TCH60. One new aroE allele, aroE gene encode for Shikimate dehydrogenase, which is again involved in catabolic processes, was found for the NZ15MR0322 strain. One new gmk allele, which is essential for the recycling of gmp, was identified for the S. aureus 2148.N strain. Furthermore, one new allele was found for tpi and yqil in strains SA40TW and RF122. Consequently, novel alleles and genes were identified during this analysis step, respectively. A total of 10 new STs were found in the available data as manifested in the Additional file Table S2. The numbers of unique alleles of S. aureus identified for each of the seven housekeeping genes were arcC 19, aroE 20, glpf 18, gmk 15, pta 19, tpi 21, and yqil 23. The mean number of unique alleles is 19.2. The most common allele for different loci was arcC~3 (77 isolates), aroE~3 (74 isolates), glpf~1 (135 isolates), gmk~1 (91 isolates), pta~4 (79 isolates), tpi~4 (80 isolates), and yqil~3 (78 isolates). The total allelic profiles found in 502 strains are 135. Similarly, the total number of STs found is 52. The most common sequence type found in the data set is ST8 [36]. Furthermore, the dN/dS ratio, which compares synonymous substitution rates (dS) with nonsynonymous substitution rates (dN) to calculate the mode and degree of selection for each allele was also calculated and was demonstrated in Table 2.

Limitations of Existing Scheme
The present MLST scheme for S. aureus has a few shortcomings according to the literature and our study of the phylogenetics of organisms.
The present/existing MLST scheme represents a portion of the whole genome, not the complete genome. The complete genome is comprised of many more segments than just those currently represented in the MLST scheme. Approximately 43% of the genomic area goes unrepresented in the present scheme as manifested in Figures 5 and 6. This shortcoming can result in the failure to represent the genome phylogeny [36]. This leaves us with an incomplete picture of how fast evolving or rapidly evolving the species is, which can make it hard for researchers to draw conclusions about its evolutionary history, and it may be impossible to accurately identify all strains from one species.
It is reported that due to 12 bp, overlap in the 3' region of gmk is difficult to amplify in the laboratory which should be conserved, and that led to difficulty in typing of strains. This may be probably due to 12 variations in the 3 region of 12 bp revealed in our analysis. When the genome was subjected to MLST on PubMed (https://pubmlst.org/bigsdb?db=pubmlst_S. aureus seqdef (accessed on 29 October 2022)), the website did not show any allele number for yqil. When it was investigated, the site showed that yqil was found truncated from 1-58 in 22 genomes. Furthermore, the statistical analysis revealed that the gene aroE and gmk of the current scheme have dN/dS values of 12.60328 and 4.06854, respectively. Further details of the statistical analysis of the current scheme are given in Table 3.  It is reported that due to 12 bp, overlap in the 3' region of gmk is difficult to amplify in the laboratory which should be conserved, and that led to difficulty in typing of strains. This may be probably due to 12 variations in the 3′ region of 12 bp revealed in our analysis. When the genome was subjected to MLST on PubMed (https://pubmlst.org/bigsdb?db=pubmlst_S. aureus seqdef (accessed on 29 October 2022)), the website did not show any allele number for yqil. When it was investigated, the site showed that yqil was found truncated from 1-58 in 22 genomes. Furthermore, the statistical analysis revealed that the gene aroE and gmk of the current scheme have dN/dS values of 12.60328  It is reported that due to 12 bp, overlap in the 3' region of gmk is difficult to amplify in the laboratory which should be conserved, and that led to difficulty in typing of strains. This may be probably due to 12 variations in the 3′ region of 12 bp revealed in our analysis. When the genome was subjected to MLST on PubMed (https://pubmlst.org/bigsdb?db=pubmlst_S. aureus seqdef (accessed on 29 October 2022)), the website did not show any allele number for yqil. When it was investigated, the site showed that yqil was found truncated from 1-58 in 22 genomes. Furthermore, the statistical analysis revealed that the gene aroE and gmk of the current scheme have dN/dS values of 12.60328

Selection of New Genes
The selected candidate genes denoted the genomic region unrepresented in the previous MLST scheme. Afterward, the genome map was modified in the form of Figure 7.
Genes 2022, 13, 2160 11 of 17 and 4.06854, respectively. Further details of the statistical analysis of the current scheme are given in Table 3.

Selection of New Genes
The selected candidate genes denoted the genomic region unrepresented in the previous MLST scheme. Afterward, the genome map was modified in the form of Figure 7. The initial shortlisting included 130 candidate genes in the unrepresented region, which were further evaluated for their suitability as candidates for the MLST scheme as manifested in Table 4.  The initial shortlisting included 130 candidate genes in the unrepresented region, which were further evaluated for their suitability as candidates for the MLST scheme as manifested in Table 4. These genes were examined for the presence of mobile elements, representing the intra-genome and inter-genomes mobility of the DNA, resulting in a reduction to 55 genes. The genes were further screened for the presence of mobile elements in 100 bp upstream and downstream reducing the number to 23. Later, only nine genes, found in the core genome, were selected after shortlisting based on size and unique alleles.

Statistical Values for New Gene Candidates
For each new candidate, the phi-test for recombination, invariant sites, nucleotide diversity, Tajima's D, Tajima's D significance, Tajima's Non-syn/Syn ratio, Fu and Li's test, and Fu and Li's test significance were determined as represented in Table 5. Based on these statistical scores, only three genes were selected for MLST on the basis of the above-mentioned test parameters.

New MLST Scheme
Based on the proposed genes and the current MLST scheme genes that had no reported drawbacks, a new MLST scheme was constructed as shown in Table 6. This scheme eliminated the genes of the current scheme that included aroE, gmk, and yqil. Three new genes were selected namely, opuCC, aspC, and rpiB on the basis of genome localization and statistical scores.

Sequence Type Assigning
The new scheme was assigned STs, and it resulted in 58 STs as manifested in Additional file Table S1. The previous scheme constituted a total of 52 STs, of which 10 were new STs. Hence, the genomic data of 502 sequences showed 42 reported STs and 10 new STs.

MLST Tree Construction
MLST trees for current and new schemes were constructed as depicted in Figures 8 and 9. localization and statistical scores.

Sequence Type Assigning
The new scheme was assigned STs, and it resulted in 58 STs as manifested i tional file Table S1. The previous scheme constituted a total of 52 STs, of which new STs. Hence, the genomic data of 502 sequences showed 42 reported STs and STs.

MLST Tree Construction
MLST trees for current and new schemes were constructed as depicted in F and 9.

Discussion
The advancements in Next Generation Sequencing (NGS) and consequently the increase in the number of genomic data show that there is a need to compile and analyze the data collectively to be able to combat effectively against bacterial pathogens. S. aureus is coming into more and more focus as the incidents of the outbreaks have risen over the past decades. Phylogenetic analysis using the sequenced genome data not only helps in determining the evolutionary relationship between the strains but also gives a clue about the way the genome is evolving over time to adapt to its environment. However, the horizontal gene transfer and the acquisition of antibiotic resistance genes make the need of classifying the strains prominent. Several typing methods are currently in use with great room for improvement.
The rapid evolution of bacterial genomes with the passage of time asserts the need for the evaluation of the effectiveness of the current typing methods. Consequently, the

Discussion
The advancements in Next Generation Sequencing (NGS) and consequently the increase in the number of genomic data show that there is a need to compile and analyze the data collectively to be able to combat effectively against bacterial pathogens. S. aureus is coming into more and more focus as the incidents of the outbreaks have risen over the past decades. Phylogenetic analysis using the sequenced genome data not only helps in determining the evolutionary relationship between the strains but also gives a clue about the way the genome is evolving over time to adapt to its environment. However, the horizontal gene transfer and the acquisition of antibiotic resistance genes make the need of classifying the strains prominent. Several typing methods are currently in use with great room for improvement.
The rapid evolution of bacterial genomes with the passage of time asserts the need for the evaluation of the effectiveness of the current typing methods. Consequently, the MLST schemes also need to be evaluated and revised with time. The revision of schemes makes them suitable for the effective characterization of pathogenic strains. Henceforth, the revision of schemes is a regular process towards the improvement of the typing method. The evaluation and the revision of the MLST schemes improve the resolution power just like the revision of the MLST scheme of the Staphylococcus epidermis gave significant results [36].
According to the literature review and genomic data analysis of S. aureus, the current MLST scheme has some shortcomings as well. The polymorphic region of gmk and truncated region of yqil make it difficult to type the strains in regular laboratory settings. Moreover, the high non-synonymous mutations of the genes indicate less stability of the genes like that of aroE and gmk. This contradicts the very purpose of MLST genes that are supposed to be stable over the period. Hence, it draws the attention of researchers to look for suitable gene candidates. It is expected that this revision will prove of vital significance and will help us in characterizing the rapidly evolving S. aureus genome.

Conclusions
The current typing methods for bacterial strains have certain limitations and there is no single method available to perform efficiently enough under all situations. Nevertheless, it remains a challenge for the researchers to keep looking at and improving the existing methods until a new or revised method is proposed with ideal typing potentials. The current MLST scheme was evaluated, and an improved scheme was proposed. gmk, yqil, and aroE were excluded from the present scheme, and opuCC, rpiB, and aspS were included in the proposed scheme along with the arcC, glpF, tpi, pta from the current scheme based on strict exclusion inclusion criteria. According to our research, the proposed genes for the new MLST scheme are relatively stable as shown by the statistical analysis. The Tajima's D value, Tajima's non-synonymous to synonymous mutation ratio, and Phi test show promising outcomes for the new scheme. The new scheme also showed more resolution as compared to the previous scheme for our 502-complete genome dataset. Our reported scheme shows 58 different sequence types as opposed to the 52 sequence types that are delineated in the present scheme. The scheme can be used for strain typing in the future.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13112160/s1, Table S1: New STs assigned through Sequence Type Assigning. The table represents the new STs that were assigned based on the new gene alleles. The allele numbers of the previous genes are kept the same to avoid confusion; Table S2: The presence and absence of new genes inserted in a new scheme in the core genomes of S. aureus 502 strains.