Previous Article in Journal
Effects of Perfluorooctanoic Acid (PFOA) on Colony Growth, Bioluminescence, and Swarming Motility of Vibrio campbellii and Quorum-Sensing Defective Mutants
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Composite Genome Quality Index for Pathogenic Bacterial Genomes

1
Department of Food Science, University of Guelph, Guelph, ON N1G 2W1, Canada
2
Department of Human Health and Nutritional Sciences, University of Guelph, Guelph, ON N1G 2W1, Canada
*
Author to whom correspondence should be addressed.
Appl. Microbiol. 2025, 5(4), 144; https://doi.org/10.3390/applmicrobiol5040144 (registering DOI)
Submission received: 27 April 2025 / Revised: 29 November 2025 / Accepted: 4 December 2025 / Published: 7 December 2025

Abstract

High-quality bacterial genomes are essential for robust comparative genomics, reliable taxonomic assignment, and accurate pathogen and antimicrobial resistance (AMR) surveillance. Yet, public repositories still contain highly heterogeneous assemblies, and genome quality is often judged using single metrics in isolation. Here we develop an integrative Genome Quality Index (GQI) that combines four complementary metrics—including BUSCO single-copy completeness, contig number, N50, and unmapped read percentage—into a composite, interpretable score. We re-assembled and evaluated 474 pathogenic bacterial genomes submitted from South Korea using a standardized Illumina-based pipeline and validated the framework on an independent Enterobacteriaceae dataset (n = 5781). Species-level analyses and unsupervised clustering revealed pronounced variation in genome quality (one-way ANOVA, p < 1.05 × 10−33), with Cronobacter sakazakii and Listeria monocytogenes showing consistently high GQI scores, whereas Mycobacterium tuberculosis exhibited broad variability, including clear low-quality outliers. After log-transforming skewed variables, contig count and N50 remained strongly negatively correlated (r = −0.83), while BUSCO completeness showed moderate positive association with N50 and negative association with unmapped reads. GQI scores spanned 0.23–0.96, with most genomes clustering between 0.70 and 0.85. A Random Forest classifier trained on the four raw metrics predicted GQI-based quality tiers (low, medium, high) with 97% accuracy. From the top-decile genomes, we derived empirical thresholds like BUSCO ≥ 98.6%, contigs ≤ 30, N50 ≥ 1 Mb, and unmapped reads ≤ 0.82% that refine existing recommendations and provide actionable curation criteria. Our framework complements tools such as CheckM, gVolante, and Hybracter by offering a platform-agnostic composite scoring system that can be integrated into submission workflows and surveillance pipelines to systematically flag low-quality genomes and improve the reliability of microbial genomics.

1. Introduction

Scientists heavily rely on the reliability of genomic resources as they often reuse sequence data generated by others. However, if the quality of the available data is low, it not only affects downstream analysis but also propagates further into the public resources. Consequently, ensuring the quality of public data in molecular biology becomes crucial and presents a challenge for developing automated error detection and processing approaches [1].
Whole genome sequencing (WGS) has revolutionized the surveillance of human pathogenic bacteria, and numerous databases host WGS data. As of October 2025, the genomes of 101 species are stored in the National Center for Biotechnology Information Pathogen Detection Database (NCBI-PD), ensuring convenient accessibility to these genomes [2]. However, strict scrutiny procedures to identify and filter out low-quality or contaminated raw sequences, misassembled genomes, and incorrectly assigned taxonomies are lacking during data submission. This poses challenges in correctly assembling raw reads, assigning accurate species taxonomies, and characterizing genes. Particularly, raw data submitted to the Sequence Read Archive (SRA) undergoes limited scrutiny at present [3].
Genome assemblies often encounter quality issues such as sequencing errors, misassembly, and contamination. Contamination-related errors are particularly concerning as they can lead to misinterpretation of data, mischaracterization of gene content, inaccurate species assignment, and biases in genomic analyses. Contamination is suspected to be widespread, originating from foreign DNA present in raw biological materials or introduced during library preparation. Sequencing errors and misassembly can result in fragmented genomes, which hinder downstream analysis, including the detection of antibiotic resistance genes and mobile genetic elements such as plasmids and phages [4].
Three widely used quality control (QC) parameters for genome assessment are completeness, contiguity, and accuracy. Completeness is determined by the gene content of contigs. As contiguous genome that contains few gene contents is not useful for downstream analysis [5]. High-quality genomes exhibit completeness above 90% [6]. Contiguity relates to the size and number of contigs, with high-quality genomes characterized by fewer contigs of larger size [7]. Accuracy is typically evaluated by comparing the assembled genome with a reference or by mapping reads to the assembled genome to assess coverage. Accurate genomes show consistent contig order (for reference-based assembly) or a lower percentage of unmapped reads (for de novo assembly [8].
Individual measures of contiguity, completeness, or accuracy alone can be misleading when judging genome quality [7]. Although numerous quality control measurements are available, there is currently no consistent framework for integrating these metrics to accurately report the quality of a whole genome [9]. Hence, an integration of quality control parameters is necessary for precise genome quality assessment.
Given the importance of reliable genomic resources and the potential drawbacks of low-quality genomes in public databases, it is crucial to address the gaps in quality assessment and control. Present study aimed at introducing a composite Genome Quality Index (GQI) that integrates four widely reported and biologically interpretable metrics: BUSCO single-copy completeness, contig count, N50, and unmapped read percentage. We apply this framework to a controlled case study of 474 pathogenic bacterial genomes originating from South Korea, selected because they share consistent Illumina sequencing strategies, complete raw read availability, and harmonized metadata within public repositories. Focusing on a single country minimizes confounding variation arising from heterogeneous submission protocols and sequencing platforms while still providing taxonomically diverse genomes. This geographic focus reflects a controlled case study rather than a country-specific bias; the GQI framework itself is globally applicable.
We further extend the analysis in three ways. First, we re-assemble all genomes from raw reads under a standardized pipeline to minimize methodological variation and then compute the GQI from normalized, log-transformed metrics. Second, we perform species-level analyses and unsupervised clustering to identify genome-quality typologies and taxon-specific biases. Third, we compare GQI distributions between our curated dataset and an independent Enterobacteriaceae cohort (n = 5781) and evaluate the ability of a Random Forest classifier to automatically assign genome-quality tiers. From the top-decile GQI genomes, we derive empirical thresholds that refine existing recommendations such as Minimum information about a single amplified genome (MISAG)/metagenome assembled genome (MIMAG) [10].
Together, these analyses yield a reproducible and scalable composite quality framework that complements established tools such as BUSCO, QUAST, ALE, gVolante, CheckM, and Hybracter, [11,12,13,14,15,16,17] and can be embedded into submission workflows and surveillance dashboards to systematically flag low-quality genomes.

2. Materials and Methods

2.1. Data Retrieval and Study Design

We compiled a primary dataset of 474 pathogenic bacterial genomes from the NCBI pathogen detection (NCBI-PD) database. Inclusion criteria were: (i) isolates originating from South Korea; (ii) Illumina short-read WGS with publicly available raw reads in the Sequence Read Archive (SRA); (iii) clear species-level taxonomic labels; and (iv) complete metadata enabling standardized processing. Focusing on a single country ensured homogeneous sequencing technologies and submission practices, while still capturing multiple clinically relevant species (Supplementary File S1). Raw reads were retrieved from the SRA accessed on 12 December 2024 using the SRA Toolkit (https://github.com/ncbi/sra-tools). Briefly, prefetch was used to fetch the SRA accession numbers, and the validity of the raw sequencing data was determined using vdb-validate. Forward and reverse reads were retrieved using fastq-dump with the --split-file parameter. As part of validation, we also compiled an independent Enterobacteriaceae set (n = 5781) comprising public genomes from Escherichia coli, Salmonella enterica, Klebsiella pneumoniae, and related taxa with ≥20 assemblies per species and available raw reads or high-quality assemblies (Supplementary File S2). These genomes were processed using the same pipeline wherever raw data were available.

2.2. Quality Assessment of Whole Genomes

Quality trimming of the raw reads was performed using fastp v1.0.1 [18]. Genomes were assembled using SPAdes v4.2.0 assembler [19] with default settings, and subsequent polishing of the assembled genomes was carried out using Pilon v1.24 [20]. The taxonomy of the genomes was determined through 16S RDP classifier v1.23 [21]. To assess genome quality, we employed the benchmarking universal single-copy orthologs (BUSCO v5) tool [11] with the --auto-lineage-prok detection parameter. BUSCO measures several metrics, including complete single-copy, duplicated, fragmented, and missing orthologs in the genomes. The contiguity of the assembled genomes was evaluated using QUAST v5.3.0, a quality assessment tool [12], with default parameters. Contiguity features including number of contigs and N50 values (defined as the length of the contig at which half of the genome is represented by contigs of that size or larger [7]) were determined. Accuracy, which examines the positions of sequence read pairs within an assembly to identify anomalies [8], was evaluated using the Assembly Likelihood Evaluation (ALE) framework [13]. ALE measures the ratio of mapped and unmapped reads to the assembled genome. The complete workflow for genome quality assessment is illustrated in Figure 1.
Species names were harmonized to consistent labels (e.g., “Salmonella enterica subsp. enterica” collapsed to S. enterica). To ensure meaningful species-level statistics, only taxa with ≥5 genomes were included in groupwise analyses and this yielded 11 distinct taxa (Supplementary File S1). Because contig count, N50, and unmapped read percentage exhibited skewed distributions and apparent non-linear relationships in preliminary plots, we applied log10 transformations to these metrics prior to correlation and regression analyses. We use the following qualitative categories for correlation strength: “very strong” (|r| ≥ 0.90), “strong” (0.70–0.89), “moderate” (0.40–0.69), “weak” (0.20–0.39), and “negligible” (|r| < 0.20). These thresholds are reported alongside exact r values to avoid ambiguous wording.

2.3. Genome Quality Index (GQI) Construction and Clustering

To construct the GQI, we first min–max normalized each metric to the [0, 1] range so that higher values consistently reflected better quality. Specifically, BUSCO complete single-copy percentages and log10-transformed N50 values were normalized directly, whereas log10(contig count) and log10(unmapped read percentage) were inverted prior to normalization so that genomes with fewer contigs and fewer unmapped reads received higher scores. For descriptive purposes, genomes were stratified into three quality tiers based on GQI tertiles: “low” (GQI ≤ 0.50), “medium” (0.50 < GQI ≤ 0.75), and “high” (GQI > 0.75). In addition, we applied k-means clustering to the same normalized metrics, with the optimal number of clusters chosen via silhouette scores, to explore genome-quality typologies.

2.4. Species-Level and Comparative Analyses

Species-level differences in GQI were tested using one-way ANOVA followed by Tukey’s HSD post hoc tests. To compare our curated All_QC set (obtained from 474 genomes) to the Enterobacteriaceae validation cohort (n = 5781), we used Welch’s t-test or the Mann–Whitney U test depending on normality, along with Kolmogorov–Smirnov tests to compare entire distributions.

2.5. Machine-Learning Classification

We trained a Random Forest classifier (scikit-learn) using the four raw metrics (BUSCO complete, contig count, N50, unmapped read percentage) to predict the three GQI-defined quality tiers. The dataset was split into 80% training and 20% test sets, stratified by class. Hyperparameters were tuned via five-fold cross-validation on the training set. Performance was evaluated using accuracy, macro-averaged F1-score, and confusion matrices on the held-out test set.

2.6. Threshold Derivation

To propose empirical quality thresholds, we examined genomes in the top decile of GQI (>0.85) and computed the 10th percentile for BUSCO completeness and N50, and the 90th percentile for contig count and unmapped reads. These percentiles were used to derive threshold values that characterize consistently high-quality genomes.

3. Results and Discussion

3.1. Taxonomy Assignment

We used the 16S RDP classifier to assign taxonomy to 474 whole genomes. Our analysis revealed that 14 of these genomes showed 16S rRNA gene contamination of other bacteria, primarily from Bradyrhizobium sp. Such contamination can result in incorrect taxonomy assignment and strain identification. The existence of contaminated genomes within public resources undermines the reliability of downstream analysis [22]. Notably, all of the contaminated genomes were found to be of low quality in terms of completeness, contiguity, and accuracy. This observation aligns with a previous study that reported similar contamination issues in genomes sequenced using Illumina-based short-read sequencers and among publicly available genomes in NCBI [23,24]. Therefore, implementing proper screening procedures during genome submission to public databases is crucial to improve the overall quality of these databases.

3.2. Distributions of Completeness, Contiguity, and Accuracy Metrics

We first summarized the distributions of completeness, contiguity, and accuracy metrics for the 474 assemblies (Figure 2A). The median value of complete single-copy BUSCO was 99.10% with an interquartile range (IQR) of 4.5%. Median percentages of duplicated, fragmented, and missing BUSCO were 0.30%, 0.10% and 0.40%, with IQRs of 0.20%, 0.80% and 0.90%, respectively.
For contiguity, the median number of contigs was 113 (IQR 209) and the median N50 was 112,932 bp (IQR 156,297 bp). For accuracy, the median percentage of unmapped reads was 0.65% (IQR 1.88%). The relatively wide IQRs, particularly for contig count, N50, and unmapped reads, reflect the presence of both high-quality and poor-quality genomes in the dataset.
These values are broadly consistent with previous reports that consider genomes with complete single-copy BUSCO above 90% and low levels of missing, duplicated, and fragmented BUSCO as high-quality [25]. In our dataset, 392 genomes (83%) had complete single-copy BUSCO > 90%; 444 (94%), 370 (78%), and 372 (78%) had less than 2% duplicated, fragmented, and missing BUSCO, respectively (Supplementary File S3). Regarding contiguity, 354 genomes (75%) had N50 values greater than 50 kb, and 346 (73%) had fewer than 200 contigs. For accuracy, 348 genomes (73%) had less than 2% unmapped reads. While more than 70% of genomes showed favorable values for individual metrics, they did not always perform consistently across all metrics. This motivated a more detailed examination of relationships among completeness, contiguity, and accuracy.

3.3. Relationships Among Quality Metrics and Non-Linear Trends

The correlation among the completeness, contiguity, and accuracy parameters was assessed using Spearman’s rank order correlation coefficient. Significantly very strong, strong, moderate, and less strong correlations between the features are shown in Figure 2B (p value < 0.05). The percentage of complete single-copy BUSCO manifested a strong negative correlation (ρ = −0.61) with the number of contigs. Additionally, it exhibited a less strong negative correlation with unmapped reads (ρ = −0.3), and a less strong positive correlation with N50 values (ρ = 0.2). However, it is challenging to establish a consistent relationship between complete single-copy BUSCO, N50 values, and the number of contigs due to the presence of genomes with diverse quality metrics. Some genomes with low N50 values and a higher number of contigs still had a wide range of complete single-copy BUSCO percentages and vice versa.
This pattern likely reflects that BUSCO targets conserved single-copy orthologs, which can remain intact even when the assembly is highly fragmented. Repetitive or high copy elements such as rRNA operons, tRNA clusters, or mobile elements can break assemblies locally without greatly affecting these core genes, and short high confidence contigs in high coverage datasets may still contain complete BUSCOs. Consequently, BUSCO completeness alone can overestimate the quality and usability of fragmented genomes, reinforcing the need to integrate multiple metrics as in our GQI framework.
We observed strong positive associations (0.73) between fragmented BUSCO and the number of contigs. Additionally, fragmented BUSCO showed very strong associations (0.8) with N50 values and unmapped reads. The percentage of missing BUSCO had a less strong negative correlation (−0.25) with N50 values, a moderate positive correlation (0.5) with the number of contigs, and a strong positive correlation (0.74) with unmapped reads. On the other hand, duplicated BUSCO values did not show a significant correlation with N50 values or unmapped reads; however, they exhibited a moderate positive correlation (0.42) with the number of contigs. Although these trends were not consistent across all genomes, they provide insights into potential associations among the parameters. The negative relationship between single-copy BUSCO and the number of contigs, along with the positive relationships of the other three completeness parameters, suggests that genomes with lower numbers of contigs are better assembled and exhibit higher completeness. As a result, N50 values tend to increase, and the percentage of unmapped reads decreases, as indicated by their associations. Higher percentages of fragmented and duplicated BUSCO indicate contamination in the genome sequences, often associated with an elevated number of contigs and lower N50 values. Conversely, lower percentages of these completeness parameters suggest better contiguity and accuracy of the genomes [14].
To explore potential non-linear relationships and examined six pairwise combinations of four key metrics, including complete single-copy BUSCO, log10(number of contigs), log10(N50) and log10(unmapped reads), we used linear regression (Figure 3). As expected, the strongest linear trend was the inverse relationship between contig number and N50. Relationships involving unmapped reads and BUSCO scored displayed curvature (e.g., an inverted-U pattern between BUSCO and unmapped reads), indicating that linear fits on raw scales can understate or misrepresent true associations. This justifies the use of log-transforms and multivariate approaches rather than relying solely on simple linear correlations or single-metric thresholds.

3.4. Genome Quality Index (GQI) Construction

GQI values across the 474 genomes ranged from 0.23 to 0.96, with a central cluster between 0.70 and 0.85 (Figure 4A). Using tertiles of GQI, we classified 100 genomes (21%) as low-quality (GQI ≤ 0.50), 179 (38%) as medium-quality (0.50–0.75), and 195 (41%) as high-quality (GQI > 0.75). High GQI genomes showed the expected profile of high completeness, low fragmentation, and strong read support, whereas low GQI genomes typically combined reduced completeness, many contigs, small N50 values, and elevated unmapped reads. Notably, several genomes that would be considered acceptable based on a single metric, such as BUSCO completeness above 90%, fell into the medium or low GQI tiers. This illustrates the added resolution obtained by integrating multiple indicators into a single composite score.

3.5. Species-Level Differences in Genome Quality

GQI allowed us to explore species-specific patterns (Figure 4B). Stacked bar plots of quality tiers revealed that Cronobacter sakazakii and Listeria monocytogenes genomes were predominantly high-quality, with very few low-GQI assemblies (Figure 4A). In contrast, Klebsiella pneumoniae and Mycobacterium tuberculosis displayed a broader spread of GQI values, including a substantial fraction of medium- and low-quality genomes.
Boxplots and ranked mean GQI values by species (Figure 4B,C) confirmed these patterns. One-way ANOVA showed that mean GQI differed significantly among species (F = 39.37, p < 1.05 × 10−33), and Tukey’s HSD tests identified pronounced contrasts between consistently high-quality taxa (e.g., C. sakazakii, L. monocytogenes) and more variable taxa (e.g., K. pneumoniae, M. tuberculosis). These differences likely reflect a combination of biological factors (e.g., genome size, GC content, repeat structure) and technical aspects such as sequencing depth and assembly strategy [7,25,26].
From a surveillance standpoint, the presence of intermediate- and low-GQI genomes among common pathogens such as K. pneumoniae and Salmonella enterica suggests that public collections include assemblies that may not be suitable for high-resolution comparative analyses without further curation.

3.6. Comparison with Public Enterobacteriaceae Genomes

To test the generality of the framework, we applied the same pipeline and GQI calculation to an independent Enterobacteriaceae dataset. The resulting GQI distributions broadly overlapped with those of the curated All_QC set but were shifted slightly toward lower values and showed greater dispersion (Figure 5A). Enterobacteriaceae assemblies also contained a larger fraction of low and medium GQI genomes (Figure 5B).
These findings are consistent with previous work documenting variable quality, contamination and incomplete metadata in public genomes from Enterobacteriaceae and other pathogens [3,4,22]. The comparison illustrates how GQI can be used as a screening tool: genomes with very low GQI values can be flagged for reassembly, reannotation, or exclusion from sensitive analyses such as AMR surveillance and outbreak investigation [27,28,29].

3.7. Machine-Learning Prediction of Quality Tiers

We next evaluated whether quality tiers could be predicted automatically from the four raw metrics. The Random Forest classifier achieved 97% accuracy on the held-out test set, with macro-averaged F1-scores above 0.95 for all three classes. Feature importance analysis indicated that contig count and N50 contributed most strongly to discrimination, followed by BUSCO completeness and unmapped read percentage.
This high performance shows that a small set of interpretable metrics is sufficient for automated quality tier assignment once a composite score such as GQI is defined. Such classifiers could be embedded into submission portals or institutional pipelines to provide instant feedback to depositors, like quality dashboards used in other genomics contexts [6,9,10,15,27,30].

3.8. Empirical Thresholds for High-Quality Genomes

We next derived empirical thresholds for high-quality genomes by examining the top decile of assemblies (GQI > 0.85). These genomes shared a characteristic profile: complete single-copy BUSCO ≥ 98.6%, ≤30 contigs, N50 ≥ 1 Mb and unmapped reads ≤0.82%. These values are broadly consistent with, but somewhat more stringent than, criteria proposed in MISAG/MIMAG and other frameworks for high-quality bacterial and metagenome-assembled genomes [10,26].
When applied back to the full dataset, these thresholds captured most high-GQI genomes while excluding most low-GQI assemblies, confirming their internal consistency with the composite index. We therefore propose them as practical guidance for curators and submitters, with the caveat that taxon-specific adjustments may be necessary for particularly challenging genomes.

3.9. Biological Implications of Low-Quality Genomes

Low-quality assemblies have direct consequences for biological interpretation and public health. Fragmentation can obscure mobile genetic elements, genomic islands, and structural variants, while misassemblies may introduce false positives or negatives in gene presence–absence matrices, distort phylogenies, and misclassify transmission clusters, especially when strains differ by only a small number of SNPs [2,3,26,27,29].
Our results show that a non-trivial fraction of genomes in contemporary surveillance efforts and public repositories fall below conservative GQI-based thresholds. This echoes prior calls for stricter submission guidelines, systematic quality checks, and integration of assembly metrics into surveillance pipelines [15,29,31]. GQI offers a quantitative and interpretable scaffold for such initiatives, enabling consistent triage of genomes for reassembly, reannotation, or exclusion from high-stakes analyses.

3.10. Methodological Considerations and Limitations

Several limitations warrant discussion. First, we did not have explicit per-sample coverage information for all genomes. While low unmapped read fractions and high BUSCO completeness suggest adequate coverage for most assemblies, certain edge cases, particularly those with high unmapped reads, could reflect low or uneven coverage. Future work integrating coverage profiles and depth-based metrics would refine GQI further [30,32].
Second, we chose BUSCO rather than CheckM as the primary completeness metric because BUSCO provides gene-level resolution across a broad bacterial lineage set and is widely integrated into assembly workflows [11,14]. Nonetheless, CheckM’s explicit contamination estimates and lineage-specific models offer clear advantages for metagenome-assembled genomes and complex communities [16]. We therefore view GQI as complementary to CheckM and anticipate that future implementations could integrate both metrics.
Third, our analysis was based on re-assembled genomes rather than the originally submitted assemblies. This design was intentional: by standardizing the assembly pipeline, we sought to isolate intrinsic genome properties and raw data quality from submission pipeline variability. However, it means that our study evaluates the quality achievable under a uniform pipeline, not the exact quality of deposited assemblies. For repository auditing, a future study could compute GQI directly on in situ assemblies, perhaps stratifying by assembler or sequencing technology [1,3,6,25,26,27].
Finally, while our case study focuses on South Korean isolates to leverage homogeneous metadata and raw reads, the framework is not geographically constrained. Applying GQI across multiple countries, sequencing centers and genome types (including metagenome-assembled genomes) will be an important next step.

4. Conclusions

We developed a composite Genome Quality Index (GQI) that integrates BUSCO completeness, N50, contig count and unmapped read percentage into a single, interpretable score for bacterial genomes. Applied to 474 re-assembled pathogenic genomes and an independent Enterobacteriaceae dataset, GQI captures established relationships among completeness, contiguity and accuracy metrics while revealing non-linear behavior that is not apparent from single metrics alone, identifying species-specific quality patterns, highlighting that expectations for genome quality should be taxon-aware, and distinguishing high-, medium- and low-quality assemblies more effectively than individual metrics, thereby supporting robust quality-tier classification. It further enables empirical derivation of practical thresholds for high-quality genomes (BUSCO ≥ 98.6%, ≤30 contigs, N50 ≥ 1 Mb, unmapped reads ≤0.82%) and can be approximated with high accuracy by a Random Forest classifier using only four routinely reported metrics. By complementing tools such as BUSCO, QUAST, ALE, gVolante, CheckM, and Hybracter, [11,12,13,14,15,16,17] the GQI framework provides an actionable, reproducible, and scalable approach to elevating genome-quality standards in microbial genomics, and its integration into submission portals, repository dashboards, and pathogen surveillance pipelines will help ensure that downstream analyses, from AMR surveillance to comparative evolutionary studies, which are based on genomes whose quality has been assessed in a transparent and quantitatively rigorous manner.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/applmicrobiol5040144/s1, Supplementary File S1: primary dataset of 474 pathogenic bacterial genomes from the NCBI pathogen detection (NCBI-PD) database; Supplementary File S2: Independent Enterobacteriaceae set (n = 5781) comprising public genomes from Escherichia coli, Salmonella enterica, Klebsiella pneumoniae, and related taxa with ≥20 assemblies per species and available raw reads or high-quality assemblies; Supplementary File S3: dataset.

Author Contributions

Conceptualization, A.F. and A.R.; methodology, A.F. and A.R.; software, A.F.; validation, A.F., A.R.; formal analysis, A.F. and A.R.; investigation, A.F.; resources, A.F.; data curation, A.F. and A.R.; writing—original draft preparation, A.F. and A.R.; writing—review and editing, A.F.; visualization, A.F. and A.R.; supervision, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available by the author upon request.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BUSCOBenchmarking Universal Single-Copy Orthologs
ALEAssembly Likelihood Evaluation
QUASTQuality Assessment Tool
GQIGenome Quality Index
WGSWhole Genome Sequencing
NCBINational Center for Biotechnology Information
NCBI-PDNCBI Pathogen Detection database
SRASequence Read Archive
RDPRibosomal Database Project
IQRInterquartile Range
ANOVAAnalysis of Variance
RFRandom Forest
KbKilobase
MbMegabase

References

  1. Chiara, M.; Pavesi, G. Evaluation of quality assessment protocols for high throughput genome resequencing data. Front. Genet. 2017, 8, 94. [Google Scholar] [CrossRef]
  2. Timme, R.E.; Wolfgang, W.J.; Balkey, M.; Venkata, S.L.G.; Randolph, R.; Allard, M.; Strain, E. Optimizing open data to support one health: Best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook 2020, 2, 20. [Google Scholar] [CrossRef]
  3. Robertson, J.; Yoshida, C.; Kruczkiewicz, P.; Nadon, C.; Nichani, A.; Taboada, E.N.; Nash, J.H.E. Comprehensive assessment of the quality of salmonella whole genome sequence data available in public sequence databases using the salmonella in silico typing resource (sistr). Microb. Genom. 2018, 4, e000151. [Google Scholar] [CrossRef] [PubMed]
  4. Francois, C.M.; Durand, F.; Figuet, E.; Galtier, N. Prevalence and implications of contamination in public genomic resources: A case study of 43 reference arthropod assemblies. G3 Genes Genomes Genet. 2020, 10, 721–730. [Google Scholar] [CrossRef]
  5. Xie, L.; Wong, L. Pdr: A new genome assembly evaluation metric based on genetics concerns. Bioinformatics 2020, 37, 289–295. [Google Scholar] [CrossRef]
  6. Jung, H.; Ventura, T.; Chung, J.S.; Kim, W.-J.; Nam, B.-H.; Kong, H.J.; Kim, Y.-O.; Jeon, M.-S.; Eyun, S.-I. Twelve quick steps for genome assembly and annotation in the classroom. PLoS Comput. Biol. 2020, 16, e1008325. [Google Scholar] [CrossRef]
  7. Jauhal, A.A.; Newcomb, R.D. Assessing genome assembly quality prior to downstream analysis: N50 versus busco. Mol. Ecol. Resour. 2021, 21, 1416–1421. [Google Scholar] [CrossRef] [PubMed]
  8. Studholme, D.J. Genome update. Let the consumer beware: Streptomyces genome sequence quality. Microb. Biotechnol. 2016, 9, 3–7. [Google Scholar] [CrossRef] [PubMed]
  9. Whalley, J.P.; Buchhalter, I.; Rheinbay, E.; Raine, K.M.; Stobbe, M.D.; Kleinheinz, K.; Werner, J.; Beltran, S.; Gut, M.; Hübschmann, D.; et al. Framework for quality assessment of whole genome cancer sequences. Nat. Commun. 2020, 11, 5040. [Google Scholar] [CrossRef]
  10. Bowers, R.M.; Kyrpides, N.C.; Stepanauskas, R.; Harmon-Smith, M.; Doud, D.; Reddy, T.B.K.; Schulz, F.; Jarett, J.; Rivers, A.R.; Eloe-Fadrosh, E.A.; et al. Minimum information about a single amplified genome (MISAG) and a metagenome assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 2017, 35, 725–731. [Google Scholar] [CrossRef]
  11. Simão, F.A.; Waterhouse, R.M.; Ioannidis, P.; Kriventseva, E.V.; Zdobnov, E.M. Busco: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 2015, 31, 3210–3212. [Google Scholar] [CrossRef]
  12. Gurevich, A.; Saveliev, V.; Vyahhi, N.; Tesler, G. Quast: Quality assessment tool for genome assemblies. Bioinformatics 2013, 29, 1072–1075. [Google Scholar] [CrossRef] [PubMed]
  13. Clark, S.C.; Egan, R.; Frazier, P.I.; Wang, Z. Ale: A generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics 2013, 29, 435–443. [Google Scholar] [CrossRef] [PubMed]
  14. Waterhouse, R.M.; Seppey, M.; Simão, F.A.; Manni, M.; Ioannidis, P.; Klioutchnikov, G.; Kriventseva, E.V.; Zdobnov, E.M. Busco applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 2018, 35, 543–548. [Google Scholar] [CrossRef] [PubMed]
  15. Nishimura, O.; Hara, Y.; Kuraku, S. Gvolante for standardizing completeness assessment of genome and transcriptome assemblies. Bioinformatics 2017, 33, 3635–3637. [Google Scholar] [CrossRef]
  16. Parks, D.H.; Imelfort, M.; Skennerton, C.T.; Hugenholtz, P.; Tyson, G.W. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015, 25, 1043–1055. [Google Scholar] [CrossRef] [PubMed]
  17. Bouras, G.; Houtak, G.; Wick, R.R.; Mallawaarachchi, V.; Roach, M.J.; Papudeshi, B.; Judd, L.M.; Sheppard, A.E.; Edwards, R.A.; Vreugde, S. Hybracter: Enabling scalable, automated, complete and accurate bacterial genome assemblies. Microb. Genom. 2024, 10, e001244. [Google Scholar] [CrossRef] [PubMed]
  18. Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics 2018, 34, i884–i890. [Google Scholar] [CrossRef] [PubMed]
  19. Prjibelski, A.; Antipov, D.; Meleshko, D.; Lapidus, A.; Korobeynikov, A. Using spades de novo assembler. Curr. Protoc. Bioinform. 2020, 70, e102. [Google Scholar] [CrossRef] [PubMed]
  20. Walker, B.J.; Abeel, T.; Shea, T.; Priest, M.; Abouelliel, A.; Sakthikumar, S.; Cuomo, C.A.; Zeng, Q.; Wortman, J.; Young, S.K.; et al. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 2014, 9, e112963. [Google Scholar] [CrossRef]
  21. Lan, Y.; Wang, Q.; Cole, J.R.; Rosen, G.L. Using the rdp classifier to predict taxonomic novelty and reduce the search space for finding novel organisms. PLoS ONE 2012, 7, e32491. [Google Scholar] [CrossRef] [PubMed]
  22. Gonçalves, R.S.; Musen, M.A. The variable quality of metadata about biological samples used in biomedical experiments. Sci. Data. 2019, 6, 190021. [Google Scholar] [CrossRef]
  23. Jeong, H.; Pan, J.-G.; Park, S.-H. Contamination as a major factor in poor illumina assembly of microbial isolate genomes. bioRxiv 2016. [Google Scholar] [CrossRef]
  24. Steinegger, M.; Salzberg, S.L. Terminating contamination: Large-scale search identifies more than 2,000,000 contaminated entries in genbank. Genome Biol. 2020, 21, 115. [Google Scholar] [CrossRef] [PubMed]
  25. Jung, H.; Winefield, C.; Bombarely, A.; Prentis, P.; Waterhouse, P. Tools and strategies for long-read sequencing and de novo assembly of plant genomes. Trends Plant Sci. 2019, 24, 700–724. [Google Scholar] [CrossRef] [PubMed]
  26. Molina-Mora, J.A.; Campos-Sánchez, R.; Rodríguez, C.; Shi, L.; García, F. High quality 3c de novo assembly and annotation of a multidrug resistant st-111 pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers. Sci. Rep. 2020, 10, 1392. [Google Scholar] [CrossRef] [PubMed]
  27. Bradnam, K.R.; Fass, J.N.; Alexandrov, A.; Baranay, P.; Bechner, M.; Birol, I.; Boisvert, S.; Chapman, J.A.; Chapuis, G.; Chikhi, R.; et al. Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2013, 2, 10. [Google Scholar] [CrossRef] [PubMed]
  28. Jayakumar, V.; Sakakibara, Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation pacbio long-read sequence data. Brief. Bioinform. 2019, 20, 866–876. [Google Scholar] [CrossRef]
  29. Land, M.L.; Hyatt, D.; Jun, S.-R.; Kora, G.H.; Hauser, L.J.; Lukjancenko, O.; Ussery, D.W. Quality scores for 32,000 genomes. Stand. Genom. Sci. 2014, 9, 20. [Google Scholar] [CrossRef] [PubMed]
  30. Pfeifer, S.P. From next-generation resequencing reads to a high-quality variant data set. Heredity 2017, 118, 111–124. [Google Scholar] [CrossRef]
  31. Lischer, H.E.L.; Shimizu, K.K. Reference-guided de novo assembly approach improves genome reconstruction for related species. BMC Bioinform. 2017, 18, 474. [Google Scholar] [CrossRef] [PubMed]
  32. Liao, X.; Li, M.; Zou, Y.; Wu, F.-X.; Yi, P.; Wang, J. Current challenges and solutions of de novo assembly. Quant. Biol. 2019, 7, 90–109. [Google Scholar] [CrossRef]
Figure 1. Workflow for composite genome quality assessment and Genome Quality Index (GQI) construction. Public Illumina short-read data were retrieved using the SRA Toolkit and pre-processed with fastp, followed by de novo assembly and polishing with SPAdes and Pilon. Taxonomy was validated using 16S rRNA sequences and the RDP Classifier. Assembly contiguity (N50 and number of contigs) was quantified with QUAST, completeness with BUSCO, and accuracy by remapping reads and evaluating assemblies with ALE. These metrics were integrated in a central Genome Quality Assessment step and combined using principal component analysis to derive the Genome Quality Index (GQI), a composite score that was subsequently used for quality-tier classification, species-level comparisons, clustering, and derivation of empirical quality thresholds.
Figure 1. Workflow for composite genome quality assessment and Genome Quality Index (GQI) construction. Public Illumina short-read data were retrieved using the SRA Toolkit and pre-processed with fastp, followed by de novo assembly and polishing with SPAdes and Pilon. Taxonomy was validated using 16S rRNA sequences and the RDP Classifier. Assembly contiguity (N50 and number of contigs) was quantified with QUAST, completeness with BUSCO, and accuracy by remapping reads and evaluating assemblies with ALE. These metrics were integrated in a central Genome Quality Assessment step and combined using principal component analysis to derive the Genome Quality Index (GQI), a composite score that was subsequently used for quality-tier classification, species-level comparisons, clustering, and derivation of empirical quality thresholds.
Applmicrobiol 05 00144 g001
Figure 2. Overview of quality metrics and associations in 474 pathogenic bacterial genomes. (A) Violin plots illustrating the median values and interquartile ranges for seven quality parameters: single-copy BUSCO, duplicated BUSCO, fragmented BUSCO, missing BUSCO, number of contigs, N50 values, and unmapped reads. (B) Spearman rank order correlation analysis of genome completeness, contiguity, and accuracy metrics (n = 7) using the NCBI-PD whole genomes dataset. The size and color of the circles represent the strength of the correlation. Dark blue indicates highly positive significant correlations, while dark orange indicates highly negative significant correlations. A p value < 0.05 was considered statistically significant.
Figure 2. Overview of quality metrics and associations in 474 pathogenic bacterial genomes. (A) Violin plots illustrating the median values and interquartile ranges for seven quality parameters: single-copy BUSCO, duplicated BUSCO, fragmented BUSCO, missing BUSCO, number of contigs, N50 values, and unmapped reads. (B) Spearman rank order correlation analysis of genome completeness, contiguity, and accuracy metrics (n = 7) using the NCBI-PD whole genomes dataset. The size and color of the circles represent the strength of the correlation. Dark blue indicates highly positive significant correlations, while dark orange indicates highly negative significant correlations. A p value < 0.05 was considered statistically significant.
Applmicrobiol 05 00144 g002
Figure 3. Pairwise relationships between BUSCO completeness, contiguity, and read-mapping metrics. (A) Single-copy BUSCOs (%) versus number of contigs, (B) N50 values (log-transformed) versus single-copy BUSCOs (%), (C) single-copy BUSCOs (%) versus proportion of unmapped reads, (D) N50 values (log-transformed) versus number of contigs, (E) number of contigs versus proportion of unmapped reads, and (F) N50 values (log-transformed) versus proportion of unmapped reads. Each point represents one genome assembly; red lines show the fitted linear regression, and the corresponding Pearson correlation coefficient (r) and coefficient of determination (R2) are indicated in each panel.
Figure 3. Pairwise relationships between BUSCO completeness, contiguity, and read-mapping metrics. (A) Single-copy BUSCOs (%) versus number of contigs, (B) N50 values (log-transformed) versus single-copy BUSCOs (%), (C) single-copy BUSCOs (%) versus proportion of unmapped reads, (D) N50 values (log-transformed) versus number of contigs, (E) number of contigs versus proportion of unmapped reads, and (F) N50 values (log-transformed) versus proportion of unmapped reads. Each point represents one genome assembly; red lines show the fitted linear regression, and the corresponding Pearson correlation coefficient (r) and coefficient of determination (R2) are indicated in each panel.
Applmicrobiol 05 00144 g003
Figure 4. Species-level variation in Genome Quality Index (GQI) among pathogenic bacterial genomes. (A) Boxplots summarizing the distribution of GQI values by species. Central lines indicate medians, boxes represent interquartile ranges (IQRs), whiskers extend to 1.5 × IQR and points denote outliers The boxplot colors are used only to aid visualization and do not encode any additional categories. (B) Stacked bar plots showing the proportion of genomes in each GQI tier (low, medium, high) for species represented by ≥5 genomes. Tiers are defined as low (GQI ≤ 0.50), medium (0.50–0.75), and high (GQI > 0.75). (C) Mean GQI for each species ranked from lowest to highest. The dashed horizontal line marks the overall mean GQI across all genomes. Species differ significantly in their GQI distributions (one-way ANOVA F = 39.37, p < 1.05 × 10−33), with taxa such as Cronobacter sakazakii and Listeria monocytogenes enriched for high-quality genomes and species such as Klebsiella pneumoniae and Mycobacterium tuberculosis exhibiting broader and lower quality spectra. The colors of horizontal bars are used only to aid visualization and do not represent additional categories.
Figure 4. Species-level variation in Genome Quality Index (GQI) among pathogenic bacterial genomes. (A) Boxplots summarizing the distribution of GQI values by species. Central lines indicate medians, boxes represent interquartile ranges (IQRs), whiskers extend to 1.5 × IQR and points denote outliers The boxplot colors are used only to aid visualization and do not encode any additional categories. (B) Stacked bar plots showing the proportion of genomes in each GQI tier (low, medium, high) for species represented by ≥5 genomes. Tiers are defined as low (GQI ≤ 0.50), medium (0.50–0.75), and high (GQI > 0.75). (C) Mean GQI for each species ranked from lowest to highest. The dashed horizontal line marks the overall mean GQI across all genomes. Species differ significantly in their GQI distributions (one-way ANOVA F = 39.37, p < 1.05 × 10−33), with taxa such as Cronobacter sakazakii and Listeria monocytogenes enriched for high-quality genomes and species such as Klebsiella pneumoniae and Mycobacterium tuberculosis exhibiting broader and lower quality spectra. The colors of horizontal bars are used only to aid visualization and do not represent additional categories.
Applmicrobiol 05 00144 g004
Figure 5. Validation of the Genome Quality Index (GQI) using an independent Enterobacteriaceae dataset. (A) Comparison of GQI distributions between the primary pathogenic genome dataset (All_QC, n = 474) and an independent set of Enterobacteriaceae genomes. Boxplots show the distribution of Composite GQI scores for genomes of each species. The central line in each box marks the median, the box shows the interquartile range, whiskers extend to 1.5×IQR, and individual points represent outlier genomes. Different box colors are used only to distinguish species and do not represent additional categories or values. (B) Distribution of Enterobacteriaceae genomes across GQI quality tiers. Bars indicate the proportion of genomes classified as low (GQI ≤ 0.50), medium (0.50–0.75), and high (GQI > 0.75). The larger fraction of low- and medium-tier genomes in the Enterobacteriaceae set highlights variability in public assembly quality and supports the utility of GQI as a screening tool for repository curation and downstream analyses.
Figure 5. Validation of the Genome Quality Index (GQI) using an independent Enterobacteriaceae dataset. (A) Comparison of GQI distributions between the primary pathogenic genome dataset (All_QC, n = 474) and an independent set of Enterobacteriaceae genomes. Boxplots show the distribution of Composite GQI scores for genomes of each species. The central line in each box marks the median, the box shows the interquartile range, whiskers extend to 1.5×IQR, and individual points represent outlier genomes. Different box colors are used only to distinguish species and do not represent additional categories or values. (B) Distribution of Enterobacteriaceae genomes across GQI quality tiers. Bars indicate the proportion of genomes classified as low (GQI ≤ 0.50), medium (0.50–0.75), and high (GQI > 0.75). The larger fraction of low- and medium-tier genomes in the Enterobacteriaceae set highlights variability in public assembly quality and supports the utility of GQI as a screening tool for repository curation and downstream analyses.
Applmicrobiol 05 00144 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Farooq, A.; Rafique, A. Composite Genome Quality Index for Pathogenic Bacterial Genomes. Appl. Microbiol. 2025, 5, 144. https://doi.org/10.3390/applmicrobiol5040144

AMA Style

Farooq A, Rafique A. Composite Genome Quality Index for Pathogenic Bacterial Genomes. Applied Microbiology. 2025; 5(4):144. https://doi.org/10.3390/applmicrobiol5040144

Chicago/Turabian Style

Farooq, Adeel, and Asma Rafique. 2025. "Composite Genome Quality Index for Pathogenic Bacterial Genomes" Applied Microbiology 5, no. 4: 144. https://doi.org/10.3390/applmicrobiol5040144

APA Style

Farooq, A., & Rafique, A. (2025). Composite Genome Quality Index for Pathogenic Bacterial Genomes. Applied Microbiology, 5(4), 144. https://doi.org/10.3390/applmicrobiol5040144

Article Metrics

Back to TopTop