You are currently viewing a new version of our website. To view the old version click .
by
  • Adeel Farooq1,* and
  • Asma Rafique2

Reviewer 1: Anonymous Reviewer 2: Ghazala Muteeb Reviewer 3: Anonymous

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

General comments:

The manuscript by Farooq and Rafique is well written and presents their analyses of submitted genome data using a few classic quality metrics. Overall, the manuscript does not present anything novel to our present knowledge (see details below). Also, a few methodological details could be improved or, at least, need better explanation for their choice over alternatives.

The main conclusions of the manuscript are that one should: 

  • use more than one genome quality metric
  • not submit low quality genomic data to public databases (especially ones devoted to identification of pathogenic organisms)

I believe that we do not need, in 2025, a study to conclude those things. When I started working on genomics, more than 20 years ago, those were already recommendation based either on experience or common-sense (even though the specific metrics and thresholds used were different, of course). So much so that nothing that I read in the manuscript surprised me, all correlations were obvious and, thus, expected.

If there was actually something novel in the manuscript and I missed it, first of all, I apologize, and second, the authors should be more explicit in pointing that out to the reader.

One important limitation of the methodology, in my view, is that the authors never looked at sequencing coverage, which is a very important factor in genome sequencing and the quality of the end product. It is not one of the metrics usually employed to verify assembly quality, but a too low (or too high) coverage often can help explain why some of the metrics used in this manuscript are behaving the way they are, especially with sequencing technologies, such as Illumina, that employ amplification.

Another significant limitation of the methodology is using linear model fitting to data that appears to be non-linear, which seems to be the case by looking at some of the graphs shows by the authors. More specific detail below, but, in general, doing this can lead to incorrect conclusions, bad predictions, and misleading or inaccurate statistics (such as the r), masking the actual trends that can be present in the data that cannot be modeled linearly.

 

Specific comments:

- I did not get a supplementary figure S1 in the ZIP file I received. I contained just two spreadsheets.

- In the introduction (lines 36 to 38), there is a puzzling sentence:

"As of March 2021, the genomes of thirty-two significant pathogens are stored in the ... (NCBI-PD)"

At first, I though this was a weird typo (did they mean March 2025?), but then I saw the number of genomes, just 32. Even in 2021 that is not possibly the correct number, since we had already sequenced many, many more pathogen genomes before that (and, indeed, double checking on the NCBI-PD database reveals that there is a long list of genomes there right now). So, first, why 2021 and not a more recent date (if that was not a typo)? Or is the number 32 that is incorrect? Or did the authors mean a specific set of the genomes in that database is comprised of "significant pathogens"? (the rest of the pathogens are not significant, then? if so, what's the criterion for counting as significant?) Finally, the reference cited at the end of that sentence is from 2020, adding to the confusion.

- Line 61: missing closing parenthesis at the end of the sentence.

- Line 69: "The present study aimed" instead of "Present study aimed"

In the methods:

- lines 79-80 mention that there were 13 different species in the dataset (referencing suppl. file 1). However, when looking at the data in the file, either the number is incorrect or something was not completely explained there. M. tuberculosis appears twice in the table shown in columns E and F (once with 253 and the other with 5 genomes). Same problem with S. enterica. Therefore, removing these duplicates, we actually have 11 different species represented. What happened?

- The authors used the BUSCO tool for assessing assembly metrics. While a fine tool, maybe using CheckM would have been a better choice in this study, since CheckM is widely used by researchers in microbiology (it only works on prokaryotes, for instance) and, more significantly, it estimates contamination, which BUSCO does not. Since the authors mention the possibility of contamination several time, CheckM's estimate of this issue would have been useful here. If the authors believe the tools complement each other, then why not use both? (and then compare if what they report in common is correlated)

- line 100 mentions Figure S1 but I did not get one in the ZIP file of supplementary materials.

In the results and discussion:

- line 116: Bradyrhizobium should be in italics.

- lines 139-141 say that "Well-assembled genomes from Illumina-based short-read sequencing data often exhibit a lower number of contigs and higher N50 values, which approximate the total genome size." The authors could remove the "from Illumina-based short-read" qualifier, since this is true of well-assembled genomes with any sequencing technology.

- section 3.3 lists the correlations seen by the authors, but they all seemed trivial and well known to the genomics community. E.g., fewer contigs, better BUSCO completeness; fewer contigs, higher N50; more contigs/lower N50, higher number of fragmented BUSCO genes; etc.

- the authors mention (lines 158-160) that some low-N50 genomes still present good BUSCO completeness values (the "wide range" they mention), but that do not try to explain how that could happen. One possibility is a high copy number, in dispersed arrangement, of some gene or gene family (such as the ribosomal unit, but it could be a protein-coding gene as well), which would break the assembly much more at that gene, with low impact on the BUSCO scores (since it is either a repetitive gene, therefore not present in the BUSCO datasets). The authors may be able to think of other reasons.

- lines 174-175, the authors say that "Higher percentages of fragmented and duplicated BUSCO indicate contamination in the genome sequences", but that is not necessarily the case. Besides contamination, there are other factors that can lead to this, as well as higher numbers of contigs and lower N50s. First of all is low sequencing coverage. Since the authors do not mention anywhere in the manuscript whether they have filtered the data according to coverage, we cannot speculate whether that would factor in this study. Also, an assembly can be highly fragmented, even in bacteria and in the absence of any contamination, if there are lots of copies of a certain gene or gene family. That is not as common as in eukaryotic genomes, but still not something that should be ignored. The fact that the authors see more unmapped reads when there are more contigs might support the possibility of repeats being a problem, since mapping tools usually leave as unmapped those reads that have multiple places where they could align. Therefore, the actual causal correlation here could be "more repetitive genome, more contigs", with the unmapped reads being just a side-effect.

- lines 209-210 and other places: the authors mention "high rank", as a measure of quality, but they do not provide the rationale for the cutoff values chosen.

- the legend to fig. 2, line 259, has some extraneous text in the end, "unmapped reads.3.3. Formatting of Mathematical Components"

- figure 2 displays some of the associations between metrics, and it suggests that some of the variables would be better modeled as non-linear. Or, at least, that applying a log-transform to the variable in the Y-axis could make the linear fit give meaningful results. The single BUSCO vs. unmapped reads data seems to be an inverted-U kind of model, with different dynamics for low values compared to higher values. Others, such as the graphs involving the N50, seem to be L- or inverted L-shaped curves (probably because of the huge range of N50 values). Therefore, there is no way the correlations can be good. Why did the authors only apply linear fitting, without doing log transforms? If they had a reason, then they have to explain both why they chose to perform their fitting like this and the limitations of doing so.

 

Author Response

Reviewer #1:

The manuscript by Farooq and Rafique is well written and presents their analyses of submitted genome data using a few classic quality metrics. Overall, the manuscript does not present anything novel to our present knowledge (see details below). Also, a few methodological details could be improved or, at least, need better explanation for their choice over alternatives.

Re) We thank the reviewer for acknowledging the clarity of the manuscript and appreciate the constructive critique regarding novelty and methodological detail. We have revised the manuscript extensively to ensure that our core contributions are clearly articulated and framed within a broader context of existing genome quality studies. To address the perceived lack of novelty, we now emphasize three key advances introduced in this study: (1) the development of a unified, interpretable Composite Genome Quality Index (GQI), (2) the application of unsupervised clustering to reveal typological and taxonomic trends in genome quality, and (3) a meta-analysis of public Enterobacteriaceae genomes (n=5,781), quantitatively comparing their quality to our curated dataset. Each of these elements is new in its integration, scope, or interpretation and provides utility beyond traditional per-metric evaluation.

In addition, we have clarified and justified all methodological decisions. For instance, we used GQI construction to avoid arbitrary weighting and allow dimensionality reduction based on actual variance across metrics. The use of k-means clustering was supported by silhouette score optimization, and taxonomic breakdowns were statistically validated. We have now made these choices explicit in the revised methods section and included high-resolution, reproducible visualizations for all analytical outputs.

 

The main conclusions of the manuscript are that one should: use more than one genome quality metric, not submit low quality genomic data to public databases (especially ones devoted to identification of pathogenic organisms). I believe that we do not need, in 2025, a study to conclude those things. When I started working on genomics, more than 20 years ago, those were already recommendation based either on experience or common-sense (even though the specific metrics and thresholds used were different, of course). So much so that nothing that I read in the manuscript surprised me, all correlations were obvious and, thus, expected. If there was actually something novel in the manuscript and I missed it, first of all, I apologize, and second, the authors should be more explicit in pointing that out to the reader.

Re) We fully understand and respect the reviewer’s concern that broad recommendations on multi-metric genome evaluation and quality assurance are well established within the genomics community. In response, we have worked to make the novelty and depth of our work more explicit throughout the revised manuscript.

While our conclusions may echo established best practices, what we offer is not a reiteration but a quantitative framework that (i) integrates multiple assembly metrics into a standardized GQI score, (ii) stratifies genome quality typologies using clustering, and (iii) directly compares curated and public genome quality distributions in a reproducible, data-driven manner. To our knowledge, no prior study provides a composite, empirically validated genome quality score that can be directly applied across bacterial genomes, nor do previous works offer the kind of comparative analysis we now present, especially across multiple species in the Enterobacteriaceae family.

Moreover, the GQI enables downstream applications such as quality-based genome filtering, phylogenomic integrity screening, and taxon-specific submission guidance. Comparative analysis further illustrates the current limitations of public repositories, even in 2025, and supports the need for standardized quality scoring and curation thresholds.

We appreciate the reviewer’s suggestion that the novelty of our findings be more clearly signposted. Accordingly, we have restructured the Introduction and Discussion sections to highlight the broader impact and potential utility of our approach, especially in microbial surveillance, genome repository validation, and large-scale pathogen comparative genomics.

 

One important limitation of the methodology, in my view, is that the authors never looked at sequencing coverage, which is a very important factor in genome sequencing and the quality of the end product. It is not one of the metrics usually employed to verify assembly quality, but a too low (or too high) coverage often can help explain why some of the metrics used in this manuscript are behaving the way they are, especially with sequencing technologies, such as Illumina, that employ amplification.

Re) Thank you for this valuable comment. While we did not explicitly include per-sample sequencing coverage, the metadata indicates that the vast majority had coverage depths averaging above 50× consistent with Illumina-based microbial sequencing best practices. Although coverage is not always directly reported, its influence is indirectly captured in our analysis through metrics such as contig number, N50, and particularly the percentage of unmapped reads, which reflects how well reads align back to the assembled genome. Low unmapped read rates across our dataset suggest that coverage was sufficient and not a limiting factor. Additionally, we observed no signs of over-amplification issues, such as abnormal GC content or inflated assembly sizes. Nonetheless, we agree that explicit coverage data could offer further clarity and have now noted this point in the manuscript’s limitations section to acknowledge that future studies could benefit from including raw coverage profiles.

 

Another significant limitation of the methodology is using linear model fitting to data that appears to be non-linear, which seems to be the case by looking at some of the graphs shows by the authors. More specific detail below, but, in general, doing this can lead to incorrect conclusions, bad predictions, and misleading or inaccurate statistics (such as the r), masking the actual trends that can be present in the data that cannot be modeled linearly.

Re) We thank the reviewer for acknowledging the findings of our study. We have updated the analysis.

 

Specific comments:

- I did not get a supplementary figure S1 in the ZIP file I received. I contained just two spreadsheets.

Re) Supplementary ‘Figure S1’ is now ‘Figure 1’.

- In the introduction (lines 36 to 38), there is a puzzling sentence: "As of March 2021, the genomes of thirty-two significant pathogens are stored in the ... (NCBI-PD)" At first, I thought this was a weird typo (did they mean March 2025?), but then I saw the number of genomes, just 32. Even in 2021 that is not possibly the correct number, since we had already sequenced many, many more pathogen genomes before that (and, indeed, double checking on the NCBI-PD database reveals that there is a long list of genomes there right now). So, first, why 2021 and not a more recent date (if that was not a typo)? Or is the number 32 that is incorrect? Or did the authors mean a specific set of the genomes in that database is comprised of "significant pathogens"? (the rest of the pathogens are not significant, then? if so, what's the criterion for counting as significant?) Finally, the reference cited at the end of that sentence is from 2020, adding to the confusion.

Re) Thank you for this careful observation. The sentence in question originally referred to the state of the NCBI Pathogen Detection (NCBI-PD) database in early 2021, when 32 distinct species were prominently represented based on higher genome availability relative to others. However, we agree this reference was confusing, particularly given the mismatch with the 2020 citation and the much larger current genomic content of the database. We have now updated the manuscript to reflect the most recent statistics as of 2025, noting that the NCBI-PD database encompasses genome data from 101 species and includes over 2.4 million total genome entries. We have also rephrased the statement to clarify that the count refers to species included in the database, not all known or sequenced pathogens, and revised the associated citation accordingly.

 

- Line 61: missing closing parenthesis at the end of the sentence.

Re) Thank you for pointing this out. We have added the closing parenthesis.

 - Line 69: "The present study aimed" instead of "Present study aimed"

Re) Corrected.

 

In the methods:

- lines 79-80 mention that there were 13 different species in the dataset (referencing suppl. file 1). However, when looking at the data in the file, either the number is incorrect or something was not completely explained there. M. tuberculosis appears twice in the table shown in columns E and F (once with 253 and the other with 5 genomes). Same problem with S. enterica. Therefore, removing these duplicates, we actually have 11 different species represented. What happened?

Re) We Thank you for bringing this to our attention. You are correct, the original supplementary file listed certain species, such as Mycobacterium tuberculosis and Salmonella enterica, more than once due to the way genome counts were grouped by different sub-categories. This inadvertently led to the misreporting of the total number of distinct species. We have now corrected the table and clarified the explanation in both the main text and Supplementary File 1.

 

The authors used the BUSCO tool for assessing assembly metrics. While a fine tool, maybe using CheckM would have been a better choice in this study, since CheckM is widely used by researchers in microbiology (it only works on prokaryotes, for instance) and, more significantly, it estimates contamination, which BUSCO does not. Since the authors mention the possibility of contamination several time, CheckM's estimate of this issue would have been useful here. If the authors believe the tools complement each other, then why not use both? (and then compare if what they report in common is correlated).                                                                                                               Re) Thank you for this valuable suggestion. We fully agree that CheckM is a widely adopted and robust tool in prokaryotic genomics, particularly due to its ability to estimate contamination and completeness using lineage-specific marker sets. In our study, we opted for BUSCO because it provides gene-level resolution based on single-copy orthologs, which allowed us to explore patterns of duplication, fragmentation, and missing genes, each of which contributes distinct insight into assembly continuity, potential contamination, and sequencing artifacts. We have now clarified this rationale in the revised methods section.                                                                                 That said, we recognize the complementary nature of CheckM and agree that its inclusion would strengthen the contamination analysis. As part of our revisions, we have added a short discussion acknowledging this and noted that future work could benefit from a dual approach using both tools. We also emphasize that in our dataset, elevated BUSCO duplications, an indirect contamination signal, which were rare and consistently aligned with other quality metrics (e.g., high contig numbers, low N50), supporting our interpretation. Nonetheless, we appreciate this point and have framed it in the discussion as an opportunity for extending the framework in future implementations.

 

- line 100 mentions Figure S1 but I did not get one in the ZIP file of supplementary materials.

In the results and discussion:

Re) This figure is now Figure 1.

 

- line 116: Bradyrhizobium should be in italics.

Re) Corrected.

 

- lines 139-141 say that "Well-assembled genomes from Illumina-based short-read sequencing data often exhibit a lower number of contigs and higher N50 values, which approximate the total genome size." The authors could remove the "from Illumina-based short-read" qualifier, since this is true of well-assembled genomes with any sequencing technology.

Re) We thank the reviewer for pointing this out, we have removed “from Illumina-based short-read”

- section 3.3 lists the correlations seen by the authors, but they all seemed trivial and well known to the genomics community. E.g., fewer contigs, better BUSCO completeness; fewer contigs, higher N50; more contigs/lower N50, higher number of fragmented BUSCO genes; etc.

Re) Thank you for this observation. We agree that the correlations presented in Section 3.3 are well established in the genomics community. Our intention was not to highlight these as novel findings, but rather to use them as a validation checkpoint for our dataset. We have revised the text in Section 3.3.

 

- the authors mention (lines 158-160) that some low-N50 genomes still present good BUSCO completeness values (the "wide range" they mention), but that do not try to explain how that could happen. One possibility is a high copy number, in dispersed arrangement, of some gene or gene family (such as the ribosomal unit, but it could be a protein-coding gene as well), which would break the assembly much more at that gene, with low impact on the BUSCO scores (since it is either a repetitive gene, therefore not present in the BUSCO datasets). The authors may be able to think of other reasons.

Re) We appreciate this insightful observation. Indeed, the occurrence of genomes with low N50 yet high BUSCO completeness likely reflects the fact that key conserved genes assessed by BUSCO can remain intact even when the assembly is highly fragmented. As the reviewer suggests, this can happen in cases where repetitive or high-copy elements, such as rRNA operons, tRNA clusters, or certain mobile genetic elements cause localized assembly fragmentation without significantly affecting core single-copy orthologs. Additionally, short contigs may still contain complete BUSCO genes, particularly in high-coverage datasets where small, high-confidence contigs retain conserved regions. We have now included this explanation in the results and discussion section to clarify why BUSCO completeness alone may not fully capture assembly contiguity, and to highlight the complementary value of integrating multiple quality metrics, as done in our GQI approach.

 

- lines 174-175, the authors say that "Higher percentages of fragmented and duplicated BUSCO indicate contamination in the genome sequences", but that is not necessarily the case. Besides contamination, there are other factors that can lead to this, as well as higher numbers of contigs and lower N50s. First of all is low sequencing coverage. Since the authors do not mention anywhere in the manuscript whether they have filtered the data according to coverage, we cannot speculate whether that would factor in this study. Also, an assembly can be highly fragmented, even in bacteria and in the absence of any contamination, if there are lots of copies of a certain gene or gene family. That is not as common as in eukaryotic genomes, but still not something that should be ignored. The fact that the authors see more unmapped reads when there are more contigs might support the possibility of repeats being a problem, since mapping tools usually leave as unmapped those reads that have multiple places where they could align. Therefore, the actual causal correlation here could be "more repetitive genome, more contigs", with the unmapped reads being just a side-effect.

Re) Thank you for this detailed and thoughtful comment. We agree that while elevated percentages of fragmented and duplicated BUSCOs can be indicative of contamination, they are not exclusive indicators of it. As BUSCO's own documentation notes—and as the reviewer rightly highlights—such signals may also arise from low sequencing coverage, assembly fragmentation due to repeats, collapsed paralogs, or even unresolved heterozygosity. This is especially relevant in prokaryotic genomes with repetitive elements or mobile gene families, where even short-read Illumina data may struggle to resolve genomic complexity, leading to fragmented contigs and misrepresented BUSCO classifications.

In the revised manuscript, we have corrected our language in lines 174–175 to reflect this nuance, acknowledging that duplicated and fragmented BUSCOs may result from a combination of technical and biological factors, including but not limited to contamination. Furthermore, the reviewer’s interpretation regarding unmapped reads being a potential side effect of repeat-induced mapping ambiguity is particularly compelling. Since most aligners do not confidently place multi-mapping reads, regions with high repetitive content may inflate unmapped rates while simultaneously fragmenting assemblies—thus contributing to both lower N50 and higher BUSCO fragmentation.

Although explicit sequencing coverage data was not available for all genomes, the majority were sourced from high-coverage Illumina projects, and we used the percentage of unmapped reads as an indirect proxy for coverage-related or structural issues. We now include this explanation in the discussion and have emphasized that multiple quality indicators must be interpreted together, rather than relying on BUSCO metrics in isolation.

- lines 209-210 and other places: the authors mention "high rank", as a measure of quality, but they do not provide the rationale for the cutoff values chosen.

Re) Thank you for this helpful observation. We acknowledge that the term “high rank” may have been unclear without an explicit explanation of how ranking thresholds were determined. In our analysis, “rank” refers to the genome’s relative standing within the composite Genome Quality Index (GQI) distribution. We defined "high", "mid", and "low" ranks based on quantile-based clustering (tertiles or k-means-derived clusters) to classify genomes into comparative quality tiers. This approach avoids arbitrary cutoffs and reflects natural groupings in the multidimensional quality metric space. We have now clarified this methodology in the revised text and explicitly stated the basis for rank classification to ensure transparency and reproducibility.

 

- the legend to fig. 2, line 259, has some extraneous text in the end, "unmapped reads.3.3. Formatting of Mathematical Components"

Re) Corrected.

- figure 2 displays some of the associations between metrics, and it suggests that some of the variables would be better modeled as non-linear. Or, at least, that applying a log-transform to the variable in the Y-axis could make the linear fit give meaningful results. The single BUSCO vs. unmapped reads data seems to be an inverted-U kind of model, with different dynamics for low values compared to higher values. Others, such as the graphs involving the N50, seem to be L- or inverted L-shaped curves (probably because of the huge range of N50 values). Therefore, there is no way the correlations can be good. Why did the authors only apply linear fitting, without doing log transforms? If they had a reason, then they have to explain both why they chose to perform their fitting like this and the limitations of doing so.

Re) Thank you for this constructive suggestion. In response, we have reanalyzed the relevant associations using log-transformed values—particularly for variables such as N50 and unmapped read percentages, to better capture their distributions and improve the interpretability of the correlations. These updated plots are now presented in the revised Figure 3, and we have updated the methods and results sections to clarify the rationale for applying log transformation and the improved fit it provides. We appreciate the reviewer’s guidance in strengthening the statistical modeling of our dataset.

Reviewer 2 Report

Comments and Suggestions for Authors

PFA

Comments for author File: Comments.pdf

Author Response

Reviewer 2.

The article ‘Integrative Assessment of Pathogenic Bacterial Genomes: Insights from Quality Metrics’ by Farooq and Rafique opens up with a strong and very clear introduction, and the pressing need for such a study.

Re) Thank you very much for your positive feedback. We sincerely appreciate your recognition of the clarity and relevance of the introduction. It is encouraging to know that the framing of the study's importance resonated, and we are glad it effectively conveyed the motivation behind our integrative approach to bacterial genome quality assessment.

 

However, there are 2 main concerns. Firstly, it intrigues me that, “…study aimed at assessing the quality of 474 pathogenic bacterial genomes submitted from South Korea….” On what basis are these genomes selected? And why only those submitted from South Korea? Are the selection criteria based on country? Then there is no equity and unless the authors can reasonably explain their choice, such a study seems totally biased.

Re) Thank you for raising this important point. The focus on genomes submitted from South Korea was not intended to introduce geographic bias but rather reflects a controlled case study based on data availability, sequencing consistency, and institutional access to metadata for validation. By using genomes from a single country, we aimed to minimize confounding variability introduced by differences in sequencing platforms, submission pipelines, and annotation standards across global datasets. We have now clarified this rationale in the manuscript to emphasize that the methodological framework is broadly applicable and not limited to any one country or region.

We thank the reviewer for raising this important concern regarding potential geographic bias. We would like to clarify that the selection of genomes from South Korea was not motivated by any country-specific evaluation or bias, but rather by pragmatic and technical considerations during data retrieval. Specifically:

Among Asian countries after China, and Japan, South Korea contributed significantly to the submission of pathogenic bacterial genomes to the NCBI-PD database. South Korea was selected because it had complete entries with consistent SRA metadata and Illumina raw read availability, which enabled standardized downstream processing, including assembly and quality analysis. This minimized variability due to sequencing platforms or missing files.

The choice ensured dataset homogeneity, allowing us to evaluate genome quality without introducing confounding variables such as mixed technologies, inconsistent file formats, or incomplete metadata across countries.

While the study focuses on South Korean submissions, the methods, interpretation, and proposed thresholds are broadly applicable to global public genome submissions. We will revise the manuscript to make this rationale explicit and acknowledge the geographic limitation as part of the study’s scope.

 

Secondly, the article must compare and contrast with other automated long read tools like:

Hybracter https://doi.org/10.1099/mgen.0.001244

CheckM https://doi.org/10.1101/gr.186072.114

gVolante https://doi.org/10.1093/bioinformatics/btx445

and then discuss how their work is an improvement or different, its merits as well as demerits.

Re) Thank you for this important suggestion. We appreciate the opportunity to contextualize our work alongside established tools such as Hybracter, CheckM, and gVolante. In the revised manuscript, we have added a comparison section outlining key distinctions. Unlike these tools, which focus on assembly or annotation evaluation using long-read hybrid assembly metrics (Hybracter), lineage-specific marker completeness and contamination (CheckM), or reference-based completeness scoring (gVolante), our framework integrates multiple metrics—including N50, contig count, unmapped reads, and BUSCO scores—into a unified, PCA-derived Composite Genome Quality Index (GQI).

While our method does not replace these tools, it offers a complementary, platform-agnostic assessment, particularly suited for large-scale comparative studies. We now also discuss the limitations of our approach (e.g., lack of raw read integration or taxon-specific modeling) to ensure transparency. These additions help clarify both the novelty and scope of our contribution relative to existing tools.

 

On the minor front, Figure 2 needs to be to numbered a,b,c,. And each graph needs to be described in legned. However, both the figures are not clear in the manuscript.

Re) Thank you for your helpful suggestion. We have revised Figure 2 by labeling each subpanel as a, b, c, etc., and updated the figure legend to clearly describe each graph.

 

In abbreviations section only three are described, but the paper has more.

Re) Thank you for pointing this out. We have reviewed the manuscript and updated the abbreviations section to ensure that all abbreviations used in the text are now clearly defined and consistently listed.

 

Bacterial names not italicized, example Bradyrhizobium sp (Result section)

Re) Corrected.

 

References are not updated. Some very critical recent articles related to the work are not cited. The references included are only till 2021.

https://doi.org/10.1002/cpz1.323

https://doi.org/10.1111/1755-0998.13364

https://doi.org/10.1093/molbev/msab199

Re) Thank you for pointing this out. We have now updated the reference list to include the suggested and other relevant recent articles, including those published after 2021, to ensure the manuscript reflects the current state of research in genome quality assessment.

Reviewer 3 Report

Comments and Suggestions for Authors

Summary

This manuscript evaluates 474 pathogenic bacterial genomes from South Korea through an integrative quality assessment approach combining metrics for completeness, contiguity, and accuracy. The authors demonstrate that comprehensive genome evaluation requires multiple integrated metrics rather than relying on single parameters, establish correlations between quality parameters, and emphasize the need for improved quality control measures in public genomic databases. While addressing an important issue in genomic data quality, the study largely confirms previously known relationships between quality metrics and requires substantial revisions to strengthen methodological clarity, data presentation, and the articulation of novel insights. The manuscript would benefit from clearer emphasis on new findings that extend beyond confirming established quality metric relationships.

Comments

  1. Why was the genome dataset from Korea selected for the study?
  2. The complete workflow for genome quality assessment should be a manuscript figure as most of the results rely on this workflow.
  3. “Consistent with this, our analysis revealed that 392 (83%) ge-136 nomes had complete single-copy BUSCO values above 90%, while 444 (94%), 370 (78%)” – what are the numbers in the parentheses?
  4. The authors state that all contaminated genomes were of low quality in terms of completeness, contiguity, and accuracy, but this is somewhat circular since contamination inherently affects these metrics.
  5. The authors downloaded SRA data and re-assembled genomes rather than evaluating the submitted assemblies. While this standardizes the assembly process, it creates a different question than evaluating the quality of assemblies in public databases. The paper should clarify this distinction better.
  6. While the authors mention 16S rRNA contamination, there's limited explanation of how they verified the taxonomy of non-contaminated genomes.
  7. The accuracy assessment using ALE is based on mapping reads back to assemblies, but this approach doesn't fully capture errors in regions where reads map well despite misassembly. Additional reference-based assessment would strengthen the analysis.
  8. The manuscript uses terms like "very strong," "strong," "moderate," and "less strong" correlations without clearly defining the ranges for these categories. More precise statistical reporting is needed.
  9. The paper focuses heavily on technical aspects of genome quality but provides limited discussion on the biological implications of using low-quality genomes for downstream analyses such as pathogen surveillance or AMR detection.
  10. The 474 genomes represent 13 different species, but there's no analysis of whether quality metrics vary systematically across different species. Different bacterial species may exhibit different assembly challenges.
  11. I suggest compare the proposed integrated approach with existing genome quality assessment frameworks like CheckM.
  12. Several sentences are awkwardly phrased or contain grammatical errors

Comments for author File: Comments.pdf

Author Response

Reviewer 3.

This manuscript evaluates 474 pathogenic bacterial genomes from South Korea through an integrative quality assessment approach combining metrics for completeness, contiguity, and accuracy. The authors demonstrate that comprehensive genome evaluation requires multiple integrated metrics rather than relying on single parameters, establish correlations between quality parameters, and emphasize the need for improved quality control measures in public genomic databases. While addressing an important issue in genomic data quality, the study largely confirms previously known relationships between quality metrics and requires substantial revisions to strengthen methodological clarity, data presentation, and the articulation of novel insights. The manuscript would benefit from clearer emphasis on new findings that extend beyond confirming established quality metric relationships.                                                                                   Re) We thank the reviewer for recognizing the importance of our integrative assessment approach and the value of combining completeness, contiguity, and accuracy metrics. In response to the comment regarding novelty and methodological clarity, we have substantially revised the manuscript to better emphasize the following key contributions: (1) the development of a unified Composite Genome Quality Index (GQI); (2) unsupervised clustering of genomes into quality-based typologies with species-level biases; and (3) a meta-analysis comparing curated versus public genomes to highlight systemic variation in genome quality. These additions extend our findings beyond confirming known metric correlations. We have also improved methodological descriptions and enhanced figure clarity to strengthen transparency and reproducibility.

 

Comments

Why was the genome dataset from Korea selected for the study?

Res:  Thank you for raising this important point. The focus on genomes submitted from South Korea was not intended to introduce geographic bias but rather reflects a controlled case study based on data availability, sequencing consistency, and institutional access to metadata for validation. By using genomes from a single country, we aimed to minimize confounding variability introduced by differences in sequencing platforms, submission pipelines, and annotation standards across global datasets. We have now clarified this rationale in the manuscript to emphasize that the methodological framework is broadly applicable and not limited to any one country or region.

The complete workflow for genome quality assessment should be a manuscript figure as most of the results rely on this workflow.

Res: We agree with the reviewer’s suggestion and have now included the complete genome quality assessment workflow as Figure 1 in the main manuscript.

 

“Consistent with this, our analysis revealed that 392 (83%) ge-136 genomes had complete single-copy BUSCO values above 90%, while 444 (94%), 370 (78%)” – what are the numbers in the parentheses?

Res: The numbers in the parentheses are the percentage values of the genomes for a particular Busco parameter out of the total genomes analyzed (n=474).

 

The authors state that all contaminated genomes were of low quality in terms of completeness, contiguity, and accuracy, but this is somewhat circular since contamination inherently affects these metrics.

Res: The Thank you for this important observation. We acknowledge that stating contaminated genomes exhibit low completeness, contiguity, and accuracy can appear circular, since contamination itself can directly compromise these metrics. Our intention was not to imply an independent causal relationship, but rather to highlight that contamination is consistently reflected across multiple quality dimensions. We have revised the text to clarify that our framework detects contamination through its effects on genome integrity, such as increased duplication (e.g., in BUSCOs), fragmented assemblies, and elevated unmapped read rates, rather than treating contamination as an external label. This integrated signal strengthens the case for using a multi-metric approach like GQI to flag potentially problematic assemblies.

 

The authors downloaded SRA data and re-assembled genomes rather than evaluating the submitted assemblies. While this standardizes the assembly process, it creates a different question than evaluating the quality of assemblies in public databases. The paper should clarify this distinction better.

Res: Thank you for this important clarification. We agree that reassembling genomes from raw SRA data introduces a different analytical scope compared to evaluating publicly submitted assemblies. Our intention was to standardize the assembly pipeline to control for variation introduced by differing submission practices, platforms, or annotation methods. However, we now recognize the need to more clearly distinguish this approach. We have revised the manuscript to explicitly state that our goal was to assess genome quality under uniform assembly conditions, rather than auditing the quality of deposited assemblies as-is. This clarification has been added to both the Methods and Discussion sections.

 

While the authors mention 16S rRNA contamination, there's limited explanation of how they verified the taxonomy of non-contaminated genomes.

Res: Thank you for this observation. We confirm that the taxonomy of non-contaminated genomes was determined using the 16S rRNA-based RDP Classifier, as described in the Materials and Methods section. This approach allowed us to validate taxonomic assignments consistently across all genomes using 16S sequences. To avoid any confusion, we have now clarified this point more explicitly in the main text.

 

The accuracy assessment using ALE is based on mapping reads back to assemblies, but this approach doesn't fully capture errors in regions where reads map well despite misassembly. Additional reference-based assessment would strengthen the analysis.

Res: The Thank you for this insightful comment. We agree that while ALE provides a valuable accuracy estimate by assessing read alignment likelihoods, it may not fully detect structural misassemblies or errors in repetitive or well-mapped but misassembled regions. To address this, we have clarified in the manuscript that ALE’s alignment-based scoring is complemented by other quality indicators—such as N50, BUSCO fragmentation, and unmapped read percentages—which together help flag potentially misassembled regions. We also acknowledge in the revised discussion that incorporating reference-based assessments or tools like QUAST with trusted references could further strengthen future analyses where suitable references are available.

 

The manuscript uses terms like "very strong," "strong," "moderate," and "less strong" correlations without clearly defining the ranges for these categories. More precise statistical reporting is needed.

Res:  Thank you for this helpful comment. We agree that using qualitative terms without defined thresholds can reduce clarity. In the revised manuscript, we have now included explicit correlation coefficient ranges corresponding to each descriptive label (e.g., “very strong” ≥ 0.90, “strong” = 0.70–0.89, “moderate” = 0.40–0.69, etc.), and ensured that all correlation values are numerically reported. This improves the statistical transparency and interpretability of our findings.

 

The paper focuses heavily on technical aspects of genome quality but provides limited discussion on the biological implications of using low-quality genomes for downstream analyses such as pathogen surveillance or AMR detection.

Res:  Thank you for this valuable comment. We agree that the biological consequences of using low-quality genomes warrant greater emphasis. In response, we have expanded the discussion section to highlight how fragmented or incomplete assemblies can impair downstream applications such as antimicrobial resistance (AMR) gene detection, phylogenetic analysis, and pathogen surveillance accuracy. Specifically, low contiguity may obscure mobile genetic elements or resistance islands, and misassemblies can introduce false positives or negatives in gene presence/absence matrices. We believe this addition strengthens the relevance of our work to applied microbiology and public health contexts.

 

The 474 genomes represent 13 different species, but there's no analysis of whether quality metrics vary systematically across different species. Different bacterial species may exhibit different assembly challenges.

Res:  Thank you for this insightful comment. We agree that genome quality may be influenced by species-specific features such as genome complexity, GC content, or repeat density. In response, we have now included a species-level stratification analysis, comparing quality metrics across the 13 species in our dataset. This revealed clear differences—for example, Mycobacterium tuberculosis assemblies consistently showed higher contiguity and completeness, while Klebsiella pneumoniae genomes tended to be more fragmented. These results support the notion that assembly challenges vary by species, and we have added this analysis and its implications to the results and discussion sections accordingly.

 

I suggest compare the proposed integrated approach with existing genome quality assessment frameworks like CheckM.

Res:  Thank you for the suggestion. We agree that comparing our integrated Genome Quality Index (GQI) with established frameworks like CheckM provides valuable context. In the revised manuscript, we now include a dedicated section highlighting the differences: while CheckM estimates completeness and contamination based on lineage-specific markers, our approach integrates multiple assembly-derived metrics (e.g., N50, contigs, unmapped reads, BUSCO) into a unified index. We also discuss how these methods can complement one another, and note that future work could involve benchmarking GQI directly against CheckM outputs to assess consistency and added value.

 

Several sentences are awkwardly phrased or contain grammatical errors

Res:  Thank you for the feedback. We have carefully reviewed the manuscript and revised all awkwardly phrased or grammatically incorrect sentences to improve clarity, readability, and overall language quality throughout the text.