Review Reports - Identification and Genetic Diversity Analysis of <i>Cucurbita</i> Varieties Based on SSR Markers

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

MAIN COMMENT

Zou et al. presented here the results of a study aimed at the development of an SSR-based protocol for the identification of Chinese Cucurbita cultivars (C. moschata, C. pepo, and C. maxima). Starting from 175 SSR primer pairs (125 taken from literature, 50 newly developed), a set of 24 highly polymorphic markers were selected and tested on 306 Cucurbita varieties, achieving a 98.36% discrimination rate. Population structure analysis showed the presence of three genetic clusters, corresponding to the three analysed species, though some putative interspecific hybrids were noted.

While this work could represent a valuable contribution for the efficient identification of Chinese Cucurbita cultivars, supporting the preservation of local cultivars and their correct inclusion in registers of varieties, the manuscript suffers from several important flaws and weakness (listed below).

(1) Scope and breadth of the study:

while the manuscript is entitled "Identification and genetic diversity analysis of Cucurbita varieties based on SSR markers", all the analysed varieties are Chinese cultivars (if I understood correctly), questioning the universality of the descriped SSR genotyping system and of the associated fingerprint database. Cucurbita maxima, C. pepo and C. moschata originated in North, Central and South America and, while China is now a major producer of pumpkins and squash and has planty of local landraces, many important commercial varieties were developed elsewhere.

At this regard, at lines 383-385 the authors stated: "The genetic diversity of the materials in this study is relatively narrow, which may be the reason why the average number of allelic variations and the average PIC value of the core primers in this study are slightly lower." In the context of a the development of "a new approach for population genetic analysis, variety identification and protection in Cucurbita cultivars with high efficiency, accuracy, and inexpensive compared with conventional methods" this could be limitation, since a set of cultivars representing a "narrow" genetic diversity might be not the best choice at this aim: indeed, the same marker set might not perform well when applied to distantly related cultivars.

By the other hand, obtaining a high identification rate from a dataset of Chinese cultivars characterized by relatively low diversity could also be a plus, as long as the effectiveness and relevance of the presented protocol is limited to Chinese cultivars, and not, generally speaking, on Cucurbita cultivars. According to me, even a crop genotyping study focusing on a single country can deserve to be published, particularly when focusing on a vast and diverse country such as China and in the context of local germplasm preservation, but it should be presented for what it is, without overemphasis, and properly contextualized starting from the title.

Lastly, if the study included a list of local varieties, I would expect a tentative geographic analysis of germplasm diversity.

(2) Novelty of the study:

As stated by the authors (lines 70-71), "molecular marker technology has been widely applied in the identification of Cucurbita varieties", including SSRs markers.

In particular, Sim et al. (2015) in their study entitled: “DNA Profiling of Commercial Pumpkin Cultivars Using Simple Sequence Repeat Polymorphisms”, developed a core set of 29 SSR markers (starting from 300 SSRs) for DNA profiling and cultivar identification of C. pepo, C. moschata and C. maxima, which they tested on a set of 160 commercial cultivars by means of fluorescence capillary electrophoresis on a a Genetic Analyzer 3130XL system (Applied Biosysems, USA).

The authors of the current manuscript at lines 86-87 wrote: “Currently, research on establishing a molecular identification system and an SSR fingerprint database for Cucurbita varieties based on the fluorescence capillary electrophoresis platform has not been reported”. This sentence is in contradiction with lines 86-87 and, according to me, is not true. Sim et al. (2015) is indeed an example of “research on establishing a molecular identification system and an SSR fingerprint database for Cucurbita varieties based on the fluorescence capillary electrophoresis platform”: am I wrong? If I am not wrong, the current manuscript does not represent the first attempt in this direction. Moreover, Sim et al. concluded that “we found that 29 SSR markers were able to differentiate all 160 cultivars", and "The UPGMA dendrogram showed that all 160 commercial pumpkin cultivars in our collection were differentiated by the 29 SSR markers". Therefore, as far as I understand, the discrimination rate in Sim et al. was 100%, therefore higher than that of the present study, although Sim et al. dataset included a lower no. of cultivars (160 against 306).

To conclude, the protocol and marker set described in the current manuscript still can represent a useful molecular resource for Cucurbita cultivar identification, but the novelty of the research is different from what stated by the authors.

(3) Accessibility and usability of the results:

considering the above-mentioned clarifications, in my opinion the only true element of novelty of the current study could perhaps be the generation of a fingerprint database for 306 Chinese varieties of the genus Cucurbita. However, the authors, despite claiming the fingerprint database as main output of their study (see e.g. Conclusion), did not provide it as as a public resource (neither in public repository nor in Supplementary Material). The latter point not only contrasts with the great applied importance given by the authors themselves to this output, but also does not allow the reproducibility of all the other results presented. Similarly, 20 reference varieties, which ideally could serve as standard "to correct the systematic errors between different test batches or detection platforms" were selected by the authors (section 3.4). However, if the genopypes for these 20 varieties are not available to the scientific community, they cannot be used as reference by other authors (i.e., standardization is not guaranteed).

(4) Methodological issues:

Microsatellite quality control is lacking: apparently, neither basic quality tests (detection of allele dropouts and other scoring errors, e.g., software Micro-Checker), nor null alleles detection (e.g. software FreeNA) and linkage disequilibrium test were performed for the selected 24 core SSRs.

Moreover, many important methodological issues are poorly reported; e.g.,: (a) Line 119-120: which genomic resources did you used? Please specify (e.g., if deposited in public repositories, report identification codes); (b) data analyis (section 2.8): several software are mentioned (SSR Analyser, GenALEx, PowerMarker, MEGA and "RSudio"), with no literature references: this is not the way software and tools should be cited in a scientific manuscript. Regarding "the software RSudio", which the authors used for construction of PCA, I guess it's RStudio, which is not a software but a working and development environment for R. Which version of R did the authos used? Which R package was used for PCA? Which was the input for PCA? Were the variables scaled and the PCA centered? (see: Jombart et al. 2009. Genetic markers in the playground of multivariate analysis. Heredity). For population structure analysis: no details about parameter setting for population structure analysis with STRUCTURE software (e.g., admixture and allele frequency models, no. of assumed populations (K), no. of iterations for burn-in and runs, no. of independent runs, etc.). These are key methodological details that cannot be omitted.

Lastly, section 2.6.1, 2.6.2, 2.6.3, and 2.7 are written in the form of instructions for a protocol (i.e., using verbs in the imperative form), and not in the form of a scientific manuscript (i.e., describing the methods that were used by the authors). E.g.: "Calibrate the results on the SSR fingerprint analyzer according to the molecular weight internal standard, read the data, and save it": I don't think you need to specify that you read and saved the data.

(5) Phenotypic characterization:

it was performed by the authors only for 7 varieties, the ones which could not be distinguished by the 24 pairs of SSR core primers (all belonging to C. pepo). However, in a study of this kind, and also considering that in Cucurbita spp. "traditional variety identification methods mainly rely on morphological traits to distinguish varieties" (line 53), it could be interesting to see a comparison between variety identification based on phenotypic vs molecular methods, at least on a subset of varieties (possibly more than 7). Although not a critical requirement per se, if the authors used phenotypic characterization when molecular identification failed, it should be important to test phenotypic-based identification also in cases when molecular identification worked. Moreover, in Material and methods, the authors only wrote: "The investigation of traits referred to the "Guidelines for the Test of Distinctness, Uniformity and Stability of134 New Plant Varieties Cucurbita (C. moschata)" (NY/T 2762—2015) [38], the "Guidelines for the Test of Distinctness, Uniformity and Stability of New Plant Varieties Summer Squash" (NY/T 2343—2013) [49], and the "Guidelines for the Test of Distinctness, Uniformity and Stability of New Plant Varieties Summer Squash" (submitted for approval draft)", without specifying in the manuscript which traits were measured and how the measured were performed.

In my opinion, this study can be considered for publication only if the authors carefully address all the points raised here, solve or mitigate the identified weakness, re-writing the manuscript accordingly, and/or provide convincing replies to the concerns raised.

MINOR COMMENTS

Lines 16-17: "has impeded the development of the plant variety protection and genetic breeding of Cucurbita": this sentence is a bit convoluted. Better something like: "has impeded the development of proper conservation strategies and marker assisted genetic breeding for Cucurbita varieties". Please note: genetic breeding can be perfomed also without the use of genetic markers, as it was practiced for centuries before the advent of molecular biology; what you cannot do is marker assisted genetic breeding: that's the reason of the suggested clarification.

Line 18: "provide" should probably be corrected into "providing" (since you previously used "distinguishing").

Line 19: "variety protection" could be changed into into "variety preservation", to avoid repetition of the word "protection" in the same sentence; "allelic variations": do you mean "allelic variants"?

Line 21: "successfully distinguished 300 varieties from 306 Cucurbita varieties": not clear to me. You meant "successfully distinguished 300 out of 306 Cucurbita varieties"?

Line 24: "cultivars were categorized into three populations, which are C. moschata, C. pepo, and C. maxima": from a scientific point of view, speaking about "populations" when you are dealing with different species can be misleading. Suggested correction: "categorized into three genetic clusters, corresponding to the three species: C. moschata, C. pepo, and C. maxima".

Line 24: "morphological characteristic" is not wrong, but perhaps "phenotypic traits" is more accurate and consistent with the terminology used in the rest of the manuscript.

Line 27: please replace "a new approach" with "new tools" . If you speak about "a new approach" referring to "population genetic analysis", readers may think you are describing a new theoretical approach.

Line 27: i don't think you can say it's "inexpensive". Maybe better "less expensive" (or "less costs compared to…").

Introduction: at the beginning, perhaps you should provide a taxonomic classification for the genus Cucurbita (e.g., specifying the family) as well as the common names, and a very brief outline of the reasons why the 3 mentioned species are cultivated for (I guess fruit consumption). I am aware this information can be considered obvious, but a brief description and contextualization of the study species is needed.

Lines 46-49: I understand the general meaning of this sentence, but some passages are confusing; "introduction of varieties": you mean from different countries? "self-breeding between different regions": not clear to me, same for "interlacing of strains". Please rephrase in a clearer form.

Lines 59-61: the last part of the sentence (after the comma at line 59) is unnecessarily long and include repetitions (e.g., "rapid identification" repeated two times). You could shorten it as follow:

"are of great significance for the rapid identification and preservation of Cucurbita varieties, genetic diversity analysis, market supervision, resolution of variety rights disputes, and auxiliary screening of similar varieties in DUS testing". Moreover, please specify what "DUS" stands for.

Line 64: perhaps "fast, efficient and reproducible" is more complete than "fast and efficient".

Lines 64-67: what you write here is well-known and the list of cultivated plants with developed SSR marker-based variety identification systems is extensive and growing. This sentence could be removed or summarized.

Line 70 and following: there seems to be a contraddiction in what you wrote here and how you introduced your study in the abstract. Indeed, in the abstract you wrote about "the lack of molecular identification system" for Cucurbita, here you wrote "Molecular marker technology has been widely applied in the identification of Cucurbita varieties" and, among the examples you mentioned, also studies with SSR markers are reported (e.g., "Sim [18] used 29 SSR markers to distinguish 160 Cucurbita varieties"). When I reached this point in the reading, I started to ask myself where is the novelty in your study. Moreover, when you report a study which was performed by multiple authors, you should add "et al." after the name of the first author. Please check thorughout the manuscript.

Line 80: what do you mean for "simple operation"?

Lines 87-96: this paragraph is too long and convoluted: it's a single, 10 lines sentence, with only commas as punctuation and nonsense syntax ("By extensively collecting...consulting literature...using the published genomes...using polyacrylamide gel electrophoresis...providing technical support...": where is the main statement?). Please report the aims of the study and the adopted methodology in a schematic and clearer way.

Lines 94-96: "for the rapid identification of Cucurbita varieties, genetic diversity analysis, market supervision, rapid identification of variety disputes, protection of new varieties, and auxiliary screening of similar varieties in DUS testing" are identical to lines 59-61: "for the rapid identification of Cucurbita varieties, genetic diversity analysis, market supervision, rapid identification of variety rights disputes, protection of new varieties, and auxiliary screening of similar varieties in DUS testing", and almost identical to lines 438-440: "for rapid identification of Cucurbita varieties, analysis of genetic diversity, market supervision, rapid identification in variety disputes arbitration, protection of new varieties, and auxiliary screening of similar varieties for DUS testing.

Section 3.5: please check hyphenation.

343-344: "When the results of molecular identification are inconsistent with those of phenotypic identification, the results of phenotypic identification shall prevail": who said this? Please add some reference.

Lines 395-396: Reference no. 9 is not Ling et al., but instead: Wang et al. Construction of an SSR-Based Standard Fingerprint Database for Corn Variety Authorized in China. Scientia Agricultura Sinica. 2017, 50, 1-14. The only paper by Ling et al. reported in the references (no. 12). Please check citations and references.

Line 400: title of section 4.2 should be revised

Figure 7 :which is the % of variance explained by PC1 and PC2?

Figure 8: legend for axis X and Y are lacking

Figure 6: varieties are represented here by numbers and not by their names, but a legend is lacking.

Discussion: in 2015, Sim et al. wrote in their Discussion section about the potential use of NGS-based SNPs markers, alone or together SSR markers. Ten years later, in the genomic era, and particularly after works such as: “Nguyen et al. 2020. Genome-wide SNP discovery and core marker sets for assessment of genetic variation s in cultivated pumpkin ( Cucurbita spp.). Hortic. Res.”, I believe that one cannot do without mentioning the pros and cons of microsat markers compared to SNPs.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear authors,

In the attached document, you will find a few recommendation to improve your manuscript.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors did not provide convincing replies to the concerns raised, they did not implement the modifications, additions, and clarifications which were required to mitigate the identified weakness and shortcomings. As a result, the reported study still fails to meet the robustness and the reproducibility criteria necessary for supporting the results and conclusions, the methodological reporting does not meet the minimum standard for a scientific publication and data availability is not granted.

More specifically:

(1) Data availability and reproducibility of results are not granted.

In my main comment to the original manuscript, I highlighted that, in my opinion, the only true element of novelty of the current study was perhaps be the generation of a fingerprint database for 306 Chinese varieties of the genus Cucurbita. However, I had pointed out that the authors did not provide this database as as a public resource (neither in public repository nor in Supplementary Material), therefore making the output of their study inaccessible to the scientific community, while also failing to ensure validation and reproducibility of the results. The author's replied "The database is hosted by the Plant New Variety Testing Division of the Science and Technology Development Center, Ministry of Agriculture and Rural Affairs", without providing any link. From these words, I must conclude that the SSR fingerprint database, the main outcome and element of novelty of this study, is not publicly available to the scientific community.

Moreover, since the SSR fingerprint database is nothing but the microsatellite genotyping results, i.e., the main result of the current analysis, this is equivalent to saying that the reproducibility of the presented findings is not guaranteed. This is contrary to MDPI Reasearch data policy, which states: "We recommend that data and code should be deposited in a trusted repository that will allow for maximum reuse... If this is not possible, authors are encouraged to share the specific reason in the Data Availability Statement and make this material available upon request to interested researchers...Data sharing policies concern the minimal dataset that supports the central findings of a published study. Generated data should be publicly available and cited in accordance with journal guidelines".

(2) Lack of microsatellite quality control tests.

In my previous comments I had warned the authors that in their analysis microsatellite quality control was lacking, specifying that neither basic quality tests, such as detection of allele dropouts and other scoring errors (e.g., software Micro-Checker), nor null alleles detection (e.g.

software FreeNA) and linkage disequilibrium test were performed for the selected 24 core SSRs.

The author's replied "The 24 pairs of core primers were distributed across all 20 chromosomes (Figure 4), and there were no cases of equivalent identification. The primer sequences are shown in Table 2. These 24 pairs of core primers amplified a total of 152 allelic variations and 308 genotypes in 306 varieties...". This answer has nothing to do with quality control tests of microsatellite data.

The above-mentioned tests are basic standard requirements for the validity and reliability of microsatellite-based results.

(3) Insufficient methods reporting.

In my previous comments I had pointed out that, in section 2.8 (Data analysis), several software were mentioned but important methodological details and the used parameter settings were lacking.

In the new version of the manuscript, the R package/function used for PCA is still not specified, and neither was specified which was the input for PCA and if the variables were scaled and the PCA centered. I had explicitly asked for all these details, providing to the authors the following paper as guideline for multivariate analysis of microsatellite data: Jombart et al. 2009. Genetic markers in the playground of multivariate analysis. Heredity. Apparently, the authors ignored my requests and input.

Similarly, I had pointed out that no details about parameter setting for population structure analysis with STRUCTURE software were reported (specifying which parameters I was talking about: admixture and allele frequency models, no. of assumed populations (K), no. of iterations for burn-in and runs, no. of independent runs, etc.). It is well-known that different STRUCTURE models and parameter settings can lead to different (or inconsistent) results: these are therefore key methodological details that cannot be omitted. Again, the authors ignored my remarks and no specification was added to the text. Moreover, the authors did not provide any citation for the software STRUCTURE and the implemented methodology, despite the fact that I had already pointed out this shortcoming as well, and despite the fact that STRUCTURE documentation clearly specifies how to cite the program: https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/structure_doc.pdf (pag. 37).

Lastly, I had pointed out that the authors did not specified in the manuscript which traits were measured and how the measured were performed. As a reply, the authors wrote: "Field trait investigations were conducted with reference to the Guidelines for Testing Specificity, Uniformity, and Stability of New Plant Varieties - Pumpkin (Cucurbita moschata), Guidelines for Testing Specificity, Uniformity, and Stability of New Plant Varieties - Zucchini (Cucurbita pepo), and Guidelines for Testing Specificity, Uniformity, and Stability of New Plant Varieties - Butternut Squash (Cucurbita maxima) (submitted for approval). Traits were converted into codes for analysis according to the grading ranges specified in the guidelines. The guidelines specify 35 traits for C. moschata and C. maxima, and 61 traits for C. pepo". Although this answer is partly understandable for reasons of brevity, deferring all these methodological details to external resources that are not easily accessible to the reader, thus forcing them to make an additional effort to understand the present manuscript, is not ideal from the perspective of transparency and the effort to clearly convey what the authors have done. At the very least, the authors could have made an effort to summarize the most important aspects in a methodological appendix within the Supplementary Information.

(4) Lastly, in one of my comments, I pointed out that, since the study includes local varieties, I would have expected a geographical analysis of germplasm diversity. The authors' response, namely "The source locations of the 306 materials in this study have been shown in Table S1 of the supplementary materials", cannot be considered sufficient. I was hoping for the addition of a tentative geographic analysis of germplasm diversity, not for a table with the region of provenance of each variety.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

In the fingerprinting table, provided through the QR code, the first row ("Fingerprint code") is a list from V1 to V152, without any explanation. I suppose those are the 152 different alleles. I would expect at least locus-by-locus information.

Author Response

Comments 1: In the fingerprinting table, provided through the QR code, the first row ("Fingerprint code") is a list from V1 to V152, without any explanation. I suppose those are the 152 different alleles. I would expect at least locus-by-locus information.

Response 1: It has been revised. I have added the SSR primers corresponding to the alleles (V1~V152) in the second row of the fifth column ("Fingerprint code") in the fingerprint map table. In addition, I have added a note below the table: Note: The second row (NG2~NW98) of the fifth column ("Fingerprint code") are the 24 SSR primers in this study, and the third row (V1~V152) are the different alleles amplified by each SSR primer.