De Novo SNP Discovery and Genotyping of Masson Pine ( Pinus massoniana Lamb.) via Genotyping-by-Sequencing

: Masson pine ( Pinus massoniana Lamb.) is an important tree species in China, but its genomic research has been hindered due to a large genome size. Genotyping-by-sequencing (GBS) has been a powerful approach to revolutionize the ﬁeld of genomic research by facilitating the discovery of thousands of single nucleotide polymorphisms (SNPs) and genotyping in non-model organisms, at relatively low cost. Here, we performed de novo SNP discovery and genotyping in 299 trees via the genotyping-by-sequencing (GBS) approach. The effort produced 9.33 × 10 9 sequence reads, 265,525 SNP-associated contigs, and 6,739,240 raw SNPs. Further ﬁltering and validation of the SNP-associated contigs for reliable SNPs were performed using blasting against the Pinus tabuliformis reference genome, functional annotation, technical replicates, and custom parameter settings for the optimization. The 159,372 SNP-associated contigs were aligned and validated for SNP prediction, in which 60,038 contigs were searched with hits in the NCBI nr database. We further improved the SNP discovery and genotyping with multiple technical replicates and custom parameter settings ﬁltering. It was found that the use of blasting, annotation, technical replicates, and speciﬁc parameter settings removed many unreliable SNPs and identiﬁed 20,055 more precise and reliable SNPs from the 10,712 ﬁltered contigs. We further demonstrated the informativeness of the identiﬁed SNPs in the inference of some genetic diversity and structure. These ﬁndings should be useful to stimulate genomic research and genomics-assisted breeding of Masson pine.


Introduction
Masson pine (Pinus massoniana Lamb., 2n = 24), a diploid conifer of the family Pinaceae, is one of the most commercially important timber tree species in China, providing timber, fiber pulp, rosin, and pollen pini for industrial, chemical, and medical use [1].Genomics has the potential to revolutionize conventional forest tree breeding with promising genotypebased genomic selection and increased genetic gain [2,3].However, affordable, reliable, and sufficient genome-wide markers are lacking for Masson pine, due to its large and complex genome, which is similar to that of most gymnosperms [4][5][6], such as the sequenced genomes of 23.2 Gbp in Pinus taeda [5] and 25.4 Gbp in Pinus tabuliformis [6].This has hindered the adoption of genomic selection in breeding programs.Efforts have been made to obtain single nucleotide polymorphisms (SNPs) using RNA-seq [7] and a set of genomewide genetic variation using SLAF-seq [8].Those efforts have facilitated the development of high-throughput SNPs for Masson pine, but it remains expensive to genotype large Forests 2023, 14, 387 2 of 18 numbers of samples with sufficient read depths through higher degrees of multiplexing, as forest tree breeding programs traditionally operate on a large number of individuals [3].Fixed array platforms have been regarded as the gold standard for robust and reliable high-throughput genotyping [9].However, the development of a species-specific SNP array for species like Masson pine without sufficient genetic resources requires a much more competitive price.Therefore, developing cost-effective genotyping platforms is essential for incorporating genomics into Masson pine breeding schemes.
Genotyping-by-sequencing (GBS), also known as reduced representation sequencing (RRS), is a widely used approach to reduce the complexity of large genomes to identify high-throughput SNP markers [10][11][12][13].This approach enables subset diverse but identical enzyme-target and recognized genomic regions for sequencing multiple samples.For species without a reference genome, a pseudo-reference or homogenous reference genome is used for the identification of SNPs [14].GBS can provide rapid, cost-effective, and comprehensive reduction of complex genomes, and thus is suitable to develop genomewide molecular markers in non-model species, especially in tree species, as most forest trees are non-model species with large and complex genomes [15].The compromise of sequence read depth and the number of sequenced individuals makes GBS a cost-effective molecular marker development platform [16].It has been applied to some tree species with large genomes [17,18], such as Pinus sylvestris [18], and has discovered large volumes of SNPs in studied species in the absence of species-specific genomic resources.However, concerns about the precision and reliability of the genetic data resulting from the GBS approach are increasing [19,20].As raw SNP datasets resulting from all genotyping experiments are typically inaccurate and incomplete [21,22], it is technically difficult to identify reliable SNPs for such a large-genome species without a sequenced genome.Thus, quality control (QC) procedures are important steps to identify more precise and reliable SNPs via GBS.
QC involves many aspects from the initial data preparation to core bioinformatics analysis [20], including filtering out poor-quality or suspected artifactual SNP loci; filtering out individuals related to missing data, anomalous genotype call, and genetic synonymies; and characterizing the identified SNPs [22].Generally, the custom per-marker QC of GBS data consists of at least three steps: (i) filtering out SNPs with an excessive missing genotype, (ii) the removal of all markers with very low minor allele frequency (MAF) or minor allele count (MAC), and (iii) the removal of all individuals with large missing data [23].For a non-model species without a reference genome, such as in forest tree species, extra approaches have also been used to ensure more accurate reads or contigs used for SNP prediction, such as blasting reads against the sequenced reference genome for filtering reads, performing SNP-associated contigs annotation, and using technical replicates to mitigate the genotyping errors [24][25][26].For example, a significantly higher proportion of good loci have been obtained using paired replicates in GBS for Fagus sylvatica and Quercus robur L. [26].However, little is known about the characteristics of identified SNPs resulting from the GBS approach in Masson pine with its un-sequenced, complex, and massive genome.
The specific objectives of this study were to (1) employ the GBS approach to generate a set of reliable SNPs in 299 Masson pine samples; (2) validate the de novo assembled contigs for SNP prediction by blasting against the well-assembled Pinus taeda and Pinus tabuliformis reference genomes and functional annotation; (3) illustrate the characteristics of SNPs identified through GBS application by the means of technical replicates; and (4) validate the informativeness of optimized SNPs with population genetic diversity analysis.These efforts aimed to generate a set of highly reliable SNP markers for Masson pine via GBS and reveal the characteristics of identified SNPs, as well as to provide more understanding of high-throughput molecular marker development using GBS in non-model tree species with large genomes.

Plant Material and DNA Extraction
The study material comprised 299 samples, including 293 open pollinated progenies from 65 selected families and an additional six parents with two replicates each from the advanced Masson pine (Pinus massoniana Lamb.) seed orchard in the Shanghang Baisha State-Owned Forest Farm in Fujian province, China.Information about the 299 sample collection is displayed in Table S1.There were 293 progenies from 65 families in 15 subpopulations from four local locations with a range of 1-25 individuals per family and the extra six parents.A set of 305 samples of young needles including six replicates were collected on 20-28 July 2019, sealed in a bag with full silica gel, and delivered to the lab in 4 °C containers, and then stored at −25 • C prior to DNA extraction.Genomic DNA was extracted from the dry needles using a modified cetyl trimethyl ammonium bromide (CTAB) method [27].The quality and quantity of DNA were measured using the NanoDrop 1000 spectrophotometer (ThermoFisher Scientific, Wilmington, DE 19810, USA).The final concentration of DNA was adjusted to 50 ng/µL for GBS library construction.

GBS Library Preparation and Sequencing
A GBS library of each sample was prepared according to the protocol described in the Illumina ® TruSeq ® Nano DNA LT Library Prep kits (FC-121-4001).The 200 ng of genomic DNA from each sample was restriction-digested in a total volume of 50 µL, containing 5 units each of EcoRV and ScaI-HF, as well as 5 µL NEB 10 × CutSmart Buffer (New England Biolabs, Ipswich, MA, USA).The reaction was incubated at 37 • C for 16 h for digestion and then heat-inactivated at 80 • C for 10 min.The enzyme digestion products were purified and recovered with magnetic beads in TruSeq ® Library Building Kit.Barcoded and common adapters were designed as described in Illumina ® TruSeq ® Nano DNA LT Library Preparation kits to complement the restriction overhangs created by EcoRV and ScaI-HF, respectively, and the overhangs resulting from fragmentation were converted into blunt ends using End Repair Mix 2. Following end repair, the appropriate library size of about 400-550 bp was selected using different ratios of the SPB (Sample Purification Beads).After adenylating 3 ends, each restriction-digested sample was then ligated to a unique 5 barcoded adapter and a common 3 adapter.The process of ligating adapters was performed to ligate multiple indexing adapters to the ends of the DNA fragments for sequencing use, and then clean up the ligated fragments.To enrich the ligated DNA and perform size selection, PCR amplification was performed using 10 µL of the ligated products' DNA, 25 µL of KAPA HiFi HotStart Ready Mix PCR Kit, and the common PCR1 and indexed PCR2 primers, respectively.The PCR primers used were specific to each adapter and comprised an Illumina index sequence and flow cell annealing complementary sequences.The ligated products were amplified with PCR by the following conditions: initial denaturation at 95 • C for 3 min; 8 cycles of denaturation at 98 • C for 15 s, annealing at 60 • C for 15 s, and extension at 72 • C for 30 s; and then final extension at 72 • C for 5 min.After electrophoresis, the PCR products of the library with the length of the inserted fragment in the target interval (500-550 bp) were cut and purified for subsequent sequencing according to the preset scheme in TIANGEN (DP209-02).
The quality of individual libraries and median fragment size were assessed on the Agilent Technologies 2100 Bioanalyzer using Qubit™ 1X dsDNA HS Assay Kits.Indexed DNA libraries were normalized to 10 nM in the DCT plate and then pooled in equal volumes in the PDP plate.The pooled library underwent a final 0.8X Ampure XP bead cleanup to remove any remaining residual fragments shorter than 500 bp.The concentration of the final bead-cleaned library was determined in preparation for sequencing.Sequencing was performed using the Illumina NovaSeq 6000 SP Reagent Kit v1.5 (500 cycles) for 150 bp paired-end sequencing (Illumina, 20028402) in LC-BIO Co., Ltd.(Hangzhou, China).

Sequence Quality Analysis and Filtering
Sequence reads were de-multiplexed by using the outer dual index barcode information and assigned to sequenced samples.Reads containing the correct restriction sites in R1 and R2 were obtained by searching restriction site sequences in the raw reads.Quality filtration was carried out as follows: the adapters were removed, the low quality reads with lengths less than 40 bp were removed, and the unstable bases in the read within the first 10 bp and the last 8 bp were trimmed using Trimmomatic v0.39 [28] and Fastp v0.23.1 pipelines [29].

Sequence Quality Analysis and Filtering
Sequence reads were de-multiplexed by using the outer mation and assigned to sequenced samples.Reads containing in R1 and R2 were obtained by searching restriction site sequen ity filtration was carried out as follows: the adapters were rem with lengths less than 40 bp were removed, and the unstable b first 10 bp and the last 8 bp were trimmed using Trimmomatic pipelines [29].

De Novo Assembly, Read Alignment, and SNP Calling
A set of 24 samples was used for de novo assembly using kmer-size ranging from 29 to 141bp and the optional de no kmer-size of 141 were selected.The obtained de novo assemb the Lobolly pine (v2.0) sequenced reference genome in the local NCBI (https://www.ncbi.nlm.nih.gov/assembly/GCA_000404(https://blast.ncbi.nlm.nih.gov/Blast.cgi)[32].The aligned asse ≧95% and length ≧141 bp) was then split into 40 subsets, each Pre-individual variants were called using a set of custom shell eratingBamFiles.sh"implemented with Bowtie2 v2.4.2 [33], BCFtools [35] for parallel runs for subset BAM files generatin nomic subsets before a joint genotype call over all 305 samples eral, the bowtie2-build command was applied to index the refe samtools faidx was used to deal with the indexed reference; then with the application of Bowtie2 v2.4.2 and SAMtools v1.11 pip speed up the multiple sample-specific SAM file construction p files, SAMtools v1.11 was used to create, sort, and index BAM on allele frequencies that were estimated with genotype likelih files, the SNP calling based on certain subset reference contigs out by using the following commands : angsd -bam bam.filelist -G 2 -doMajorMinor 1 -SNP_pval 1e-6 -doGeno 5 -doPost 1 -postCuto

SNP Filtering
A total of 305 samples with an extra six replicates for six The six replicates were processed with the 299 individuals at yses were conducted based on the filtering of missing data npGeno pipeline [37] to filter out the singleton, duplicated, a missing threshold <30%.The obtained SNPs with missing < PLINK v1.9 [38] and the minor allele frequency (MAF) was ca MAF > 0.01 were treated as the initial SNPs set, and the result million.By using the six sample replicates for genotyping e script was developed to label each SNP with a distinct genoty missing data (MD), locus error (LE), missing loci (ML), and m as good loci (GL); these are described in Table 1.The repx1SNP i both replicates in the paired replicates had consistent alleles b data among the selected pairs of replicates in certain repx filter the rep11SNP was obtained by the removal of all inconsistent d the rep61SNP was the selected good loci with consistent allele licates across six sets of replicates.Those SNP-associated contig well-assembled Chinese pine (Pinus tabuliformis

95% and length
Forests 2023, 14, x FOR PEER REVIEW Sequencing was performed using the Illum cycles) for 150 bp paired-end sequencing (Ill Zhou, China).

Sequence Quality Analysis and Filtering
Sequence reads were de-multiplexed b mation and assigned to sequenced samples. in R1 and R2 were obtained by searching rest ity filtration was carried out as follows: the a with lengths less than 40 bp were removed, a first 10 bp and the last 8 bp were trimmed usi pipelines [29].

De Novo Assembly, Read Alignment, and SN
A set of 24 samples was used for de nov kmer-size ranging from 29 to 141bp and th kmer-size of 141 were selected.The obtained the Lobolly pine (v2.0) sequenced reference g NCBI (https://www.ncbi.nlm.nih.gov/assem(https://blast.ncbi.nlm.nih.gov/Blast.cgi)[32] ≧95% and length ≧141 bp) was then split int Pre-individual variants were called using a s eratingBamFiles.sh"implemented with Bow BCFtools [35] for parallel runs for subset BA nomic subsets before a joint genotype call ov eral, the bowtie2-build command was applied samtools faidx was used to deal with the index with the application of Bowtie2 v2.4.2 and SA speed up the multiple sample-specific SAM files, SAMtools v1.11 was used to create, sort on allele frequencies that were estimated wit files, the SNP calling based on certain subset out by using the following commands : angsd 2 -doMajorMinor 1 -SNP_pval 1e-6 -doGeno 5 -

SNP Filtering
A total of 305 samples with an extra six The six replicates were processed with the 2 yses were conducted based on the filtering npGeno pipeline [37] to filter out the single missing threshold <30%.The obtained SNP PLINK v1.9 [38] and the minor allele frequen MAF > 0.01 were treated as the initial SNPs s million.By using the six sample replicates script was developed to label each SNP with missing data (MD), locus error (LE), missing as good loci (GL); these are described in Table both replicates in the paired replicates had c data among the selected pairs of replicates in the rep11SNP was obtained by the removal of the rep61SNP was the selected good loci with licates across six sets of replicates.Those SN well-assembled Chinese pine (Pin 141 bp) was then split into 40 subsets, each containing ~20,000 contigs.Pre-individual variants were called using a set of custom shell scripts named "parallelGeneratingBamFiles.sh"implemented with Bowtie2 v2.4.2 [33], Samtools v1.11 [34], and BCFtools [35] for parallel runs for subset BAM files generating separately on the 40 genomic subsets before a joint genotype call over all 305 samples using ANGSD [36].In general, the bowtie2-build command was applied to index the reference contigs first, then the samtools faidx was used to deal with the indexed reference; then, the custom program cycle with the application of Bowtie2 v2.4.2 and SAMtools v1.11 pipeline was run in parallel to speed up the multiple sample-specific SAM file construction process.Based on the SAM files, SAMtools v1.11 was used to create, sort, and index BAM files for SNP calling.Based on allele frequencies that were estimated with genotype likelihoods from the sorted.bamfiles, the SNP calling based on certain subset reference contigs using ANGSD was carried out by using the following commands: angsd -bam bam.filelist -GL 2 -out gatk_outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doGeno 5 -doPost 1 -postCutoff 0.95.

SNP Filtering
A total of 305 samples with an extra six replicates for six samples were sequenced.The six replicates were processed with the 299 individuals at the same time.Initial analyses were conducted based on the filtering of missing data using Further_deletion.sh in npGeno pipeline [37] to filter out the singleton, duplicated, and homogenous loci with missing threshold < 30%.The obtained SNPs with missing < 10% were formatted using PLINK v1.9 [38] and the minor allele frequency (MAF) was calculated.Only the loci with MAF > 0.01 were treated as the initial SNPs set, and the resulting SNPs for use were 0.67 million.By using the six sample replicates for genotyping error detection, a custom R script was developed to label each SNP with a distinct genotyping error label, including missing data (MD), locus error (LE), missing loci (ML), and missing allele (MA), as well as good loci (GL); these are described in Table 1.The repx1SNP included the SNPs in which both replicates in the paired replicates had consistent alleles but there were nonmissing data among the selected pairs of replicates in certain repx filtering scenarios.For example, the rep11SNP was obtained by the removal of all inconsistent data in one pair of replicates; the rep61SNP was the selected good loci with consistent alleles between each pair of replicates across six sets of replicates.Those SNP-associated contigs were blasted against the well-assembled Chinese pine (Pinus tabuliformis) reference genome (https://db.cngb.org/search/project/CNP0001649/,accessed on 2 March 2022) [6] to remove contigs with identity < 95% and multi-location hits against the reference genome (Blasted).The remaining SNP-associated contigs were further searched for validation with functional annotation (Annotated).Another filtering was considered to delete the loci with less than four samples with a minimum number of minor genotypes per loci (minimum minor genotype count ≥ 4), and different MAF values at different missing levels ranged from 0 (M0), 5% (M5) to 10% (M10).Extra filtering was performed to exclude all SNPs within 35 bp distance of those markers.The remaining SNPs were obtained for statistics summary.As each SNP loci had a distinct genotyping error label, the SNP count according to genotyping error tags in different filtering scenarios could be determined accordingly.To simply mark the custom SNP filtering scenario, seven parameters were included as follows: missing level (M), minor allele frequency (MAF), minimum minor genotype count (mC), different paired replicates for SNP filtering (repx), blasted against Chinese pine reference genome (blast), SNP-associated contigs annotation (Annot), and 35bpFiltering (35 bp).A certain SNP filtering strategy was the combination of the above several parameters labeled as SNP filtering scenario.For example, an SNP dataset with missing < 5% (M5), minor allele frequency (MAF) > 0.05 (F05), the minimum minor genotype count ≥ 4 (C4), using six paired replicates (rep6), blasted (blast), annotation contigs (Annot), and 35 bp distance filtered (35 bp) for SNP filtering was recorded as a "rep6M5F05C4blastAnnot35bp" filtering scenario.

SNP-Associated Contig Validation
BLAST research for the SNP-associated contig in the 0.67 million SNPs was performed, taking advantage of the well-assembled 25.4 Gb chromosome-level assembly of Chinese pine downloaded from GSA (https://db.cngb.org/search/project/CNP0001649/,accessed on 2 March 2022) [6].Those contigs with identity Forests 2023, 14, x FOR PEER REVIEW Sequencing was performed using the Illumina NovaSeq 6000 cycles) for 150 bp paired-end sequencing (Illumina, 20028402) in Zhou, China).

Sequence Quality Analysis and Filtering
Sequence reads were de-multiplexed by using the outer d mation and assigned to sequenced samples.Reads containing t in R1 and R2 were obtained by searching restriction site sequenc ity filtration was carried out as follows: the adapters were remov with lengths less than 40 bp were removed, and the unstable ba first 10 bp and the last 8 bp were trimmed using Trimmomatic v0 pipelines [29].

De Novo Assembly, Read Alignment, and SNP Calling
A set of 24 samples was used for de novo assembly using M kmer-size ranging from 29 to 141bp and the optional de nov kmer-size of 141 were selected.The obtained de novo assembly the Lobolly pine (v2.0) sequenced reference genome in the local d NCBI (https://www.ncbi.nlm.nih.gov/assembly/GCA_00040406(https://blast.ncbi.nlm.nih.gov/Blast.cgi)[32].The aligned assem ≧95% and length ≧141 bp) was then split into 40 subsets, each c Pre-individual variants were called using a set of custom shell sc eratingBamFiles.sh"implemented with Bowtie2 v2.4.2 [33], S BCFtools [35] for parallel runs for subset BAM files generating nomic subsets before a joint genotype call over all 305 samples u eral, the bowtie2-build command was applied to index the refere samtools faidx was used to deal with the indexed reference; then, with the application of Bowtie2 v2.4.2 and SAMtools v1.11 pipe speed up the multiple sample-specific SAM file construction pr files, SAMtools v1.11 was used to create, sort, and index BAM fi on allele frequencies that were estimated with genotype likeliho files, the SNP calling based on certain subset reference contigs u out by using the following commands : angsd -bam bam.filelist -G 2 -doMajorMinor 1 -SNP_pval 1e-6 -doGeno 5 -doPost 1 -postCutoff

SNP Filtering
A total of 305 samples with an extra six replicates for six The six replicates were processed with the 299 individuals at th yses were conducted based on the filtering of missing data us npGeno pipeline [37] to filter out the singleton, duplicated, an missing threshold <30%.The obtained SNPs with missing <10 PLINK v1.9 [38] and the minor allele frequency (MAF) was calc MAF > 0.01 were treated as the initial SNPs set, and the resultin 95% and unique position hit were selected as the reference-located contigs.Those reference-located contigs were then used to extract the reference-located-SNP from the 0.67 million SNPs.Those reference-located-SNPassociated contigs were further used for functional annotation analysis.The reference-located-SNP-associated contigs that might putatively encode proteins were searched against the nonredundant protein database at the NCBI (National Center for Biotechnology Information) (USA) with minimum E-value of <1.0 × 10 −6 as the threshold to extract the Gene Ontology (GO) terms associated with the blasted hits using the program Blast2GO [39].The three major GO terms, cellular component (CC), biological process (BP), and molecular function (MF) were also determined with the e-value hit filter < 1.0 × 10 −6 .In a final step, details of the pathway annotation with the reference-located-SNP-associated contigs were produced using the KEGG (Kyoto Encyclopedia of Genes and Genomes) database accessed on 12 March 2022 [40].

Diversity Analyses
To evaluate the impact of obtained SNPs with different filtering scenarios in genetic diversity, the genetic associations of the 299 Masson pine samples were assessed using principal component analysis from the R program of AveDissR v6 [41,42], and the PCoA plots of the first three resulting principal components were made based on the obtained SNPs in different filtering scenarios.For individual comparison, the resulting PCoA plots were individually labeled with respect to the sample's original source group.To investigate the impact of missing SNP data on genetic diversity analysis, another PCoA analysis was performed with the MAF > 0.05 at different missing levels of 0, 5%, and 10% at rep6F05C4blastAnnot35bp filtering scenarios.

High Throughput Sequencing and Assembly
Following the workflow outlined in Figure S1, this study generated a total of 9.33 × 10 9 clean sequence reads in 610 FASTQ files.The average number of reads after cleaning was obtained with 1.52 × 10 7 reads/sample, ranging from 0.6 × 10 7 to 7.6 × 10 7 reads/sample.It was found that the replicated pairs with higher read numbers generated slightly more SNPs; however, larger differences in the numbers of reads between individuals within the pair led to a slight decrease in the quantity of data obtained per pair.
A set of 24 paired FASTQ files with sequence reads ranging between 2.05 × 10 7 to 7.63 × 10 7 /sample were de novo assembled using MEGAHIT.To select a set of reliable contigs as a de novo assembly for SNP calling, the de novo assembled contigs were first blasted against the sequenced genome of Lobolly pine (Pinus taeda) and those contigs with identity < 95% and length < 141 bp were filtered out.The remaining 843,351 contigs were used as a de novo assembly for SNP discovery.In the polished de novo assembly, a majority of 98.93% of the contigs had a length ranging between 150 and 1100 bp, with an average of 329 bp.

SNP Calling and Filtering
All 305 samples (including each replicate of six samples) were used for SNP calling using custom shell scripts of ref-ANGSD pipeline based on the polished 843,351 contigs.The obtained 0.67 million raw SNP data from 265,525 contigs were filtered with minor allele frequency (MAF) > 0.01 and SNP calling rate > 90% firstly based on PLINK v1.9 formatted data.A statistical summary of SNP count under varied SNP filtering strategies is illustrated in detail in Table 2.
The SNP and contig counts with and without replicate filtering in all the SNP filtering processes are displayed in Table 2.It was found that different filtering steps resulted in a reduced number of SNPs and contigs, which meant that more rigorous filtering was being carried out.For example, the blasting and annotating filtering at rep0M10F01 with missing < 10% (M10), MAF > 0.01 (F01) resulted in a sharp decrease from 6,739,240 to 2,199,317 in total SNP count.With further filtering at rep0M10F05C4blastAnnot, the SNP count was decreased from 2,199,317 to 627,362.The custom parameter settings on MAF, mC at different missing levels at rep0F05C4blastAnnot resulted in remaining SNPs with the proportions of 9.31% (627,362; M10), 3.63% (244,948; M5), and 0.10% (6787; M0).Obviously, the blasting, annotating, and custom parameters filtering removed a majority of 90.69% (6,739,240 vs. 627,362) or more un-reliable SNPs from the raw SNP sets in the rep0 scenario.There was only a small proportion of 1.08% (6787) without missing data in the M10 (627,362) SNP set at rep0M10F05C4blastAnnot.The small proportion of 1.08% of SNPs with nonmissing data showed that the identified 627,362 SNPs in M10 were generally harbored with missing data in the Masson pine GBS application.Along with the missing level in the identified SNPs reduced from M10 and M5 to M0 at the rep0F05C4blastAnnot scenario, the contigs were sharply decreased from 20,554 and 15,434 to 2312, respectively.A steep reduction of 13,122 (85.02%) in SNP-associated contigs was observed when the missing level decreased from M5 to M0 at rep0F05C4blastAnnot, which meant a sharp shrinking of the SNP coverage in the de novo assembly.After the exclusion of all SNPs within a 35 bp distance in each set of SNPs in M10, M5, and M0, the remaining SNPs were reduced to 9780, 14,612, and 2942 at rep0F05C4blastAnnot35bp, respectively.Note: a: rep01SNP-The rep01SNP was obtained by the removal of all missing data across 12 replicates; b: rep61SNP-The rep61SNP was the selected good loci with consistent alleles between each pair of replicates across six samples; c: The numbers in brackets were the SNP-associated contigs count; d: The filtering strategy at missing <10%, MAF > 0.01, and minimum minor genotype count >= 4; e: Blasting-Blasting against the sequenced genome of Chinese pine and filtering out the SNP-associated contigs with identity < 95% and multi-locus location hits.
The effects of technical replicates in obtained SNP counts are also displayed in Table 2.The initial 6,739,240 SNPs dataset at rep0M10F01 was used for paired replicates filtering.A notable decrease in SNP count was observed among the three datasets from 6,739,240 (100%) to 5,420,678 (80.43%) and 2,626,541 (38.97%) in rep01SNP, rep11SNP, and rep61SNP, respectively.At the same time, the contigs decreased from 265,525 (100%) and 256,986 (98.78%) to 204,433 (76.99%) in rep01SNP, rep11SNP, and rep61SNP, respectively.The combination of blasting, annotation, and six replicates filtering resulted in a proportion of 12.53% (844,311 vs. 6,739,240) of SNPs being retained and 19.01%(50,447 vs. 265,525) of contigs being retained at the rep6M10F01blastAnnot scenario.With further filtering with MAF and missing and minimum minor genotype counts in each SNP set, the SNPs decreased from 844,311 to 95,115 at the rep6M10F05C4blastAnnot scenario (Table 2).Different missing levels at rep6F05C4blastAnnot resulted in 95,115, 60,143, and 2532 SNPs in M10, M5, and M0, respectively.The removal of all SNPs within a 35 bp distance retained 26,680, 20,055, and 1641 SNPs in M10, M5, and M0, respectively, at rep6F05C4blastAnnot35bp. Compared with the identified SNPs in the two scenarios at rep0F05C4blastAnnot35bp and rep6F05C4blastAnnot35bp with different missing levels, the SNP count with missing in the rep6 scenario obtained more markers than in the rep0 scenario.For example, there were 9780 vs. 26,680 in M10 and 14,612 vs. 20,055 in M5, due to a longer distance between markers resulting from the removal of inconsistent loci between each pair of replicates in the rep6 scenario.Consequently, more SNP were retained in rep6 than in the rep0 scenario.
Statistics were conducted to illustrate the distribution of the identified 20,055 SNPs and related SNP-associated contigs on each of twelve chromosome-level linkage groups and contigs of the Chinese pine reference genome at rep6M5F05C4blastAnnot35bp (Table 3).Except for the seventh linkage group, evenly distributed SNPs with unique locations were observed across the 11 chromosome linkage groups and contigs, which indicated that the identified SNPs were detected across the whole genome.Further investigation of the SNPs in the seventh linkage group showed that there were no unique hits, but rather two or more hits, recorded for the SNP-associated contigs located in this group.Consequently, no SNP with unique location was obtained in the seventh linkage group.

Characterization of Identified SNPs
Based on the initial 0.67 million SNPs, the statistics of the SNP counts at different missing levels of M0, M5, and M10 within different filtering steps are illustrated in Table 4.A greatly reduced proportion of filtered SNPs was observed along with the QC steps at different missing levels.For example, the proportions of filtered SNPs decreased from the largest of 100%, 71.96%, and 53.21% to the smallest of 9.95%, 4.61%, and 2.96% in the M0, M5, and M10, respectively (Table 4).The notably reduced proportions of the SNP counts were seen in the rep6 replicates filtering process with proportions decreased more than 61.95%, 45.24%, and 24.44% in the M0, M5, and M10, respectively.Interestingly, the smallest reduction proportions were seen in the Blasted filter process based on the rep6 replicates filtered GL SNP results.
Simultaneously, the key indicators of filtered SNPs in each filtering step in rep0, rep0blast, and rep0blastAnnot at F05C4 were counted according to the genotyping error label for each SNP locus.Generally, the proportion of GL decreased as the missing levels increased from M0 to M10 at F05C4 in different filtering steps.The missing allele (MA) and missing locus (ML) types always occupied the largest proportion of genotyping error types within each of the identified SNPs.For example, the two genotyping error types of MA and ML occupied the very large proportions of 73.77% and 82.17% in M5 and M10 at rep0F05C4, respectively.After the Blasted and Annotated filtering, the two genotyping error types of MA and ML remained at the very large proportions of 74.71% and 82.39% in M5 and M10 at rep0F05C4blastAnnot.Meanwhile, the proportion of GL within each identified SNP set remained stable around 25.47%~24.55%and 15.31%~15.16% in M5 and M10, respectively.first 10 bp and the last 8 bp were trimmed usi pipelines [29].

De Novo Assembly, Read Alignment, and S
A set of 24 samples was used for de nov kmer-size ranging from 29 to 141bp and th kmer-size of 141 were selected.The obtained the Lobolly pine (v2.0) sequenced reference g NCBI (https://www.ncbi.nlm.nih.gov/assem(https://blast.ncbi.nlm.nih.gov/Blast.cgi)[32] ≧95% and length ≧141 bp) was then split int Pre-individual variants were called using a s eratingBamFiles.sh"implemented with Bow BCFtools [35] for parallel runs for subset BA nomic subsets before a joint genotype call ov eral, the bowtie2-build command was applied samtools faidx was used to deal with the index with the application of Bowtie2 v2.4.2 and SA speed up the multiple sample-specific SAM files, SAMtools v1.11 was used to create, sor on allele frequencies that were estimated wi files, the SNP calling based on certain subset out by using the following commands : angsd 2 -doMajorMinor 1 -SNP_pval 1e-6 -doGeno 5 -

A total of 305 samples with an extra si
The six replicates were processed with the 2 yses were conducted based on the filtering npGeno pipeline [37] to filter out the single missing threshold <30%.The obtained SNP PLINK v1.9 [38] and the minor allele frequen MAF > 0.01 were treated as the initial SNPs million.By using the six sample replicates script was developed to label each SNP with missing data (MD), locus error (LE), missing as good loci (GL); these are described in Table both replicates in the paired replicates had c data among the selected pairs of replicates in the rep11SNP was obtained by the removal of the rep61SNP was the selected good loci wit licates across six sets of replicates.Note: * biallelic SNPs; a: GL-good loci, MA-missing allele, LE-locus error, ML-missing loci, MD-missing data; ** rep01SNP refers to the remaining SNPs after the removal of any missing data in selected six samples' genotypes; *** rep6sSNP refers to the GL SNPs in the corresponding filtering steps using the six pair of replicates filtering.
Specific allele distribution with respect to the minor allele frequency is illustrated in Figure 1 at M5C4 with rep0 (A), rep1 (B), rep6 (C), rep6blast (D), rep6blastAnnot (E), and rep6blastAnnot35bp (F).Interestingly, more rigorous filtering altered the distribution of minor allele frequency among the six datasets from similar patterns of U shape (A, B) to reverse L shape (C, D, E, F) in SNP count numbers.The reverse L shape patterns observed in the last four datasets indicated that a higher proportion of heterozygous loci were retained and more low-frequency minor alleles were removed in these SNP datasets.

Functional Analysis of SNP-Associated Contigs
A total of 265,525 SNP-Associated contigs for 6,739,240 SNPs prediction at M10F01 were blasted against the Chinese pine reference genome and there were 159,372 contigs with identity ≥95% and unique location in the reference genome for functional annotation.The length distributions of the searched 159,372 contigs are displayed in Figure 2A.The aligned SNP-Associated-blasted 159,372 contigs were searched against the nr (NCBI nonredundant protein sequences database) via BLAST with a minimum E-value of < 1.0 × e −6 as a similarity threshold (Table S2).There were 60,038 contigs corresponding to known protein sequences with a proportion of 37.67% annotated among the 159,372 contigs.

Functional Analysis of SNP-Associated Contigs
A total of 265,525 SNP-Associated contigs for 6,739,240 SNPs prediction at M10F01 were blasted against the Chinese pine reference genome and there were 159,372 contigs with identity ≥ 95% and unique location in the reference genome for functional annotation.The length distributions of the searched 159,372 contigs are displayed in Figure 2A.The aligned SNP-Associated-blasted 159,372 contigs were searched against the nr (NCBI nonredundant protein sequences database) via BLAST with a minimum E-value of <1.0 × 10 −6 as a similarity threshold (Table S2).There were 60,038 contigs corresponding to known protein sequences with a proportion of 37.67% annotated among the 159,372 contigs.A total of 17,396 contigs were searched in GO annotation with a proportion of 10.92% annotated.The functional annotations resulted in 50 GO terms (Figure S2A).These 50 GO terms were further classified into three functional categories such as cellular component (CC, 16 GO terms), molecular function (MF, 11 GO terms), and biological process (BP, 23 GO terms).Some contigs matched with more than one GO term, whereas a few matched A total of 17,396 contigs were searched in GO annotation with a proportion of 10.92% annotated.The functional annotations resulted in 50 GO terms (Figure S2A).These 50 GO terms were further classified into three functional categories such as cellular component (CC, 16 GO terms), molecular function (MF, 11 GO terms), and biological process (BP, 23 GO terms).Some contigs matched with more than one GO term, whereas a few matched only one GO term.The three most predominant GO subcategories in the CC category were associated with cell (category I; GO:0005623) with 3984 contigs, cell parts (category II; GO: 0044464) with 3981 contigs, and organelle (category III; GO: 0043226) with 3635 contigs.The two most predominant GO subcategories in the MF category were associated with function activity (category I, GO: 0003824) with 10,897 contigs and binding (category II, GO: 0005488) with 10,879 contigs.The two most predominant GO subcategories in the BP category were associated with metabolic process (category I; GO: 0008152) with 12,942 contigs and cellular process (category II, GO: 0009987) with 12,220 contigs.
Analysis of KEGG pathway details from annotation results showed that a total of 3694 contigs were involved in five categories and 19 subtypes (Figure S1).Based on the greatest number of contigs identified in each functional category, the largest functional category detected most often was the Metabolism category, which involved the largest number of genes and was divided into 11 subcategories.Among them, the largest two were annotated in the global and overview maps (3157 contigs) and nucleotide metabolism (1793 contigs).Another high number of genes was detected in the category of Genetic Information Processing, with the high number of 919 genes involved in transcription folding, 439 genes involved in translation folding, and so on.There were small numbers of genes detected in the other three categories of Environmental Information Processing, Cellular Processes, and Organismal Systems.Those annotated contigs would be a set of valuable genetic resources to stimulate Masson pine genomic research in the future.

Patterns of Genetic Relationship in Obtained SNP Sets
The impacts on the genetic analysis based on the obtained SNPs from different filtering scenarios were analyzed to explore the extent and influences displayed in downstream genetic analysis using different filtering scenarios.According to the original background of 299 samples from four local populations displayed in Table S1, the clustering patterns among samples were used to reveal the precise genetic background in the genetic structure analysis.The impact of missing levels in the SNPs set was evaluatedbased on the PCoA analysis using the obtained three sets of SNPs at repxblastAnnot35bp in 299 Masson pine samples with different missing levels, 0 (M0), 5% (M5), and 10% (M10).The PCoA plots resulting fromSNPs with different missing levels illustrated that the removal of all missing data from the SNPs would result in problematic problem on the relationship inference compared to the patterns inferred with missing data of 5% (M5) and 10% (M10) (Figure 3), as four more distinct clusters according to the four local sample locations were obtained in the M5 and M10 SNP sets than in the M0 SNP set.However, the cluster patterns in the M5 and M10 sets seem similar to each other.
were obtained in the M5 and M10 SNP sets than in the M0 SNP set.However, the cluster patterns in the M5 and M10 sets seem similar to each other.S1.
The effects of identified SNPs in downstream genetic analysis in different QC steps were illustrated in PCoA plots based on the obtained six sets of SNPs in rep0, rep1, rep6 and rep6blast, rep6blastAnnot, and rep6blastAnnot35bp at M5F05 (Figure 4A-F).The same signal of individuals in the PCoA plots represents the same original subpopulation of families which would indicate more common background retained within the original subpopulations.It was found that the open pollination of parents from four local populations in the advanced seed orchard had resulted in a heterozygous genetic background among the 293 progenies from 65 families, seen as the same color signal scattering among the four clusters in the PCoA plots.When the number of replicates increased from zero, one to six, a more reasonable four distinct gathering patterns according with the four local populations of the samples' original sources were observed in the PCoA plots based on the obtained SNPs from the rep0 to rep6 filtering scenarios (Figure 4A-C).Four more stable clustering patterns were observed along with the QC procedure from rep0 to rep6blas-tAnnot35bp at M5F05.A little more overall compacted but more distinct separation among different color signals in the gathering patterns was displayed in the PCoA plots in Figure 4C-F.According to the signal clustering patterns, the PCoA plots based on SNPs in the rep6blastAnnot35bp filtering scenario displayed more useful information in revealing the heterozygous genetic background of samples, as more variations that were displayed among individuals both in the same color signals with more gathering patterns and different color signals with more distinct separation and less overlapping patterns (Figures 4F and S3F).Furthermore, to display the usefulness of the QC procedure for the optimization of identified SNPs resulting from different filtering steps, the two subpopulations of YQ and ZB were taken as an example and the PCoA plot was highlighted in Figure S3 based on the same QC procedure from rep0 to rep6blastAnnot35bp at M5F05.It was found that more variations were observed with the scattering patterns within and among subpopulations than in the rep6 scenario (Figure S3F,C).The scattering trend of the highlighted samples in the YQ and ZB subpopulations developed along with the SNP sets ranged from rep6, rep6blast, rep6blastAnnot, and rep6blastAnnot35bp step by step (Figure S3C-F).Overall, those clustering patterns illustrated that it was helpful for facilitating the improvement in the reliability of SNPs to conduct the filtering strategy of blasting, annotation, and the removal of SNPs within a 35 bp distance in SNP filtering, as well as the custom filtering parameter settings.S1.
The effects of identified SNPs in downstream genetic analysis in different QC steps were illustrated in PCoA plots based on the obtained six sets of SNPs in rep0, rep1, rep6 and rep6blast, rep6blastAnnot, and rep6blastAnnot35bp at M5F05 (Figure 4A-F).The same signal of individuals in the PCoA plots represents the same original subpopulation of families which would indicate more common background retained within the original subpopulations.It was found that the open pollination of parents from four local populations in the advanced seed orchard had resulted in a heterozygous genetic background among the 293 progenies from 65 families, seen as the same color signal scattering among the four clusters in the PCoA plots.When the number of replicates increased from zero, one to six, a more reasonable four distinct gathering patterns according with the four local populations of the samples' original sources were observed in the PCoA plots based on the obtained SNPs from the rep0 to rep6 filtering scenarios (Figure 4A-C).Four more stable clustering patterns were observed along with the QC procedure from rep0 to rep6blastAnnot35bp at M5F05.A little more overall compacted but more distinct separation among different color signals in the gathering patterns was displayed in the PCoA plots in Figure 4C-F.According to the signal clustering patterns, the PCoA plots based on SNPs in the rep6blastAnnot35bp filtering scenario displayed more useful information in revealing the heterozygous genetic background of samples, as more variations that were displayed among individuals both in the same color signals with more gathering patterns and different color signals with more distinct separation and less overlapping patterns (Figures 4F and S3F).Furthermore, to display the usefulness of the QC procedure for the optimization of identified SNPs resulting from different filtering steps, the two subpopulations of YQ and ZB were taken as an example and the PCoA plot was highlighted in Figure S3 based on the same QC procedure from rep0 to rep6blastAnnot35bp at M5F05.It was found that more variations were observed with the scattering patterns within and among subpopulations than in the rep6 scenario (Figure S3C,F).The scattering trend of the highlighted samples in the YQ and ZB subpopulations developed along with the SNP sets ranged from rep6, rep6blast, rep6blastAnnot, and rep6blastAnnot35bp step by step (Figure S3C-F).Overall, those clustering patterns illustrated that it was helpful for facilitating the improvement in the reliability of SNPs to conduct the filtering strategy of blasting, annotation, and the removal of SNPs within a 35 bp distance in SNP filtering, as well as the custom filtering parameter settings.S1.

Discussion
GBS is considered as one of most cost-effective and powerful approaches to develop high-throughput SNPs for non-model tree species without a reference genome [15,16,43,44].To date, the genetic resources for SNP discovery in Masson pine remain limited [45].In this study, we set out to develop a set of useful genetic resources and genomewide SNPs of Masson pine in a cost-and time-efficient manner using the GBS approach.We selected the combination of EcoRV and ScaI-HF enzymes to sample subsets of genome in Masson pine GBS to develop genetic resources and perform more accurate SNP discovery in 299 Masson pines in the absence of a reference genome.Considering the possible problem of genotyping errors in GBS, the SNP quality control tools were applied to deal with the precision and reliability of the identified SNPs by the combined QC strategies of Blasted, Annotated, technical replicates, as well as custom filtering parameter settings in SNP call rate, MAF and mC [23].Those QC processes filtered most of the unreliable SNPs and improved the downstream genetic structure analysis illustrated with 299 individuals in PCoA analysis (Figures 4 and S3).The application of GBS in 299 Masson pine samples generated 20,055 SNPs and 159,372 contigs as a set of reliable SNPs and informative genetic resources for Masson pine.The Blasted against related databases and the homologous reference genomes of Pinus taeda and Pinus tabuliformis revealed alignments with lengths of roughly 26.09Mb.The validated 60,038 functional-associated contigs can be used as informative genetic resources in Masson pine breeding.These efforts are available to stimulate more reliable, confident, and high-throughput SNP discovery in Masson pine, as well as in the tree species with large genomes, using GBS approach.
It is reported that the Pinus ssp genomes have sizes of more than 22-32 Gbp, of which more than 80% are repetitive sequences and hypermethylated [46,47].Those repetitive sequences cause ambiguous assembly of paralogous loci and thus genotyping errors have occurred in SNP identification via GBS [19][20][21] S1.

Discussion
GBS is considered as one of most cost-effective and powerful approaches to develop high-throughput SNPs for non-model tree species without a reference genome [15,16,43,44].To date, the genetic resources for SNP discovery in Masson pine remain limited [45].In this study, we set out to develop a set of useful genetic resources and genome-wide SNPs of Masson pine in a cost-and time-efficient manner using the GBS approach.We selected the combination of EcoRV and ScaI-HF enzymes to sample subsets of genome in Masson pine GBS to develop genetic resources and perform more accurate SNP discovery in 299 Masson pines in the absence of a reference genome.Considering the possible problem of genotyping errors in GBS, the SNP quality control tools were applied to deal with the precision and reliability of the identified SNPs by the combined QC strategies of Blasted, Annotated, technical replicates, as well as custom filtering parameter settings in SNP call rate, MAF and mC [23].Those QC processes filtered most of the unreliable SNPs and improved the downstream genetic structure analysis illustrated with 299 individuals in PCoA analysis (Figures 4 and S3).The application of GBS in 299 Masson pine samples generated 20,055 SNPs and 159,372 contigs as a set of reliable SNPs and informative genetic resources for Masson pine.The Blasted against related databases and the homologous reference genomes of Pinus taeda and Pinus tabuliformis revealed alignments with lengths of roughly 26.09Mb.The validated 60,038 functional-associated contigs can be used as informative genetic resources in Masson pine breeding.These efforts are available to stimulate more reliable, confident, and high-throughput SNP discovery in Masson pine, as well as in the tree species with large genomes, using GBS approach.
It is reported that the Pinus ssp genomes have sizes of more than 22-32 Gbp, of which more than 80% are repetitive sequences and hypermethylated [46,47].Those repetitive sequences cause ambiguous assembly of paralogous loci and thus genotyping errors have occurred in SNP identification via GBS [19][20][21].With the help of blasting against available genetic resources of closely relative well-assembled reference genomes [16,43,44], the precision of target contigs could be improved much to facilitate the reliability of SNP discovery in the species under study [48].In this study, we applied two pines' sequenced reference genomes (Pinus taeda and Pinus tabuliformis) for the correction of the obtained de novo assembly in Masson pine with identity >= 95% to ensure more accurate contigs were used for SNP discovery (Table 2).Both the two reference genomes were helpful to identify the SNPs across the genome.Our blasting with SNP-associated contigs against the sequenced Pinus tabuliformis reference genome provided more repeatability and reliability for the identification of SNPs, as well as a set of available genetic resources for Masson pine genomic research.
Peculiarities of SNPs with high missing data are a common concern in GBS application [19].A custom choice of disregarding the missing SNPs with SNP calling rates lower than 80% across all the assayed samples was performed to deal with the missing data in GBS application.However, filtering out all the missing data from the identified SNPs dataset seemed to not be a good solution to deal with missing data for downstream analysis according to our study, as a problematic problem on the clustering pattern was observed in the M0 dataset compared to the patterns in M5 and M10 (Figure 3).The empirical data analysis in our study showed that the removal of all missing data from M5 to M0 resulted in a substantial loss of 18,414 (91.82%) reliable SNPs (20,055 vs. 1641) and a decrease of a majority proportion of 88.56% of the contigs (10,712 vs. 1225) at rep6F05C4blastAnnot35bp filtering scenario (Table 4), which indicated that the balance between the missing data and the reliable SNPs, along with the contigs genome coverage representation, should be evaluated in the SNP filtering.Nevertheless, to obtain informative SNPs, high-quality DNA must be prepared first to avoid a large proportion of missing data in tree GBS application [25].
The genotyping errors, inconsistent alleles detected with the help of paired replicates, are universally known in molecular marker development [19][20][21]24,49,50].The use of technical replicates facilitated the detection and evaluation of genotyping errors in genetic data within different filtering steps to characterize the identified SNPs in this study.The count number based on the key indicators of the filtered SNPs in different filtering steps illustrated the detailed characteristics of identified SNPs in GBS application in Masson pine (Table 4).With the help of technical replicates, the detection and selection of GL from the identified SNPs were more feasible in GBS application.The detected genotyping errors types of MA and ML displayed high proportions of 74.14% and 82.39% in M5 and M10 at F05C4 in the rep0balstAnnot scenario, respectively (Table 4).Those genotyping errors retained in the identified SNPs would heavily bias the inference of genetic analysis [19,20].For example, the patterns of PCoA plots based on two sets of SNPs (rep0 vs. rep6) displayed notable differences (Figure 4A,C), which prompted a great concern to address the precision and reliability of SNPs identified through the GBS approach in massive-genome tree species.Thus, great concern should be addressed to the optimization and monitoring of genotyping errors in large-genome forest tree GBS applications prior to sequencing experiment design.
Much higher genotyping error rates were observed in the obtained SNPs set in this study of Masson pine GBS application compared to GBS applications in forest tree species with small genomes [24,26].For example, a high proportion of 47.14% genotyping errors on MA between replicates was detected at rep0M5F05C4blastAnnot in this study.However, Mastretta-Yanes et al. [24] found that only a small fluctuation between 5.9% and 8.8% of alleles were not concordantly called between replicates, based on the optional parameter settings in the de novo assembly in Berberis alpina.Another study found that a wider range of 1.96% to 22.66% genotyping error rates was observed in de novo assembly-based SNP calling with the help of replicates in Fagus sylvatica and Quercus robur L. using three reduced-genome genotyping approaches [26].The difference between our study and these reported genotyping error rates would be due to the structure of the genomes related to genome size and genome complexity in the four forest tree species [24,26].It is well known that the Pinus ssp have huge genomes of more than 22 GB (Chinese pine 25.4 GB vs. Loblolly pine 22.1 GB) and those genomes are filled with repetitive DNA sequences [6,47,48], while a much smaller genome compared to Pinus ssp is observed in the sequenced genome of Q. robur L. [51].These characterizations of identified SNPs in the Masson pine GBS application provide more understanding to identify high-throughput SNPs based on GBS in non-model forest trees with a large genome.
Our analysis also focused on the reliability of identified SNPs in Masson pine and a series of quality control (QC) procedures were performed for the optimization (Tables 2 and 4).Converting the raw GBS sequences into high-throughput SNP markers involved a number of steps, each of which contributed to the accuracy of the final genotype calls [20].In this study, along with the QC procedures, most of the SNPs from the raw SNPs set were filtered out.For example, the count was reduced from 4,339,365 SNPs in 263,822 contigs at rep0F05 to 20,055 SNPs in 10,712 contigs at rep6F05C4blastAnnot35bp with only a small proportion of 0.46% SNPs retained (Table 2); however, more informative relationships were illustrated in the resulting small number of 20,055 SNP sets (Figures 4 and S3).Obviously, the searching and filtering using different QC processes including technical replicates, Blasted and Annotated on the consistent SNPs, and reliable contigs checking had removed a majority of the unreliable SNPs and contigs from the SNPs and SNP-associated contigs; consequently, the precision and reliability of the identified SNPs in the remaining SNPs set and contigs should be much improved.
The informativeness of the resulting SNPs sets from the QC procedure was displayed in the PCoA plots according to the 299 individuals' genetic background from four local populations.The open pollination of the sampled 65 families in the advanced seed orchard had resulted in a heterozygous genetic background, as illustrated by the scattering pattern of the same color signals in the resulting four clusters in the PCoA plots (Figure 4).More stable cluster patterns were obtained in the PCoA plots from the rep6 to rep6blastAnnot35bp filtered scenarios (Figure 4C Those patterns indicated that more precise original genetic background in the sampled individuals were illustrated along with the optimization of identified SNPs.

Conclusions
We developed a protocol for the identification of reliable SNPs and optimized SNPassociated contigs for Masson pine via the GBS approach.The QC procedures for the precision and reliability of identified SNPs via the GBS approach were achieved by the combination of blasting the de novo assembly on available sequenced reference genomes, functional annotation, technical replicates, and 35 bp interval distance filtering, as well as the custom parameter settings on missing, minor allele frequency, and minimum minor genotype count.The use of available reference genomes and technical replicates during the generation of SNPs provided possible solutions for the mitigation of the effect of genotyping errors.The derived SNPs may have some problematic problems if no optimization was carried out on those SNPs because of the detected high genotyping error rate in Masson pine GBS in this study, which has prompted notable attention to the mitigation of genotyping errors in GBS application in large-genome forest tree species, such as the Pinus ssp.Thus, for any high-throughput sequencing data, the characteristics of raw data, assembly quality, the utilization of reference genomes, and the range of parameter values used for bioinformatics analysis should be carefully considered for precise genotyping in forest tree GBS application.However, our research is encouraging, as a continuous search for more affordable, accurate, and reliable SNP discovery in forest tree breeding for the adoption of genomic selection is possible and may yield more accurate prediction in molecular forest tree breeding.These

Forests 2023 , 18 Figure 2 .
Figure 2.The contig length distribution based on the searched 159,372 SNP-Associated-blasted contigs in the de novo assembly for reliable SNP prediction (A), the number of the 60,038 annotated SNP-Associated-blasted contigs with blasted hits (B).

Figure 2 .
Figure 2.The contig length distribution based on the searched 159,372 SNP-Associated-blasted contigs in the de novo assembly for reliable SNP prediction (A), the number of the 60,038 annotated SNP-Associated-blasted contigs with blasted hits (B).

Figure 3 .
Figure 3. PCoA plots of 299 Masson pine samples based on the SNPs with different missing levels of M0, M5, and M10.The different signals of colored circles with the related abbreviations represent the original resource of subpopulations from four local populations displayed in TableS1.

Figure 3 .
Figure 3. PCoA plots of 299 Masson pine samples based on the SNPs with different missing levels of M0, M5, and M10.The different signals of colored circles with the related abbreviations represent the original resource of subpopulations from four local populations displayed in TableS1.
-F).Further SNP filtering resulted in more reasonable genetic relationships of 293 individuals with more overall compacted but less overlapped patterns among different color signals of the 15 original subpopulations.The overall compacted clustering patterns along with the QC procedure displayed more similarities illustrated within each family based on the optimized SNPs.The clustering patterns in the PCoA plots with two highlighted colors of the YQ and ZB subpopulations clearly revealed more variations that were illustrated with more gathering within the same color signals and less overlapping between different color signals in the two subpopulations (Figure S3C-F).

Table 1 .
Key indicators used to assess genotyping error types.
* R1/R2-first and second technical replicates for a pair; T, G-example nucleotides; N|N-missing genotype.

Table 2 .
SNP filtering process and SNP count and contig numbers with and without one or six replicates filtering.

Table 3 .
The distribution of SNPs at rep6M5F05C4blastAnnot35bp across the Chinese pine reference genome in the twelve linkage groups and contigs.

Table 4 .
The SNP counts and each key indicator genotyping error count for different missing levels in QC steps.
. With the help of blasting against available