Preliminary Investigation of Essentially Derived Variety of Tea Tree and Development of SNP Markers

The continuous emergence of Essentially Derived Varieties (EDVs) in the process of tea tree breeding will endanger and affect the innovation ability and development potential of tea tree breeding. In this study, genotyping by sequencing (GBS) technology was used to screen high-quality genomic SNPs for the first time to investigate the derived relationships of 349 tea trees from 12 provinces in China. A total of 973 SNPs uniformly covering 15 tea tree chromosomes with high discrimination capacity were screened as the core SNP set. A genetic similarity analysis showed that 136 pairs of tea trees had a genetic similarity coefficient (GS) > 90%, among which 60 varieties/strains were identified as EDVs, including 22 registered varieties (19 were indisputably EDVs). Furthermore, 21 SNPs with 100% identification of 349 tea trees were selected as rapid identification markers, of which 14 SNP markers could be used for 100% identification of non-EDV. These results provide the basis for the analysis of the genetic background of tea trees in molecular-assisted breeding.


Introduction
True breeding innovation is a process that requires a large amount of investment and careful long-term research. However, some formal "new varieties" are constantly being bred around a few limited breeding resources. These "new varieties" have a single genetic background, resulting in the loss of population genetic diversity. The main reason is that these "new varieties" are all inbred offspring of a similar initial variety (IV), and they retain the essential characteristics from the genotype or combination of genotypes of the IV. These "new varieties" are called Essentially Derived Varieties (EDVs) [1]. At present, ornamental plants, fruit trees, rice, and wheat have been reported to have a lot of "new varieties", mainly through mutation and genetic modification; the genetic background of these EDVs has a serious homogeneity phenomenon [2][3][4][5]. The registration of these EDVs as a new variety will damage the interests of the IV owners, reduce the enthusiasm of the original breeding innovators, and cause the proliferation of cosmetic breeding, which is not conducive to the real improvement of agricultural production characteristics. In addition, due to the high genetic similarities between EDVs and IV, the existence of a large number of EDVs narrows the genetic basis, which is also not conducive to further genetic improvement [6]. As a result of the proliferation of EDVs, true new varieties will inevitably become increasingly scarce and breeding innovation will decrease, resulting in a vicious circle. Therefore, the identification and control of derivative varieties is an important problem to be solved.
With the development of domestic and international seed trade, the commercial quality of seeds based on authenticity and purity is becoming more and more important for both

Genotyping by Sequencing (GBS)
GBS library construction was done according to Wu [24]. Briefly, the genomic DNA was digested using restriction enzymes (EcoR I and Nia III) and then ligated with barcoded adapters and common Illumina sequencing adaptors. A total of 349 multiplex libraries corresponding to 349 tea trees were constructed, with each library DNA sample having a unique nucleotide multiplex identifier for the barcode adapter. These libraries were subjected to Pair-end sequencing using the Illumina NovaSeq 6000 sequencing platform (150 bp*2, Shanghai Biozeron Co., Ltd.).
When examining genetic similarities between moderately or closely related germplasm, the required number of SNPs may be more than 300, with relatively high PIC values and uniform genomic coverage [18]. Therefore, an optimal core SNP set with MAF > 0.15, miss rate ≤ 5%, PIC > 0.15, more than 300 SNPs and uniform coverage of the whole genome was selected for further analysis. The observed heterozygosity (Ho), minor allele frequency (MAF) and Polymorphic information content (PIC) were calculated using PowerMaker 3.25 software (Raleigh, NC, USA) [31].

Population Structure Analysis
Population structure was analyzed by Admixture software [32]. All tea trees were assigned into corresponding groups based on the values of the membership coefficient. The optimal K value was determined according to cross-validation error (CV error). Principal component analysis (PCA) was performed using Plink software, with the principal components plotted against one another [33] using R 3.4 to visualize patterns of genetic variation. A phylogenetic NJ-tree was constructed by the MEGA software, and the color of the phylogenetic tree was coded by iTOL [34].

Genetic Similarity Analysis
A genetic similarity analysis was carried out for pairs of samples using the NTSYSpc2.11 software [35] to calculate the distribution frequency of SNPs and the genetic similarity coefficient (GS), GS = NS/(NS + ND): NS is the same number of SNP genotypes, and ND is the different SNP genotypes [36].
According to the recommendations of the International Seed Federation (ISF) [37], a GS value greater than 0.9 would represent strong evidence of essential derivation. Therefore, the pairs of samples with GS > 0.9 are considered likely to have an essentially derived relationship. Six pairs of samples of the same variety from different growing places were used as positive controls. When the GS value of any pair of samples was greater than that of the positive controls, this pair of samples is considered to have an indisputable essentially derived relationship. Varieties with late registration dates are considered EDV.

Core Varieties/Strains Analysis
According to Wang et al., 2007 [38], the least distance stepwise sampling method was used to construct the core germplasm, which could preserve the genetic diversity of the original population to the maximum extent. A pairwise comparison matrix by calculating the numbers of differential SNP genotypes between each collection was built within each population; the missing genotype was treated as null. Fewer differential SNP genotypes Cluster is one of the important factors that will affect the results of core collection [38]; collections with extremely admixture backgrounds were excluded from this analysis based on the phylogenetic and the population structure analysis. If IV and EDVs or two collections with known parent-offspring relationships appear in the results at the same time, EDV and offspring were removed and replaced by a subsequent collection.

SNPs Markers for Rapidly Varieties Identification
In order to achieve rapid variety identification with fewer SNPs, based on 973 SNPs, according to the Perl method of Yang et al., 2019 [39] and Liu et al., 2019 [40], the discernibility of pairwise comparison of all samples was used as the first filter condition, and the dataset with the same discernibility was then selected with higher PIC. First, the highest discernible SNP loci were chosen as an initial dataset, and each SNP was subsequently added to the initial dataset to form a new dataset. The second SNP was chosen from the new datasets with the highest discernibility and was added to the initial dataset. The following selection was the same as the second SNP until the discernibility reached the maximum. Finally, a set of SNP markers with minimum numbers and high discernibility was selected for the rapid identification of varieties. Primers were designed in the two conserved flanking regions, and 10 samples were randomly selected for verification.

Genome-Wide Perfect SNPs Discovery of 349 Tea Trees
A total of 829.05 G sequencing data were obtained from 349 tea trees, with an average of 2.38 G of data for each sample, with an average proportion of 98% Q20 and 93.05% Q30. A total of 5,737,108,966 sequences were obtained. There were 5,668,097,099 sequences that could be matched to the tea reference genome, with a matching rate of 98.83% (Table S1). These results indicated a high sequencing quality. After strict filtering, a total of 12,937,679 SNPs were obtained (Table S2). Raw data obtained by sequencing have been uploaded to the NCBI database (BioProject number: PRJNA924950).

Population Structure Analysis
The population structure of 349 tea trees was analyzed based on the 12,937,679 perfect SNPs. The model-based structure analysis showed that the best K value was K = 5 ( Figure 1a,b). The phylogenetic analysis (Figure 1c) was in accordance with the population structure inferred by the structure analysis. In addition, PCA was conducted to assess the population structure ( Figure 1d) and could indicate five clusters of tea trees, consistent with structure analysis at K = 5. All these suggested that the 349 tea trees could be classified into five populations. The presence of a mixture was observed within the five populations (the membership coefficient of its own population is less than 0.8). There were 159, 72, 51, 50 and 17 tea trees in populations 1-5, respectively (Table S3).

Core SNPs Set Exploration
A set of 973 SNPs with MAF > 0.15, miss rate ≤ 5% and PIC > 0.15 was further selected as the core SNPs set (Table S4). Using these 973 SNPs, the phylogenetic tree analysis showed that the clustering relationships of the vast majority of tea trees were consistent with the clustering relationships using 12,937,679 SNPs, while only 21 individuals with extremely admixture backgrounds were inconsistent ( Figure 2a). The result of the PCA analysis was also almost consistent with that using all 12,937,679 SNPs ( Figure 2b). These results suggested that these 973 SNPs could almost represent the genetic diversity of 349 tea trees.

Core SNPs Set Exploration
A set of 973 SNPs with MAF > 0.15, miss rate ≤ 5% and PIC > 0.15 was further selected as the core SNPs set (Table S4). Using these 973 SNPs, the phylogenetic tree analysis showed that the clustering relationships of the vast majority of tea trees were consistent with the clustering relationships using 12,937,679 SNPs, while only 21 individuals with extremely admixture backgrounds were inconsistent ( Figure 2a). The result of the PCA analysis was also almost consistent with that using all 12,937,679 SNPs ( Figure 2b). These results suggested that these 973 SNPs could almost represent the genetic diversity of 349 tea trees.

GS and EDV Analysis
A genetic similarity matrix calculating GS values between each sample was built within 349 tea trees (Table S5). Among 349 samples, 136 pairs (0.22%) had GS values higher than 0.9 ( Figure 4). The GS values of six positive controls were set as reference thresholds for indisputable EDVs. In six positive controls, the GS value was 0.9764-0.9846. Therefore, a GS ≥ 0.97 was taken as a threshold for indisputable EDVs (Table S6). According to the time rated as varieties by the Tea Variety Certification Committee, the earlier varieties were set as IV, and the others were EDV. Among 349 samples, 121 have been registered as varieties (Including 63 national varieties and 58 provincial varieties), of which 22 are considered EDVs, and 19 are indisputable EDVs (Table 1). In addition, 38 landraces had essentially derived relationships with other tea trees, of which 30 were indisputable (Table S7).
A genetic similarity matrix calculating GS values between each sample was built within 349 tea trees (Table S5). Among 349 samples, 136 pairs (0.22%) had GS values higher than 0.9 (Figure 4). The GS values of six positive controls were set as reference thresholds for indisputable EDVs. In six positive controls, the GS value was 0.9764-0.9846. Therefore, a GS ≥ 0.97 was taken as a threshold for indisputable EDVs (Table S6). According to the time rated as varieties by the Tea Variety Certification Committee, the earlier varieties were set as IV, and the others were EDV. Among 349 samples, 121 have been registered as varieties (Including 63 national varieties and 58 provincial varieties), of which 22 are considered EDVs, and 19 are indisputable EDVs (Table 1). In addition, 38 landraces had essentially derived relationships with other tea trees, of which 30 were indisputable (Table S7).

Core Varieties Analysis
A genetic similarity matrix was built based on the number of differential SNP genotypes between each sample among 349 tea trees (Table S8). The top 20% of varieties with minimum differential SNP genotypes were core varieties within each population (Table 2). Finally, 70 tea trees were considered to be the core or backbone of 349 tea trees (Table S9).

Screening of SNPs Markers for Rapid Identification
By using pairwise comparison discernibility screening from 973 core SNPs, a set of 21 SNPs was selected for easier and faster study in 349 sample identification (Figure 5a), among which 14 SNPs could identify non-EDV (Table S10, Figure 5b). The PIC average value of 21 SNPs in 349 samples was 0.32, the Ho average value was 0.420, and the GD average value was 0.38 (Table 3). Twenty-one pairs of primers of SNPs markers were designed (Tables 3 and S11) and randomly verified by PCR with good effect (Figure 6).       Figure 6. Primers were randomly verified by PCR. Randomly select 10 samples for first-generation sequencing and typing; arrows indicate clearly identifiable heterozygous sites. Figure 6. Primers were randomly verified by PCR. Randomly select 10 samples for first-generation sequencing and typing; arrows indicate clearly identifiable heterozygous sites.

Discussion
In this study, a tea population of 349 tea trees from 12 provinces, including 63 National varieties, 58 Provincial varieties and 228 landraces, were collected for the first time to evaluate the essentially derived relationship between tea trees using GBS-SNPs. The GBS approach identified a total of 12,937,679 high-quality SNPs, and all samples passed the quality assessment with an average call rate of 98.8%. High numbers and high-quality SNPs suggested that the GBS approach is powerful for the genetic diversity analyses of tea species. Population structure and diversity analyses are important for the classification and identification of germplasm resources. Our population structure, PCA, and phylogenetic relationship analyses consistently showed that 349 tea trees could be divided into five populations. This was mainly based on the genetic relationship between tea trees but not on morphological characteristics of tree shape or leaf size, which was consistent with previous studies on the differentiation of wild and cultivated tea trees [41]. Thus, these five populations should be the groups representing different lineage sources. This also further suggested that distance coefficients measured using morphological data were of limited use in distinguishing between IVs and putative EDVs, that they did not always reflect lineages or genetic relationships, and that molecular markers have stronger discriminative power and are, therefore, more suitable for the identification of EDVs than morphology [42].
A suitable set of SNP markers for the implementation of the EDV system should be able to represent genetic diversity, cover the whole genome uniformly, and have a high discrimination capacity [42]. In this study, a set of SNP markers with an extremely low miss rate (≤5%) and PIC > 0.15 was selected that were evenly covered across the 15 chromosomes of the tea plant genome. A population genetic diversity analysis using 973 SNPs was almost consistent with all 12,937,679 SNPs, which suggested these 973 core SNPs were sufficient for representing the genetic diversity of the 349 samples. The relatively modest PIC average value and MAF average value might be due to the presence of EDVs in 349 tea cultivars, suggesting that breeding practices have an effect on reducing genetic diversity in cultivated germplasm [43]. A high Ho average value (0.355) and GD average value (0.3) suggested that these 973 perfect SNPs are informative with good discrimination capacity and suitable for the following variety identification.
What similarity ratio can be used as the threshold of EDV has been a controversial issue. CIOPORA (International Community of Breeders of Asexually Reproduced Horticultural Plants) recommended a genetic similarity coefficient of 0.9 as the standard [37,44]. However, a genetic similarity coefficient of 0.9 as the threshold standard may not be suitable for all crops, and a marker-based DUS (significance, uniformity, and stability) assessment system for EDV required a crop-by-crop approach [42]. At present, the specific threshold value of EDV for tea plants has not been clarified and needs to be further explored. In this study, we could only temporarily take the genetic similarity coefficient of 0.9 as the threshold value and regarded it as a reference for whether there was a certain derivative relationship between tea trees. In addition, even if individuals of the same variety have been cultivated in different places for a long time, there will be a small accumulation of genetic mutations. According to the examples given in the UPOV act, mutants were considered as EDVs [12,45]. If the genetic similarity between two individuals is extremely high, even greater than the genetic similarity between individuals of the same variety, but they have different phenotypes, they should be considered to be mutants with a derived relationship. Therefore, six pairs of tea trees known to belong to the same variety but grown in different places were used as positive controls to determine an indisputable derivation relationship, and the smallest genetic similarity value among them (0.97) was taken as the threshold for an indisputable EDV according to the "tail principle" [4,46]. If according to this criterion, 22 of the 121 registered varieties used in the study were EDVs (GS > 0.9), and 19 were indisputable EDVs (GS > 0.97). It was worth noting that some varieties with derived relationships come from different provinces, which means that a certain amount of EDVs might have been produced in the process of tea tree cross-breeding due to the extensive introduction of tea trees and excessive selection of elite varieties such as Fudingdahao, Tezao213, etc., and these results were also consistent with previous breeding records [41,47]. Further, in order to provide a reference for the selection of parents in the subsequent breeding process of tea trees, the core varieties in 349 tea trees were also analyzed by the least distance stepwise sampling method [38][39][40]. After excluding those varieties with possible derived relationships and known progeny varieties, the top 20% of the remaining varieties of each population were considered as the core/backbone varieties for the future breeding of tea trees.
In the application of SNP markers for tea varieties and EDVs, we conducted the first investigation of EDV in 349 tea trees using 973 SNPs screened by GBS technology, and the 349 tea tree samples used in this study were from a wide range of sources (12 different provinces and regions, including 63 national elite cultivars, 58 provincial elite cultivar and 228 landraces); therefore, this SNP set have certain application value in tea variety identification. In order to identify variety more conveniently and economically by using fewer SNP markers, a set of 21 SNPs was also furtherly screened from 973 SNP markers to identify 349 tea trees and DNA fingerprints were set up, of which only 14 SNPs could identify non-EDVs in 349 tea trees. Random validation of these 21 SNP markers showed good evaluation results, which could meet the requirements of efficient, flexible, simple, rapid and low-cost detection applications for these 349 tea trees in the future.
Furthermore, 973 SNP markers in this study have been strictly screened, and there is no other SNP interference on either side of these sites, so these SNP markers are very suitable for future detection of tea trees by SNaPshot technology, also known as mini-sequencing, a primer extension-based method developed for the analysis of SNPs [48]. Although the derivation threshold of the tea tree has not been definitively defined, the pairs of individuals with GS > 0.9 could be regarded as having a suspicious derivation relationship according to the recommendations of ISF [44]. In contrast, the pairs of individuals with a GS > 0.97 could be clearly regarded as a derivation relationship. Nevertheless, if this marker set is to perform future EDV identification in tea trees, further data needs to be collected in collaborative studies of genetic material obtained through different breeding methods or reference populations. In addition, it is also still necessary to combine pedigree relationship and morphological data so as to make a comprehensive EDV identification of tea trees.

Conclusions
In conclusion, this study was the first to investigate and analyze the EDV status of tea trees in China, and 60 varieties/strains were identified as EDVs, including 22 registered varieties (19 were indisputably EDVs). Based on the high discrimination capacity and genome coverage, our study provided a set of 973 SNP markers capable of identifying 349 tea trees, of which only 21 SNP markers were able to identify all 349 tea trees (14 SNP markers could be used for 100% identification of non-EDV). Our results provide a research basis for the genetic background analysis of tea trees in the process of molecular-assisted breeding and contribute to the implementation of the EDV system in the field of tea trees in the future.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/plants12081643/s1. There are 11 supplementary tables, Table S1: 349 tea trees information and GBS resequencing data output statistics; Table S2: Quality filtering of GBS resequencing raw data; Table S3: Population structure analysis of 349 tea trees by using the model-based program STRUCTURE; Table S4: A set of 973 core SNPs for 349 tea trees; Table S5: A genetic similarity matrix by calculating GS values between each tea trees was built within 349 tea trees; Table S6: Accession pairs with essentially derived relationships (GS > 0.9) within 349 tea trees; Table S7: Analysis of essentially derived variety/strain within 349 tea trees; Table S8: A genetic similarity matrix was built based on the number of differential SNP genotypes between each sample among 349 tea trees; Table S9: Analysis of the core variety in 5 populations within 349 tea trees; Table S10: The 21 SNP markers polymorphism for 349 sample identification; Table S11: The 21 pairs of primers for 349 tea trees rapid identification.
Author Contributions: L.L. and X.L. contributed to the study conception and design. Sample preparation of these materials was conducted by L.L., F.L., J.Z., Y.Z., W.Z. and L.F. Data collection, analysis, and interpretation were performed by L.L. and X.L. The first draft of the manuscript was written by L.L. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data will be made available on request.