Utilization of Phytochemical and Molecular Diversity to Develop a Target-Oriented Core Collection in Tea Germplasm

: Tea has received attention due to its phytochemicals. For the direct use of tea germplasm in breeding programs, a core collection that retains the genetic diversity and various phytochemicals in tea is needed. In this study, we evaluated the content of eight phytochemicals over two years and the genetic diversity through 33 SSR (simple sequence repeats) markers for 462 tea accessions (entire collection, ENC) and developed a target-oriented core collection (TOCC). Signiﬁcant phytochemical variation was observed in the ENC between genotypes and years. The genetic diversity of ENC showed high levels of molecular variability. These results were incorporated into developing TOCCs. The TOCC showed a representation of the ENC, where the mean di ﬀ erence percentage, the variance di ﬀ erence percentage, the variable rate of coe ﬃ cient of variance percentage, and the coincidence rate of range percentage were 7.88, 39.33, 120.79, and 97.43, respectively. The Shannon’s diversity index (I) and Nei’s gene diversity (H) of TOCC were higher than those of ENC. Furthermore, the accessions in TOCC were shown to be selected proportionally, thus accurately reﬂecting the distribution of the overall accessions for each phytochemical. This is the ﬁrst report describing the development of a TOCC retaining the diversity of phytochemicals in tea germplasm. This TOCC will facilitate the identiﬁcation of the genetic determinants of trait variability and the e ﬀ ective utilization of phytochemical diversity in crop improvement programs.


Introduction
Since the International Board for Plant Genetic Resources (IBPGR) was established in 1974 to coordinate the global efforts to systematically collect and conserve the world's threatened genetic plant diversity, many countries and organizations have founded gene banks, and millions of crop resources have been preserved [1,2]. As a result of the global efforts to conserve plant genetic resources for food and agriculture, the number and scale of ex situ germplasm collection has increased tremendously in the last 40 years [3]. However, the large sizes of redundant collections, either individually or collectively, for particular species have become an obstacle to the characterization, evaluation, utilization, and maintenance of those species [2,3]. As part of this solution, the authors in [4] proposed that collections could be pruned to core collections, which could "represent with a minimum of repetitiveness, the genetic diversity of a crop species and its relatives." This core collection serves as a working collection that can be extensively examined, while the accessions excluded from such collections can be preserved as preliminary collections [5]. Therefore, core collections can facilitate the use of crop germplasm and manage the entire collections [2].
Tea (Camellia sinensis (L.) Kuntze) is a woody evergreen plant in the family Theaceae and is native to the region covering the northern part of Myanmar, as well as the provinces of Yunnan and Sichuan in China. It is one of the most popular beverages and has become a daily drink for many people around the world [6]. To combat climate change, biological threats, and market fluctuations, the main tea-producing countries of China, Sri Lanka, and India have managed and preserved their tea and genetic resources both in situ and ex situ [7]. In addition, they have developed core collections or subsets of tea germplasm that maintain the original diversity of the collections but at a size that facilitates the evaluation, use, and conservation of the entire collections using geographical origin, phenotypic traits, and molecular markers [8][9][10][11][12]. Knowledge and understanding of the genetic background, genetic diversity, relationships, and identification are important for the collection, preservation, characterization, and utilization of tea resources [13]. The proper characterization and evaluation of genetic resources via systematic preservation and maintenance is the most important factor in utilizing such resources for improving crops [14]. The characterization of germplasm can be carried out using morphological, biochemical, and molecular descriptors according to the standard criteria contained in the tea descriptors [15]. Among the characteristics in tea descriptors, morphological traits and phytochemical content tend to be most affected by environmental factors. In addition, since phytochemical content can have very large variations depending on the environment, these characteristics need to be evaluated for multiple years, yielding more precise data. On the other hand, molecular markers are rarely influenced by the environment and thus directly offer an observation of genomic diversity.
The phytochemical characterization of plant germplasm is an acceptable method to define biochemical diversity [16]. The composition of phytochemicals in tea is important, as these chemicals contribute to tea's quality and pharmacological properties [15]. Tea consists of compounds rich in polyphenols, theanine, and caffeine, which not only determine the quality of tea but also provide tremendous health benefits [17]. Among tea polyphenols, catechins account for 8% to 26% of the tea leaves' dry weight [18]. Previous studies reported that because each catechin monomer has a different chemical structure they each have unique bioactivity, bioavailability, and physiological pharmacokinetic properties [19,20]. In addition, the origin and growing conditions of the tea plant affect the contents of the tea's phytochemicals, which changes bioactivity [21,22]. The leaves of the tea tree have been primarily cultivated as a source of tea beverages, in which phytochemicals such as catechin and caffeine are the main functional compounds. The development of a new variety that contains enhanced phytochemical contents (qualitatively or quantitatively) is the ultimate objective of tea breeding programs. Therefore, along with the evaluation of phytochemical diversity, the development of a core collection that can represent the diversity of the entire germplasm is very important not only for the conservation and management of germplasm but also for tea breeding programs.
To assess the genetic diversity and/or develop new cultivars in many countries, molecular markers such as restriction fragment of length polymorphism RFLP) [23,24], random amplified polymorphic DNA (RAPD) [25][26][27], amplified fragment length polymorphism (AFLP) [24], and simple sequence repeats (SSR) [11,[28][29][30] were used. [31] reported that morphological traits have drawbacks such as the influences of environment on trait expressions, epistatic interactions and pleitrophic effects among others despite the value of their advantages. On the other hand, molecular markers are used because they are least affected by environmental factors and are almost unlimited in number. In addition, they offer a possibility to observe the genome directly, and thus eliminate the shortcomings inherent in a phenotype observation [32].
In our previous study, we analyzed the genetic diversity of tea accessions collected in Korea using 21 SSRs [28]. In this study, we evaluated the content of eight phytochemicals over two years (2018 and 2019) and analyzed the genetic diversity through 33 SSR markers for 462 tea accessions collected from Agronomy 2020, 10, 1667 3 of 18 Korea, China, Japan, and Indonesia. In addition, a target-oriented core collection was developed using both the phytochemical content and genetic diversity. This core collection will be used to efficiently preserve, manage, and evaluate tea germplasm in the genebank of Korea and to be provided to the tea breeding program as breeding materials.

Plant Material
A total of 462 tea accessions were obtained from the National Agrobiodiversity Center (NAC) at the Rural Development Administration in South Korea (Table S1). These accessions are currently preserved as genetic resources in the Tea Industry Institute (34"46 N, 127"5 E) and are maintained through similar horticultural practices. Fresh tea buds and young leaves of the first flush were harvested between 09:00 a.m. and 12:00 a.m. on 24 May in 2018 and 2019. All samples were stored in a freezer at −80 • C until analysis.
All the data collected from three replicate experiments. The summarized phytochemicals in the tea accessions were calculated, and evaluation of the annual variation in the phytochemicals under consideration was conducted through a multivariate analysis of variance (MANOVA) using PAST 3 [33]. Hierarchical clustering was performed using the R statistical software (http://www.r-project.org).

DNA Extraction
Genomic DNA was extracted from the leaves of the tea accessions using a Qiagen DNA extraction kit (Qiagen, Hilden, Germany). The DNA quality and quantity were measured using 1% (w/v) agarose gel and spectrophotometry (Epoch, BioTek, Winooski, VT, USA). The extracted DNA was diluted to 30 ng/uL and stored at −20 • C until further PCR amplification.

SSR Genotyping
For the SSR analysis, a total of 33 SSRs were selected from previous studies [11,34] based on linkage groups and PIC value (Supplementary Table S2). They were fluorescently labelled (6-FAM, HEX, and NED) and used to facilitate the detection of the amplification products. The PCR reactions were carried out in a 25 uL reaction mixture containing 30 ng template DNA, 1.5 mM MgCl 2 , 0.2 mM of each dNTP, 0.5 um of each primer, and 1 U Taq polymerase (Inclone, Korea). Amplification was performed with the following cycling conditions: initial denaturation at 94 • C for 5 min, followed by 35 cycles of denaturation at 95 • C for 30 s, annealing at 55-62 • C (depending on the primers, Table S2) for 30 s, extension at 72 • C for 1 min, and a final extension step at 72 • C for 10 min. Each amplicon was resolved on an ABI prism 3500 DNA sequencer (ABI3500, Thermo Fisher Scientific Inc., Wilmington, DE, USA) and scored using the Gene Mapper Software (Version 4.0, Thermo Fisher Scientific Inc.).

Genetic Diversity and Population Structure
The number of alleles (Na), number of genotypes (Ng), Shannon-Wiener index (S), Expected heterozygosity (He), and Evenness were calculated using the poppr package for the R software [35]. An analysis of molecular variance (AMOVA) within and between the gene pools was performed using the GenAlEx software v. 6.5 [36].
The population structure was analyzed using STRUCTURE v.2.3.4 [37] and DAPC. In the STRUCTURE analysis, Bayesian-based clustering was performed, testing three independent runs, with K ranging from 1 to 10. Each run had a burn-in period of 50,000 iterations and 500,000 Monte Carlo Markov iterations, assuming an admixture model. The output was subsequently visualized with the STRUCTURE HARVESTER v.0.9.94 [38]. The most likely number of clusters was inferred according to Evanno [39]. The DAPC analysis was performed using the adegenet package for the R software [40,41] according to Lee et al. [28]. A Mantel test was performed using R software [40] in order to investigate the relationship between the genetic and phytochemical distances of tea accessions.

Development and Evaluation of the Core Collection
The POWERCORE program [42] was used to develop the independent core collection using the phytochemical data for two years (2018 and 2019) and the genotypic data of 33 SSR markers. The mean difference percentage (MD%), variance difference percentage (VD%), variable rate of coefficient of variance (VR%), and coincidence rate of range (CR%) were calculated to assess the level of diversity captured in the core collection compared to the entire collection [43]. In addition, the representation of the core collection was evaluated by estimating Shannon's diversity index (I) and Nei's diversity index (H). The distance matrix was used to construct a dendrogram via the neighbor-joining (NJ) method with 1000 bootstrap replicates. The principal coordinate analysis (PCoA) was performed using DARwin v. 6.0 [44].
The levels of Caf, EC, ECG, EGCG, and TC in the 462 tea accessions between 2018 and 2019 demonstrated a normal distribution, and their H' levels were high (≥2.00). In contrast, minor components such as GC, C, CG, and GCG did not show a normal distribution, and their H' values were lower, although their coefficients of variation were high.
All nine phytochemicals showed highly significant differences between tea accessions (p < 0.001) and experimental years (p < 0.001) (Table S3). In the year × accessions interactions, C and CG did not show significant differences, while the other phytochemicals showed highly significant differences (p < 0.001).

Clustering Analysis
In total, 462 tea accessions were classified into four clusters according to their phytochemicals ( Table 2   EGCG, GCG, and TC between 2018 and 2019. Cluster II had 111 accessions and showed lower contents of C, EC, GC, and GCG in 2018 and higher contents of Caf, ECG, EGCG, GC, and TC in 2019. Cluster III consisted of 59 tea accessions and showed higher C and EC and lower Caf, ECG, and EGCG content between the two years. Cluster IV had 184 tea accessions with lower C and GCG contents between 2018 and 2019. Figure 1. Hierarchical clustering analysis of the phytochemicals in the 462 tea accessions. The colours in the heatmap indicate the z-score which was calculated by subtracting the mean of phytochemicals across different samples and dividing it by the standard deviation of the phytochemicals across all the samples. The red color indicates positive z-score, the white color indicates zero z-score, whereas the blue colour indicates negative z-score. Higher intensity of the color in the scale indicates a higher magnitude of the z-score. The dendrogram on the x-axis indicates the degree of similarity between the phytochemicals, the closer the phytochemicals the higher the level of similarity in them and the phytochemicals have been clustered using hierarchical clustering. Similarly, the dendrogram on the y-axis indicates the degree of similarity between the different samples, the closer the samples the higher the level of similarity in them and they have been clustered using hierarchical clustering (Ward, Euclidean distance).  The red color indicates positive z-score, the white color indicates zero z-score, whereas the blue colour indicates negative z-score. Higher intensity of the color in the scale indicates a higher magnitude of the z-score. The dendrogram on the x-axis indicates the degree of similarity between the phytochemicals, the closer the phytochemicals the higher the level of similarity in them and the phytochemicals have been clustered using hierarchical clustering. Similarly, the dendrogram on the y-axis indicates the degree of similarity between the different samples, the closer the samples the higher the level of similarity in them and they have been clustered using hierarchical clustering (Ward, Euclidean distance).

SSR Fingerprinting
A total of 428 alleles were detected in 33 SSR loci among the 462 tea accessions ( Table 3). The number of observed alleles (Na) and the number of genotypes (Ng) ranged from 5 (TM324 and TM480)  The diversity indices among the four origins are calculated in Table 4. The Na and Ng contents ranged from 3.2 (IDN) to 11.8 (KOR) and 2.4 (IDN) to 43.6 (KOR), respectively. The S and He contents were calculated to be 1.05 (IDN) to 1.91 (CHN) and 0.73 (JPN) to 0.81 (CHN), respectively. The Evenness ranged from 0.76 (KOR) to 0.90 (IDN), with an average of 0.79.  Genetic and phytochemical distance differences among tea accessions were concordant based on the Mantel test (r = 0.0899, p = 0.017) indicating that these two analyses (genetic and phytochemicals) grouped the genotypes in a different manner.

Population Structure
The relatedness among genotypes and their rooting with geographical designation were studied using a population structure analysis. Determination of the log mean probability and change in the log probability (∆K) (following [29]) provided two subpopulations (K = 2) (Figure 2A). STR_C1 was dominated by genotypes belonging to 191 accessions from KOR, 12 accessions from JPN, 3 accessions from CHN, and 2 accessions from IDN ( Figure 2B). STR_C2 contained 217 accessions from KOR, 35 accessions from CHN, and one accession from JPN and IDN. The mean alpha value (an estimate of the degree of admixture) for the analyzed samples was 0.2759.
To understand the genetic relationship among the 462 tea accessions, a DAPC analysis was performed (Figure 3). Four clusters were detected in coincidence with the lowest BIC values using the find.clusters function. The DAPC analysis was carried out using the detected number of clusters. Typically, the 50 first PCs (60.3% of variance conserved) of PCA and three discriminant eigenvalues were retained. These values were confirmed via a cross-validation analysis. The four clusters were titled D1-4. A major shift in accessions from STR_C1 to D1 and D3 was observed, and the main tea accessions of D2 and D4 were located in STR_C2. log probability (ΔK) (following [29]) provided two subpopulations (K = 2) (Figure 2A). STR_C1 was dominated by genotypes belonging to 191 accessions from KOR, 12 accessions from JPN, 3 accessions from CHN, and 2 accessions from IDN ( Figure 2B). STR_C2 contained 217 accessions from KOR, 35 accessions from CHN, and one accession from JPN and IDN. The mean alpha value (an estimate of the degree of admixture) for the analyzed samples was 0.2759. To understand the genetic relationship among the 462 tea accessions, a DAPC analysis was performed ( Figure 3). Four clusters were detected in coincidence with the lowest BIC values using the find.clusters function. The DAPC analysis was carried out using the detected number of clusters. Typically, the 50 first PCs (60.3% of variance conserved) of PCA and three discriminant eigenvalues were retained. These values were confirmed via a cross-validation analysis. The four clusters were titled D1-4. A major shift in accessions from STR_C1 to D1 and D3 was observed, and the main tea accessions of D2 and D4 were located in STR_C2. D1 contained 132 (126 accessions from STR_C1 and sic accessions from STR_C2) accessions, with 123 (117 from STR_C1 and 6 from STR_C2) from KOR, 6 (STR_C1) from JPN, two (STR_C1) from IDN, and one (STR_C1) from CHN.    The molecular variance within and between the regional pools, as well as the sub-populations derived from the clustering analysis, STRUCTURE, and DAPC analysis, was evaluated (Table 5). For the regional gene pools, percentage of variance within and among populations was found to be 94% and 6% of the total variation, respectively. The clustering analysis and DAPC provided a variance of 99% and 1% for within and among sub-populations, respectively, while the two sub-population derived from STRUCTURE showed only 100% of total variance within the groups. Among the four AMOVA results, regional pools showed minimum within-population variance (94%) and maximum among-population variance (6%), indicating that the regional pools are fairly structured groups for the panel under consideration. The genetic differentiation (PhiPT) of the four subpopulations showed a range from 0.005 (STRUCTURE) to 0.056 (regional pools). The molecular variance within and between the regional pools, as well as the sub-populations derived from the clustering analysis, STRUCTURE, and DAPC analysis, was evaluated (Table 5). For the regional gene pools, percentage of variance within and among populations was found to be 94% and 6% of the total variation, respectively. The clustering analysis and DAPC provided a variance of 99% and 1% for within and among sub-populations, respectively, while the two sub-population derived from STRUCTURE showed only 100% of total variance within the groups. Among the four Agronomy 2020, 10, 1667 9 of 18 AMOVA results, regional pools showed minimum within-population variance (94%) and maximum among-population variance (6%), indicating that the regional pools are fairly structured groups for the panel under consideration. The genetic differentiation (PhiPT) of the four subpopulations showed a range from 0.005 (STRUCTURE) to 0.056 (regional pools). Table 5. Analysis of the molecular variance (AMOVA) among and within populations in the regional pools and sub-populations derived from the clustering analysis, STRUCTURE, and DAPC.

Development and Evaluation of a Core Collection
The MANOVA analysis indicated significant year effects, as well as significant interaction effects between the year and accession effects by considering all quantitative traits together (Table S4). Therefore, the phytochemical data for both years (2018 and 2019) and molecular marker data were treated independently for the development of the core collection. The target-oriented core collection (TOCC) was developed with phytochemicals and molecular data using POWERCORE. TOCC included 100 accessions (21.6% of the entire collection) belonging to four origins, with 73 accessions from KOR, 22 from CHN, 4 from JPN, and 1 from IDN.
Differences between the means of the entire collection (ENC) and TOCC were found to be not significant for all traits ( Table 6). The mean difference percentage (MD%), coincidence rate of range (CR%), variance difference percentage (VD%), and variable rate of the coefficient of variance (VR%) were used to comparably evaluate the properties of TOCC with ENC (Table 7). Overall, the nine phytochemicals, MD%, VD%, VR%, and CR% were 7.88%, 39.33%, 120.79%, and 97.43%, respectively.  To evaluate the quality of TOCC, Shannon's diversity index (I), Nei's diversity index (H), and the number of alleles (Na) were calculated using the molecular data ( Table 7). The number of alleles (Na) in TOCC was the same as that in ENC. The genetic diversity of TOCC revealed by these markers was compared with that of ENC. The I and H of TOCC were higher (0.335, 0.209) than those of ENC (0.308, 0.195).
The distribution of tea accessions in TOCC was determined via a line graph obtained through the phytochemicals of ENC, along with a Principal coordinate analysis (PCoA) and Neighbor Joining (NJ) obtained through the genetic analysis of ENC (Figure 4 and Supplementary Figure S1). The accessions in TOCC were shown to be selected proportionally, accurately reflecting the distribution of the overall accessions for each phytochemical. For the distribution of molecular data, TOCC showed a balanced distribution in PCoA and NJ.

Discussion
A vast collection consisting of 15,234 accessions of tea is available in 23 gene banks around the world [7]. The biochemical characterization of tea germplasm in earlier studies demonstrated significant variability [18,[45][46][47][48]. Despite the substantial diversity of compounds in tea germplasm, the development of tea cultivars was limited due to bottlenecks in tea breeding, such as long gestation periods, high inbreeding depression, and self-incompatibility [49]. In addition, the tea quality and yield in the main tea producing countries, such as China, India, Sri Lanka, Kenya, Japan, etc., were significantly improved with an increase in the ratio of clonal tea acreage [50]. Breeding strategies often focus on a limited set of target traits, resulting in cultivars with a narrow genetic base. Yao et al. [51] reported that the developed tea cultivars from China, Japan, and Kenya have a narrow genetic basis due to the popularity of only a few cultivars for breeding and planting. This has produced several problems, such as the spread of specific diseases and insects, the concentration of plucking time in the tea season, the non-uniformity of taste and flavor, and susceptibility to environmental changes [40,51]. Meegahakumbura et al. [29] noted that a molecular analysis that can discern not only patterns of lineage, but the origin of tea germplasm is also required because the morphological characteristics that are traditionally used to define cultivars are highly plastic and easily influenced by environmental conditions. The present study attempted to address the above issue by generating a core collection of tea germplasm that includes data on the molecular variability of the crop, in addition to biochemical characterization.

Discussion
A vast collection consisting of 15,234 accessions of tea is available in 23 gene banks around the world [7]. The biochemical characterization of tea germplasm in earlier studies demonstrated significant variability [18,[45][46][47][48]. Despite the substantial diversity of compounds in tea germplasm, the development of tea cultivars was limited due to bottlenecks in tea breeding, such as long gestation periods, high inbreeding depression, and self-incompatibility [49]. In addition, the tea quality and yield in the main tea producing countries, such as China, India, Sri Lanka, Kenya, Japan, etc., were significantly improved with an increase in the ratio of clonal tea acreage [50]. Breeding strategies often focus on a limited set of target traits, resulting in cultivars with a narrow genetic base. Yao et al. [51] reported that the developed tea cultivars from China, Japan, and Kenya have a narrow genetic basis due to the popularity of only a few cultivars for breeding and planting. This has produced several problems, such as the spread of specific diseases and insects, the concentration of plucking time in the tea season, the non-uniformity of taste and flavor, and susceptibility to environmental changes [40,51]. Meegahakumbura et al. [29] noted that a molecular analysis that can discern not only patterns of lineage, but the origin of tea germplasm is also required because the morphological characteristics that are traditionally used to define cultivars are highly plastic and easily influenced by environmental conditions. The present study attempted to address the above issue by generating a core collection of tea germplasm that includes data on the molecular variability of the crop, in addition to biochemical characterization.

Phytochemical Diversity of Tea Germplasm
Significant variation was observed among the 462 tea accessions for catechin and caffeine content in this study (Table 1). In addition, significant differences between the two years were observed (Table S3). Catechins and caffeine serve as secondary metabolite defense compounds in tea plants. They provide sessile plants with protection against pathogens and predators, oxidative stress, and other environmental variables. Thus, the content of catechins and caffeine varied in the tea samples based on environmental variability [45]. Many previous studies reported a large variation in catechin and caffeine contents in tea accessions [15,18,52,53]. The authors in [54] noted that a biochemical characterization with different proportions of total catechins and their components would be a useful tool for the development of quality-tea clones. The authors in [55] reported that differences between locations were far larger than the variations among cultivars, implying that environmental effects should be taken into consideration when total catechin and its component contents are utilized as biochemical markers in tea breeding programs.
The concentration of catechins in tea was determined as follows: EGCG>ECG>EGC>EC>GC>C [52,53,57,58]. In addition, the authors determined antioxidant activity in the following order: ECG>EGCG>EC>EGC [59]. The variation of catechin contents in tea accessions depends on the condition of the tea germplasm, such as the number of samples and the origin of the tea accessions, in each study. The range of each catechin's content in the previous studies was as follows: EGCG, 13.0 to 139.0 mg/g; ECG, 3.2 to 89.1 mg/g; EGC, 2.1 to 249 mg/g; EC, 2.0 to 54.5 mg/g; GC, 1.4 to 22.7 mg/g; and C, 0.3 to 30.9 mg/g [52,54,57,59,60]. In this study, the 462 tea accessions also showed a similar level of catechin content to that in previous studies ( Table 1). The concentration of catechins in tea germplasm is important for tea quality. For instance, the ratio of (EGCG + ECG) × 100/EGC has been suggested as a quality index for measuring the difference in the catechin levels of fresh tea shoots across growing seasons [60]. In addition, the catechin index (CI)) (EC + ECG)/(EGC + EGCG)) has been used as a biochemical marker for studying the genetic diversity of tea germplasms [54]. The tea accessions with desirable compositions of catechins in this study could be incorporated into breeding programs for crop improvement.
Caffeine is the most abundant alkaloid in tea, with content usually between 15 and 50 mg/g [15]. In this study, the caffeine content of 462 tea accessions ranged from 0.4 to 36.6 mg/g (2018) and 0.4 to 28.8 mg/g (2019) ( Table 1). Kottawa-Arachchi et al. [15] noted that various amounts of caffeine have been observed in different tea growing countries. Due to the pharmacological properties of caffeine on the central nervous system, the demand for low-caffeine tea is increasing greatly, from 2% of total tea consumption in 1980 to 15% in the early twenty-first century [61]. Although many countries have invested in methods and techniques to make decaffeinated tea, such techniques can remove the tea's unique aroma and taste, which will worsen the quality. It is thus important to develop low caffeine clones through breeding and selection, as such clones could be a solution to the problem of high caffeine levels and contribute tremendously to the provision of natural low-caffeine tea [18]. The tea accessions with a lower caffeine content in this study could be used as naturally low-caffeine genetic resources for crossbreeding parents.

Genetic Diversity of Tea Germplasm
In our previous study, we analyzed the genetic diversity and population structures of 410 tea accessions collected from South Korea using 21 SSR markers and revealed the narrow genetic base of South Korean tea accessions [28]. In the present study, the genetic diversity and population structure of 462 tea accessions from China, Japan, Indonesia, and Korea (conserved in NAC) were analyzed using 33 SSR markers. As shown in Table 4 Other studies also similarly reported that the Chinese tea population exhibited a higher level of genetic diversity than that of other tea populations from other countries [24,51]. In general, China is thought to be the origin of tea, so Chinese tea populations are the most likely to account for the largest proportion of diversity [51]. Our previous study noted that Korean tea germplasm showed low genetic diversity because of limitations in the gene stock from China, political and religious reasons, and extreme environmental conditions [45]. Tanaka et al. [62] reported that the tea plant in Japan was first introduced from China about 1200 years ago and that the country's original tea populations were established based on only a few of seeds from a restricted source. In addition, the authors in [23,25] suggested that the low genetic diversity of tea accessions in Japan could be attributed to long and intensive selection and breeding from the genetically limited tea stock in Japan.
It is important to identify the correspondence between the genetic diversity of tea accessions and their origins. In this study, the different approaches (STRUCTURE and DAPC) used to analyze the population structures of the 462 tea accessions were able to provide complementary information. However, the structuring of tea accessions at K = 2 (based on the estimated ∆K value in STRUCTURE) and K = 4 (based on the BIC and DAPC) clearly did not segregate the accessions based on geographical distinctions. The Evanno method is artificially maximal at K = 2, in some cases, because it finds the highest level of structure in the data by focusing only on the changes in slope [39,63]. Similar results were obtained in previous studies on tea germplasm structures based on SSR (K = 2) [11,30,64,65]. The DAPC method does not require that populations be in HW equilibrium and can handle large sets of data without using parallel processing software, so it provides an interesting alternative to the STRUCTURE software [66]. In addition, the DAPC analysis provided more detailed clusters compared to the STRUCTURE analysis in previous analyses using SSR [28,66,67]. Our results also agree with those of previous studies where the DAPC analysis (K = 4) provided more detail than STRUCUTRE (K = 2). However, these results indicated lower genetic differentiation (PhiPT, DAPC = 1.2%; Clustering analysis of phytochemicals = 0.8%; STRUCTURE = 0.5%) than that in the collection area (5.6%). This might be due to an imbalance in the distribution of tea accessions used in this study, as 88.3% of tea accessions in this study were collected from South Korea. In our previous study, the genetic differentiation in the DAPC analysis of Korean tea germplasm was 1.4% [23]. This affected the low genetic differentiation between groups resulting from an analysis of the population structure, although the genetic differentiation of tea origins was also shown to be low (5.6%).

Development of a Target-Oriented Core Collection
To develop core collections, various methods, such as phenotypes, proteins, and molecular markers, have been used. However, there is no universally accepted method to construct a core collection because every method has advantages and disadvantages [68]. Previous studies have proven that phenotypes are useful parameters for developing core collections [2,12,69]. Kumar et al. [70] reported that the use of molecular markers in the development of a core collection is more effective than the use of other data, such as morphological traits sensitive to environmental effects. In addition, molecular markers are more effective in identifying and minimizing redundancy. Le et al. [71] suggested that the use of both phenotypic and molecular data together is more effective than their use individually when constructing a core collection. In this study, molecular markers and biochemical contents were utilized for the construction of a core collection in tea germplasm using the POWERCORE program, which was successfully used to build a core collection for various plant species, including olive [69], safflower [71], and tea [9].
In this study, seasonal data sets were handled independently to develop the core collections because the MANOVA analysis presented noticeable Genotype X environmental interactions. In addition, the evaluation indices (MD%, VD%, VR%, CR%) were comparable and reflected their effectiveness in capturing diversity to validate the core collection. MD%, VD%, and VR% were used to evaluate the statistical consistency between the core and entire collections [42], while MD% was used to represent the difference in the accession averages between the core and entire collections, which should be <20% for a representative core collection. VD% indicates the variance captured by the core collection, and VR% indicates a comparison between the coefficient of variation values present in the core and entire collections. CR% indicates whether the distribution ranges of each variable in the core set are well represented when compared to the entire collection, which should be greater than 80% [12,42,43,70]. In this study, the core collections yielded a CR% of more than 80% (97.43%) and an MD% of less than 20% (7.88%) ( Table 7). Similar results for other species were reported in core collections developed with a lower MD% or higher CR%, which were more representative of the entire collections [72,73]. In addition, the distributions of each phytochemical in the tea accessions showed similarities to those of the entire collection ( Figure S1). In general, the core collections can be classified into three types or categories: core collections representing (1) individual accessions, (2) extremes, and (3) the distribution of accessions in the entire collection [3]. Odong et al. [3] suggested that a core collection of type 3 (distribution of accessions) is only of interest if the aim is to provide an overview of the composition of the whole collection using only a part of the collection. The authors in [23,74] suggested that this type of core collection can be obtained by maximizing the representativeness of the pattern of trait variations in the whole collection. Considering these reports, the core collection developed in this study showed a similar pattern of type 3, which could represent the entire collection.
By integrating genetic diversity and phytochemical content, we developed a target-oriented core collection that we have not tried before in tea germplasm. The main targets for tea breeding and use are mostly related to catechin content; therefore, the phytochemical analysis and development of TOCC allow us to extend the use of tea germplasm broadly. Furthermore, the TOCC retained the phytochemical and genetic diversity of ENC, as we extracted the accessions after analyzing the variation of the content over two years using molecular marker data. The genetic diversity indices (I and H) and the distribution of accessions (NJ and PCoA) also indicate that the TOCC is well developed and reflects the whole diversity of ENC. Throughout this process, we developed a greater value-added core collection, which will not only provide useful materials to breeders but also aid in the efficient management of genebank. This target-oriented core collection is distinguished from the previous core collection in which accessions were selected based on their agronomic traits and molecular markers. Our upgraded core collection focused on the phytochemical content in tea germplasm suggests new directions for the use and conservation of tea germplasm.

Conclusions
Evaluating a plant germplasm and establishing a core collection will enhance the proper utilization of plant genetic resources [73]. Especially, core collections have been developed in various crop collections because they have a size that facilitates evaluation, use, and conservation while maintaining existing genetic diversity [7]. In this study, phytochemicals content and genetic diversity on 462 tea accessions were evaluated and the target-oriented core collection was constructed based on these results. The phytochemical contents of 462 tea accessions showed varying distributions, although the genetic diversity was low. In addition, this is the first attempt to combine molecular diversity data with phytochemicals to develop a core collection of tea germplasm conserved in NAC. This target-oriented core collection will provide access to genetic diversity and phytochemical traits, which will be useful for characterizing the genetic determinants of the traits of interest. Furthermore, it could be used to design more effective breeding programs to increase the global utility of tea as a functional crop.  Table S1. List and the content of phytocemicals of 462 tea accessions in this study. Table S2. List of 33 SSR primers in this study. Table  S3. Mean squares for phytochemicals according to year, accessions, and year x accession interactions. Table S4. Multivariate analysis of variance (MANOVA) to study yearly differences in all quantitative traits together.