Developing and Testing Molecular Markers in Cannabis sativa (Hemp) for Their Use in Variety and Dioecy Assessments

Cannabis sativa (2n = 2x = 20) is a popular species belonging to the Cannabaceae family. Despite its use for medical, recreational, and industrial purposes as well as its long history, the genetic research on this species is in its infancy due to the legal implications and the prohibition campaigns. The recent legalization of Cannabis in many countries along with the use of genomics boosted the approaches aimed at marker-assisted selection, germplasm management, genetic discrimination, and authentication of cultivars. Nonetheless, the exploitation of molecular markers for the development of commercial varieties through marker-assisted breeding schemes is still rare. The present study aimed to develop an informative panel of simple sequence repeat markers to be used for the genotyping of high breeding value C. sativa lines. Starting from 41 nuclear SSR designated by in silico analyses, we selected 20 highly polymorphic and discriminant loci that were tested in 104 individuals belonging to 11 distinct hemp varieties. The selected markers were successful in accessing homozygosity, genetic uniformity, and genetic variation within and among varieties. Population structure analysis identified eight genetic groups, clustering individuals based on sexual behaviors (dioecious and monoecious) and geographical origins. Overall, this study provides important tools for the genetic characterization, authentication, conservation of biodiversity, genetic improvement and traceability of this increasingly important plant species.


Introduction
Cannabis sativa L. is a multiuse plant species cultivated for its fiber and seeds (var. sativa; hemp) or for its high content in cannabinoids (var. sativa and var. indica) [1]. Among the latter, the principal psychoactive constituent of Cannabis is tetrahydrocannabinol (THC) with varieties containing up to 30% THC by dry weight [2]. The genus Cannabis belongs to the family Cannabaceae (order Rosales), and its nomenclature has been disputed since the Linnaeus classification in 1753. At that time, it was not clear whether the genus was mono-or polytypic [3][4][5]. Subsequently, Small and Cronquist [4] proposed a unique species system, establishing the existence of two subspecies of C. sativa, namely, subsp. sativa and indica, the validity of which is still universally accepted. Recently, some authors have proposed a classification of C. sativa varieties based on their cannabinoid and terpenoid profiles [6][7][8]. In contrast to this classification, a method based on DNA markers may be useful to develop a fast and low-cost technique enabling the solution to this taxonomic debate. These markers may also be employed in the identification and characterization of uncertified Cannabis strains, including those originating from the black market [9]. strategies. The karyological identity of monoecious plants to female ones [12,36] makes the identification of tightly linked markers difficult [37].
Linkage disequilibrium (nonrandom association of alleles at different loci) is a sensitive indicator of the population genetic forces that structure a genome [38]. In the present study, linkage disequilibrium was used to examine correlations between sex and the developed SSR markers and among the markers themselves.

Overall Genetic Diversity
Based on a previous study on the C. sativa (cs10) genome (BioProject ID: PRJNA560384), a total of 126,593 perfect and 12,017 compound SSR regions were identified with a density equal to 148 SSRs/Mbp and 0.34% of the total length of the genome [9]. In total, 41 primer pairs amplifying an equivalent number of SSR loci, with an average of 4 markers per chromosome, were designed in silico and tested on a subset of DNA samples obtained considering at least one sample from each variety. Of these loci, 4 did not produce any amplicon, 6 produced nonspecific products, and 11 were discarded for the high number of missing data among samples. The remaining 20 loci were used to screen on the entire collection, comprised of 104 individuals belonging to 11 different hemp varieties of European origin, both monoecious and dioecious. The SSR profiles of each sample is provided as supplementary material (Table S1). Table 1 summarizes the number of detected alleles per locus (N), the frequency of the most abundant allele (p i ), the observed heterozygosity (H o ), expected heterozygosity (He), average heterozygosity (H a ) and the polymorphism information content (PIC) of the 20 selected loci.
In total, 301 alleles were detected among 11 varieties with several observed alleles per locus ranging between 3 (SSR_6-4) and 28 (SSR_X-1). According to [39], 16 of 20 SSR loci were highly informative (PIC > 0.5), and 4 SSR loci were reasonably informative (0.5 > PIC > 0. 25). It is likely that the high number of alleles per locus (N a ) and, therefore, the resulting PIC values reflect the extension of the geographical area of varietal origin. A similar average PIC value (0.71) was obtained by [40] by analyzing samples collected from different European regions with 16 SSR loci. In contrast, limiting analyses to samples derived exclusively from Turkey with 22 markers [41] obtains a much lower average PIC (0.28). The frequency of the most common marker allele (p i ) was low when the observed number of marker alleles was high and vice versa. For instance, SSR_6-4 and SSR_X-1, had Na values equal to 3 and 28, showed pi values of 0.67 and 0.13, respectively.
Although extremely variable (loci ranging from 0.07-SSR_2-1 to 0.84-SSR_8-2), the observed heterozygosity (H o ), was always lower than the expected heterozygosity (H e ), highlighting an excess of homozygosity equally distributed across the genome. One possible explanation for this discrepancy is the excess of false homozygotes caused by null alleles. Nuclear SSRs are in fact characterized by high frequencies of null alleles, i.e., alleles that are not amplified due to the failure of primer annealing in the flanking regions of a target SSR locus [42]. Consequently, null alleles are not detected in the expected heterozygous form and the genotypes are incorrectly scored as homozygotes. The occurrence of these alleles can be inferred (and their proportions quantified) in Hardy-Weinberg (HW) equilibrium natural populations or in progenies where parent genotypes are known [42]. In our study, the varieties used do not meet the criteria required for null alleles estimate. Beyond the low size of each variety group (3 < n < 13) and the lack of both parents and offspring genotyping data, the main limitation is the lack of HW equilibrium due to the strong selection pressure imposed by breeders. Moreover, the accessions analyzed for each variety are unrelated and were chosen as representative of the variety itself. A second plausible explanation is that the excess of homozygosity is the result of a strong selection pressure and inbreeding, which has characterized hemp cultivation over the years. Industrial hemp is in fact grown on a wide scale leading to a stabilization of agronomically important traits in seed stocks and, likely, to a reduced heterozygosity. On the contrary, drug-type varieties, being the results of hybridization events and being mainly propagated via cloning, are usually characterized by higher levels of heterozygosity [13,43]. In support of this, the inbreeding coefficient (F is ) result was always positive and, on average, equal to 0.41 (Table 1). Similarly, F it (heterozygosity between subpopulations) and F st (fixation index) indices were highly variable with values ranging from 0.10 to 0.89 and from 0.06 to 0.37, respectively. N m (gene flow measure) ranged from 0.23 to 2.78 with an average value of 1.02 ± 0.13, showing the presence of gene flow (N m > 1) for 9 loci out of 20. Table 2 summarizes the statistics related to the 104 accessions analyzed in the present study, and they are grouped into the 11 varieties of C. sativa sampled. The number of polymorphic loci was high among all varieties analyzed (at least 90% polymorphic loci), except for the FRA3 variety. Nevertheless, it should be emphasized that this result may be affected by the reduced number of individuals considered in this variety compared to the others (only 3 in contrast to the 8-13 of the other varieties). A similar result was observed for the Shannon index (I) of phenotypic diversity and simple matching-based genetic similarity, which showed high values in all varieties (1 to 1.5 and 63% to 69%, respectively), except for the FRA3 variety.
On the other end, the mean genetic similarity (MGS) among all varieties was more uniform with values ranging between 86% and 91%. The mean number of observed and effective alleles per locus ranged from 2.10 to 6.15 and from 1.89 to 4.43, respectively. Despite the different geographical origins and the different sexual behaviors, the observed heterozygosity was comparable among all varieties, and on average, it was equal to 0.58 ± 0.09, consistent with the allogamous reproductive system. When compared to the expected heterozygosity, all varieties showed an excess of homozygosity as confirmed by Wright's statistics. Both F is and F it were higher than 0 in all varieties, suggesting repeated crosses of genetically related individuals and thus strong selective pressure operated by breeders. Estimates of the fixation index (F st ) changed considerably (from 0.54 to 0.73) among the 11 varieties, indicating an unbalancing contribution of the investigated populations to the total genetic variation. Accordingly, our estimates of inbreeding coefficients suggested that these varieties are characterized by a relatively high degree of genetic differentiation with approximately 64% (average F st = 0.64) of the genetic variation found among varieties and 36% of the total genetic variation expressed within varieties. As expected, the gene flow (N m ) estimate derived from F st was low (ranging from 0.09 to 0.21), demonstrating the absence of gene flow among the varieties.
Private alleles, beyond estimating genetic diversity, are also useful for varietal registration and traceability. The 301 alleles were screened for private markers by considering the core collections from different perspectives. The first classification was made between dioecious and monoecious samples, and the second classification considered the 11 varieties as distinct groups. In addition, the third classification was based on the geographical origin (Italy, France, Netherlands, Poland, Hungary and Finland). In the first case, 22 alleles were found to be private for dioecious samples and 13 for monoecious samples. By considering the 11 groups separately, the highest number of private alleles was found in the ITA3 variety (23 out of 123). Finally, the third division of the samples based on their geographical origin found the highest number of private alleles in the Italian, Finnish and Hungarian samples with 13 out of 175, 12 out of 124, and 10 out of 153, respectively. However, it should be noted that the frequency of each private allele was low with few highly frequent alleles. For example, in the case of the ITA3 variety, the private allele frequency ranged from 5% (loci 8-4) to 41.7% (loci 5-2) with 16 out of 23 private alleles having a frequency ≤10%. Although this finding would discourage the use of single SSR loci for traceability or registration purposes, the combination of multiple private alleles at different loci may still represent an excellent tool for varietal identification [44]. The fact that a high number of private alleles was assigned to Italian and Hungarian clusters and varieties suggests a higher degree of isolation and local adaption (or selection) to environmental conditions for the southern samples, and at the same time, it may indicate that this area is the European domestication center of Cannabis. In fact, it is conceivable that the longer a species is present in each geographical area, the higher the number of allelic variants (albeit rare) that have accumulated.

Population Structure of Cannabis Germplasms and Cluster Analysis
The genetic structure of the germplasm collection was investigated to infer population relationships among individuals ( Figure 1). is the European domestication center of Cannabis. In fact, it is conceivable that the longer a species is present in each geographical area, the higher the number of allelic variants (albeit rare) that have accumulated.

Population Structure of Cannabis Germplasms and Cluster Analysis
The genetic structure of the germplasm collection was investigated to infer population relationships among individuals ( Figure 1). Following the procedure of [45], a maximum ΔK value was found at K = 2 and K = 8 (ΔK = 181 and ΔK = 23, respectively, Figure S1). For K = 2 ( Figure 1, panel a), samples were clearly divided based on geographical origin into two groups as follows: one consisting of varieties from southern Europe (Italian and Hungarian Origin) and the other from northern Europe (France, Poland, Netherlands, and Finland). The clustering of genotypes revealed that 92 of 104 samples showed strong ancestry association (>90%), and on average, the membership of the southern and northern samples to their ancestral group was 92.3% and 96.9%, respectively. In the southern samples, 12 showed admixed ancestry, and in particular, HUN2-10 was the only sample with a higher percentage membership in the Following the procedure of [45], a maximum ∆K value was found at K = 2 and K = 8 (∆K = 181 and ∆K = 23, respectively, Figure S1). For K = 2 ( Figure 1, panel a), samples were clearly divided based on geographical origin into two groups as follows: one consisting of varieties from southern Europe (Italian and Hungarian Origin) and the other from northern Europe (France, Poland, Netherlands, and Finland). The clustering of genotypes revealed that 92 of 104 samples showed strong ancestry association (>90%), and on average, the membership of the southern and northern samples to their ancestral group was 92.3% and 96.9%, respectively. In the southern samples, 12 showed admixed ancestry, and in particular, HUN2-10 was the only sample with a higher percentage membership in the opposite group (northern group). Given the geographical distances between these two groups, the most likely hypothesis is that the admixed samples are the intended result of hybridization events among southern and northern genotypes performed in the past by impassionate amateurs or expert geneticists to introduce genetic variability and constitute new lines [9]. It should also be emphasized that the subdivision into two large groups is not only geographic but also based on the reproductive system. In fact, the southern group contained only dioecious plants (i.e., both male and female plants), while the northern group contained only monoecious plants with few exceptions (FIN group).
Historical information on the introduction of Cannabis in Europe is vague. If the center of origin is in Central or East Asia [46], the first data certifying its presence in Europe date back to approximately 10,000 years ago. In particular, paleobotanical evidence shows the presence of wild Cannabis/Humulus-type pollen in the southeastern part of Europe and, in particular, in Hungary, Romania, and Bulgaria approximately 10200-8500 years BP [46] as well as in Italy (Lake Albano, Rome) 11000 BCE. Although the most accredited hypothesis [47] suggests that dioecious plants (such as those belonging to the southern group in this study) descended from monoecious plants (such as those belonging to the northern group). However, our molecular data along with the paleobotanical evidence mentioned above discourage the hypothesis that the Italian-Hungarian group evolved from the Northern Europe group. It is more likely that the introduction of Cannabis in the presence of wild Cannabis/Humulus-type pollen in the southeastern part of Europe and, in particular, in Hungary, Romania, and Bulgaria approximately 10200-8500 years BP [46] as well as in Italy (Lake Albano, Rome) 11000 BCE. Although the most accredited hypothesis [47] suggests that dioecious plants (such as those belonging to the southern group in this study) descended from monoecious plants (such as those belonging to the northern group). However, our molecular data along with the paleobotanical evidence mentioned above discourage the hypothesis that the Italian-Hungarian group evolved from the Northern Europe group. It is more likely that the introduction of Cannabis in Northern Europe and the Mediterranean basin occurred independently at different times. The clear division between the two European clusters was also clearly visible with PCoA analysis (Figure 2), in which the southern samples were positioned in the top-right part and the northern samples were positioned in the bottom-left part. When analyzing the germplasm collection with an additional level of population structure (K = 8), sample clustering by geographical area becomes even more detailed. Plant materials from ITA1 and ITA2 were clearly assigned to a single, distinct cluster by STRUCTURE software (Figure 1, panel b) with an average membership value of 91.3%. PCoA ( Figure 2) analysis further confirmed this finding with these two varieties visibly separated from the others and occupying the top-right quadrant. Instead, the third group of Italian varieties (ITA3) formed a separate cluster in the structure analysis. This sample When analyzing the germplasm collection with an additional level of population structure (K = 8), sample clustering by geographical area becomes even more detailed. Plant materials from ITA1 and ITA2 were clearly assigned to a single, distinct cluster by STRUCTURE software (Figure 1, panel b) with an average membership value of 91.3%. PCoA ( Figure 2) analysis further confirmed this finding with these two varieties visibly separated from the others and occupying the top-right quadrant. Instead, the third group of Italian varieties (ITA3) formed a separate cluster in the structure analysis. This sample group showed the lowest levels of similarity with all the varieties analyzed (Table S2) and clustered separately both in the PCoA ( Figure 2) and in the resulting dendrogram ( Figure S2). Additionally, the average similarity values with ITA1 (85.4%) and ITA2 (85.1%) were lower than that calculated between ITA1 and ITA2 (88.4%). Since the introduction of Cannabis in Italy, it could be hypothesized that the ITA3 group has remained moderately isolated both from the ITA1/ITA2 group and from all other European varieties. This hypothesis was further supported by the number of private alleles found in ITA3, which was the highest among the 11 varieties.
In particular, the HUN1 and HUN2 dioecious samples were assigned to two different clusters by STRUCTURE software. Members of the HUN1 variety showed, on average, higher membership percentages to the cluster (91.9%) compared to the HUN2 ones (85.3%), which was clearly visible in the PCoA with the HUN1 samples tightly clustered in the bottom-right quadrant and the HUN2 samples dispersed around the previous ones. Similarly, samples from Finland (FIN), Poland (POL), and Netherlands (NED) were assigned to three different clusters with an average membership equal to 92.6%, 93.2%, and 91.2%, respectively. The results observed for the three French varieties (FRA1, FRA2, and FRA3) were unexpected. FRA1 and FRA2 formed a cluster apart with an important share of admixed samples and a consistent representation of the Polish ancestor. The only exception was represented by FRA1-8 that resulted admixed also with the Hungarian cluster HUN2 and towards which it showed a mean GS of 88% (Table S2). Similarly, within the FRA3 variety, FRA3-1 was admixed, with a consistent representation of the Hungarian cluster HUN1 (the mean GS between FRA3-1 and HUN1 group was 89%, Table S2); FRA3-2 and FRA3-3 samples clustered instead with high ancestry membership (96.4% and 93.1%, respectively) with the Polish cluster. The close phylogenetic relationship existing between French and Polish samples was also evident from the PCoA and from the similarity matrix. The three French groups showed an average degree of similarity with the POL group oscillating between 86.6% (FRA2) and 88.3% (FRA3). In support of this, a recent study conducted on the occurrence of the CBDA-synthase gene (CBDAS), THCA-synthase gene (THCAS) and two CBDAS pseudogenes across 110 Cannabis accessions has demonstrated the tight relationship existing among three French and two Polish genotypes, suggest a recent common ancestor [48]. On the contrary, those samples proving an admixed ancestry between French and Hungarian clusters (i.e., FRA1-8 and FRA3-1) could be the results of hybridization events that have never been described in previous studies.
Overall, this is the first study in which a robust and novel panel of SSRs has been successfully used to investigate the geographical origin of samples derived from different European countries. Previous studies have mainly concerned Asian accessions [49], while the only other study focusing on accessions collected from the same countries considered here (i.e., Italy, Hungary, France, and Finland) failed to reconstruct the geographical origin of the samples [50].

Sex Determination and Linkage Disequilibrium
Separating male and female plants at the seedling stage is useful in breeding programs because plant gender influences the economic value and the selection schemes, which is particularly true in Cannabis. Thus, the entire germplasm was screened using UBC354151 RAPD-deriving SCAR119, originally developed by Törjék [35] and named male-associated DNA from Cannabis (MADC4). In the last few years, several sex predictive markers have been developed and tested [51], but MADC4 is still considered the most reliable marker [34]. Similarly, our study confirmed that SCAR was 100% effective in predicting the sex of the samples as a 119 bp amplicon was detected in all the male individuals but not in female and monoecious plants. Moreover, to investigate any possible pattern of disequilibrium, we analyzed the association existing between the MADC4 marker locus and each of the 20 selected SSR loci in all possible pairwise combinations. Significant (P, 0.01) linkage disequilibria were found between the SCAR119 sex determination locus and the 2.1 and 5.5 SSR loci (Table 3), attributed to chromosomes 2 and 5, respectively.
In the scientific literature, there is no information regarding the localization of the SCAR119 marker as its identification precedes the sequencing of the Cannabis genome. However, by aligning the SCAR119 sequence (AB021659.1) on the Cannabis reference genome (GCA_900626175.2, cs10), we located MADC4 on chromosome 2 (E-value = 0, at position 88,801,370 bp), justifying the association with SSR_2-1, which is located on the same chromosome as much as 15,695,145 bp upstream. Although neither the SCAR119 marker locus nor the 2.1 and 5.5 SSR marker loci are located on the sex chromosome (i.e., the chromosome in Cannabis that determines the sex of the flower), it is likely that specific regions of autosomal chromosomes also contribute to sex determination. In this regard, a recent study has reported that, of 555 sex-linked genes, 363 mapped to sex chromosomes, while the remaining 192 (i.e., 35% of all sex-linked genes) mapped to all the other autosomes [52]. Table 3. Pairwise linkage disequilibria (LD) between all SSR marker loci and the SCAR119 marker. Significant LD **: 0.001 < p<0.01 ***: p < 0.001.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 • C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.
Genomic DNA (gDNA) was extracted from 70-100 mg of each sample using the "DNeasy ® 96 Plant Kit" (Qiagen, Hilden, Germany) following the manufacturer's instructions. The quality of the genomic DNA samples was assessed by electrophoresis on a 1% (w/v) agarose gel stained with 1X SYBR ® Safe™ DNA Gel Stain (Life Technologies, Carlsbad, CA, USA) in Tris-acetate-EDTA (TAE) running buffer. The yield and purity were evaluated using a NanoDrop 2000c UV-vis Spectrophotometer (Thermo Scientific, Pittsburgh, PA, USA). Following DNA quantification, all the DNA samples were diluted to a final concentration of 20 ng/µL to be used as templates for PCR amplification.

Analysis of the SSR Marker Loci
The amplification reactions were performed using the three-primer strategy reported by Schuelke [53] with some modifications. Briefly, for each primer pair, universal sequences (namely, M13 and PAN1-3) were used to tag the 5 end of the forward primer and were utilized in PCR assays in combination with M13, PAN1, PAN2, and PAN3 fluorophorelabeled oligonucleotides. The fluorophores used in all amplification reactions were 6-FAM, VIC, NED, and PET. As a preliminary step, 41 SSR primer pairs previously designed by [9] based on a de novo SSR identification analysis on the cs10 genome were evaluated to amplify a subset of DNA samples (at least one for each population). Each PCR contained  (Table 5) and then used to amplify all 104 genomic DNA samples using the same conditions previously reported.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4. Table 4. Detailed information of the 104 samples analyzed in this study. The name of each variety, the geographical origin, the number of samples and the sex behavior are reported along with a representative picture of the leaf morphology.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4. Table 4. Detailed information of the 104 samples analyzed in this study. The name of each variety, the geographical origin, the number of samples and the sex behavior are reported along with a representative picture of the leaf morphology.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4. Table 4. Detailed information of the 104 samples analyzed in this study. The name of each variety, the geographical origin, the number of samples and the sex behavior are reported along with a representative picture of the leaf morphology.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4. Table 4. Detailed information of the 104 samples analyzed in this study. The name of each variety, the geographical origin, the number of samples and the sex behavior are reported along with a representative picture of the leaf morphology.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Plant Materials of Cannabis
Fresh samples from 104 individuals representing 11 different Cannabis sativa industrial hemp varieties of high breeding value kindly provided by Gruppo Padana (Paese, TV, Italy) were collected in December 2020 and stored at −20 °C. Varieties were selected based on their geographical origin (Italy, Finland, Hungary, France, Poland, and Netherlands) and their sex behavior (dioecious and monoecious). Detailed information is presented in Table 4.

Molecular Data Analysis
Statistical analyses for all SSR marker loci were performed using the PopGene software package v. 1.32 [54]. The observed number of alleles per locus (n e ), Levene's observed heterozygosity (H o ; [55]), Nei's expected heterozygosity (H e ; Nei, 1973) and the average heterozygosity (H a ; Nei, 1978) were computed for each SSR locus and over all SSR markers. PopGene software was also used to estimate F-statistics [56], including the heterozygosity within (F is ) and between (F it ) subpopulations as well as the fixation index (F st ), according to Wright [57]. The phenotypic diversity of the allele profiles was estimated using Shannon's information index (I) as reported by Lewontin [58]. Gene flow (N m ) estimates among subpopulations were derived from the fixation index as described by McDermott and McDonald [59].
The genetic similarity (GS) in all the pairwise comparisons and the principal component analysis (PCoA) for the two main dimensions were conducted under default settings using NTSYS software package v. 2.21c [60]. Moreover, the cluster dendrogram was built based on the unweighted pair-group with arithmetic mean (UPGMA) method by applying Dice's coefficient [61], and a bootstrap analysis was conducted with 1,000 resampling replicates [62] using Past3 software [63].
The population structure of the C. sativa collection was investigated using the modelbased (Bayesian) clustering algorithm implemented in STRUCTURE software v.2.2 [64], which groups individuals according to marker allele combination and distribution. The number of founding groups ranged from 1 to 15, and the method described by [45] was used to evaluate the most likely estimation of K. A burn-in of 2 × 10 5 and a final run of 10 6 Markov chain Monte Carlo (MCMC) steps were set. GENALEX v 6.503 [65] was used to investigate the total genetic variation by means of analysis of molecular variance (AMOVA) and to evaluate the presence of private alleles among populations. Alleles were considered private when present with at least 5% frequency in one group and absent in all others.

Prediction of Plant Sex through the SCAR119 Marker and Linkage Disequilibrium Analysis
To predict the sex of dioecious individuals (61 out of 104) at the seedling stage, [35] developed a SCAR-based assay that generates 119 bp bands only in male plants. To test the reliability of this molecular assay, we applied this analysis to the entire germplasm by using the protocol described by the authors and the following oligos: SCAR119_F: 5 -TCAAACAACAACAAACCG-3 and SCAR119_R: 5 -GAGGCCGATAATTGACTG-3 .
Electrophoresis was performed using a 1.5% agarose gel on a Mupid ® -one apparatus (Advance ® ) with Tris-acetate-EDTA buffer (TAE) and SYBR Safe for staining. The presence and size of PCR products were visualized using an UV transilluminator (Uvitec Cambridge ® ).
Any possible linkage disequilibrium existing between SCAR119 and the SSR loci previously analyzed was estimated and tested by Fisher's exact test. A probability lower than 0.01 was selected to indicate a statistically significant amount of disequilibrium. All calculations and analyses were conducted using Genetic Data Analysis (GDA) version 1.0 software [66].

Conclusions
Cannabis sativa is a high-demand crop with a long human history of medicinal, agricultural, industrial, and recreational uses. However, there is a lack of knowledge of this valuable plant in many fields of science, including the use of genomics for breeding new varieties and tracing their commercial derivatives.
Our research focused on the implementation and validation of an informative, reliable, and reproducible multilocus genotyping system in C. sativa that may be useful both for marker-assisted breeding of new varieties and for unequivocally identifying specific varieties already on the market or assessing their genetic identity. The availability of easy-to-use molecular tools for the authentication of newly released varieties is also crucial for protecting plant breeders' rights and Cannabis derivatives' consumers.
First, we demonstrated the effectiveness of our novel set of SSR marker loci for assessing the genetic distinctiveness, uniformity, and stability of individual varieties as well as for estimating heterozygosity/homozygosity statistics of single plants or populations and the extent of genetic variation within and genetic relationship among different C. sativa varieties. This information could be exploited for planning crosses and predicting heterosis as an expression of hybridity in experimental F1 populations based on the allelic divergence and genetic distance of the selected parental lines.
Finally, our study on linkage disequilibrium demonstrated a correlation between two newly developed SSR marker loci and the well-known sex determination locus, SCAR119, which may assist further investigations on dioecy in C. sativa.
Knowing the parental genotypes would enable not only protection of newly registered varieties but also assessment of the genetic purity and identity of the seed stocks of commercial F1 hybrids and certification of the origin of clonal samples (cuttings) of unknown derivation. Moreover, the newly developed panel of SSR markers not only discriminated southern European samples (Italian and Hungarian Origin) from northern European samples (France, Poland, Netherlands, and Finland) but also assigned each sample to a single variety of hemp or its country of origin with only a few exceptions (<5%).
We are confident that the DNA-based labeling of varieties covered by patent protection represents the only solution in the Cannabis market for managing intellectual property rights in the Cannabis seed industry and protecting professional producers of hemp cuttings and derivatives.