Transcriptome Analysis and Identification of a Female-Specific SSR Marker in Pistacia chinensis Based on Illumina Paired-End RNA Sequencing

Pistacia chinensis Bunge (P. chinensis), a dioecious plant species, has been widely found in China. The female P. chinensis plants are more important than male plants in agricultural production, as their seeds can serve as an ideal feedstock for biodiesel. However, the sex of P. chinensis plants is hard to distinguish during the seedling stage due to the scarcity of available transcriptomic and genomic information. In this work, Illumina paired-end RNA sequencing assay was conducted to unravel the transcriptomic profiles of female and male P. chinensis flower buds. In total, 50,925,088 and 51,470,578 clean reads were obtained from the female and male cDNA libraries, respectively. After quality checks and de novo assembly, a total of 83,370 unigenes with a mean length of 1.3 kb were screened. Overall, 64,539 unigenes (77.48%) could be matched in at least one of the NR, NT, Swiss-Prot, COG, KEGG, and GO databases, 71 of which were putatively related to the floral development of P. chinensis. Additionally, 21,662 simple sequence repeat (SSR) motifs were identified in 17,028 unigenes of P. chinensis, and the mononucleotide motif was the most dominant type of repeats (52.59%) in P. chinensis, followed by dinucleotide (22.29%), trinucleotide (20.15%). The most abundant repeats were AG/CT (13.97%), followed by AAC/GTT (6.75%) and AT/TA (6.10%). Based on these SSR, 983 EST-SSR primers were designed, 151 of which were randomly chosen for validation. Of these validated EST-SSR markers, 25 SSR markers were found to be polymorphic between male and female plants. One SSR marker, namelyPCSSR55, displayed excellent specificity in female plants, which could clearly distinguish between male and female P. chinensis. Altogether, our findings not only reveal that the EST-SSR marker is extremely effective in distinguishing between male and female P. chinensis but also provide a solid framework for sex determination of plant seedlings.


Introduction
At present, the divergence in the relationship between fossil energy consumption and global oil demand is becoming increasingly prominent. The volatility of international crude oil market and the environmental issues caused by the wide-scale use of fossil fuels have become two major areas of focus [1]. Moreover, it is bound to develop diversified, re-biochemical, and clean energy for humans. Therefore, the development of new biomass energy is one of the most important directions to solve the global energy crisis. P. chinensis is a dioecious plant species with wind-pollinated apetalous flowers, and its biomass can serve as a potentially important renewable energy option [2]. This species is commonly found in China and has many distinctive features. For instance, it can tolerate dry conditions and grow in alkaline or acidic soil [3]. The oil content of P. chinensis seed is typically higher than 40%, which is a non-drying type. The sixteen alkyl value of biodiesel produced by

Plant Sampling and RNA Extraction
The flowers of P. chinensis are unisexual and clustered in axillary panicles. Flowers are small, and perianth segments are lanceolate or narrowly lanceolate with a length of about 1.5-2 mm and with a pedicel of about 1 mm. Bracts are lanceolate or narrowly lanceolate, with a length of about 1.5-2 mm. Male inflorescences are closely arranged with a length of 6-7 cm, and the female inflorescences are loosely arranged with a length of 15-20 cm  Figure 1). The flower buds of P. chinensis (more than 100 buds per plant) were pooled from 3 female trees and 3 male trees (at least 10 m tall), respectively, in Kunming, Yunnan, southwest of China. All samples were snap-frozen in liquid nitrogen and kept at -80 • C until further analysis. Total RNA was extracted with the EASY spin Plus Plant RNA Kit following the manufacturer's instructions (Aidlab Biotech, Beijing, China). The purity and quantity of RNA samples were determined by a NanoDrop 2000 spectrophotometer (Life Technologies, CA, USA). Only those samples with an A260/A280 of >1.8 and an A260/A230 ratio of >1.8 were selected for subsequent analysis. Final quality assessment was performed using the Bioanalyzer RNA 6000 Nano assay (Agilent Technologies, Santa Clara, CA, USA) prior to deep sequencing. The qualified samples with RNA integrity number (RIN) of >6.5 were selected for further analysis. DNase I (RNase-free) treatment was performed on all RNA samples in order to eliminate possible DNA contamination.
Genes 2022, 13, x FOR PEER REVIEW 3 of 14 length of 6-7 cm, and the female inflorescences are loosely arranged with a length of 15-20 cm (Figure 1). The flower buds of P. chinensis (more than 100 buds per plant) were pooled from 3 female trees and 3 male trees (at least 10 m tall), respectively, in Kunming, Yunnan, southwest of China. All samples were snap-frozen in liquid nitrogen and kept at -80 °C until further analysis. Total RNA was extracted with the EASY spin Plus Plant RNA Kit following the manufacturer's instructions (Aidlab Biotech, Beijing, China). The purity and quantity of RNA samples were determined by a NanoDrop 2000 spectrophotometer (Life Technologies, CA, USA). Only those samples with an A260/A280 of >1.8 and an A260/A230 ratio of >1.8 were selected for subsequent analysis. Final quality assessment was performed using the Bioanalyzer RNA 6000 Nano assay (Agilent Technologies, CA, USA) prior to deep sequencing. The qualified samples with RNA integrity number (RIN) of >6.5 were selected for further analysis. DNase I (RNase-free) treatment was performed on all RNA samples in order to eliminate possible DNA contamination.

cDNA Library Construction and Paired-End Sequencing
cDNA libraries were constructed with the pooled RNA samples (5 μg) by using the NEB Next Ultra RNA Library Prep Kit for Illumina (NEB, Ipswich, MA, USA) according to the manufacturer's instructions. The quality and quantity of cDNA libraries were assessed using the StepOnePlus Real-Time PCR System (Applied Biosystems, Foster City, CA, USA) and Agilent 2100 Bioanaylzer (Agilent Technologies, Santa Clara, CA, USA). Two cDNA libraries of female and male P. chinensis plants were subjected to paired-end sequencing on an Illumina HiSeq 4000 platform (1Gene Company, Hangzhou, China).

Data Pre-Processing and De Novo Assembly
To obtain high-quality clean data, the raw reads were filtered using in-house Perl scripts. The Q20, Q30, GC content, and sequence duplication level of the clean sequences were determined. De novo assembly of the transcriptomic data was conducted with a Trinity assembler (trinityrnaseq-2.0.6), and the parameters are: minimum contig length = 200, min_kmer_cov = 2, and min_glue = 3 [23]. All contigs were generated by merging the sequences with a certain overlap length. The paired-end reads were then mapped back to the contigs, and the distance between the two end reads was revealed. Subsequently, the contigs were connected by the Trinity assembler to obtain the sequences that could no longer be extended on either end. These sequences were referred to as unigenes. The resulting unigenes were then subjected to sequence splicing and redundancy elimination by using the TGICL software system (Linux x86) with the parameters of repeats tringency = 0.95, min_match = 35, and min_score = 35 [24] in order to yield non-redundant unigenes with the maximum length. As a metric for assembly quality, the distribution of contigs

cDNA Library Construction and Paired-End Sequencing
cDNA libraries were constructed with the pooled RNA samples (5 µg) by using the NEB Next Ultra RNA Library Prep Kit for Illumina (NEB, Ipswich, MA, USA) according to the manufacturer's instructions. The quality and quantity of cDNA libraries were assessed using the StepOnePlus Real-Time PCR System (Applied Biosystems, Foster City, CA, USA) and Agilent 2100 Bioanaylzer (Agilent Technologies, Santa Clara, CA, USA). Two cDNA libraries of female and male P. chinensis plants were subjected to paired-end sequencing on an Illumina HiSeq 4000 platform (1Gene Company, Hangzhou, China).

Data Pre-Processing and De Novo Assembly
To obtain high-quality clean data, the raw reads were filtered using in-house Perl scripts. The Q20, Q30, GC content, and sequence duplication level of the clean sequences were determined. De novo assembly of the transcriptomic data was conducted with a Trinity assembler (trinityrnaseq-2.0.6), and the parameters are: minimum contig length = 200, min_kmer_cov = 2, and min_glue = 3 [23]. All contigs were generated by merging the sequences with a certain overlap length. The paired-end reads were then mapped back to the contigs, and the distance between the two end reads was revealed. Subsequently, the contigs were connected by the Trinity assembler to obtain the sequences that could no longer be extended on either end. These sequences were referred to as unigenes. The resulting unigenes were then subjected to sequence splicing and redundancy elimination by using the TGICL software system (Linux x86) with the parameters of repeats tringency = 0.95, min_match = 35, and min_score = 35 [24] in order to yield non-redundant unigenes with the maximum length. As a metric for assembly quality, the distribution of contigs and length of unigenes were calculated. After gene family clustering, all unigenes assigned Genes 2022, 13, 1024 4 of 14 into 2 categories: (i) cluster (unigenes with 70% similarity to each other) and (ii) singleton. Finally, the sequence directions of the unigenes were evaluated.

Functional Annotation
To characterize their putative functions, all unigenes were first aligned with the NCBI non-redundant (NR) protein database, Swiss-Prot Protein database, Cluster of Orthologous Groups of Proteins (COG) database, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment database by using BLASTx with a threshold E-value of 1.0 × 10 −5 . The unigenes were then aligned with the non-redundant nucleotide (NT) database in GenBank by BLASTn with an E-value of 1.0 × 10 −5 . According to a gene ontology (GO) functional classification, the unigenes were annotated using Blast2GO against the NCBI-NR database with a threshold E-value of 1.0 × 10 −5 [25]. To assess the distribution of gene functions in P. chinensis at the macro level, GO functional classification (molecular function, cellular component, and biological process) was conducted for all unigenes using the WEGO software program [26]. KEGG, an important public pathway-related database [27], was employed to unravel the complex functions of the unigenes associated with flower development in P. chinensis. To examine the difference in gene expression between male and female buds, the transcription levels of unigenes were quantified by aligning the RNA-seq reads from each library to the assembly. p-value < 0.01, FDR ≤ 0.001, and log2 (fold change) ≥ 1 or ≤−1 were used as thresholds to determine the significant differences between two samples.

DNA Extraction and EST-SSR Marker Evaluation
Leaf samples of 100 female and 102 male P. chinensis trees were collected from Anyang, Henan Province, China. After drying, genomic DNA was isolated from each leaf sample using the cetyltrimethylammonium bromide (CTAB) extraction method [30]. The quality and quantity of DNA samples were evaluated using the NanoDrop 2000 spectrophotometer. To construct two DNA pools using bulked segregant analysis (BSA), 20 DNA samples in each group of females and males were equally mixed. Then, the concentration of each DNA sample was diluted to 50 ng/µL, and all samples were kept at −20 • C until further analysis. Polymerase chain reaction (PCR) and gel electrophoresis were carried out to evaluate the amplification of 151 SSR primer pairs. PCR assay was initiated with 10 µL of reaction mixture containing 0.1 µL of Taq polymerase (5 U/µL), 1.0 µL of 10× PCR buffer with MgCl 2 (25 mM), 0.8 µL of dNTPs (2.5 mmol/L), 0.2 µL of each primer (10 mmol/L), 2.0 µL of template DNA (50 ng/mL), and 5.9 µL of sterile distilled water. PCR conditions were set as follows: an initial denaturation of 94 • C for 3 min, followed by 35 cycles of 94 • C for 30 s, 55-65 • C for 45 s, and 72 • C for 1 min and a final extension of 72 • C for 5 min. Equivalent aliquots (6 µL) of the PCR products were electrophoresed on a 6% polyacrylamide gel. After electrophoresis at 2000 V for 1-1.5 h, the PCR bands were visualized by silver nitrate staining.

Illumina Paired-End Sequencing and De Novo Assembly of P. chinensis Transcriptome
In total, 54,895,796 and 55,210,442 sequence reads were generated from the female and male buds, respectively. Of these, 50,925,088 and 51,470,578 sequences were of high Genes 2022, 13, 1024 5 of 14 quality after filtering ( Table 1). Considering that there is no reference genome available for P. chinensis, the high-quality sequences obtained from the two cDNA libraries were integrated into a reference transcriptome by de novo assembly using the Trinity short reads assembler [23]. Note: PC mean female bulk and PX mean male bulk. Q20 and Q30 percentages are the proportion of nucleotides with quality value larger than 20 and 30, respectively; GC percentage is the proportion of guanine and cytosine nucleotides among total nucleotides.

Illumina Paired-End Sequencing and De Novo Assembly of P. chinensis Transcriptome
In total, 54,895,796 and 55,210,442 sequence reads were generated from the female and male buds, respectively. Of these, 50,925,088 and 51,470,578 sequences were of high quality after filtering ( Table 1). Considering that there is no reference genome available for P. chinensis, the high-quality sequences obtained from the two cDNA libraries were integrated into a reference transcriptome by de novo assembly using the Trinity short reads assembler [23].  Figure 2). These findings indicated that the assembled sequences were qualified for subsequent analyses.
The results of NR annotation demonstrated that 40,643 unigenes were assigned least one GO term. The sequences enriched in "molecular function", "cellular co nent", and "biological process" clusters were grouped into 46 functional groups (F 4). The most dominant groups of the three major clusters were "cellular processes "metabolic processes", "cell" and "cell part", and "binding" and "catalytic activity spectively (Figure 4). Furthermore, all unigenes were searched against the COG database for funct classification. In total, 47,049 of the 58,543 unigenes were assigned to 25 COG categ including biochemistry metabolism, cellular structure, molecular processing, and s transduction ( Figure 5). Of all clusters, "general function prediction only" (8420, 17 was the most dominant, followed by "replication, recombination, and repair" ( 9.09%), and "transcription" (4149, 8.82%). However, only 13 and 2 unigenes were c fied under "extracellular structure" and "nuclear structure", respectively ( Figure 5). Furthermore, all unigenes were searched against the COG database for functional classification. In total, 47,049 of the 58,543 unigenes were assigned to 25 COG categories, including biochemistry metabolism, cellular structure, molecular processing, and signal transduction ( Figure 5). Of all clusters, "general function prediction only" (8420, 17.90%) was the most dominant, followed by "replication, recombination, and repair" (4276, 9.09%), and "transcription" (4149, 8.82%). However, only 13 and 2 unigenes were classified under "extracellular structure" and "nuclear structure", respectively ( Figure 5).

KEGG Pathway Assignment of Unigenes
Overall, 36,136 assembled sequences were assigned to 128 KEGG pathways, ranging from 3 to 8208 for each pathway. Table 4 shows the top 20 pathways with the highest sequence numbers. The most abundant genes were assigned to "metabolic pathways" (8208, 22.71%), followed by "biosynthesis of secondary metabolites" (4045, 11.19%), "plant-pathogen interaction" (2516, 6.96%), and "plant hormone signal transduction" (1729, 4.78%). More importantly, some unigenes were also enriched in KEGG pathways related to metabolism, including "metabolic pathways" and "biosynthesis of secondary metabolites". All these pathways can play a key role in metabolic regulation.

Identification of Sex-Linked SSR Markers
Among the 983 primer pairs, 151 were randomly selected to screen the efficiency of these primers and to identify sex-linked EST-SSR markers in P. chinensis (Supplementary  Table S2). Of the selected primers, 138 (91.39%) pairs showed clear amplifications in P. chinensis, whereas the remaining 13 failed to amplify. In the genotyping assay combined with BSA, all the 151 primer pairs were employed to detect putative EST-SSR markers in two DNA bulks of female and male P. chinensis. Notably, 25 primer pairs showed distinct polymorphisms between the two DNA bulks. To evaluate whether these polymorphisms can be used for sex identification, 20 DNA samples in each group of females and males were detected separately. The results showed that only PcSSR55 produced a female-specific marker that was not detectable in all male DNA samples ( Figure 6A). To further verify the female-specific marker, 100 female and 102 male DNA samples were amplified with PcSSR55 primer pair, and the results showed that all the female DNA samples exhibited the specific marker band, and only four male samples showed amplification of the specific band ( Figure 6B). Altogether, these findings indicate that PcSSR55 is effective for the sex determination of P. chinensis.

Characterization of P. chinensisTranscriptome
Next-generation sequencing technology presents opportunities for plant genome analysis and offers a fast, cost-effective way to characterize the whole transcriptomes of various organisms [32]. This technique has been applied to sequence an array of nonmodel plants, including strawberry, pistachio, grasspea, and others [33][34][35][36]. In this work, the pooled RNA samples from female and male P. chinensis were analyzed by the Illumina RNA-seq platform, and de novo assembly of their transcriptomes was conducted due to the scarcity of reference sequences in the publicly available databases. The quality of a de novo assembly can be assessed by the mean length and N50 value of the contigs. As presented in Table 2, the mean length and N50 value of the unigenes were 1.3 and 2.0 kb, respectively, which were comparatively better than other transcriptomic studies [37][38][39] and other Pistacia transcriptomes published in the literature [40,41]. Although the higher N50 value and greater average length can indicate an accurate and effective assembly [37,42], previous research has suggested that both measures are primitive and often misleading [43]. Thus, it is generally believed that N50 can be applied to measure the continuity of the unigenes but not their applicability [44].
Of the 83,370 high-quality unigenes of P. chinensis transcriptome, 64,539 unigenes (77.48%) were successfully annotated to the six databases (i.e., COG, GO, KEGG, NR, NT, and Swiss-Prot), and only 22.52% unigenes did not significantly match to any of those six

Characterization of P. chinensisTranscriptome
Next-generation sequencing technology presents opportunities for plant genome analysis and offers a fast, cost-effective way to characterize the whole transcriptomes of various organisms [32]. This technique has been applied to sequence an array of non-model plants, including strawberry, pistachio, grasspea, and others [33][34][35][36]. In this work, the pooled RNA samples from female and male P. chinensis were analyzed by the Illumina RNA-seq platform, and de novo assembly of their transcriptomes was conducted due to the scarcity of reference sequences in the publicly available databases. The quality of a de novo assembly can be assessed by the mean length and N50 value of the contigs. As presented in Table 2, the mean length and N50 value of the unigenes were 1.3 and 2.0 kb, respectively, which were comparatively better than other transcriptomic studies [37][38][39] and other Pistacia transcriptomes published in the literature [40,41]. Although the higher N50 value and greater average length can indicate an accurate and effective assembly [37,42], previous research has suggested that both measures are primitive and often misleading [43].
Thus, it is generally believed that N50 can be applied to measure the continuity of the unigenes but not their applicability [44].
Of the 83,370 high-quality unigenes of P. chinensis transcriptome, 64,539 unigenes (77.48%) were successfully annotated to the six databases (i.e., COG, GO, KEGG, NR, NT, and Swiss-Prot), and only 22.52% unigenes did not significantly match to any of those six datasets, which might be attributed to their short full-length transcripts or the high threshold of E-value [45]. Moreover, it is speculated that these unmatched unigenes have no similar annotations in the six databases and may represent the species-specific genes without prior characterization. From the BLASTX search against the NR database, 84.28% of the identified unigenes of P. chinensis displayed high homology with those of C. clementina, C. sinensis, T. cacao, V. vinifera, P. balsamifera subsp., trichocarpa, R. communis, and A. persica (Figure 3). Such similarity might be due to the lack of whole-genome-sequencing data in the publicly available databases for the related Pistacia species. Through the use of COG and GO databases, the unigenes were categorized into 25 sub-terms and 46 sub-categories (Figures 4 and 5), demonstrating that the identified unigenes possess a wide range of important functions in P. chinensis [46,47]. Next, a total of 36,136 unigenes were annotated and mapped to 128 KEGG pathways (Table 3), which help us to reveal the metabolic pathways and gene interaction. In addition, 1729 unigenes involved in "plant hormone signal transduction" were identified, and these findings may help us to discover potential candidate genes related to the sex differentiation of P. chinensis in the future. In short, de novo RNA-seq based transcriptome analysis of P. chinensis can facilitate future studies on the physiological, biochemical, and molecular aspects of other Pistacia species.

Validation and Polymorphism of EST-SSRs for Gender Identification of P. chinensis
Among the agriculturally important crops, such as pistachio, papaya, kiwifruit, and date palm, female trees are responsible for the production of commercial crop [55]. Thus, it is crucial to identify the gender of these plants [41]. For the development of EST-SSR markers, transcriptomic data mining can provide greater efficiency and flexibility in biomarker selection. SSR marker detection possesses several distinct advantages over other techniques, including low cost, rapidity, and being commonly available and applicable in various plants such as Myrica rubra [56], Phoenix dactylifera [57], A. chinensis [54], and Tapiscia sinensis [58]. In this study, 151 EST-SSR markers were developed and validated, which could provide necessary information for the determination of sex-linked markers in P. chinensis. Among these markers, one SSR marker was able to distinguish between female and male P. chinensis, and the percentage of accurate identification of sex was more than 98%. However, only 2% of the individual gender could not be identified. The reason may be that there is a certain physical distance on the genome between the marker and the gene controlling female traits, so the marker is only closely linked to the female and cannot be co-separated with the female. Previous studies also found that it was not easy to develop sex-linked markers in P. vera with 100% accuracy [5][6][7]. Until the male and female progenies belonging to the Siirt × Bagyolu F 1 population were sequenced, eight loci from seven RAD reads were successfully able to distinguish the sexes in P. vera [8]. Furthermore, seven novel sex-linked SNP markers were identified and mapped to the center of the chromosome and were therefore considered potential sex-linked markers for MAS in P. vera [9].

Conclusions
From the transcriptomic analysis of female and male P. chinensis, 83,370 unigenes were de novo assembled, and 77.48% of them were annotated and mapped to the publicly available databases. In addition, 21,662 SSR motifs were characterized, and one SSR marker was specific to female P. chinensis. Altogether, our findings provide useful insights into the genetic mechanism underlying sex differences in P. chinensis, which can be used as the basis for further research on the functional genomics and reproductive biology of this plant. Furthermore, the established EST-SSR markers may serve as an important molecular tool for the conservation and MAS of P. chinensis.