Transcriptome Characterization of Different Tissues of Stone Pine ( Pinus pinea L.): De Novo Assembly †

: Stone pine ( Pinus pinea L.) is an emblematic tree distributed around the whole Mediterranean basin. The species is well known for the economics of its timber, resins and edible seeds, the stone pine nuts commercialized in the food industry. Despite its relevance, the genomic information available for the species is scarce, and until now no reference genome has been available. The main purpose of this study was to characterize the stone pine transcriptome of seven different tissues, by performing a de novo transcriptome assembly. A total of 55,328 genes were predicted and functionally annotated based on the SWISS-PROT and nr-NCBI databases and InterProScan signatures.


Introduction
Stone pine (Pinus pinea L.) is a Mediterranean species, distributed in coastal areas from the western area of the Iberian Peninsula to Turkey [1]. Stone pine is a valuable species for its pine nuts or pine kernels, which are a delicious and highly nutritious edible seeds, providing a good source of fat, proteins and vitamins, among other phytochemical characteristics [2][3][4]. In addition, the species is also well known for the economics of its timber and resins. Between 2010 and 2015, the Portuguese stone pine area increased by 20,700 ha, reaching 193,600 ha in 2015 [5], representing the second largest area of stone pine of the world.
Advances in sequencing and assembly technologies have allowed rapid progress in the characterization of the genomes of angiosperms, whereas for gymnosperms, such as conifers, the same has not occurred due to the complexity and higher size of their genomes. For instance, the mean size of genomes in the Pinus genus and subgenus are 28.3 Gbs and 26.4 Gbs, respectively [6]. The recent advances in the third generation of high-throughput sequencing technologies and the reduction in their cost allowed the sequencing of two pine genomes from the genus Pinus, P. labertiana (GCA_001447015.2) and P. taeda [7].
Despite the scarce genomic information available for the stone pine, the characterization of the transcriptome even for species with no reference genome available can be performed using RNA-Seq. Transcriptome differences between different plant tissues have been well studied so far, providing a comprehensive characterization of the species transcriptome. Here, in order to explore the differences in the transcriptome between different tissues of stone pine, a transcriptome characterization of needles, xylem, stem bark, terminal bud, first and second year pine cones, and pine nuts, was performed using a de novo transcriptome assembly. This study provides, for the first time, transcriptome resources for seven different tissues of the stone pine, providing a valuable resource for further studies.

Sample Preparation, RNA Extraction, and Sequencing
Samples of different tissues (needles, xylem, stem bark, terminal bud, first and second year pine cones, and pine nuts) were collected from five trees located in Coruche (Portugal). Samples were immediately frozen in liquid nitrogen and stored at −80 • C until being processed. The RNA extraction was performed according to the method of Le Provost [8] with minor modifications. The extracted RNA was sequenced in two different Illumina platforms, NextSeq 550 and HiSeq 4000, producing paired-end (PE) reads of 75 bp and 100 bp in length, respectively.

Sequencing Data, Transcriptome Assembly, and Annotation
The raw reads were pre-processed with Trimmomatic v.0.38 [9], keeping reads with a minimum quality of 20, over a screen size window of 10% of the read length, and a minimum length of 80% of the read length. Then, the de novo transcriptome assembly was performed using Mira v.4.0.2 [10], discarding contigs shorter than 200 bp.
Gene prediction and transcriptome annotation was performed using TransDecoder v.5.5 [11], following its guidelines. BlastP was used to functionally annotate the predicted genes, identifying homologous genes from the SWISS-PROT and nr-NCBI plant databases, and InterProScan was used to obtain protein domains, gene ontology (GO) terms, and KEGG pathways [12][13][14][15].

Tissue-Specific Characterization
In order to characterize the transcriptome at the tissue level, the pre-processed reads of all individuals were mapped against the assembled transcriptome with STAR v.2.7.3a [16], using the two-pass mode according to the manual guidelines. The unique mapped reads were retained and used to estimate the RNA abundance of the predicted genes by means of StringTie (parameter used -e) [17]. The tissue-specific characterization was performed taking into account only genes with an abundance ≥5 in at least one of the biological replicates in at least one tissue, considering those genes as genes expressed. Then, the BinGO plugin from Cytoscape was used to identify GOs overrepresented over the set of genes expressed in each tissue, performing a (BH) multiple-testing correction with a p-value ≤ 0.05 [18].

Transcriptome Assembly and Annotation
A de novo transcriptome assembly was generated from the sampled tissues. Sequencing of cDNA of all samples from both sequencing platforms generated a total of 2,026,716,380 PE reads. After trimming low-quality bases and removing low-quality reads with Trimmomatic, 1,898,376,282 high-quality reads were kept, representing 93.7% of the raw reads ( Table 1). The transcriptome assembly of stone pine resulted in 165,179 contigs equal or greater than 200 bp, which represented an accumulative assembly size of 81,310 Mb (Table 2). A total of 55,328 candidate genes were identified by Transdecoder, of which 41,839 found at least one homologous hit against the SWISS-PROT database. The remaining predicted genes with no hits were further blasted against the nr-NCBI plants, with 8322 genes finding at least one homology hit. Functional categories in terms of GOs and associated KEGG pathways were identified using InterProScan. A total of 28,258 (51.07%) predicted genes were assigned with at least one GO term, covering 2079 different GO terms (BP-biological processes: 41.75%; MFmolecular function: 45.46%; CC-cellular components: 12.94%). Moreover, 4134 predicted genes were successfully assigned to at least one KEGG pathway of the 124 identified, codifying 482 different enzymes.

Tissue-Specific Characterization
After removing genes with low abundances, the universe of expressed genes considered for the transcriptome characterization was 54,627, of which 30,137 genes were co-expressed in all tissues (Figure 1). In addition, 5738 genes were found exclusively expressed in pine nuts, whereas in the other tissues the number of exclusively expressed genes was much lower (needles: 1087; stem bark: 212; xylem: 210; terminal bud: 143; first year pine cone: 74; second year pine cone: 21). By performing the pairwise comparisons of genes expressed per tissue, the pine nut tissue is the one with the lowest number of genes expressed in common with the other tissues. For instance, 5862 genes were co-expressed in all tissues but pine nut tissues. On the other hand, the two pine cone tissues, first year and second year, were the ones with the greatest similarity (the highest Jaccard index). expressed in pine nuts, whereas in the other tissues the number of exclusively expressed genes was much lower (needles: 1087; stem bark: 212; xylem: 210; terminal bud: 143; first year pine cone: 74; second year pine cone: 21). By performing the pairwise comparisons of genes expressed per tissue, the pine nut tissue is the one with the lowest number of genes expressed in common with the other tissues. For instance, 5862 genes were coexpressed in all tissues but pine nut tissues. On the other hand, the two pine cone tissues, first year and second year, were the ones with the greatest similarity (the highest Jaccard index).

Figure 1.
Pairwise comparison per tissue. In brackets are represented the total number of genes expressed per tissue. Within each square is represented the number of genes in common between tissues and below that number, the corresponding Jaccard index is presented. The higher the index value, the more similar the two tissues compared.
In order to understand the enrichment occurrence of overrepresented GO terms in each tissue, the proportion of genes expressed in each tissue was compared with the genes expressed overall among the transcriptome assembly. The analysis showed that 1170 GOs were found to be overrepresented among all tissues (BP: 683; MF: 284; CC: 203). When looking for exclusive overrepresented GO terms per tissue, a total of 409 overrepresented GO terms were found (Table 3). In terms of exclusively overrepresented GOs identified in each tissue, we observed that in pine nut tissue, most of them were related with seed maturation, initiation of transcription, translation and stored nutrient mobilization, cell expansion, root development, and cell division, among others, whereas those in needles were related with In order to understand the enrichment occurrence of overrepresented GO terms in each tissue, the proportion of genes expressed in each tissue was compared with the genes expressed overall among the transcriptome assembly. The analysis showed that 1170 GOs were found to be overrepresented among all tissues (BP: 683; MF: 284; CC: 203). When looking for exclusive overrepresented GO terms per tissue, a total of 409 overrepresented GO terms were found (Table 3). In terms of exclusively overrepresented GOs identified in each tissue, we observed that in pine nut tissue, most of them were related with seed maturation, initiation of transcription, translation and stored nutrient mobilization, cell expansion, root development, and cell division, among others, whereas those in needles were related with photosynthesis and energy metabolism. Additionally, metabolic processes associated with coenzymes and cofactors were related with exclusively overrepresented GOs in stem bark tissue and in terminal buds with the translation of elongation factors, cell structure, and cell wall organization. Finally, catalytic activities were related with exclusively overrepresented GOs in second-year pine cone tissue.
Regarding KEGG pathways, 482 different enzymes were codified by 4057 genes expressed among the whole transcriptome (needles: 3769; stem bark: 3474; terminal bud: 3067; first-year pine: 3114; pine nut: 2909; second-year pine: 2900; xylem: 2400). The most representative KEGG pathways per tissue are represented in Figure 2. Clear differences were observed between needles and stem bark tissues in comparison with the other tissues. Both contained a higher number of expressed genes that were associated directly with energy metabolism, such as glycolysis; gluconeogenesis, and diverse sugar metabolism (galactose, fructose and mannose). The highest difference was observed in both of these tissues against the others in "Glyoxylate and dicarboxylate" and "Carbon fixation in photosynthetic organisms" metabolisms, which are usually more active in photosynthetic tissues. photosynthesis and energy metabolism. Additionally, metabolic processes associated with coenzymes and cofactors were related with exclusively overrepresented GOs in stem bark tissue and in terminal buds with the translation of elongation factors, cell structure, and cell wall organization. Finally, catalytic activities were related with exclusively overrepresented GOs in second-year pine cone tissue.
Regarding KEGG pathways, 482 different enzymes were codified by 4057 genes expressed among the whole transcriptome (needles: 3769; stem bark: 3474; terminal bud: 3067; first-year pine: 3114; pine nut: 2909; second-year pine: 2900; xylem: 2400). The most representative KEGG pathways per tissue are represented in Figure 2. Clear differences were observed between needles and stem bark tissues in comparison with the other tissues. Both contained a higher number of expressed genes that were associated directly with energy metabolism, such as glycolysis; gluconeogenesis, and diverse sugar metabolism (galactose, fructose and mannose). The highest difference was observed in both of these tissues against the others in "Glyoxylate and dicarboxylate" and "Carbon fixation in photosynthetic organisms" metabolisms, which are usually more active in photosynthetic tissues.

Conclusions
This is the first time that a large-scale RNA-seq dataset was generated from seven different tissues from stone pines, providing a complete transcriptome characterization. The data produced in this process will be a useful resource for future studies on the species. The generated transcriptome assembly resulted in a total of 55,328 identified genes, of which 50,161 were functionally annotated. More studies are on-going in order to assess differences in gene expression between tissues in stone pine.

Conclusions
This is the first time that a large-scale RNA-seq dataset was generated from seven different tissues from stone pines, providing a complete transcriptome characterization. The data produced in this process will be a useful resource for future studies on the species. The generated transcriptome assembly resulted in a total of 55,328 identified genes, of which 50,161 were functionally annotated. More studies are on-going in order to assess differences in gene expression between tissues in stone pine.