De novo Transcriptome Assembly and Comprehensive Annotation of Two Tree Tomato Cultivars (Solanum betaceum Cav.) with Different Fruit Color

The tree tomato (Solanum betaceum Cav.) is an underutilized fruit crop native to the Andean region and phylogenetically related to the tomato and potato. Tree tomato fruits have a high amount of nutrients and bioactive compounds. However, so far there are no studies at the genome or transcriptome level for this species. We performed a de novo assembly and transcriptome annotation for purple-fruited (A21) and an orange-fruited (A23) accessions. A total of 174,252 (A21) and 194,417 (A23) transcripts were assembled with an average length of 851 and 849 bp. A total of 34,636 (A21) and 36,224 (A23) transcripts showed a significant similarity to known proteins. Among the annotated unigenes, 22,096 (A21) and 23,095 (A23) were assigned to the Gene Ontology (GO) term and 14,035 (A21) and 14,540 (A23) were found to have Clusters of Orthologous Group (COG) term classifications. Furthermore, 22,096 (A21) and 23,095 (A23) transcripts were assigned to 155 and 161 (A23) KEGG pathways. The carotenoid biosynthetic process GO terms were significantly enriched in the purple-fruited accession A21. Finally, 68,647 intraspecific single-nucleotide variations (SNVs) and almost 2 million interspecific SNVs were identified. The results of this study provide a wealth of genomic data for the genetic improvement of the tree tomato.


Introduction
The tree tomato or tamarillo (Solanum betaceum Cav.) is a Solanaceae crop native to the Andean region [1,2]. The tree tomato is phylogenetically related to the potato (S. tuberosum L.) and tomato (S. lycopersicum L.), forming part of the same clade [3]. The tree tomato plant develops into a small tree, even though some cultivars can grow up to four meters in height, with a fast-growing, shallow root system and simultaneous reproductive and vegetative development [4]. In recent years, the tree tomato has caught the attention of growers and the industry due to its attractive, fleshy, edible fruits, which can be consumed either in salads or as a dessert fruit, or processed for making jams, yogurts, juices, or alcoholic beverages, among others [5]. It has developed from being a neglected crop, with a local interest in subsistence farms [6], into a promising fruit crop, having been introduced in several countries of Oceania, Southeast Asia, Europe and Africa [7]. Aside from South American countries, New Zealand is the largest producer and exporter of the tree tomato, where the marketable word, "tamarillo", was coined from the Maori term "tama", meaning

Plant Material
The study was carried out in 2019 at the the Universitat Politècnica de València (UPV). A purple-fruited tree tomato accession (A21, with purple epicarp and mean fruit weight of 108.8 g) and an orange-fruited tree tomato accession (A23, with orange epicarp a mean fruit weight of 75.1 g) [6] (Figure 1), obtained from the UPV germplasm bank, were used for the present study. Seeds from each accession were germinated following the protocol of Ranil et al. (2015) [15]. Subsequently, the plants were grown in a greenhouse at UPV, Spain (GPS coordinates: latitude, 39 • 28 55 N; longitude, 0 • 20 11 W; 7 m above sea level). From each accession, tissues were sampled from several young leaves and flower buds and pools were made for each tissue and accession. Unfortunately, the two accessions did not set fruit under greenhouse conditions at our latitude, and thus fruit tissues were not used for the transcriptome assembly. All samples collected were immediately frozen in liquid nitrogen and stored at −80 • C for later use.

RNA Extraction, Library Construction and RNA Sequencing
Total RNA was isolated from each tissue using the Mini spin kit (Macherey-Nage, Dueren, Germany). RNA integrity was determined by 1.0% (w/v) agarose gel electrophoresis and RNA quantification was performed by Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). From each accession, tissues were sampled from several young leaves and flower buds and pools were made for each tissue and accession. A total of 2 µg of RNA for each pool was sent to Novogene (Cambridge, UK) for library preparation and sequencing. The cDNA paired-end libraries of 150 bp (250~300 bp insert size) libraries were constructed according to Illumina's instructions. The mRNA of each sample was purified from the total RNA by using Sera-mag Magnetic Oligo (dT), then fragmented into short fragments using the fragmentation buffer. Using these fragments as templates, the first strand of cDNA was synthesized. The second strand of cDNA was synthesized using the buffer containing dNTPs, RNase H, and DNA polymerase I. Short fragments (200 ± 20 bp) were connected to the sequencing adapters and suitable fragments were excised from an agarose gel using a gel extraction kit. Then, the library was sequenced using the Illumina Hiseq-2000 sequencer. The raw reads data are available at NCBI Sequence Read Archive (SRA) with accession number SRR15258852 (A21) and SRR15258851 (A23), within the bioproject number PRJNA749599, available at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA749599 http://www.ncbi.nlm.nih.gov.

RNA Extraction, Library Construction and RNA Sequencing
Total RNA was isolated from each tissue using the Mini spin kit (Macherey-Nage, Dueren, Germany). RNA integrity was determined by 1.0% (w/v) agarose gel electrophoresis and RNA quantification was performed by Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). From each accession, tissues were sampled from several young leaves and flower buds and pools were made for each tissue and accession. A total of 2 μg of RNA for each pool was sent to Novogene (Cambridge, UK) for library preparation and sequencing. The cDNA paired-end libraries of 150 bp (250~300 bp insert size) libraries were constructed according to Illumina's instructions. The mRNA of each sample was purified from the total RNA by using Sera-mag Magnetic Oligo (dT), then fragmented into short fragments using the fragmentation buffer. Using these fragments as templates, the first strand of cDNA was synthesized. The second strand of cDNA was synthesized using the buffer containing dNTPs, RNase H, and DNA polymerase I. Short fragments (200 ± 20 bp) were connected to the sequencing adapters and suitable fragments were excised from an agarose gel using a gel extraction kit. Then, the library was sequenced using the Illumina Hiseq-2000 sequencer. The raw reads data are available at NCBI Sequence Read Archive (SRA) with accession number SRR15258852 (A21) and SRR15258851 (A23), within the bioproject number PRJNA749599, available at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA749599 http://www.ncbi.nlm.nih.gov.

DNA Sequence Processing and De Novo Transcriptome Assembly
The quality of reads was assessed using FastQC v0.11.8 [16]. The adapter sequences, low-quality reads (Phred score <30) and reads with an average length of less than 135 bp were trimmed using Trimmomatic v0.36 [17]. The two accessions were assembled separately using Trinity software v2.10 [18] with a default k-mer size of 25. Identical or near-identical contigs were clustered into a single contig by CD-HIT-EST tool v 4.8.1 [19] with an identity of more than 80%. The quality and completeness of the assemblies were first evaluated with Bowtie2 v2.3.2 [20] for assessing the number of paired-end reads that were present in the assembled transcripts, then the Ex90N50 transcript contig length (the contig N50 value based on the set of transcripts representing 90% of the expression data) was computed using contig ExN50 statistic.pl script bundled with Trinity. Finally, the completeness of the assemblies was evaluated using BUSCO v4.1.1 [21,22] using a set of eukaryotic genes as a database (https://busco-data.ezlab.org/v5/data/lineages/ eukaryota_odb10.2020-09-10.tar.gz) (accessed on 10 August 2020).

Transcriptome Sequencing and Assembly
The RNA sequencing of the two tree tomato accessions yielded 100,919,310 (14.68 Gb) and 113,802,281 (15.84 Gb) raw paired-end reads for A21 and A23, respectively (Table 1). After the initial trimming and stringent quality filtering to remove adapters and low-quality data, 38,411,167 (4.25 Gb) clean paired-end reads were obtained for A21 and 54,474,055 (5.97 Gb) for A23 ( Table 1). The two cohorts of clean reads were assembled independently into transcriptomes using Trinity. For the A21 accession, the assembled transcriptome consisted of 174,252 transcripts and spanned 148,352,996 bp, with an average transcript length of 851.37 bp ( Table 1). The N50 value was 1494 bp and the GC content of 38.8% (Table 1). On the other hand, the A23 accession was assembled in 194,417 transcripts with a total length of 165,074,290 bp and an average length of 849.07 bp ( Table 1). The N50 value for the latter was 1503 bp and the GC content of 38.6% ( Table 1). The assembled sequence lengths ranged from the 200 bp cut-off value to a maximum transcript length of 17,046 bp for A21 and 16,865 bp for A23 ( Table 1). The majority of the assembled sequences were in the ranges of 200 bp to 500 bp and 501 to 1000 bp.
To evaluate the quality of the assemblies, the clean reads were mapped back to the final assembled transcriptome. The overall alignment rates using the alignment software Bowtie2 were 99.09% for A21 and 99.21% for A23 (Table 1). BUSCO was employed to evaluate the accuracy and completeness of our transcriptome assembly, gene set, and transcripts. When comparing the set of genes with the genome, we found that the proportion of complete BUSCO was 98.4% for A21 and 98.8% for A23, which indicated that the integrity of the whole transcriptome was very good (Table 1).

Structural and Functional Annotation
TransDecoder software was used to identify the open reading frames (ORFs) of the unitranscripts assembled and their associated functions, predicting 27,441 ORFs and 34,636 potential proteins for the A21 and 28,336 ORFs and 36,224 potential proteins for A23 ( Table 2). Subsequently, the unique transcripts and the putative proteins identified were annotated by performing Blast searches against several databases using the Trinotate pipeline. A total of 57,422 (33%) and 60,772 unigenes (31.3%) displayed a significant homology when Blastx was performed and 24,311 (14.0%) and 25,054 protein sequences (12.3%) when Blastp searches were performed against the UniProtKB/Swiss-Prot database (cut-off E-value of 1e-3) for A21 and A23, respectively (Table 2). Furthermore, 22,954 and 23,637 unique Pfam protein motifs, 1623 and 1,745 protein sequences with signal peptides (SignalP), and 6899 and 7216 transcripts with at least one transmembrane domain (TmHMM) were predicted for A21 and A23, respectively ( Table 2). The species distribution showed that most sequences exhibited a high similarity mainly to those of Arabidopsis thaliana (L.) Heynh. ( unique Pfam protein motifs, 1623 and 1,745 protein sequences with signal peptides (Sig-nalP), and 6899 and 7216 transcripts with at least one transmembrane domain (TmHMM) were predicted for A21 and A23, respectively ( Table 2). The species distribution showed that most sequences exhibited a high similarity mainly to those of Arabidopsis thaliana (L.) Heynh. (  GO-based functional classification for A21 and A23 transcriptomes assemblies retrieved a total of 196,800 GO terms for A21 and 204,090 for A23 from 22,096 and 23,095 transcripts, respectively ( Table 2). The largest number of GO terms (75.2%) was annotated in sequences with a length between 100 and 500 bp ( Figure 3). Both assemblies had a similar GO distribution for each category; four to nine terms in biological process (BP), three to nine in molecular function (MF) and four to eight in cellular components (CC) category ( Figure 4). The GO levels that ranged between 5 and GO-based functional classification for A21 and A23 transcriptomes assemblies retrieved a total of 196,800 GO terms for A21 and 204,090 for A23 from 22,096 and 23,095 transcripts, respectively ( Table 2). The largest number of GO terms (75.2%) was annotated in sequences with a length between 100 and 500 bp ( Figure 3).
unique Pfam protein motifs, 1623 and 1,745 protein sequences with signal peptides (Sig-nalP), and 6899 and 7216 transcripts with at least one transmembrane domain (TmHMM) were predicted for A21 and A23, respectively ( Table 2). The species distribution showed that most sequences exhibited a high similarity mainly to those of Arabidopsis thaliana (L.) Heynh. (  GO-based functional classification for A21 and A23 transcriptomes assemblies retrieved a total of 196,800 GO terms for A21 and 204,090 for A23 from 22,096 and 23,095 transcripts, respectively ( Table 2). The largest number of GO terms (75.2%) was annotated in sequences with a length between 100 and 500 bp ( Figure 3). Both assemblies had a similar GO distribution for each category; four to nine terms in biological process (BP), three to nine in molecular function (MF) and four to eight in cellular components (CC) category ( Figure 4). The GO levels that ranged between 5 and Both assemblies had a similar GO distribution for each category; four to nine terms in biological process (BP), three to nine in molecular function (MF) and four to eight in cellular components (CC) category ( Figure 4). The GO levels that ranged between 5 and 15, were 88.9% for biological processes, 69.8% for molecular function and 88.2% for cellular components, indicating that the precision of the annotation was accurate (Figure 4) and that a broad diversity of genes was sampled in our transcriptomes.

COG Classification
A Cluster Orthologous Group (COG) is defined as a cluster of three or more homologous sequences that diverge from the same speciation event. Orthologous groups were functionally annotated using the EggNog (evolutionary genealogy of genes: Nonsupervised Orthologous Groups) database. In total 97,437 for the A21 and 99,471 for the A23 GO were assigned to 14,530 and 14,928 unique sequences, respectively (Figure 8). The largest group is represented by the cluster for cellular processes and signaling (CPS) (6311; 21.4% and 6443; 21%), followed by metabolism (MB) (6052; 20.5% and 6,417; 20.9%), information storage and processing (ISP) (6040; 20.4% and 6,396; 20.9%) (Figure 8). Within the CPS category, the largest proportion was assigned to signal transduction mechanisms (T) (2359 for A21 and 2392 for A23) and post-translational modification, protein turnover, and chaperones (O) (2051 and 2104). Within the MB category, the largest proportion was assigned to amino acid transport and metabolism (E) (1232 and 1,342), and carbohydrate transport and metabolism (G) (1249 and 1314), and within the ISP category, the majority were assigned to replication, recombination, repair (L) (2043 and 2254), and transcription (K) (1997 and 2043) (Figure 8). Horticulturae 2021, 7, x FOR PEER REVIEW 9 of 19 Several candidate regulatory genes of the carotenoid biosynthetic pathway were identified in the assembled transcriptomes from S. lycopersicum and A. thaliana. The protein query sequences used for mining the transcriptomic data were the S. lycopersicum prolycopene isomerase (CRTISO), 9-cis-epoxycarotenoid dioxygenase (NCED1), lycopene epsilon cyclase (Lcy-e), neoxanthin synthase (NSY) and the A. thaliana protein ORANGE (OR) (Supplementary Table S2). All of them were found to be expressed in both cultivars, where CRTISO and Lcy-e homologues exhibited a high identity (96%), followed by NCED1 with 95%, NSY with 93%, and finally, OR, which showed a higher identity in A23 (74%) than A21 (71%) (Supplementary Table S2).

COG Classification
A Cluster Orthologous Group (COG) is defined as a cluster of three or more homologous sequences that diverge from the same speciation event. Orthologous groups were functionally annotated using the EggNog (evolutionary genealogy of genes: Non-supervised Orthologous Groups) database. In total 97,437 for the A21 and 99,471 for the A23 GO were assigned to 14,530 and 14,928 unique sequences, respectively (Figure 8). The largest group is represented by the cluster for cellular processes and signaling (CPS) (6311; 21.4% and 6443; 21%), followed by metabolism (MB) (6052; 20.5% and 6,417; 20.9%), information storage and processing (ISP) (6040; 20.4% and 6,396; 20.9%) (Figure 8). Within the CPS category, the largest proportion was assigned to signal transduction mechanisms (T) (2359 for A21 and 2392 for A23) and post-translational modification, protein turnover, and chaperones (O) (2051 and 2104). Within the MB category, the largest proportion was assigned to amino acid transport and metabolism (E) (1232 and 1,342), and carbohydrate transport and metabolism (G) (1249 and 1314), and within the ISP category, the majority were assigned to replication, recombination, repair (L) (2043 and 2254), and transcription (K) (1997 and 2043) (Figure 8).
We further analyzed the sequences of the candidate genes that played an important role in the carotenoids biosynthesis, identifying a total of 1548 SNVs in the two cultivars assessed when compared to the tomato reference genome (Table 5). Of them, 478 SNPs were found in the coding region of the prolycopene isomerase (CRTISO) gene, 372 in 9-cis-epoxycarotenoid dioxygenase (NCED1), 194 in lycopene epsilon cyclase (Lcy-e), 164 in neoxanthin synthase (NSY), and 340 in protein ORANGE (OR) ( Table 3). The impact of the majority of variants (42.2%) was classified as a "modifier", 31.6% as "low", 17.9% as "moderate" and 1.9% as "high". Regarding the effects on protein function, on average, 51% of the variants were synonymous mutations, while the remaining variants were missense mutations.

Discussion
Although tree tomato is one of the most promising fruit crops in the Mediterranean and temperate regions [4], its genomic landscape has not yet been explored yet. Other unexploited crops similar to the tree tomato, such as the cape gooseberry (Physalis peruviana L.) and amaranth (Amaranthus cruentus L.), have greatly benefited from genomic studies, which have fostered the dissection of multiple agronomic traits and breeding programs [35][36][37]. In this study, we conducted the de novo transcriptome assembly of two tree tomato cultivars to provide useful genomic data for the improvement of this unexploited but emerging crop. Through RNA sequencing, a total of 174,252 (for A21) and 194,417 (for A23) transcripts were assembled from 38 and 54 million filtered reads and with an average length of 851 and 849 bp, respectively. The number of transcripts of these accessions was slightly higher than those obtained in previous transcriptome studies in other related Solanaceae species such as tomato, potato or pepino (Solanum muricatum Aiton) [14,38,39], but it was similar to others obtained in plant species of the same family, such as S. commersonii Dunal and S. aculeatissimum [40,41], suggesting the high quality and reliability of our assemblies. Furthermore, the assembly and annotation completeness was quantitatively confirmed by the high percentage values (>98%) of the BUSCO assessment, values that were comparable or even higher than those of other recent Solanum transcriptomes, which exhibited values of 97% for S. tuberosum and 93% for S. chilense [42,43].
The functional annotation of the assembled unigenes is essential for understanding the role of the represented genes [44]. Even though the number of protein-coding genes is unknown in tree tomato, the prediction of the potential ORFs (27,441 in A21 and 28,336 in A23) and proteins (34,636 in A21 and 36,224 in A23) was in agreement with those observed for protein-coding genes in other Solanum species, such as tomato (35,535), potato (39,290), eggplant (S. melongena L.) (30,630 and 34,231) [45][46][47]. Similarly, signal peptides, transmembrane and Pfam domains were assigned to around 5%, 20%, and 65% of the identified proteins, respectively. These percentages were higher than those obtained in other plant species of the Solanaceae family such as S. trilobatum and S. sisymbriifolium [48,49]. The GO annotation revealed that unigenes could be categorized into three major functional categories: biological processes (68%), molecular functions (18%) and cellular components (12%). The top two subcategories were the cellular and metabolic processes in the biological processes, binding, and catalytic activity of the molecular function; and the cellular anatomical entity and protein-containing complex in the cellular component, which suggests that many novel genes involved in metabolic activities could play important roles during the growth and development stages of the plant.
The KEGG annotation allows for the functional analysis and interpretation of transcriptomic data and exhibits how the assembled transcripts are integrated into metabolic pathways and biological systems [50]. A total of 155 pathways in A21 and 161 in A23 involving 14,035 and 14,540 unigenes were annotated, including pathways of great interest that could be used to improve the quality of breeding programs for the breeding of tree tomato such as purine metabolism, drug metabolism, terpenoid backbone biosynthesis, and the biosynthesis of flavonoid and carotenoids. The increased accumulation of flavonoids and carotenoids in fruit crops improves their commercial and health values [51]. Among the biological features, the most renowned property of flavonoids and carotenoids is their antioxidant effects, which are often much higher than those of vitamin E and vitamin C [52,53]. Our transcriptional results confirmed the presence of known genes and enzymes in pathways related to the synthesis of flavonoids and carotenoids. These results are in agreement with previous studies that reported the tree tomato as an abundant source of carotenoids, anthocyanins, flavonoids, and phenolic compounds and has higher antioxidant activity than other antioxidant-rich fruits such as kiwifruit or grape [7]. The carotenoid concentration in tree tomato may be under the control of several genes that are associated with the structure and function of the genes in the carotenoid pathway. In the accession A21, our data showed that the carotenoid biosynthetic process GO terms were significantly enriched. This is in agreement with the previous results of [54,55] who reported that the purple cultivar had higher levels of carotenoids compared to the yellow or orange cultivars. Our results suggested that the flavonoid and carotenoid biosynthesis pathway-related genes were well conserved in the tree tomato when compared with the tomato [56]. The sequence variants in these genes among tree tomato varieties could be used as functional markers for marker-assisted breeding to obtain new varieties of tree tomato with improved nutritional values.
We obtained a total of 68,647 SNVs between both accessions, suggesting a high level of polymorphisms for tree tomato. The SNVs reported here were higher than the cohorts identified in other transcriptomic studies of Solanaceae, such as the 17,000 SNVs found in tomato [39]; however, in the case of potato a similar number of SNPs, 69,011, were reported [57]. The A21 accession exhibited a higher number of SNVs and, interestingly, the vast majority of the detected variants were heterozygous. The latter might be due to the fact that, even though some tree tomato cultivars are considered self-compatible and autogamous, the flowers are frequently visited by pollinator insects, which can lead to cross-pollination [4]. The annotated SNP effects located in the exon and intergenic regions and a transition to transversion ratio of 1.86 agree with previous findings in tomato and eggplant [39,58]. On the other hand, the number of interspecific variants detected with potato was significantly higher than those of tomato, confirming that tree tomato is phylogenetically closer to the latter [59]. The data also indicated that the differences in variant number between the tree tomato and its closely related species were evident, particularly for chromosomes 1 and 12, which were highly related to the physical length between them. Regarding the SVNs found in the candidate genes involved in the carotenoids biosynthesis pathway, our results showed that the CRTISO gene exhibited the highest number of SNPs, which could be due to mutations in its coding sequence. The carotenoid analysis in tomato ripe fruits showed that a mutation in CRTISO leads to a prolycopene accumulation instead of all-trans-lycopene compounds, resulting in a fruit color change from red to orange [60]. In addition, most of the SNVs within the genes involved in carotenoid metabolism resulted in synonymous substitutions. These results were consistent with previous studies in tomato [61] where protein expression and protein folding may be influenced by synonymous SNPs as they are involved in regulating microRNA-mediated genes [62,63]. Hence, the synonymous SNPs identified in the tree tomato cultivars in this work may have potential functional significance in carotenoid biosynthesis.
The identification of the intraspecific and interspecific variants will foster several applications including genetic mapping, genotype identification, marker-assisted selection, breeding, comparative genomics, and understanding the genetic control of adaptive traits in the tree tomato [64].

Conclusions
In this work, we assembled high-quality transcriptome sequences of two tree tomato cultivars, a fruit crop closely related to tomato and potato, with great potential in subtropical regions. The comprehensive annotation provided extensive and detailed information that facilitates the dissection of traits of agronomic interest, such as the content in bioactive compounds or the response to stresses, among others. In addition, this is the first study in tree tomato where a high number of polymorphisms have been identified, both intraspecifically and with closely related species that could be used in genetic diversity analysis, qualitative and quantitative trait mapping, and breeding programs in tree tomato. This information constitutes a valuable resource for tree tomato breeding programs and genetic diversity studies and will help in the enhancement of tree tomato and its successful introduction in other regions and countries.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/horticulturae7110431/s1', Table S1: GO enrichment analysis in A21 and A23, Table S2: Identification of regulatory genes of the carotenoid biosynthetic pathway in A21 and A23 from Solanum lycopersicum and Arabidopsis thaliana,  Data Availability Statement: The raw reads data are available at NCBI Sequence Read Archive (SRA) with accession number SRR15258852 (A21) and SRR15258851 (A23), within the bioproject number PRJNA749599, available at http://www.ncbi.nlm.nih.gov. VCF files are available upon request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.