De Novo Transcriptome Assembly of Cucurbita Pepo L . Leaf Tissue Infested by Aphis Gossypii

Zucchini (Cucurbita pepo L.), extensively cultivated in temperate areas, belongs to the Cucurbitaceae family and it is a species with great economic value. One major threat related to zucchini cultivation is the damage imposed by the cotton/melon aphid Aphis gossypii Glover (Homoptera: Aphididae). We performed RNA-sequencing on cultivar “San Pasquale” leaves, uninfested and infested by A. gossypii, that were collected at three time points (24, 48, and 96 h post infestation). Then, we combined all high-quality reads for de novo assembly of the transcriptome. This resource was primarily established to be used as a reference for gene expression studies in order to investigate the transcriptome reprogramming of zucchini plants following aphid infestation. In addition, raw reads will be valuable for new experiments based on the latest bioinformatic tools and analytical approaches. The assembled transcripts will serve as an important reference for sequence-based studies and for primer design. Both datasets can be used to support/improve the prediction of protein-coding genes in the zucchini genome, which has been recently released into the public domain. Dataset: https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA439198; https://www.ncbi.nlm.nih.gov/ nuccore/1377124260 Dataset License: CC-BY

Cucurbita pepo L. (2n = 2x = 40) belongs to the Cucurbitaceae family; it is widely cultivated in temperate region and ranks among the highest-valued vegetables worldwide [1].Historical records report that C. pepo is native to North America and was dispersed to other continents during the 16th century by transoceanic travels [2].C. pepo is extremely variable in fruit-related features.The edible forms of this species can be grouped into two sub-species: ssp.Pepo, which includes pumpkin, vegetable marrow, cocozelle, and zucchini; and spp.Ovifera, which includes acorn squash, scallop, crookneck, and straightneck.
One major threat related to zucchini cultivation, both in greenhouse and open-field, is the damage imposed by the cotton/melon aphid Aphis gossypii (Homoptera: Aphididae).Aphis gossypii is a Data 2018, 3, 36 2 of 10 cosmopolitan, highly polyphagous species, widely distributed in warm climate regions [3], which can both directly and indirectly affect host plant by inducing stunt growth, leaf curling, and necrosis and by vectoring several plant viruses.Furthermore, indirect damage is related to the deposition on plant tissue surfaces of honeydew, which provides a nutrient source for saprophytic fungi (microorganism growth), resulting in hampered photosynthesis [4].
Given the economic importance and the growing attention on this crop, in the last decade, a large number of genomic resources and tools has been developed to accelerate cucurbit crop improvement.Most of these resources and tools have merged into the Cucurbit Genomics Database (http://cucurbitgenomics.org/).
In 2011, Blanca et al. [5] sequenced C. pepo transcriptome using a 454 GS FLX Titanium platform.Three cDNA libraries (root, leaves, flower tissue) from two C. pepo varieties that differ for plant, flowering, and fruit traits were used to generate a collection of 49,610 unigenes that represents the first sequenced transcriptome of the species.A subset of single nucleotide polymorphism (SNP) markers identified within this transcriptome was then selected to design a custom Illumina GoldenGate genotyping assay used to build the first linkage map of Cucurbita and to identify quantitative trait loci (QTL) [1].
Subsequently, two more C. pepo transcriptomes, namely Acorn squash cv."Sweet REBA" and Pumpkin (C.pepo subsp.ovifera, cv."Big Moose" and "Munchkin"), were de novo assembled from Illumina reads in order to provide further genomic resources within the Cucurbita genus [6,7].Comparative analysis performed on "Big Moose" and "Munchkin" transcriptomes allowed genes with potential roles in fruit size and morphology to be identified, as well as microsatellite markers derived from expressed sequence tags (EST-SSR) to be generated [8].Also, the transcriptome of zucchini cultivar "True French", used as parent in crossing scheme of pathogen-resistant commercial varieties, has been sequenced and assembled as a valuable resource for genetic and genomic studies [9].Lastly, a high-quality draft of the zucchini genome organized into 20 chromosome-scale pseudomolecules was released into the public domain [10].Additionally, 40 transcriptomes of 12 species of the genus were assembled and used as the foundation for comparative genomic studies [10].
Aiming to contribute in this scenario, we performed RNA-sequencing and de novo assembly of the cv."San Pasquale" transcriptome following a compatible interaction with Aphis gossypii.As far as we know, this is the first zucchini transcriptome from leaf tissue challenged by an insect pest.

Illumina Read Processing and Transcriptome Assembly
We performed RNA-sequencing on cultivar "San Pasquale" leaves, uninfested and infested by A. gossypii, that were collected at three time points (24, 48, and 96 h post infestation).The schematic overview of the experimental design is depicted in Figure 1.All samples were subjected to sequencing using an Illumina HiSeq 2500 device in a 2 × 101 paired-end format.The overall process of read pre-processing, transcriptome assembly, and annotation, as well as transcriptome quality evaluation, is outlined in Figure 2. All samples were subjected to sequencing using an Illumina HiSeq 2500 device in a 2 × 101 paired-end format.The overall process of read pre-processing, transcriptome assembly, and annotation, as well as transcriptome quality evaluation, is outlined in Figure 2. All samples were subjected to sequencing using an Illumina HiSeq 2500 device in a 2 × 101 paired-end format.The overall process of read pre-processing, transcriptome assembly, and annotation, as well as transcriptome quality evaluation, is outlined in Figure 2. The sequencing generated ~34 million paired-end reads of 101 nucleotides in length per sample (Table 1).After the pre-processing step (see Methods), about 552.4 million of high-quality reads (average Q score 37.44; min Q score 30; max Q score 39) of 75-101 nucleotides in length (average length 98 nucleotides) were obtained and approximately 4 million reads were filtered out for each sample (Table 1).Then, all high-quality reads were combined for de novo assembly of the transcriptome.The total number of transcripts and major characteristics of the assembled transcriptome are reported in Table 2.

Annotation
The transcriptome was annotated using similarity-based searches against five different databases.A total of 58,945 transcripts (72%) had significant matches with proteins in the Cucumis melo dataset.BLASTx searches against Cucumis sativus and Arabidopsis thaliana protein sequences, as well as against the UniProtKB/SwissProt database, revealed that 43,796 (67%), 56,683 (79%), and 44,378 (68%) transcripts, respectively, had at least one significant match in the corresponding database.Based on these results, approximately 71% of all transcripts had at least one match in one of the four protein databases queried (Figure 1).The BLASTn comparison between the de novo assembled transcriptome with the publically available C. pepo transcriptome resulted in 70,334 (98%) sequences with significant matches.This means that the transcriptome assembly herein described include 1313 novel sequences.
In detail, 548 out of 1313 transcripts had at least one match with one of the four protein databases above.All considered, a total of 36,585 sequences (~51%) matched all five databases queried (Figure 3).
(68%) transcripts, respectively, had at least one significant match in the corresponding database.Based on these results, approximately 71% of all transcripts had at least one match in one of the four protein databases queried (Figure 1).The BLASTn comparison between the de novo assembled transcriptome with the publically available C. pepo transcriptome resulted in 70,334 (98%) sequences with significant matches.This means that the transcriptome assembly herein described include 1313 novel sequences.In detail, 548 out of 1313 transcripts had at least one match with one of the four protein databases above.All considered, a total of 36,585 sequences (~51%) matched all five databases queried (Figure 3).Annotation was refined using the Blast2GO software [11].Gene Ontology (GO) terms were assigned to 51,398 sequences, allowing us to classify C. pepo transcripts in a standard and controlled vocabulary.The number of GO terms per sequence varied between 1 and 74, with an average of seven GO terms per transcript.In total, 276,601 GO terms were retrieved, with 50% assigned to biological process, 27% assigned to molecular function, and 21% assigned to cellular component domain.Enzyme Commission (EC) numbers were associated with 15,304 transcripts out of 51,398 GO annotated sequences, whereas 10,426 sequences were mapped at least onto one Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.
Finally, BLASTn-based sequence similarity search was performed using Rfam [12] as filtering database.Our dataset includes 1104 (1.5% of the total transcripts) potential non-coding RNAs.Among these transcripts, only 398 do not have a match with any of the queried protein databases.
All the annotations for each transcript are reported in Supplementary Table S1.

Evaluation of Transcriptome Quality and Completeness.
In order to assess the quality of the transcriptome, we performed open reading frame (ORF) prediction using ESTScan [13].The Arabidopsis thaliana training matrix was selected for peptide prediction in the C. pepo transcriptome.The results (Table 3) indicated that 67,534 sequences (about 94% of total transcripts) contain putative coding sequences that could be translated into proteins.Among these, 23,735 transcripts were categorized as complete ORF, containing defined start and stop codons.Additionally, 43,799 transcripts were classified as partial coding sequences.Specifically, 25,000 sequences were classified as "5′ truncated ORF" with clear stop codon and lacking the ATG start codon; 8220 transcripts displayed the initiating ATG codon, but not termination triplet.Annotation was refined using the Blast2GO software [11].Gene Ontology (GO) terms were assigned to 51,398 sequences, allowing us to classify C. pepo transcripts in a standard and controlled vocabulary.The number of GO terms per sequence varied between 1 and 74, with an average of seven GO terms per transcript.In total, 276,601 GO terms were retrieved, with 50% assigned to biological process, 27% assigned to molecular function, and 21% assigned to cellular component domain.Enzyme Commission (EC) numbers were associated with 15,304 transcripts out of 51,398 GO annotated sequences, whereas 10,426 sequences were mapped at least onto one Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.
Finally, BLASTn-based sequence similarity search was performed using Rfam [12] as filtering database.Our dataset includes 1104 (1.5% of the total transcripts) potential non-coding RNAs.Among these transcripts, only 398 do not have a match with any of the queried protein databases.
All the annotations for each transcript are reported in Supplementary Table S1.

Evaluation of Transcriptome Quality and Completeness
In order to assess the quality of the transcriptome, we performed open reading frame (ORF) prediction using ESTScan [13].The Arabidopsis thaliana training matrix was selected for peptide prediction in the C. pepo transcriptome.The results (Table 3) indicated that 67,534 sequences (about 94% of total transcripts) contain putative coding sequences that could be translated into proteins.Among these, 23,735 transcripts were categorized as complete ORF, containing defined start and stop codons.Additionally, 43,799 transcripts were classified as partial coding sequences.Specifically, 25,000 sequences were classified as "5 truncated ORF" with clear stop codon and lacking the ATG start codon; 8220 transcripts displayed the initiating ATG codon, but not termination triplet.Furthermore, 10,579 sequences encoded for truncated proteins showing neither start nor stop codons.The remaining 4114 sequences (about 6% of all transcripts) were probably un-translated regions (UTRs) with interspersed stop codons or non-coding RNAs.

Table 3.
Results of the open reading frame (ORF) prediction analysis. 1ORF lacking ATG codon but including the stop codon. 2 ORF including ATG codon but lacking the stop codon. 3ORF with neither start nor stop codon.(#: number of).

Items # Sequences
complete ORF 23,735 5 truncated 1  25,000 3 truncated 2 8220 5 and 3 truncated 3  10,579 no good ORF 4114 Total 71,648 Transcriptome completeness was evaluated running the Benchmarking Universal Single-Copy Orthologs tool (BUSCO) [14].The number of complete BUSCOs was 1215 out of 1400 (707 complete and single-copy BUSCOs + 508 complete and duplicated BUSCOs).The number of fragmented BUSCOs was 80, while the number of missing BUSCOs was 145.In summary, the 84.4% of complete BUSCOs were found and this indicates how close to completeness the assembled zucchini transcriptome is.
Finally, transcripts were mapped on the C.pepo reference genome (version 4.1) using the Maker pipeline [15].Exactly 60,632 out of 71,648 (84.62%) transcripts were automatically transferred onto the genome sequence and converted into reliable gene structures.An additional 4735 transcripts (6.6%) were mapped on the genome by Maker and were tagged as "expressed_sequence_match".The remaining 6281 sequences were aligned (via BLASTn) against the C. pepo genome; 5597 of them (7.8% of the total) were successfully mapped (see Methods).In summary, over 90% of the assembled transcripts were mapped back on the C. pepo reference genome with high confidence.

Value of the Data
This resource was primarily established to be used as a reference for gene expression studies in order to investigate the transcriptome reprogramming of zucchini plants after aphid infestation.With this study, we are making available datasets for molecular biology and genetics research in Cucurbita spp.These resources will be of critical importance for the investigation of the molecular mechanisms and signals involved in the zucchini response to aphid infestation.Furthermore, the transcriptome and its functional annotation might be easily compared with the available cucurbit transcriptomes previously generated from the same or different tissues.Finally, both RNA-seq raw reads and the assembled transcripts will be valuable to support/improve the prediction of protein-coding genes in the zucchini genome [10].

Data Records
Raw FASTQ Illumina sequence data (Table 4) have been deposited at the Sequence Read Archive (SRA) under the accession number SRP136062 (PRJNA439198).The dataset includes 18 records (from SRS3072838 to SRS3072855).For each sample, three replicates were sequenced.
This Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GGKS00000000.The version described in this paper is the first version, GGKS01000000.

Biological Material and Experimental Design
Seeds of the aphid-susceptible cultivar "San Pasquale" were obtained from the seed company "La Semiorto Sementi".Zucchini were planted in plastic pots with a diameter of 10 cm and were enclosed in insect-proof cages.Plants were grown in a climatic chamber with a photoperiod of 16/8 h (light/dark), under a temperature of 22 ± 1 • C, with 75 ± 5% of relative humidity.
Aphis gossypii Glover (Homoptera: Aphididae) was isolated from watermelon plants under severe infestation in Terracina (Latina, Central Italy) and reared on "San Pasquale" plants in cages equipped with anti-insect nets (50 mesh).Aphid rearing was maintained in a dedicated climatic chamber under the environmental conditions described above at the Department of Agricultural Sciences of the University of Naples Federico II.Zucchini plants were transferred to a new climatic chamber (temperature: 22 ± 1 • C; relative humidity: 75 ± 5%; photoperiod: L16: D8), and were individually placed in insect-proof cages for the infestation assay.First and second leaves were infested with ten A. gossypii adults.Five aphids per leaf were transferred onto the adaxial surface with a paintbrush and their number was daily monitored.Control plants, individually enclosed in insect-proof cages, were grown under the same conditions.Aphids were left to feed for 24, 48, and 96 h, after which they were manually removed using a fine paintbrush.Leaf tissue was sampled and immediately frozen in liquid nitrogen.At the same time points, leaf tissue from aphid-free control plants was sampled.Three biological replicates for both infested and control plants were collected per time point and leaves of a single replicate were pooled for downstream analysis (Figure 1).

RNA Extraction, Library Construction and Sequencing
Total RNA was extracted from 100 mg of tissue previously ground in liquid nitrogen using the RNeasy Mini kit (Qiagen, Hilden, Germany), according to manufacturer's instructions.Next generation sequencing was performed by Genomix4life srl.(Baronissi, Salerno, Italy).Indexed libraries were prepared from 2 µg of RNA with the TruSeq Stranded mRNA Sample Prep Kit (Illumina, San Diego, CA, USA) following manufacturer's instructions.Libraries had insert sizes of 125 bp.Libraries were quantified using the RNA Bioanalyzer 2100 Plant Nano chip (Agilent Technologies, Santa Clara, CA, USA) and pooled to a final concentration of 2 nM such that each index-tagged sample was present in

Figure 1 .
Figure 1.Schematic overview of the experimental design.Three biological replicates for both infested and control plants were collected at 24, 48, and 96 h post infestation and leaves of a single replicate were pooled for downstream analysis.

Figure 1 .
Figure 1.Schematic overview of the experimental design.Three biological replicates for both infested and control plants were collected at 24, 48, and 96 h post infestation and leaves of a single replicate were pooled for downstream analysis.

Data 2018, 3 , 10 Figure 1 .
Figure 1.Schematic overview of the experimental design.Three biological replicates for both infested and control plants were collected at 24, 48, and 96 h post infestation and leaves of a single replicate were pooled for downstream analysis.

Figure 2 .
Figure 2. Data processing workflow.ORF-Open Reading Frame; BUSCO-Benchmarking Universal Single-Copy Orthologs; EST-Expressed Sequence Tags; BLAST-Basic Local Alignment Search Tool; CAP-Contig Assembly Program; GO-Gene Ontology.

Figure 3 .
Figure 3. Venn diagram showing the BLAST results of C. pepo transcriptome against five databases.

Figure 3 .
Figure 3. Venn diagram showing the BLAST results of C. pepo transcriptome against five databases.

Table 1 .
Number of reads generated from sequencing (raw data) and after quality filtering and adapter trimming (high quality data) for each sample.

Table 2 .
Statistics on the de novo assembled C. pepo transcriptome.

Table 4 .
Description of samples submitted to the NCBI Sequence Read Archive (SRA).