A Revised and Improved Version of the Northern Wheatear ( Oenanthe oenanthe ) Transcriptome

: This work presents an updated and more complete version of the transcriptome of a long-distance migrant, the Northern Wheatear ( Oenanthe oenanthe ). The improved transcriptome was produced from the independent mRNA sequencing of adipose tissue, brain, intestines, liver, skin, and muscle tissues sampled during the autumnal migratory season. This new transcriptome has better sequencing coverage and is more representative of the species’ migratory phenotype. We assembled 20,248 transcripts grouped into 16,430 genes, from which 78% were successfully annotated. All the standard assembly quality parameters were improved in the second transcriptome version.


Introduction
The transcriptome represents a very dynamic part of the cellular machinery. The genespecific mRNA lifetime is strictly controlled because it is one of the first levels in the regulation of cellular and developmental processes [1,2]. Therefore, the transcriptome-level analysis of a tissue is a proxy to the biological functions it performs and to the phenotype of an organism [3,4].
Sequencing the transcriptome of non-model species is a challenge but is of special value. Species-specific sequence information can be used a priori for further functional studies and the understanding of specific phenotypes such as adaptations for migration in birds [5,6].
For molecular ecology and conservation studies, transcriptomic information can be used to infer population diversity and resilience [7][8][9][10] and not just mere genetic polymorphisms [11,12]. Furthermore, the novel assembly of transcriptomes opens the door for orthologous comparisons of biological functions among distinct species, with implications for discovering novel approaches to treat, e.g., metabolic diseases in humans [13,14].

Bird Migration
The study of the birds' migratory phenotype is of special interest since it shows the unique conjunction of a healthy super-fat phenotype and a simultaneous increment of physical activity [15,16], rarely found among other vertebrates. Thus, migratory birds have evolved to efficiently manage lipid metabolism and to counteract obesity-related diseases [17][18][19][20][21].
RNA-Seq techniques are powerful tools for the study of complex traits and the assembly of a high quality reference transcriptome is a necessary first step to guarantee reliable results [22].
As a part of the integrative study of the migratory phenotype of the Northern Wheatear (Oenanthe oenanthe), we previously published a preliminary and partial transcriptome [23]. In this work, we present an improved version of that transcriptome, which was obtained

Sequencing and Transcriptome Assembly
RNA samples (28 s/18 s ratio ≥ 1.7; RIN ≥ 8; A260/A280 ≥ 1.8) from three individuals were pooled together by tissue (brain, intestines, liver, adipose tissue, muscle, or skin) and body mass conditions and sent to GATC-Biotech AG (Konstanz, Germany) for sequencing. Samples were sequenced using a Genome Sequencer Illumina HiSeq2000 device (Sequence mode: 50 bp single-end reads). GATC-Biotech conducted cDNA library preparation following the company's standard protocols. From the total RNA sample poly(A)+ RNA was isolated and first-strand cDNA synthesis was primed with a N6 randomized primer. Then adapters were ligated to the 5 and 3 ends of the cDNA and it was finally amplified by PCR.
The reads were trimmed using Trimmomatic. v1 [27], homopolymers (60% over the entire length of the read represented by one nucleotide), primers, and adapters, and ambiguous residues (Ns) from both sides of the sequence were trimmed. The base pairs with a Phred-score of less than 20 were removed from the ends. After this step, reads with a length of <36 bases were discarded. Sam tools version 1.3.1 [28] was employed to generate the mapping statistics and filter out unmapped reads.

De Novo Assembly and Post-Processing
The short-reads from all tissues and conditions (n = 18) were pulled together to assemble de novo the transcriptome of O. oenanthe using Trinity, version 2.6.5 [29], which was run under the following conditions: in silico normalization of reads (normalize_reads and normalize_max_read_cov 50) and set for single-end reads fastq data input.
The new transcriptome was merged in a fasta file with the previous version [23] sequenced using the Genome Sequencer Roche GS FLX System generating 400 bp length reads. The redundant transcripts were then removed using CD-Hit-EST [30] by merging transcripts with a sequence similarity of 0.9. Low expressed transcripts that accounted for at most 10% of the overall expression (cumulative sum of reads by transcript) were discarded. In this way, we obtained a new, compact, and non-redundant transcriptome for O. oenanthe.
To estimate the completeness of the transcriptome, the conserved ortholog gene content was estimated using the Benchmarking Universal Single-Copy Orthologs (BUSCO) tool, version 3.0.2 [31]. The analysis was conducted against the available avian information in the database aves_odb9. The standard transcript quality-control length-parameters were also computed. To assess the quality and summarize the key standard parameters of transcriptome assemble, we used the TransRate tool [32].
To check the quality of the assembly, the reads were remapped by tissue to the new transcriptome using the mem-algorithm implemented in BWA-MEM version 0.7.12 [33] while flagging potential chimeras like supplementary reads.

Annotation
Transcripts were annotated to the coding sequences of Ficedula albicollis (release 96, downloaded from ftp://ftp.ensembl.org/pub/release-96/fasta/ficedula_albicollis/cds/; accessed in 15 May 2019) and the UniProtKB/Swiss-Prot 2019_05 databases using Blastn and Blastx, respectively. For the alignment, an E-value = 0.005 was set, and the best hit (maximum Bitscore) from any of the two sources of annotation was selected for naming the transcripts.
Based on the annotation, we summarized all transcripts with a unique "gene symbol" into a single O. oenanthe gene (OOENG). If one "Trinity gene" clustered multiple transcripts that were annotated to different genes, those genes were split into different "oenanthe genes" to avoid ambiguity. For unannotated transcripts, either the code derived from Trinity or the original isotig/isogroup from the first transcriptome version was kept.
The R programming language platform, version 3.2.2, was employed for statistical analysis and plotting [34].
The raw sequence data and the assembled transcriptome can be found in the Transcriptome Shotgun Assembly (TSA) database from the National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov; accessed in 20 February 2021) under the accession number TSA GFYT00000000 version 2. The full annotated transcriptome can be found in the supplements (Table S1).

Results and Discussion
A more robust and complete transcriptome for O. oenanthe was obtained by combining the previously published data-one-time sequencing using~400 bp long reads-with new transcripts derived from the assembly of short reads from the brain, intestines, liver, adipose tissue, muscle, and skin.
The percentages of the mapped reads by tissue to both transcriptomes are shown in Table 1. The new transcriptome recovered between 8 and 13% more reads than the original one. Meaning that if the old transcriptome were used for further differential gene expression analyses, these reads would remain unmapped with the consequent loss of information. Therefore, the new version is more representative of the "migratory transcriptome", a fact that would increase the sensitivity and robustness of the future comparisons.  [22]. Table 2 summarizes the parameters of both transcriptomes. The new version has longer transcripts and over 47 million assembled bases (32% more than the previous one). Thus, the new version is potentially more complete and less fragmented than the old version ( Figure 1). The new version contains 20,248 transcripts (the old version had 21,746); 13,040 (64%) are from the new assembled transcriptome, and 7208 (36%) are from the previous version. Up to 1112 new open reading frames (ORF) are described in the new transcriptome a good indicator of its improved quality [22]. Table 2 summarizes the parameters of both transcriptomes. The new version has longer transcripts and over 47 million assembled bases (32% more than the previous one). Thus, the new version is potentially more complete and less fragmented than the old version ( Figure 1).

Transcriptome Completeness
The new O. oenanthe transcriptome shows better quality parameters regarding gene completeness (BUSCO scores) than the first version. It agrees with the previous findings of longer transcripts and higher number of reported ORFs (Table 3). The percentages of complete and single-copy complete genes refer to the orthologous near-universallydistributed avian genes found in the wheatears' transcriptome [31]. The new version was enriched in approximately 30% of these genes, a reliable indicator of its higher quality. The 20,248 assembled transcripts were attributable to 16,430 genes, between the reported number of avian genes: 9909 in the Sunbird Asity (Neodrepanis coruscans) and 19,174 in Zebra Finch (Taeniopygia guttata) [35]. In total, 10,311 of these genes were fully annotated: 12,164 (60%) to F. albicollis, 2455 (12%) to the Swiss-Prot database, and 5631 (28%) unannotated. Up to 6596 isoforms were found which correspond to 2768 genes (Table 4). In the new version, as in the previous one, the gene with the highest number of isoforms is titin. These transcripts are most likely gene fragments and not real isoforms (OOENG09682 length interval: 469-9168 bp) since titin is the largest gene identified in vertebrates [36]. Titin fragmentation and the still modest completeness percentage of the new transcriptome suggest that sequencing the transcriptome of more tissues, sampled under different metabolic conditions using more in-depth sequencing coverage/longer reads, is needed to better describe the wheatears transcriptome.
In summary, in this paper, we presented an updated version of the Northern Wheatear transcriptome, which we recommend for use in further RNA-Seq studies on the species.