3.1. Sampling and Primary Analysis
The choice of samples for the tomato transcriptome map was based on clustering of
A. thaliana transcriptome data from Klepikova et al. [
3]. We selected the
Arabidopsis samples that had the most dissimilar expression profiles based on the clustering tree of samples, and collected tomato samples that corresponded to these
Arabidopsis samples (for example, anthers and senescent leaves). Assuming that expression profiles in homologous organs and/or corresponding developmental stages are similar in
Arabidopsis and tomato, this approach would result in a set of tomato samples representing the maximum diversity of expression profiles.
The samples were sequenced with at least 20 million sequence reads were generated for each sample and read length of 75 and 60 bp (see Materials and Methods). Initial quality analysis showed a high congruence of the biological replicates: Pearson
r2 correlation values for all replicates were between 0.79 and 1.0, with a mean value of 0.96 (median 0.98) (
Supplementary Table S6), and a clustering tree of the replicates also indicated consistency of the data (
Supplementary Figure S2). A hierarchical clustering tree of the samples reflected an organ- and age-specific structure (
Figure 1). Most samples which are not replicates have highly divergent expression profiles (1-
r2 >0.3). This shows that the initial assumption was true and that our map indeed represents samples which are the most diverse in terms of expression profiles.
Annotation SL2.50 of the
S. lycopersicum genome contains 33,810 coding genes. We used two thresholds to define genes as expressed in a certain sample: five normalized read counts in each of two replicates of the sample (weak threshold), and 16 normalized read counts for the strong threshold (as defined by Su et al. [
17]). Using the weak threshold, 26,283 (78%) of genes were expressed in at least one sample (24,792 (73%) using strong threshold,
Supplementary Table S7). In all samples with weak and strong thresholds, 13,517 (40%) and 11,669 (35%) genes were expressed, respectively (
Supplementary Table S7). The lowest number of expressed genes (17,208 (51%) and 15,348 (45%) for weak and strong thresholds, respectively) was observed in the Sol.FL.r sample (red pulp), while the greatest number (20,805, 62% and 18,564, 55%) was observed in the Sol.SD.y sample (young seeds) (
Supplementary Figure S3).
The splicing analysis demonstrates that the current annotation of the tomato genome lacks many splice sites. Our dataset reveals a high number of new splice sites. In contrast, only 10% of 123,617 previously known splice sites are not found in our data. Regarding new splice sites, even at the most stringent threshold, the number of new sites is twice as much as the number of annotated sites. The results of splicing are summarized in
Table 1.
To assess the completeness of the transcriptome map in terms of the representation of expressed genes, we used three publicly available datasets that represent different biological processes and organs. The complete list of samples is presented in
Supplementary Table S2. The first dataset—DEVELOPMENT—includes 19 samples (floral bud, leaf, petal, root, and different parts of the fruit at five stages of fruit maturity) in one replicate with a sequencing depth of 14–26 million reads [
13]. The second dataset—STRESS—includes two sets of samples from biotic stress (
Cladosporium fulvum infection-treated and control plants [
23] and PRJNA419151). Each set is a time series collected in three replicates, and the sequencing depth ranges between 9.7–31 million reads. The third dataset—FRUIT—is a detailed expression atlas of the developmental dynamics of the tomato fruit [
12]. It includes 49 samples in three replicates and 84 samples in four replicates. The sequencing depth is moderate, ranging from 3.6 to 25 million reads. Out of 483 samples, 183 have more than 10 million, and 367 have more than 7 million reads. The total number of genes expressed in the samples from these three datasets is 27,562 (under the threshold 5+5 reads); in our transcriptome map, we registered the expression of 26,283 genes (i.e., >95%). The same pattern is retained under a stronger threshold—16+16 reads: out of 25,908 genes expressed in these three datasets, and 24,792 genes are observed in our map. The expression of ~1000 genes is registered only in our dataset (
Figure 2a); we assume that this is because several samples are unique in our map, e.g., meristems.
Next, we assessed the number of samples in which each gene was expressed (
Supplementary Figure S4). Most of the protein-coding genes tended to be expressed in all or almost all samples (16,326, 48% (14,378, 43%) genes were expressed in more than 25 samples), while some genes were expressed in a few samples (3365, 10% (3674, 11%) genes in 1–7 samples). We also investigated whether there was a correlation between the number of samples in which a gene was expressed and the expression level (
Supplementary Figure S5). The mean and median expression levels across all samples were found to be higher for more widely expressed genes (i.e., those expressed in more samples). For maximum and minimum expression levels, the most widely expressed genes also exhibited a greater expression level, but the trend was not as prominent for these genes.
Analysis of splice sites using additional publicly available datasets shows that even in our dataset, many low-frequency sites remain unidentified. In particular, the addition of the detailed transcriptome map of fruit development results in a high number of additional splice sites (
Figure 2b and
Table 1). However, given low coverage of the data, they may represent artefacts.
3.2. Comparison with Arabidopsis thaliana Transcriptome Map
We compared the global parameters of the tomato transcriptome map with those of the
A. thaliana map and found that, despite the difference in number of samples, they were similar in these two species. In particular, the distribution of the number of expressed genes and Shannon entropy (H) are similar, with the only difference being that in tomato, the peak at low entropy values is almost not visible (
Supplementary Figures S6 and S7). The maximum entropy is 4.16. There are 12,641 genes with H >= 3.7; they are highly enriched in terms of being associated with basic cellular metabolism (
Supplementary Table S8). At the lower end, there are 298 genes with H <= 0.15. They are enriched in categories such as peroxidase activity (GO:0004601) and peptidase activity (GO:0008233) in molecular function or response to stress (GO:0006950) in biological processes or the cell wall (GO:0005618) in cellular components (
Supplementary Table S8). All other global parameters (see
Supplementary Figures S5, S8, and S9), such as distribution of maximum and minimum expression levels, DE score, and Z-score, are also almost identical in tomato and
Arabidopsis.
It is interesting to compare a set of genes that do not vary in expression between samples in
Arabidopsis and in tomato. We considered only genes expressed in all samples; for each gene, a covariation was calculated. We found 123 genes with CV <0.20, 657 with CV <0.25, and 1527 with CV <0.30 (
Supplementary Table S9). A set of genes with CV <0.2 was enriched by the categories related to transport, protein, nucleic acid localization, and kinases (
Supplementary Table S10). Similar to
Arabidopsis, the addition of publicly available RNA-seq data (sets DEVELOPMENT and STRESS) did not greatly decrease the number of stable genes (
Supplementary Table S10). Unfortunately, the data from the fruit development atlas could not be used for the analysis of stable genes due to shallow sequencing depth that can lead to distortion in expression profiles (in particular, the underestimation of lowly expressed genes).
Analysis of GO enrichment of stable genes in Arabidopsis and tomato reveals similar categories: GO:0051169~nuclear transport, GO:0016192~vesicle-mediated transport, GO:0015031~protein transport, GO:0008104~protein localization, GO:0006886~intracellular protein transport, GO:0006497~protein amino acid lipidation, GO:0006403~RNA localization, GO:0006397~mRNA processing GO:0004386~helicase activity, GO:0042175~nuclear envelope-endoplasmic reticulum network, GO:0016023~cytoplasmic membrane-bounded vesicle, GO:0005794~Golgi apparatus, GO:0005654~nucleoplasm, and GO:0005635~nuclear envelope.
The identification of stably expressed genes is important for further studies that utilize quantitative PCR (qPCR) for the measurement of gene expression levels. It is well-known that the genes traditionally used as a reference in qPCR experiments (glyceraldehyde 3-phosphate dehydrogenase (GAPDH), actin, tubulin, etc.) are indeed not stable across conditions and organs [
24], and each species and each experimental system requires selection and validation of the optimal reference genes [
25]. The set of stably expressed genes identified in our study could be used as a basis for such a selection in tomato. Notably, tomato orthologues of two genes that were identified as the most stable in
Arabidopsis are also among the most stable (
Supplementary Table S11).
3.3. Analysis of Expression Patterns of Duplicated Genes
The most prominent feature of plant genomes is that they undergo multiple whole-genome or segmental duplications. Gene copies resulting from the duplication usually diverge in functions (alternatively, one of the copies can be lost). In cases when one gene of a model object is an orthologue of two paralogous genes from a non-model object, it is usually difficult to identify which of the co-orthologues retains ancestral function because both of them have a similar level of sequence identity. Indeed, we found that the distribution of the identity values for interspecific pairs within ortho-triplets (1
Arabidopsis gene–2 tomato genes and vice versa) is almost identical with the distribution of the identities for orthopairs. Even the distributions of minimal and maximal identity values are not drastically different (
Figure 3 and
Table 2).
This means that in most cases of ortho-triplets, both paralogs from one species are equally similar to a single gene from the other species. In contrast, the similarity of expression profiles greatly differs for interspecific pairs—i.e., for most ortho-triplets, there is a pair with low expression distance (close to the distance typical for ortho-pairs) and a pair with high distance (see e.g.,
Figure 4 and
Table 2).
In terms of function, this means that one of the co-orthologues in the ortho-triplet usually retains the ancestral function, while the other acquires a new function. Presumably, this occurs by the divergence of the regulatory elements of the paralogs after duplication. At the same time, sequence similarity at the level of protein-coding sequencing remains the same for both co-orthologues, and does not allow for conclusions on the function to be made.