The Population Genomics of Sunflowers and Genomic Determinants of Protein Evolution Revealed by RNAseq

Few studies have investigated the causes of evolutionary rate variation among plant nuclear genes, especially in recently diverged species still capable of hybridizing in the wild. The recent advent of Next Generation Sequencing (NGS) permits investigation of genome wide rates of protein evolution and the role of selection in generating and maintaining divergence. Here, we use individual whole-transcriptome sequencing (RNAseq) to refine our understanding of the population genomics of wild species of sunflowers (Helianthus spp.) and the factors that affect rates of protein evolution. We aligned 35 GB of transcriptome sequencing data and identified 433,257 polymorphic sites (SNPs) in a reference transcriptome comprising 16,312 genes. Using SNP markers, we identified strong population clustering largely corresponding to the three species analyzed here (Helianthus annuus, H. petiolaris, H. debilis), with one distinct early generation hybrid. Then, we calculated the proportions of adaptive substitution fixed by selection (alpha) and identified gene ontology categories with elevated values of alpha. The “response to biotic stimulus” category had the highest mean alpha across the three interspecific comparisons, implying that natural selection imposed by other organisms plays an important role in driving protein evolution in wild sunflowers. Finally, we examined the relationship between protein evolution (dN/dS ratio) and several genomic factors predicted to co-vary with protein evolution (gene expression level, divergence and specificity, genetic divergence [FST], and nucleotide diversity pi). We find that variation in rates of protein divergence was correlated with gene expression level and specificity, consistent with results from a broad range of taxa and timescales. This would in turn imply that these factors govern protein evolution both at a microevolutionary and macroevolutionary timescale. Our results contribute to a general understanding of the determinants of rates of protein evolution and the impact of selection on patterns of polymorphism and divergence.


Introduction
Achieving a better understanding of the factors that shape patterns of divergence across genes, a central aim in evolutionary genetics, should become increasingly straightforward as the amount of sequencing data available grows exponentially [1 3]. A robust and simple way to quantify rates of gene evolution at the protein levels comes from d N /d S ratio measurements [4]. This approach quantifies selection pressures by comparing the rate of substitutions at silent sites (d S ), which are presumed neutral, to the rate of substitutions at non-silent sites (d N , amino acid changes), which may experience selection. A number of genomic parameters have been shown to correlate with rates of protein coding evolution in model organisms, including gene expression level and specificity, gene length, recombination rate, and mutation rate. A general conclusion is that gene expression level, specificity and essentiality account for a substantial amount of variation in rates of protein evolution due to selective constraints imposed by these genomic parameters [1,5 8]. On the other hand, divergence in protein sequence appears to be decoupled from divergence in expression levels. While early studies of a limited number of genes [9,10] did identify a positive correlation between the rate of protein evolution and expression divergence, an absence of such correlations is more commonly reported (e.g., [11,12]).
Although significant progress has been made toward understanding the genomic factors influencing gene evolution, most of these observations have come from studies of model organisms. Also, the majority of these comparisons have been among evolutionary distant species, and the relative importance of these genomic parameters at a microevolutionary time scale is less clear (e.g., [13]). This is especially true for plant studies describing the causes of evolutionary rate variation among nuclear genes; they are almost entirely restricted to relatively divergent species in the genus Arabidopsis or to evolutionary distant crop species such as Sorghum bicolor, Zea mays and Oryza sativa [14]. As a consequence, previous studies have estimated genetic divergence in terms of nucleotide substitutions rather than F ST , since the latter will be close to one in genetically distant species.
The North American annual wild sunflowers (Helianthus spp.) are emerging as a model system for adaptation and speciation genomics research [15,16]. Given their well characterized evolutionary and ecological history and more recently, their genomes, sunflowers are ideal for testing predictions about the roles of gene flow, the timing of divergence, and the role of gene expression divergence in the maintenance of a plant specie species: Helianthus annuus, H. debilis and H. petiolaris, which have diverged from one another less than two million years ago [16,17]. Helianthus annuus is widespread throughout much of the central and western United States. In southern Texas, morphologically distinct populations of H. annuus are described by Heiser [18] as an endemic subspecies (H. annuus texanus), yet these are fully interfertile with other populations of H. annuus. Populations of H. annuus, including those of ssp. texanus, are karyotypically divergent from H. debilis and H. petiolaris and have strong post-zygotic barriers that reduce, but do not eliminate, gene flow between them [18 20]. The three species also appear to have different ecological requirements, with H. annuus found mostly on moist, clay-based soils, whereas H. debilis and H. petiolaris are restricted to drier, sandy soils [21,22].
Illumina sequencing of expressed genes (RNAseq), a recently developed approach to transcriptome profiling that uses deep-sequencing technologies, has several advantages over classical microarray studies [23]. Not only does RNAseq provide more precise estimates of expression levels, but it also permits direct comparisons between gene expression and gene sequence divergence. This approach then yields a more informative and accurate view of the transcriptome at a lower cost compared with Sanger sequencing and/or microarrays. High throughput/low cost sequencing has been especially valuable for studies of non-model species by making genome sequencing, transcriptome sequencing and single nucleotide polymorphism (SNP) discovery efficient and accessible, even in the absence of a reference genome (e.g., [24]).
Here, we take advantage of next generation sequencing (RNAseq) data that was generated for H. annuus, H. debilis and H. petiolaris populations. Our objectives were to (i) confirm the genetic relationship among samples, then (ii) determine the effect of selection on protein coding evolution and finally (iii) investigate the effect of genomic factors which, based on previous studies, are expected to co-vary with rates of protein coding divergence. In turn, this will reveal if patterns observed in previous studies comparing more distantly related taxa also hold on shorter evolutionary timescales in recently diverged wild sunflowers.

Results
We generated a total of 122 GB (122 × 109 basepairs or 609 million paired-end reads) of sequencing data from RNA extracted from the 29 seedlings representing three Helianthus taxa: H. annuus, H. debilis, and H. petiolaris (Figure 1 and Tables 1 and 2, summary of sequencing statistics). We have deposited the reference transcriptome, SNP table and raw read counts per gene in the Dryad Digital Repository [25]. Following strict quality cutoffs (see methods), we kept 67 GB of sequencing data for alignment, of which 35 GB (52%) aligned to the reference transcriptome. One H. petiolaris (PI586932) individual yielded less than a third of the number of reads compared the average number of reads for all other individuals. These reads were also of lower quality and therefore we discarded this sample from further analyses. Following strict quality threshold in alignments and SNP calling, we discovered 433,257 polymorphic sites in 13,412 unique contigs covering 14.5 MB, with an average of 3.0 SNPs per 100 bp (Table 1).   a low sequencing output and sequences were also of lower quality than all other samples. As such, it was removed from further analysis.

Population Genetics
We used a random subset of 1,000 markers to conduct a Bayesian inference of population structure, which revealed a strong peak in for K = 3 ( for k2 = 84.6, for k3 = 713.1 and for k4 = 12.25) largely corresponding to the species H. annuus, H. petiolaris and H. debilis/praecox ( Figure 2). We analyzed population structure using three different sets of 1,000 random markers and obtained the same strong clustering results each time. One individual (btm15 2) clearly stood out as an early generation H. debilis-H. annuus hybrid in all runs conducted ( Figure 2). We removed this individual from further analyses of genetic, protein coding and gene expression divergence.
Given that no distinct genetic clusters between H. debilis and H. praecox were identified at K = 3 ( Figure 2) and the analysis of population structure with k = 4 did not reveal a clear split, all H. praecox individuals were subsumed within H. debilis for further analyses of genetic, protein coding and gene expression divergence. We also calculated the average mean F ST per gene as 0. 25 [27].
We then tested whether the ratio of non-synonymous to synonymous fixed differences was greater than the non-synonymous to synonymous polymorphism ratio using a g-test. For all comparisons, there was a significant excess of non-synonymous fixed differences (g-test < 0.0001 in all cases, Table 3). The genome wide proportion of base substitutions fixed by natural selection (alpha) was observed to be 0. 28  When focusing on the most abundant GO categories (comprising at least 50 genes), we found evidence that the proportion of adaptive evolution (alpha) varied significantly with the class of biological processes considered. In Table 4, we present the five GO categories with the highest alpha value in each comparison. These GO categories comprised a number of biological processes. In only one case, species pairs. Other GO categories with very high alpha values in at least one comparison included flower development, glycolysis, defense response to fungus, lipid catabolic process, and lipid metabolic process. Then, we noticed that for all three interspecific comparisons, correlation coefficients were similar. Therefore, in order to simplify the interpretations, we have presented the data as a mean of the three comparisons ( Figure 3, but see also Figures 4 6 for correlations per species pair). Rates of protein divergence (d N /d S ) were negatively correlated with gene expression level (partial cor. coef. = 0.03) and positively with gene expression specificity (0.06), consistent with results from a broad range of taxa [6,8,28] and positively correlated with overall sequence divergence (F ST , partial cor. coef. = 0.04). Among other notable results predicted from previous studies, we found a strong and significant negative correlation between expression level and specificity ( 0.15). Nucleotide diversity (pi) was strongly negatively correlated with sequence divergence ( 0.27) and surprisingly, positively with expression level (partial cor. coef. = 0.48). We also found that sequence divergence (F ST ) did not correlate with expression divergence.    . Partial correlation coefficients aiming to understand the genomic factors associated with variation in protein coding sequence. Partial correlation coefficients (above diagonal), color-coded according to sign and degree of correlation, and p-values (below diagonal). H. debilis-H. petiolaris comparison (4,196 ORFs).

Next-Generation Sequencing Era
Our approach using a reference transcriptome of 16,312 contigs allowed to survey a substantial fraction of all genes expressed and capture genome wide signals that would likely go undetected when using a handful of anonymous markers or a candidate gene strategy. Nearly every gene in our reference set had a least two sequences align to it and could therefore be considered expressed (15,756 out of 16,312 contigs) in the sequenced samples. The expression cut-offs commonly employed for microarrays or RT-PCR has created a perception that genes with a signal intensity below a certain ambient noise are not expressed. However, this perception probably relates more to the resolving power of the method rather than a real property of the gene; genes usually considered as silent are probably often expressed at a very low level making expression differences more subtle than originally perceived [29]. Conversely, we cannot rule out this possibility in some cases of uncertainty in sequence or alignment; for this reason genes with less than two aligned reads were removed from the data set. The fact that only about 50% of cleaned sequences aligned to our reference indicates that a significant fraction of the transcriptome remains un-assembled and/or that some reads are too divergent from the reference or represent chimeric sequences. Alternatively spliced genes and genes filtered from the reference dataset because they were inferred to represent families of close paralogs might also explain some of this missing fraction. In RNAseq studies, another source of bias can come from aligning reads from other species to a reference composed of a single species, here H. annuus. Fortunately, this does not seem to be a problem given the alignment parameters we chose and the fact that the same percentages (52%) of cleaned reads (Table 2) were aligned in all three species.
Given that the cultivated sunflower genome is nearing completion [30], there will soon be a more accurate reference to align sequences against. A fully sequenced genome will permit more accurate identification of gene space, alternative splicing events, untranslated 5' and 3' regions and intron/exon boundaries. In addition, we are concurrently developing reference transcriptomes for several closely related sunflower species including H. petiolaris and H. debilis. This will allow us to investigate orphan genes that are present in either H. petiolaris or H. debilis but have no detectable homolog in H. annuus. Such genes are rare, but may play an important role in generating species-specific evolutionary novelties [31]. Clearly, we have only begun to understand the evolutionary complexity underlying both sequence and expression divergence.

Population Genetics
Annual species of sunflowers in the genus Helianthus vary greatly in their geographic distribution and evolutionary origin. Based on Bayesian clustering of nuclear markers spanning the whole genome, we found a clear distinction between H. debilis, H petiolaris and H. annuus individuals. However, STRUCTURE often does not distinguish hierarchical sub-structuring and consequently, may only reveal the top level of population differentiation. Our study was not intended to resolve the phylogeny of Helianthus, as this would have needed much more extensive sampling. However, our results corroborate previous phylogenetic work [32,33] that places H. praecox within H. debilis.
As a whole, genetic divergence was surprisingly similar among all three species comparisons (  [33]. Hybridization between recognized species is known to participate in shaping the evolutionary diversity of North American sunflowers [20,34,35] and probably explains the patterns observed here. H. petiolaris and H. debilis are allopatric sister species, but both hybridize with the sympatric but evolutionary more distant H. annuus. H. annuus and H. debilis are known to have a long history of introgression in Texas, where these individuals were collected [20,35]. Similarly, H. petiolaris forms numerous hybrid zones with H. annuus throughout their ranges [34]. As such, these more distantly related species have exchanged genes across much of the genome while remaining morphologically and ecologically distinct [16], presumably accounting for lower than expected average F ST values.

Protein Coding Evolution
The longest open-ended ORFs (minimum length of 300 nucleotides) were kept as the most probable translated region of the gene. SNPs within these ORFs were considered as coding sites as thus it could be assessed whether these were synonymous or non-synonymous. This approach (as compared with a BLASTX approach to known protein sequences from model systems) has the advantage of identifying genes that have no detectable homolog present in public databases. Such orphan genes are potentially rare, but may play an important role in generating species-specific evolutionary novelties [31].
Our results are consistent with several expectations based on previous research [14]. First, mean d N /d S ratios were close to 0.2 for all three comparisons and few genes were identified with values greater than one, signaling strong constraint on amino acid replacements. This is consistent with the results of Yang and Gaut [28] who analyzed protein coding evolution between Arabidopsis thaliana and A. lyrata and estimated d N /d S at 0.203. Many studies have documented a positive correlation between d N and d S across genes, similar to that identified here [14]. The reason for this correlation is not well understood, although it may be caused by processes such as strong purifying selection towards translational efficiency or variation in mutation rates and recombination rates along chromosomes, which would affect both d N and d S in a similar fashion.
Based on calculations of the genome wide proportion of base substitutions fixed by natural selection (alpha), we conclude that a significant proportion of the most divergent SNPs represent selectively advantageous mutations: in all cases the ratio of non-synonymous to synonymous fixed differences was significantly greater than that for non-synonymous to synonymous polymorphisms (alpha, Table 3). By focusing on specific gene categories and examining patterns of polymorphism and divergence, we were able to provide additional evidence of adaptive evolution (Table 4). Gene categories exhibiting high alpha values included a number of biological processes that have previously been reported as undergoing positive selection in sunflowers, including defense response to fungus, flower development, lipid catabolic process, lipid metabolic process, and response to biotic stimulus [36 40]. Interestingly, we found one category, response to biotic stimulus, showing parallel levels of elevated alpha among species pairs. This is important given that parallel trends add stronger evidence for the role of selection in driving divergence for these particular traits [41]. Thus, natural selection imposed by other organisms appears to play an important role in driving protein evolution in wild sunflowers. However, we also acknowledge the dangers of story telling based on GO analysis and these trends should be confirmed before any strong conclusions can be reached [42].

Genomic Determinants of Rates of Protein Evolution
To understand the underlying causes of evolutionary rate heterogeneity among genes, researchers have investigated correlations between evolutionary rates and various genomic parameters [1,8,28]. In accordance with previous studies (e.g., [28]), we found a significant negative correlation between d N /d S ratios and expression level, with genes that are highly expressed being more constrained in terms of protein evolution. Likewise, d N /d S ratios were positively correlated with expression specificity. Again the leading explanation for a positive correlation is that widely expressed genes are more constrained in protein sequence evolution because they are probably involved in multiple biochemical pathways and face multiple selective environments in the different tissues they are expressed in [5]. Finally, we found a strong and significant negative correlation between expression level and specificity. Highly expressed genes are usually involved in a larger number of biochemical processes and expressed in a greater number of different tissues than lowly expressed genes. Since specificity was determined based on relatively shallow (44,000 reads) sequencing, lowly expressed genes may appear to have high specificity due to the stochastic lack of sampling in some libraries. At the same time, it is likely that with enough power (i.e., enough sequencing), we may find that all genes are expressed nearly everywhere (i.e., expression is leaky, see [29]). Therefore specificity may eventually need to be redefined or at least measured in independent samples, as in the present study.
All the trends reported above are consistent with theory and results from a broad range of taxa [1 3,5,7,28,43]. Interestingly however, the correlation coefficients reported here are generally lower than those observed in previous studies. This is probably due to the more recent divergence of sunflowers compared with the more distantly related taxa studied previously. For instance, the sunflowers species analyzed here have diverged less than two million years ago [19], compared with about eight MYA in the Arabidopsis thaliana-lyrata comparisons [28], six MYA for the Drosophila melanogaster group (but with much faster generation times than sunflowers [1,44], and about 75 MYA in the case of the mouse and human lineages [5]. As recently pointed out in a study quantifying protein coding evolution broadly across the Asteraceae family [13], it appears that, at least to some degree, [45]. Given that our results are also qualitatively similar to previous studies conducted at a more ancient timescale, we argue that macroevolutionary trends are governed by the same principles of microevolution. Yet, the importance of the genomic factors discussed above is possibly reduced relative to the idiosyncratic effects of drift and local variation in selection pressures when analyzed over shorter timescales [13]. Unlike previous studies, we also observed several new and sometimes unexpected associations. For one, we found that d N /d S ratios were positively correlated with overall genetic divergence (F ST ). Given that high values of d N /d S and F ST can result from positive selection, this association could be expected. Among other noticeable results, we did not find a significant correlation between genetic divergence (measured as F ST , or d N /d S ratios) and expression divergence. After observing no correlation in yeast between d N and expression divergence, Tirosh and Barkai [11] suggested that certain genes are more sensitive to changes in coding sequences, whereas others are more sensitive to changes in non coding regions regulating expression. Consequently, an absence of correlation between patterns of expression and genetic divergence, as we identified in the present study, suggests a decoupling of evolutionary forces shaping expression and genetic divergence.
Unexpectedly, nucleotide diversity (pi) was strongly correlated with expression level in Helianthus. With RNAseq, the confidence with which nucleotide variants are identified is a function of coverage, which itself represents expression. As a consequence, greater coverage (expression) increases the probability of identifying both alleles at heterozygous loci and may inflate the total number of variants [46]. Analyzing only the subset of genes with d N and d S greater than 0 should ensure that most low coverage genes are removed (mean expression value for all genes in dataset = 365, compared with mean expression of genes used in partial correlation analysis = 665). Using an even more stringent criterion on the minimum expression of genes (for example, minimum expression threshold of 100 instead of now 14) does slightly reduce the correlation between pi and expression level but does not substantially affect the other partial correlations reported in Figure 3. Thus, our conclusions hold when using more stringent filtering procedures. Lastly, nucleotide diversity (pi) was also strongly negatively correlated with F ST . This is not unexpected and likely pertains to the definition of F ST itself ([heterozygosity between-heterozygosity within]/heterozygosity between).
Essentially all previous studies looking at the genomic determinants of sequence evolution have employed one dataset comprised of sequenced genes, a separate dataset for calculating gene expression (i.e., microarray data often measured in different individuals or even populations) and another dataset to identify genomic context (i.e., an annotated genome). In addition, previous studies have mainly estimated genetic divergence in terms of nucleotide substitutions between distant species, rather than F ST , since the latter will be close to one in genetically distant species. Because of the independence between the expression and sequence data sets in these previous studies, the apparent ascertainment bias detected in the present study is avoided. Thus, while RNAseq increases the precision of expression analyses, and ensures that the same pairs of genes are compared for sequence and expression analyses, the approach is not without drawbacks.

Sample Preparation, Sequencing and Reference
We collected achenes (single seeded fruits) from seven Helianthus annuus and seven H. debilis individuals in Texas. Four of the H. debilis individuals were initially labeled as H. praecox, yet these clustered with H. debilis individuals genetically (see below). As molecular phylogenetic studies of Helianthus also place H. praecox within H. debilis [32,33], we follow an early taxonomy [47] that subsumes H. praecox into H. debilis. In addition, we acquired twelve H. petiolaris accessions spanning the distribution range of the species, either from USDA collections or from previous sampling efforts (Figure 1).
We germinated all achenes at the University of British Columbia (Vancouver, Canada) and grew them for approximately three weeks under growth chambers or similar greenhouse conditions (12 hours of daylight at 22 degrees). Then we harvested all aboveground biomass, flash froze it in and stems using a modified TRIzol Reagent protocol (Invitrogen, Carlsbad, CA, USA). We quantified the RNA samples using a NanoDrop (Thermo Fisher Scientific, Waltham, MA, USA) and verified their quality on agarose gels. We stored the total RNA in pure water. Finally, samples were retrotranscribed using a custom Invitrogen (now Life Technologies, Carlsbad, CA, USA) cDNA kit, nearly identical to the SuperScript ® Double-Stranded cDNA Synthesis Kit. After shearing by sonication, cDNA fragments 190 to 210 basepairs in length were isolated and PCR-amplified to generate the sequencing library, following standard Illumina (San Diego, CA, USA) Genome Analyzer paired-end library protocols. These were sequenced on a GAII Illumina platform (paired end sequencing, 2 × 100 bp reads) at the Genome Sciences Centre (Vancouver, Canada) and base calling was done as part of the standard pipelines at the Genome Sciences Centre sequencing facility. Samples were multiplexed (three samples per lane for H. annuus and two per lane for all others).
Our reference Expressed Sequence Tags (EST) set consists of 16,312 contiguous sequences (contigs). It was assembled from available Sanger and 454 EST sequences generated for H. annuus (both wild and cultivated individuals spanning the whole distribution of the species) and from several different tissue types [48]. Individual genotypes were first assembled with MIRA version 3.0 [49], using MIRA contigs and singletons were reassembled further using the program CAP3 at 94% identity [50] as in [24]. Contigs found in at least two different genotypes and with total length greater than 500 bp were kept for the final merged assembly. Based on comparisons to genetic map data [51], 783 contigs contain sequences that map to multiple locations and were removed, as they likely represent families of close paralogs that cannot be resolved. Of the 16,312 remaining sequences, 8,226 map to single locations on genetic maps [51] and appear to be single copy, while the remainder was not assessed. In addition to the mapped and unmapped nuclear loci, this reference set also contains the whole chloroplast genome, which spans nearly 127 KB. This assembly has therefore gone through strict quality trimming and should represent the majority of genes expressed in young seedlings.

Alignments
We first cleaned the raw sequencing files (fastq sequencing files in Illumina 1.3+ format) to remove any low quality reads and potential contaminating vector sequences. To ensure high quality of data, we discarded reads where one or more position had a phred quality score below 3 (in either paired end). We then aligned the reads against the reference transcriptome using BWA (ALN and SAMPE command, default parameters except for read trimming parameter (-q) set at 20 to ensure high quality alignments [52]). Following alignments, we used the Indel Realigner from the genome analysis toolkit (GATK [53]) to correct alignment errors near indel regions. We used SAMTOOLS [54] to extract the base-pair information at each site for each individual (PILEUP command, 17 million sites). Note that one individual (PI586932) was not included in the analysis given low sequencing yield (see results section). We parsed this output (base-pair information at each site for each individual) based on several criteria. First, sites with a read depth of less than three were called as missing. Then, if that site passed this first filter, the genotype was called as heterozygous only if the minor allele was represented by more than two reads and the minor allele frequency was at least 10%. At this point, we concatenated information from all individuals for all sites. From this dataset, we filtered on a per site basis, in order to keep only sites for which no more than 20% of all individuals had missing data, expected heterozygosity (He) was greater than 0.2 and observed heterozygosity (Ho) was smaller than 0.6. Polymorphic sites with values above the latter threshold likely represent paralogous sequence variants instead of true SNPs. Conversely, polymorphic sites with values below the former threshold either represent rare alleles, which possess little information for the sake of differentiating individuals, or sequencing errors (unless very high coverage is attained). Nevertheless, we are aware that our analysis likely contains a small fraction of false positives due to alignment and/or sequencing errors. However, given the large amount of data, high overall coverage, strict quality thresholds cut-offs and visual inspection of random subsets of alignments (several tens of kilobases), we expect the data to be more than sufficient for the analysis we report here. This is evident when the same trends are observed when conducting replicate analysis using small random subsamples (for example in the STRUCTURE runs, see results).

Population Genetics
To ensure that individuals clustered according to their described species, we assessed nuclear genetic structure through Bayesian model-based clustering in STRUCTURE v.2.3 [55]. STRUCTURE determines the most likely number of differentiated clusters (K) represented by the sample and assigns the sampled genotypes to the inferred clusters. Using a random subset of 1,000 markers, we estimated the log likelihood of the data, given different numbers of genetic clusters K, using an admixture model with correlated allele frequencies, without sampling locations as priors and with all other parameters as defaults. For each k value of 1 through 6, we ran ten replicates (50,000 burn-in cycles, 100,000 MCMC iterations), from which we calculated following [56]. We used the program STRUCTURE HARVESTER [57] to identify the number of populations K with the best support.

Protein Coding Evolution and Effect of Selection
We identified open reading frames (ORF) in our reference transcriptome using the program GETORF in EMBOSS (European Molecular Biology Open Software Suite [58]). We kept the longest open-ended ORF (minimum length of 300 nucleotides) as the most probable translated region of the gene. We used PAML (Phylogenetic Analysis by Maximum Likelihood [59], which accounts for codon usage bias and GC content, to calculate the rate of non synonymous mutations scaled by the rate of synonymous mutations (d N /d S ) in a maximum likelihood framework (run-mode = 0, CodonFreq = 2, model = 2). For d N /d S calculations, we first identified the consensus (major) allele for each site for each species and then annotated these sites into a single consensus sequence per gene per species. We discarded genes where either d N or d S equalled zero, keeping only those d N /d S ratios that could be calculated with confidence.
We first assessed the effect of selection on protein coding evolution globally. We tested if the ratio of non-synonymous to synonymous fixed differences was greater than the ratio of non-synonymous to synonymous polymorphisms using a g-test as in [27]. As an extension of this approach, the average proportion of amino-acid substitutions driven by positive selection (alpha) can be estimated using equation 3 in [60] 1 where all averages are across genes and P S , D S , P N and D N are the number of synonymous polymorphisms, synonymous substitutions, nonsynonymous polymorphisms and nonsynonymous substitutions, respectively. Confidence intervals of alpha were obtained through bootstraps (1,000 bootstraps) by randomly selecting genes with replacement as described in [60]. In this case, substitutions (D) were considered as SNPs with an F ST value greater or equal to 0.9. Another way of extracting ecologically relevant information is through the characterization of functional categories that are over-represented within certain subsets of genes [61]. We therefore examined patterns of adaptive protein evolution for distinct gene ontology categories as in [8]. We first compared genes using BLASTX and then employed the best hits to assign gene ontology (GO) annotation terms based on the uniprot database [62]. We calculated alpha for GO categories comprised of at least fifty genes and identified categories with the highest alpha levels (the top five for each comparison respectively).

Genomic Determinants of Protein Evolution
To assess what genomic factors are correlated with rates of protein evolution, we estimated partial correlation coefficients in the programming language R (v.2.13.0. [63], package PPCOR.R [64]). We assessed the degree of partial correlation between protein divergence (log(d N /d S )), genetic divergence (mean F ST per gene), genetic diversity (pi, calculated in SITES [65], expression specificity, log(expression level), and the absolute value of log(expression species 1/expression species 2). For genetic diversity (pi), we also considered using the Watterson estimator theta given that this parameter may have a different bias than pi [66]. Both estimators (pi and Watterson) were strongly correlated (r2 = 0.96) and partial correlation coefficient results were largely the same for both. As such, we only report values of pi. We calculated F ST values for each SNP (according to [67] using the R package HIERFSTAT [68] and then averaged it per gene. We measured gene expression tissue specificity using a dataset of publicly available Sanger sequences for H. annuus developed by the Compositae Genome Project [48]. Briefly, these libraries were constructed using eleven different tissues from two genotypes (callus, roots, disk and ray flowers, pre-fertilized flowers, developing kernel, chemically treated seedlings, environmentally stressed roots, environmentally stressed shoots, germinating seeds, environmentally stressed flowers, and hulls, see [48] for details). These libraries were Sanger sequenced to generate a total of 44,000 barcoded reads. We demultiplexed and compared all reads (BLASTX, e-value < 1e-10) against our reference transcriptome of 16,312 genes. We defined expression specificity (range 0 11) as the total number of libraries (11) minus the number of libraries where a gene scored at least one hit as in [7]. For gene expression calculations (level and divergence), we first normalized the number of reads aligned per gene per individual by the median expression of each library. Normalization via the common approach of Reads per Kilobase per Million Mapped Reads (RPKM) can be strongly biased by a relatively small proportion of highly expressed gene and we therefore decided against using it [69]. Given that we were also interested in interpreting differences in expression between genes, we normalized all expression data by gene length. Then, we used the R package DESEQ [70] to calculate expression divergence per gene, as the absolute mean value of expression for species 1 divided by expression for species 2 on a base two logarithmic scale.

Conclusions
As the cost of sequencing continues to drop, RNAseq is expected to replace microarrays for many applications that involve determining the structure and dynamics of the transcriptome [23]. While likely to revolutionize our understanding of transcriptome evolution, especially in non-model systems, a possible caveat is that sources of bias in coverage and therefore gene expression estimates are not yet fully understood. For example, it appears that base composition, library preparation and transcript length may all impact gene expression estimates [23]. As reviewed by [3], many challenges remain when dealing with next generation sequence data. In addition, more work is necessary to generalise the relationship between the evolution of promoter sequences, protein coding regions and gene expression in plants [14]. Our study provides a significant advance in our understanding of molecular evolution. It reveals several general trends that appear to be robust to variables such as timing of divergence or experimental growth conditions, given that they have been reported across a broad range of taxa. At the same time, a meta-analysis of studies conducted under different environmental conditions, developmental stages, and species may reveal novel insights into plant molecular evolution that are not apparent when studying only a single condition. With the increased availability of high-throughput sequencing datasets, such large-scale studies are becoming increasingly feasible.