Sow Thistle Chloroplast Genomes: Insights into the Plastome Evolution and Relationship of Two Weedy Species, Sonchus asper and Sonchus oleraceus (Asteraceae)

Prickly sow thistle, Sonchus asper (L.) Hill, and common sow thistle, Sonchus oleraceus L., are noxious weeds. Probably originating from the Mediterranean region, they have become widespread species. They share similar morphology and are closely related. However, they differ in their chromosome numbers and the precise relationship between them remains uncertain. Understanding their chloroplast genome structure and evolution is an important initial step toward determining their phylogenetic relationships and analyzing accelerating plant invasion processes on a global scale. We assembled four accessions of chloroplast genomes (two S. asper and two S. oleraceus) by the next generation sequencing approach and conducted comparative genomic analyses. All the chloroplast genomes were highly conserved. Their sizes ranged from 151,808 to 151,849 bp, containing 130 genes including 87 coding genes, 6 rRNA genes, and 37 tRNA genes. Phylogenetic analysis based on the whole chloroplast genome sequences showed that S. asper shares a recent common ancestor with S. oleraceus and suggested its likely involvement in a possible amphidiploid origin of S. oleraceus. In total, 79 simple sequence repeats and highly variable regions were identified as the potential chloroplast markers to determine genetic variation and colonization patterns of Sonchus species.


Introduction
Prickly (or spiny) sow thistle, Sonchus asper (L.) Hill, and common (or annual) sow thistle, S. oleraceus L., are two well-known worldwide noxious weeds. The seeds can germinate throughout the year over a broad range of temperatures [1,2]. These species are considered to be particularly troublesome weeds across the grain growing regions because they have allelopathic potential to function as interference competition even in weed communities [3] and are prolific seed producers (up to 25,000 seeds per single plant in a fallow); the seeds possess pappi, which helps wind-mediated dispersal [4]. Moreover, the evolution of herbicide resistance found that populations of these species threatens the efficiency of weed control. For example, several populations of S. oleraceus have been reported as resistant to acetolactate synthase (ALS) inhibiting herbicides including chlorsulfuron (Group B) and glyphosate herbicides (Group M) in Australia [5,6]. In addition to being troublesome Chloroplasts in plant cells serve as metabolic centers and encode many key proteins that are involved in photosynthesis and other metabolic processes, primarily participating in photosynthesis, transcription, and translation [25]. Chloroplast (cp) sequence polymorphisms have been extensively used as useful genetic markers at wide ranges of taxonomic levels in plants. They have provided valuable insights into phylogenetic relationships and the origin and evolution of the crop species [26] or introduced/invasive species [27][28][29][30] that are facilitated by maternal inheritance of cp genomes. However, earlier phylogenetic analyses utilizing the partial cpDNA sequences from several regions often resulted in limited sequence variation owing to highly conservative genome evolution, particularly for closely related and recently radiated groups of species [31]. The advent of high-throughput sequencing technologies of next-generation sequencing (NGS) has helped to reveal considerable genome-wide variations in terms of sequences and structures of entire chloroplast genomes. The benefits of genome-wide data have increased phylogenetic resolution and significantly enhanced our understanding of plant evolution and diversity in the field of chloroplast genetics and genomics [25]. The chloroplast phylogeny based on several coding and noncoding cpDNA regions in previous studies has not provided enough resolution to elucidate the phylogenetic relationship among weedy Sonchus species as well as the origin of S. oleraceus. In both of the cpDNA and nuclear DNA phylogenies, S. oleraceus was more closely related to S. asper than to S. tenerrimus, even though the precise relationship between them was not clear [13,[22][23][24]. Taking cpDNA phylogeny and maternal inheritance into consideration, it could be hypothesized that S. asper contributed as the maternal parent in the origin of amphidiploid S. oleraceus. To test this hypothesis and to better understand the evolutionary relationships of two weedy sow thistle species, we characterized four accessions of complete chloroplast genomes (two from S. asper and two from S. oleraceus) and conducted the comparative analyses of their whole chloroplast genomes. All but one accession of S. oleraceus was collected from the probable origin of diversity in the Mediterranean region. Considering their global distribution, one accession of S. oleraceus was collected from Dok-do Island in Korea and one more previously published accession of S. oleraceus from Australia (GenBank accession number MG878405) was also included in the comparative analyses.

Material Preparation, DNA Extraction, Genome Sequencing, and Annotation
The silica-gel dried leaves of four prickly and common sow thistles which were sampled from natural habitats in Spain and Korea were used as sources of DNA. Two accessions of S. asper were sampled from Seville (VIL-1) and Huelva (MAR-1) in Spain, while S. oleraceus was sampled from Huelva (MAR-1) in Spain and Dok-do Island in the East Sea, Korea. Total genomic DNA was isolated using the DNeasy Plant Mini Kit (Qiagen GmbH, Hilden, Germany) following the manufacturer's protocol. An Illumina paired-end (PE) genomic library was constructed and sequenced using the Illumina HiSeq platform (Illumina, Inc., San Diego, Ca, USA) at Macrogen Corporation (Seoul, Korea). The sequence reads of chloroplast genomes were assembled by the de novo genomic assembler, Velvet 1.2.10 [32] at coverages ranging from 789x to 1354x. Annotation was performed using the Dual Organellar GenoMe Annotator [33], ARAGORN v1.2.36 [34], and RNAmmer 1.2 Server [35]. Using Geneious v8.1.6 (Biomatters Ltd., Auckland, New Zealand), the draft annotation was inspected and corrected manually, performing blast search by comparison with homologous genes in Lactuca sativa (DQ383816), S. oleraceus (MG878405), S. canariensis (NC042381), S. acaulis (NC042382), and S. webbii (NC042383) from the GenBank database at the National Center for Biotechnology Information (NCBI) as references. The complete plastome sequences were registered in GenBank under the accession numbers MH908962 (S. oleraceus from Korea, Collection # Son0-uDS2), MK371006 (S. oleraceus from Spain, Population # MAR-1), MK371015 (S. asper from Spain, Population # VIL-1), and MK371016 (S. asper from Spain, Population # MAR-1). In addition, the raw HiSeq reads were deposited in the Short Read Archive (SRA) at NCBI under Bioproject ID PRJNA0577793 for MH908962 and PRJNA578572 for MK371006, MK371015, and MK371016. OGDRAW [36] was used to draw circular chloroplast genome maps ( Figure 2). Merged gene map of four weedy Sonchus chloroplast genomes of two accessions each of S. asper and S. oleraceus that were sequenced in this study. The genes inside and outside of the circle are transcribed in the clockwise and counterclockwise directions, respectively. Genes belonging to different functional groups are shown in different colors. The thick lines indicate the extent of the inverted repeats (IR A and IR B ) that separate the genomes into small single copy (SSC) and large single copy (LSC) regions. Large inversion and smaller inversion nested within the large inversion that is unique in Asteraceae are indicated with black lines outside the gene map.

Identification of Highly Divergent Regions
The five weedy Sonchus chloroplast genomes were compared to the reference genome of S. webbii at the entire chloroplast genomic level using DnaSP [38] and mVISTA [39]. S. webbii is a herbaceous perennial unlike other woody Sonchus species (S. acaulis and S. canariensis) in the Canary Islands and is sister to the clade containing five accessions of weedy Sonchus species. Overall sequence divergence was estimated for the five weedy Sonchus chloroplast genomes that were aligned and compared to the reference genome using the LAGAN alignment mode [40] in mVISTA. Nucleotide diversity (Pi) was calculated using the sliding window analysis (window length = 1000 bp and step size = 200 bp excluding sites with alignment gaps) to detect the most divergent regions (i.e., mutation hotspots) among the five weedy Sonchus genomes in DnaSP. The borders of large single copy (LSC), small single copy (SSC), and inverted repeats (IRs) were compared with the results of DnaSP and mVISTA.

Phylogenetic Analysis
To investigate the taxonomic position and phylogenetic relationship of the newly sequenced accessions of S. asper and S. oleraceus, 26 complete chloroplast sequences representing major lineages of the family Asteraceae were obtained from GenBank. In total, full sequences of 30 chloroplast genomes were aligned using MAFFT v.7 [41]. A maximum likelihood tree was produced based on the relationships of whole chloroplast genomes by IQ-TREE [42] with 1000 replicate bootstrap (BS) analyses. The best fit evolutionary model was chosen as TVM + F + I + G4, which was scored according to the Bayesian information criterion (BIC) scores and weights by using ModelFinder [43] implemented in IQ-TREE.

Comparative Genomic Analysis of Five Weedy Sonchus Chloroplast Genomes in Content, Order, and Organization
Despite morphological and cytological differences between two weedy Sonchus species (S. asper and S. oleraceus), gene content and arrangement were found to be identical in five chloroplast genomes, displaying 99.98% pairwise similarity in sequences ( Table 1). The total length of five cp genomes ranged from 151,808 (S. oleraceus from Spain and Australia) to 151,849 (the two samples of S. asper from Spain and S. oleraceus from Dok-do, Korea) base pairs (bp) and consisted of four typical regions: LSC, SSC, and a pair of inverted repeat regions (IR A and IR B ). One large inversion with a size of 22.8 kb and a second smaller inversion of 3.3 kb, nested within the large inversion, were found in all five chloroplast genomes ( Figure 2). These inversions are unique to Asteraceae and present in most of the family, but are absent in the species of the basal subfamily Barnadesioideae [44,45]. The overall guanine-cytosine (GC) content of each chloroplast genome was 37.6%, with LSC, SSC, and IR regions having 35.8%, 31.4-31.5%, and 43.1% GC contents, respectively. Each of the five cp genomes contained 130 genes, including 80 protein-coding genes (plus seven duplicated in IR), three rRNA genes (all duplicated in IR), and 30 tRNA genes (plus seven duplicated in IR). Eighteen genes contained introns, including seven tRNA genes. Three genes of clpP, rps12, and ycf 3 exhibited two introns. The trnK tRNA gene harbored the largest intron, which contained the matK gene in between. In total, 17 genes were duplicated in the IR regions, including seven tRNAs, three rRNAs, and seven protein genes. The trans-splicing gene rps12, consisting of three exons, was located in the LSC region for exon 1, but exon 2 and exon 3 of the gene were imbedded in the IR regions. Part of ycf 1 and rps19, for which we did not annotate in the four weedy Sonchus cp genomes that were sequenced in this study, were duplicated in the IR A region, forming pseudogenes. A pseudo ycf 1 gene in IR was extended to the SSC region and overlapped with the near ndhF gene (Figures 2 and 3; Tables 1 and 2). The genomic features of five weedy Sonchus genomes were nearly identical to congeneric species in the woody Sonchus alliance in the Macaronesian Islands (Atlantic Ocean), S. canariensis (NC042381), S. acaulis (NC042382), and S. webbii (NC042383) in gene content and overall GC rate [46]. The cp genomes of other Asteraceae species were also highly conservative in gene order and content with minor variations in the gene prediction of several genes (e.g., ycf 68 protein coding gene, 4.5 rRNA genes and pseudogenes), even though they represent morphologically and genetically diverse tribes of Asteraceae belonging to Anthemideae, Cardueae, Cichoroieae, Eupatorieae, Heliantheae, Millerieae, and Senecioneae [47][48][49].

SSRs and Large Repeat Sequences
Repeat sequences are considered to play an important role in genome recombination and rearrangement [50,51]. Particularly, SSRs, which represent a unique type of tandemly repeated genomic DNA sequence, have high polymorphisms due to large variations in motifs and number of repetitions. Because of the high level of polymorphisms and genome-wide distribution, they have been considered as powerful tools to measure genetic diversity and address population genetic issues at the level of inter-and intra-specific variations, such as gene flow, parentage, and population structure [52]. We found that all the five cp genomes contained the same numbers and distribution patterns of repeated sequences in each cp genome. There were 79 SSRs detected by MISA [53] based on search parameters set for 1-15 (mono-nucleotide motifs with 15 minimum numbers of repetition), 2-5, 3-3, 4-3, 5-3, and 6-3. The majority of the SSRs were tri-nucleotide motifs (66 SSRs, 84%). Remarkably, there were low proportions of other SSR types, i.e., three mono-nucleotide SSRs (4%), four di-nucleotide SSRs (5%), and six tetra-nucleotide SSRs (7%) ( Figure 4A). The abundance of tri-nucleotide SSRs was consistent with previous findings with similar parameter settings [54]; however, the frequency of mono-nucleotide SSRs was significantly lower because of the more stringent search parameter used in this study (i.e., minimum repeat of 15) than in previous studies (minimum repeat of 8 or 10) [47,54]. The most abundant repeat motif was "AAT/ATT" followed by "AAG/CTT" in all five genomes ( Figure 4B, Table S1). Interestingly, SSRs were distributed most abundantly in the coding regions (57%), followed by intergenic regions (38%), however much lower numbers were distributed in the non-coding introns (5%) in each cp genome. The coding regions with the highest number of SSRs were ycf genes as shown in Cynara cardunculus (globe artichoke) and other Asteraceae species [47]. Gene ycf 2 contained 10 SSRs (five duplicated in each IR) and ycf 1, three in SSC, emphasizing that these highly variable regions can be of specific interest to develop future cpSSR markers for phylogenetic studies of Sonchus species. Considering the quadripartite regional occupancy of SSRs, the IR and SSC regions were remarkably lower in overall SSR frequency compared with the LSC region: 19% from the SSC region and 10% from each of the two IR regions versus 61% from the LSC region (Table S2). In the case of the large repeats, we found 49 pairs in each cp genome using the parameters of maximum computed repeats = 50, minimum repeat size = 8 bp, and hamming distance = 0 by REPuter [37]. They contained 21 forward, 6 reverse, 1 complement, and 21 palindromic matches of repeats ( Figure 5A). Similar results were reported in previous studies with the majority of repeats in forward (21) and palindromic (13) in other Asteraceae species, Taraxacum plastomes (dandelions) [48]. Most of these large repeats were present in the intergenic spacers, but a large proportion was found within the ycf gene as in Ambrosia trifida (giant ragweed) [54]. Lengths of 20-22 repeats were the most frequent (45%) followed by lengths of 23-26 repeats (37%), with repeats of 27-30 (6%) and > 42 (12%) being quite rare ( Figure 5B).

Sequence Divergence and Hotspots
Nucleotide diversity among five weedy Sonchus cp genomes was estimated using DnaSP [38] with a sliding window analysis (window length = 1000 bp and step size = 200 bp excluding sites with alignment gaps). The divergence level was compared to the reference genome of S. webbii ( Figure 6). Overall nucleotide diversity value (Pi) among Sonchus chloroplast genomes including the two closely related species, S. asper and S. oleraceus, was 0.00117 and ranged from 0 to 0.00807. The SSC region showed the highest nucleotide diversity (0.001629) among the regions of LSC, SSC, and IRs, while the lowest value was in the IR boundary regions (0.000254). Ten divergence hotspots were suggested as the potential chloroplast markers for phylogenetic studies of Sonchus species: four intergenic regions (trnK-rps16, trnT-psbD, accD-psaI, and psbE-petL), two intron regions (trnL intron and rps16 intron), and four protein coding regions (psaA, ycf 1, ndhF, and ψycf 1). Most of them are located in the LSC region, but the most variable hotspots with the highest divergence values are in the SSC region. The result of mVISTA plotted against S. webbii also exhibited a high degree of synteny and gene order conservation among five weedy Sonchus cp genomes. A total of 528 polymorphic sites, which were identified in the DnaSP analysis, were visualized in mVISTA graph from mostly noncoding and intron regions, but also from several protein coding regions. The most divergent coding regions among Sonchus cp genomes were rpoB, rpoC2, atpA, accD, ycf 2, ndhF, and ycf 1 (Figure 7), while the coding genes of rpoC1, rpoC2, trnL, accD, clpP, psbB, ndhD, ycf 1, ndhA, rps16, and ndhF were presented as the most variable regions between Taraxacum and 18 other Asteraceae plastomes [48].

Phylogenetic Analysis
The taxonomic position and evolutionary relationship of S. asper and S. oleraceus sequenced in this study were assessed by comparative phylogenetic analysis among 30 representative Asteraceae species based on the sequences of whole cp genomes. For overall phylogenetic relationships within the family Asteraceae, the maximum likelihood tree supported the traditional taxonomy of the family, except the subfamily Asteroideae ( Figure 8). Asteroideae failed to form a monophyletic clade, as reported in a previous study [55]. Two tribes of Asteroideae, i.e., Heliantheae and Inuleae, were not nested in the monophyletic clade comprising other tribes of the same subfamily, Anthemideae, Gnaphalieae, Senecioneae, and Astereae. With regard to phylogenetic relationships within Sonchus, the genus formed a monophyletic clade (100% BS) within the tribe Cichorieae of the subfamily Cichorioideae. Sonchus was split into two subclades: i.e., the woody Sonchus alliance in the Canary Islands (100% BS) and the weedy Sonchus species distributed globally (100% BS). S. oleraceus and S. asper were nested in the same subclade, being closely related to each other, even though they were not reciprocally monophyletic. The whole chloroplast genome analysis suggested that both S. asper and S. oleraceus are not monophyletic at the chloroplast level. It is plausible that the species may in fact be monophyletic at the nuclear genome level but is paraphyletic at the chloroplast level, possibly due to incomplete lineage sorting. One lineage of S. oleraceus was represented by two accessions which were sampled from geographically widely separated regions (MK371006 from Spain and MG878405 from Australia) (98% BS) and they shared their most recent common ancestor with S. asper sampled from Spain (MK371016) (weak 33% BS). On the other hand, the accession of S. oleraceus (MH908962) sampled from Dok-do Island in the East Sea (between the Korean peninsula and Japanese archipelago) represented a different plastome type. It was deeply nested within the paraphyletic group of S. asper and, interestingly, the overall length of the whole plastome (151,849 bp) was the same as the ones of S. asper sampled from Spain (MK371015 and MK371016) ( Table 1). We consequently confirmed that the plant material studied from Dok-do Island showed typical morphological characteristics of S. oleraceus in its leaves and achenes in order to avoid misinterpretations from our phylogenetic results. The maximum likelihood tree showed that S. oleraceus from Dok-do Island (MH908962) is a sister clade containing one accession of S. asper (MK371016) and two accessions of S. oleraceus (MK371006 from Spain and MG878405 from Australia). Therefore, it is conceivable that S. oleraceus accession on Dok-do Island, which contains a distinct plastome type from two other accessions, represents a distinct lineage of S. oleraceus that is probably spread throughout Eurasia. Nevertheless, our results do not support the existence of clear geographic patterns for plastome lineages since the two accessions of S. asper from Spain (140 km apart, without apparent geographic barriers) seemed to originate from differentiated lineages and the samples of S. oleraceus from Spain and Australia (sampled in geographically remote areas) shared the same plastome profile. The current results generally support the hypothesis that S. asper contributed as the maternal parent in the hybrid origin of S. oleraceus in multiple lineages, a common feature in polyploidy processes [56,57]. However, the origin and evolution of amphidiploid S. oleraceus are yet to be determined precisely, which will likely require the inclusion of nuclear DNA analysis and cytological investigation (e.g., fluorescence in situ hybridization, FISH; or genomic in situ hybridization, GISH) to infer the paternal parents.

Conclusions
In this study, we sequenced, assembled, and annotated four cp genomes of two weedy Sonchus sow thistles (S. asper and S. oleraceus) and analyzed five cp genomes including an additional cp genome of S. oleraceus (Australia) obtained from GenBank. The results of comparative genomic analyses revealed that the genomes are highly conserved structurally, sharing most common genomic features of sequences, gene content, numbers, and distributions of repeated sequences despite the morphological and cytological differences between S. asper and S. oleraceus. The phylogenetic relationship based on whole chloroplast genome sequences suggested that amphidiploid S. oleraceus most likely originated multiple times from the closely related congeneric species S. asper. We believe that the SSRs, especially longer ones, show potential as useful markers if sample sizes are increased. Lastly, highly variable regions of both coding and noncoding regions were identified as potential phylogenetic markers for Sonchus species.

Conflicts of Interest:
The authors declare no conflict of interest.