Next Article in Journal
Exercise, mTOR Activation, and Potential Impacts on the Liver in Rodents
Previous Article in Journal
Exogenous Glycinebetaine Regulates the Contrasting Responses in Leaf Physiochemical Attributes and Growth of Maize under Drought and Flooding Stresses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Development of a Target Enrichment Probe Set for Conifer (REMcon)

1
School of Biological Sciences, Faculty of Science, The University of Adelaide, Adelaide, SA 5005, Australia
2
State Herbarium of South Australia, Adelaide, SA 5000, Australia
3
CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650204, China
4
Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650204, China
*
Author to whom correspondence should be addressed.
Biology 2024, 13(6), 361; https://doi.org/10.3390/biology13060361
Submission received: 27 March 2024 / Revised: 16 May 2024 / Accepted: 16 May 2024 / Published: 22 May 2024
(This article belongs to the Section Genetics and Genomics)

Abstract

:

Simple Summary

Conifers are vital for both ecological and economic reasons, offering valuable insights into land plant evolution. Molecular phylogenetics plays a significant role in studying evolution, but research on conifers using large-scale data from multiple nuclear genes has been limited. Target enrichment sequencing has emerged as a crucial method in phylogenomic studies. However, a specific bait set for conifers is missing. The REMcon probe set targets around 100 single-copy nuclear loci for family- and species-level phylogenetic studies of conifers. High target recovery and read coverage were observed for the REMcon when tested on 69 species, including conifers and other gymnosperm taxa. Phylogenetic analysis based on the DNA sequences generated from REMcon recovered the existing understanding of conifer relationships. The REMcon bait set will be beneficial in generating large-scale nuclear data consistently for any conifer lineage.

Abstract

Conifers are an ecologically and economically important seed plant group that can provide significant insights into the evolution of land plants. Molecular phylogenetics has developed as an important approach in evolutionary studies, although there have been relatively few studies of conifers that employ large-scale data sourced from multiple nuclear genes. Target enrichment sequencing (target capture, exon capture, or Hyb-Seq) has developed as a key approach in modern phylogenomic studies. However, until now, there has been no bait set that specifically targets the entire conifer clade. REMcon is a target sequence capture probe set intended for family- and species-level phylogenetic studies of conifers that target c. 100 single-copy nuclear loci. We tested the REMcon probe set using 69 species, including 44 conifer genera across six families and four other gymnosperm taxa, to evaluate the efficiency of target capture to efficiently generate comparable DNA sequence data across conifers. The recovery of target loci was high, with, on average, 94% of the targeted regions recovered across samples with high read coverage. A phylogenetic analysis of these data produced a well-supported topology that is consistent with the current understanding of relationships among conifers. The REMcon bait set will be useful in generating relatively large-scale nuclear data sets consistently for any conifer lineage.

1. Introduction

Conifers are the largest extant group among gymnosperms, with more than 722 species in 72 genera and 7 families, e.g., Araucariaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, Cephalotaxaceae, and Taxaceae [1,2,3]. They are of great economic, ecological, and evolutionary significance, comprising approximately 39% of the world’s forests, and have a fossil record spanning more than 300 million years [4,5,6,7]. The complete understanding of conifer diversity, trait evolutions, genetic structure, and evolutionary history is still poorly explored [3,5,6]. Molecular phylogenetic studies play an important role in understanding the mode and tempo of evolution amongst conifers, but to date, most studies have applied a limited range of markers, principally a small number of chloroplast loci plus nuclear ribosomal DNA regions typically generated by direct amplicon sequencing (e.g., [5,8,9,10,11,12]). These studies have leveraged DNA direct amplicon sequencing data generated from these loci for phylogenetic and evolutionary analyses. Further complicating the study of molecular evolution in this major land plant lineage is the large genome size and overall complexity of conifer genomes [13,14], leaving a notable gap in the exploration of conifers using large-scale data. Target enrichment by hybridization capture (e.g., hyb-seq; Weitemier et al. 2014) [15] provides an efficient and cost-effective approach for generating DNA sequence data for a large number of single and low-copy nuclear gene regions across multiple samples.
Target enrichment approaches (File S1) have become the method of choice for many systematics, phylogenetic, and evolutionary studies in plants [16,17,18,19,20,21], in part fostered by the availability of ‘universal’ probe sets that can recover a common set of genes across broad evolutionary timescales. These include angiosperm-specific probe sets that target hundreds of nuclear genes [17,22] and one recently developed to enrich more than 400 nuclear genes across flagellate land plants [18]. While there are bait kits developed specifically for specific conifer families (e.g., Pinaceae; [23]), we are not aware of a conifer-specific bait set that targets the entire clade.
Here we present a new molecular toolkit (REMcon) which, based upon published transcriptomic (The 1000 Plant Transcriptomes Initiative, 1KP; [24]) and genomic data [25,26], uses RNA baits to target approximately 100 low-copy nuclear genes across conifers. In the present study, we demonstrate the universality and application of these new molecular tools for reconstructing phylogenetic relationships among conifers based on a broad sample of gymnosperm taxa. The approach typically recovers conservative coding regions plus more variable non-coding regions that flank the exons and has application for evolutionary analyses among closely related species, as we will demonstrate in an upcoming study of the Podocarpus ‘Australis’ clade (Khan et al., in prep) of family Podocarpaceae [6,27,28].

2. Materials and Methods

2.1. Probe Design

Target enrichment probes were designed using genes selected from Duarte et al. (2010) [29], who identified a set of orthologous low-copy nuclear genes shared across angiosperms (Arabidopsis, Populus, Vitis and Oryza). For each of the selected genes, we extracted the putatively orthologous coding sequence (CDS) from the spruce (Picea abies, Pinaceae: https://plantgenie.org/, accessed on 4 April 2021) and western red cedar (Thuja plicata; https://phytozome-next.jgi.doe.gov/info/Tplicata_v3_1, accessed on 14 May 2021) genomes. These were used to retrieve putatively homologous transcript sequences for conifers from 1KP (https://www.onekp.com, accessed on 22 April 2021) using the China National GenBank (https://db.cngb.org, accessed on 12 April 2021) BLAST portal and the following settings: Discontiguous Mega-Blast, expect value = 10, maximum target sequences = 1000, selected organisms = Pinidae (taxid: 3313).
The sequences retrieved from the BLAST search for each target gene were downloaded and made into a BLAST database in Geneious (Kearse et al. 2012; https://www.geneious.com, accessed on 1 August 2021) [30]. We queried each BLAST database using the Thuja plicata gene family member with exon annotations manually added and the following settings: Discontiguous Mega-Blast, expect value = 10, maximum target sequences = 1200, results = Hit Table, retrieve = Matching Region with Annotation. We then extracted all sequences matching one or more exon annotations in P. abies with the caveat that the exon was >180 bp in length to allow for bait tiling. The extracted sequences were clustered using CD-HIT-EST ([31,32,33]; http://weizhong-lab.ucsd.edu/cdhit_suite, accessed on 4 August 2021) with a sequence identity cut-off fraction of 0.88 (see [34,35]) and a length similarity fraction of 0.2, and one representative sequence (the longest) per cluster was selected. A total of 1,124 representative sequences (mean length 1051 nucleotides, range 181–4416 nt) covering exons from 100 putative low-copy nuclear genes (Table 1; Table S1) were used for bait design with 120-nt baits and ~2× flexible tiling density for a total of 17,982 baits (see baits-Spruce [14853] for the nucleotide sequences of the baits). Bait design and synthesis were performed by Daicel Arbor Biosciences (formerly MYcroarray; Ann Arbor, MI, USA) in the generation of the myBaits Custom DNA-Seq kit™ Ann Arbor, MI, USA used for target enrichment-based next-generation sequencing.

2.2. Taxon Selection

A total of 44 conifer genera representing six families, three species of Cycadales, and Gingko biloba (69 taxa) were included to evaluate the efficiency of target capture across conifers and more widely among gymnosperms. Most plant specimens were freshly collected from the living collections held at the Botanic Gardens of South Australia and dried in silica gel, and some were sampled from preserved specimens held at the State Herbarium of South Australia (Table S2).

2.3. DNA Extraction, Library Preparation, Hybrid Capture and Sequencing

For DNA extractions, about 15 mg of silica gel dried leaf material per sample was used, and homogenized in a Omni ruptor (Omni International, Kennesaw, GA, USA) using ceramic beads. DNA was extracted using the Qiagen Plant Mini kit, QIAGEN, Germantown, MD, USA and normalized 2 ng/uL before proceeding to library preparation, which follows the steps outlined in Waycott et al. [36] for their nuclear bait set. Hybrid capture was performed following the manual provided by myBaits with a hybridization temperature of 65 °C and 150 bp paired-end sequencing was performed at the Australian Genome Research Facility (AGRF), Melbourne, Australia on an Illumina NovaSeq S1 flow cell.

2.4. Bioinformatics Analyses

High-throughput paired-end reads were de-multiplexed and quality trimmed (using a Phred score threshold of 20) using CLC Genomics Workbench (v. 20; https://www.qiagenbioinformatics.com/, accessed on 22 April 2022). The Sequence Capture Processor pipeline SECAPR v 2.2.3: Andermann et al. [37], http://antonellilab.github.io/seqcap_processor/, accessed on 24 April 2022) was used to generate nuclear DNA sequence data sets from the trimmed reads. First, the reads from each sample were assembled de novo using SPAdes [38] with default kmer values. Contigs matching the reference Thuja plicata reference sequences (i.e., annotated exons, see above) were extracted from the de novo assemblies with LASTZ v. 1.04 [39] using a target length of 0.5 and a similarity fraction of 0.75 (i.e., 50% of the contig has to overlap with the target gene and be no less than 75% similar). The ‘keep paralogs’ flag was activated and deactivated to assess the extent of paralogy in the data. SECAPR identifies paralogs as multiple overlapping contigs matching a target sequence, keeping the longest contig if the ‘keep paralogs’ flag is activated. The extracted contigs were aligned per locus to produce multiple sequence alignments (MSA) using MAFFT [40]. The aligned contigs were subsequently used for a reference-based assembly using the BWA read mapper v.0.7.16a-r1181 [41], and the ‘sample specific’ flag, i.e., each sample is extracted from the alignment and mapped separately. Consensus sequences per sample from subsequent read mappings were again aligned using MAFFT to produce MSAs for each targeted gene region. The approach developed by Yang and Smith [42] and modified with containerization for target capture data (Jackson et al. [43]; https://github.com/chrisjackson-pellicle/Yang-and-Smith-paralogy-resolution, accessed on 29 April 2022) was used to resolve groups of orthologous sequences (orthology inference) from targeted gene regions. Following various filtering steps, the approach uses phylogenetic tree-based methods and the pruning of duplicated taxa from rooted phylogenies to resolve orthologous groups of sequences. In this study, de novo contigs for each sample from the SECAPR pipeline (above) were first imported into Geneious Prime (v. 2022.0.1; (https://www.geneious.com, accessed on 24 May 2022) and made into a Blast database. This database was queried using the extracted contigs from an outgroup (Ginkgo biloba) matching the targeted gene regions in P. abies. The contigs from Ginkgo were annotated with the CDS from P. abies, and the coding region(s) were queried against the Blast database using blast-n with a maximum expected value of 1e-10 and maximum hits set to 1000. The Blast output was filtered using a minimum coverage fraction (query coverage of at least 0.4) to remove poorly aligned and short sequence fragments [42]. The resulting contigs were then used as input into the Yang-and-Smith-paralogy-resolution pipeline [43]. We used the monophyletic outgroups (MO) method to identify ortholog groups using reference genes from Gingko biloba as the outgroup. For downstream analyses, we retained alignments with >10 individuals in order to reduce the influence of missing data in tree inference. The aligned ortholog groups were concatenated, and a phylogeny was generated using IQ-tree 2 [44] using model finder [45] to estimate the optimum partitioning scheme and partition-specific nucleotide substitution model (MFP+MERGE flag activated) and 1000 ultrafast bootstrap [46] replicates to assess branch support.

3. Results and Discussion

The retrieval of target loci across the conifers was high, with, on average, 94% of the targeted gene regions recovered per sample (range 53–100), and 27 loci were recovered across all samples (Table 2). For the recovered loci, read coverage (read depth/position, averaged across loci) was also high, averaging c.146 across the included taxa with a maximum of 622 in Chamaecyparis pisifera and a minimum of c. 6 in Agathis robusta (Figure 1—coverage heatmap). In general, the recovery of genes across the six conifer families was relatively consistent, and with the exception of Araucariaceae, the mean number of loci captured per family exceeds 90 (Table 2). The lower mean value for Araucariaceae (87 loci) is influenced by the poor recovery for Agathis robusta (57 loci), which is likely a consequence of DNA quality and/or issues with the library construction, given that target recovery amongst close relatives (e.g., Agathis microstachya, 94 loci) was high (Figure 1, Table S2). There was a lower recovery of target genes for the non-coniferous gymnosperm species (c. 80% of genes recovered across the four samples), although this is not unexpected given that these were not specifically targeted in the probe design. Furthermore, the identity of the target sequences (here, sourced from Thuja) could influence locus recovery with increasing evolutionary distance, and an approach similar to McLay et al. [47] may be valuable in increasing target locus recovery from specific lineages.
Overall, there was a large number of putatively paralogous gene copies recovered at most loci (c. 36% of loci/sample, averaged across all samples) but this was highly variable among taxa (Prumnopitys andina, Podocarpaceae: c. 13% of recovered loci; Sciadopitys verticillata, Sciadopityaceae: 80% of recovered loci) (Figure 2; Table S2). The extent of paralogy might reflect the generally large genome size of conifers, which is also highly variable, with at least an order of magnitude difference between the smallest and the largest conifer genomes [14]. Polyploidy is a major driver of genome size evolution amongst angiosperms, although until recently [48,49], this phenomenon was thought to be relatively rare among conifers (e.g., [13,50,51]). In addition, conifer genome size evolution has been attributed to other factors, such as a high copy number of long transposable elements (e.g., [25]). The distribution of paralogs in our data supports the view that genome size, per se, is only partly related to the frequency of duplicated genes. For instance, Podocarpaceae has the smallest average genome size [14] and the smallest proportion of paralogs in our data. On the other hand, Pinaceae generally have large genomes, and we found a large proportion of putative paralogs among samples from this family. However, the relatively large genome size among Araucariaceae is not strongly associated with a high number of paralogous genes in our data (Figure 2), suggesting that factors other than gene duplication (e.g., transposable elements, larger introns, and abundant pseudogenes; [13,25,48] are also important drivers of genome size evolution.
Of the 100 targeted gene regions, 90 were recovered for Gingko, and these were included in the paralogy resolution analyses. Orthology inference recovered 98 MO ortholog groups and 95 with more than 10 samples included, which were retained for phylogenetic inference. The concatenated length of the 95 loci was an average of c. 48,770 bp with an aligned length of c. 74,179 bp and approximately 34% missing values. The average aligned length of the individual loci was c. 780 bp and ranged from 440–1691 bp. The concatenated alignment includes 47,469 (c. 64%) variable positions, of which 36,350 (c. 49%) are parsimony informative and 44,818 (c. 60%) variable and 34,411 (c. 46%) parsimony informative characters within the conifer clade. The maximum likelihood topology inferred from these data is shown in Figure 3. Of the 65 clades recovered, only 7 have a bootstrap support value < 100, and of these, only 3 received less than 80% support (Figure 3). The inferred topology is generally in agreement with our current understanding of conifer relationships (e.g., [2,3,5]), while the poorly supported nodes are associated with short branches and may be inherently difficult to resolve (e.g., [52,53,54]). For example, the relationships within the Prumnopityoid clade of Podocarpaceae, and in particular the placement of Halocarpus, were found to be unstable in the recent analyses of Chen et al. [55] using a large transcriptome data set of c. 1000 nuclear and c. 40 chloroplast gene regions and is poorly resolved here (Figure 3).

4. Conclusions

In conclusion, we present a conifer-specific hybrid-capture bait set that has been shown to perform well in terms of the consistency of locus recovery across a broad range of gymnosperms, and these data can be applied to credibly resolve deep phylogenetic relationships within the conifer clade. As part of ongoing studies (Khan et al. in prep) [6,27,28], we have found the REMcon bait set to be similarly successful in resolving relationships among closely related species groups within Podocarpaceae. The REMcon bait set offers an efficient and relatively cost-effective approach that fills an important gap in conifer and gymnosperm phylogenomics. This hybrid-capture bait set has exciting future applications, including the resolution of complex phylogenetic relationships, population, and comparative genomics, providing valuable insights into the evolution and conservation of conifers and other gymnosperms.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/biology13060361/s1, Table S1: Details of targeted genes; Table S2: Sample details; Table S3: Recovery of targeted gene regions across conifer familes and four non conifer gymnosperms. Locus recovery is averaged across samples (n) and the minimum and maximum recovery per seample is indicated. Figure S1: Paralogs/Family; File S1: HybCap76 NEBnext Podocarp03 half plate.

Author Contributions

Conceptualization, R.K., M.W. and E.B.; methodology, R.K., M.W., K.-j.v.D. and E.B.; software, E.B. and R.K.; validation, R.K., M.W., K.-j.v.D., R.S.H., J.L. and E.B.; formal analysis, E.B. and R.K.; investigation, E.B. and R.K.; resources, M.W. and R.S.H.; data curation, R.K and E.B.; writing—original draft preparation, R.K. and E.B.; writing—review and editing, M.W., K.-j.v.D., R.S.H., J.L. and E.B.; supervision, M.W. and R.S.H.; project administration, M.W. and R.S.H.; funding acquisition, M.W. and R.S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Botanic Gardens and State Herbarium collaborative student support 2020–2022; collections digitization support from the Board of the Botanic Gardens and State Herbarium (South Australia) 2021–2022; PhD Student support from the University of Adelaide, School of Biological Sciences 2020–2022; and a capacity building grant from The Environment Institute, the University of Adelaide 2021; Key Research Program of Frontier Sciences, CAS (ZDBS-LY-7001), the National Natural Science Foundation of China (42211540718), the Top-notch Young Talents Project of Yunnan Provincial ‘Ten Thousand Talents Program’ (YNWR-QNBJ-2018-146), the Xingdian Talent Support Program of Yunnan Province (XDYC-QNRC-2022-0068), and the CAS ‘Light of West China’ Program. Raees Khan was supported by the Postdoctoral International Exchange Program of the Office of China Postdoctoral Council, the Postdoctoral Targeted Funding, and the Postdoctoral Research Fund of Yunnan Province.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article or Supplementary Materials.

Acknowledgments

We acknowledge the provision of plant material with permission from Adelaide Botanic Garden, Mount Lofty Botanical Garden South Australia, and the State Herbarium of South Australia.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ran, J.-H.; Gao, H.; Wang, X.-Q. Fast evolution of the retroprocessed mitochondrial rps3 gene in Conifer II and further evidence for the phylogeny of gymnosperms. Mol. Phylogenetics Evol. 2010, 54, 136–149. [Google Scholar] [CrossRef] [PubMed]
  2. Yang, Y.; Ferguson, D.K.; Liu, B.; Mao, K.S.; Gao, L.M.; Zhang, S.Z.; Wan, T.; Rushforth, K.; Zhang, Z.X. Recent advances on phylogenomics of gymnosperms and an updated classification. Plant Divers. 2022, 44, 340–350. [Google Scholar] [CrossRef] [PubMed]
  3. Khan, R.; Hill, R.S.; Liu, J.; Biffin, E. Diversity, Distribution, Systematics and Conservation Status of Podocarpaceae. Plants 2023, 12, 1171. [Google Scholar] [CrossRef] [PubMed]
  4. Armenise, L.; Simeone, M.C.; Piredda, R.; Schirone, B. Validation of DNA barcoding as an efficient tool for taxon identification and detection of species diversity in Italian conifers. Eur. J. For. Res. 2012, 131, 1337–1353. [Google Scholar] [CrossRef]
  5. Leslie, A.B.; Beaulieu, J.; Holman, G.; Campbell, C.S.; Mei, W.; Raubeson, L.R.; Mathews, S. An overview of extant conifer evolution from the perspective of the fossil record. Am. J. Bot. 2018, 105, 1531–1544. [Google Scholar] [CrossRef] [PubMed]
  6. Khan, R.; Hill, R.S.; Dörken, V.M.; Biffin, E. Detailed seed cone morpho-anatomy of the Prumnopityoid clade: An insight into the origin and evolution of Podocarpaceae seed cones. Ann. Bot. 2022, 130, 637–655. [Google Scholar] [CrossRef] [PubMed]
  7. Khan, R.; Hill, R.S.; Dörken, V.M.; Biffin, E. Detailed Seed Cone Morpho-Anatomy Provides New Insights into Seed Cone Origin and Evolution of Podocarpaceae; Podocarpoid and Dacrydioid Clades. Plants 2023, 12, 3903. [Google Scholar] [CrossRef]
  8. Kelch, D.G. Phylogeny of Podocarpaceae: Comparison of evidence from morphology and 18S rDNA. Am. J. Bot. 1998, 85, 986–996. [Google Scholar] [CrossRef]
  9. Conran, J.G.; Wood, G.M.; Martin, P.G.; Dowd, J.M.; Quinn, C.J.; Gadek, P.A.; Price, R.A. Generic relationships within and between the gymnosperm families Podocarpaceae and Phyllocladaceae based on an analysis of the chloroplast gene rbcL. Aust. J. Bot. 2000, 48, 715–724. [Google Scholar] [CrossRef]
  10. Sinclair, W.; Mill, R.; Gardner, M.; Woltz, P.; Jaffré, T.; Preston, J.; Hollingsworth, M.; Ponge, A.; Möller, M. Evolutionary relationships of the New Caledonian heterotrophic conifer, Parasitaxus usta (Podocarpaceae), inferred from chloroplast trn LF intron/spacer and nuclear rDNA ITS2 sequences. Plant Syst. Evol. 2002, 233, 79–104. [Google Scholar] [CrossRef]
  11. Knopf, P.; Schulz, C.; Little, D.P.; Stützel, T.; Stevenson, D.W. Relationships within Podocarpaceae based on DNA sequence, anatomical, morphological, and biogeographical data. Cladistics 2012, 28, 271–299. [Google Scholar] [CrossRef] [PubMed]
  12. Little, D.P.; Knopf, P.; Schulz, C. DNA barcode identification of Podocarpaceae—The second largest conifer family. PLoS ONE 2013, 8, e81008. [Google Scholar] [CrossRef] [PubMed]
  13. Ahuja, M.R.; Neale, D.B. Evolution of genome size in conifers. Silvae Genet. 2005, 54, 126–137. [Google Scholar] [CrossRef]
  14. Zonneveld, B.J.M. Conifer genome sizes of 172 species, covering 64 of 67 genera, range from 8 to 72 picogram. Nord. J. Bot. 2012, 30, 490–502. [Google Scholar] [CrossRef]
  15. Weitemier, K.; Straub, S.C.; Cronn, R.C.; Fishbein, M.; Schmickl, R.; McDonnell, A.; Liston, A. Hyb-Seq: Combining target enrichment and genome skimming for plant phylogenomics. Appl. Plant Sci. 2014, 2, 1400042. [Google Scholar] [CrossRef] [PubMed]
  16. Vatanparast, M.; Powell, A.; Doyle, J.J.; Egan, A.N. Targeting legume loci: A comparison of three methods for target enrichment bait design in Leguminosae phylogenomics. Appl. Plant Sci. 2018, 6, e1036. [Google Scholar] [CrossRef] [PubMed]
  17. Johnson, M.G.; Pokorny, L.; Dodsworth, S.; Botigue, L.R.; Cowan, R.S.; Devault, A.; Eiserhardt, W.L.; Epitawalage, N.; Forest, F.; Kim, J.T.; et al. A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering. Syst. Biol. 2019, 68, 594–606. [Google Scholar] [CrossRef] [PubMed]
  18. Breinholt, J.W.; Carey, S.B.; Tiley, G.P.; Davis, E.C.; Endara, L.; McDaniel, S.F.; Neves, L.G.; Sessa, E.B.; von Konrat, M.; Chantanaorrapint, S.; et al. A target enrichment probe set for resolving the flagellate land plant tree of life. Appl. Plant Sci. 2021, 9, e11406. [Google Scholar] [CrossRef] [PubMed]
  19. Shah, T.; Schneider, J.V.; Zizka, G.; Maurin, O.; Baker, W.; Forest, F.; Brewer, G.E.; Savolainen, V.; Darbyshire, I.; Larridon, I. Joining forces in Ochnaceae phylogenomics: A tale of two targeted sequencing probe kits. Am. J. Bot. 2021, 108, 1201–1216. [Google Scholar] [CrossRef] [PubMed]
  20. Baker, W.; Dodsworth, S.; Forest, F.; Graham, S.; Johnson, M.; McDonnell, A.; Pokorny, L.; Tate, J.A.; Wicke, S.; Wickett, N. Exploring Angiosperms353: An open, community toolkit for collaborative phylogenomic research on flowering plants. Am. J. Bot. 2021, 108, 1059–1065. [Google Scholar] [CrossRef]
  21. Zuntini, A.R.; Carruthers, T.; Maurin, O.; Bailey, P.C.; Leempoel, K.; Brewer, G.E.; Epitawalage, N.; Françoso, E.; Gallego-Paramo, B.; Baker, W.J.; et al. Phylogenomics and the rise of the angiosperms. Nature, 2024; Online ahead of print. [Google Scholar] [CrossRef]
  22. Léveillé-Bourret, É.; Starr, J.R.; Ford, B.A.; Moriarty Lemmon, E.; Lemmon, A.R. Resolving rapid radiations within angiosperm families using anchored phylogenomics. Syst. Biol. 2018, 67, 94–112. [Google Scholar] [CrossRef] [PubMed]
  23. Montes, J.R.; Peláez, P.; Willyard, A.; Moreno-Letelier, A.; Piñero, D.; Gernandt, D.S. Phylogenetics of Pinus subsection Cembroides Engelm. (Pinaceae) inferred from low-copy nuclear gene sequences. Syst. Bot. 2019, 44, 501–518. [Google Scholar] [CrossRef]
  24. Leebens-Mack, J.H.; Barker, M.S.; Carpenter, E.J.; Deyholos, M.K.; Gitzendanner, M.A.; Graham, S.W.; Grosse, I.; Li, Z.; Melkonian, M.; Mirarab, S.; et al. One thousand plant transcriptomes and the phylogenomics of green plants. Nature 2019, 574, 679–685. [Google Scholar]
  25. Nystedt, B.; Street, N.R.; Wetterbom, A.; Zuccolo, A.; Lin, Y.C.; Scofield, D.G.; Vezzi, F.; Delhomme, N.; Giacomello, S.; Jansson, S.; et al. The Norway spruce genome sequence and conifer genome evolution. Nature 2013, 497, 579–584. [Google Scholar] [CrossRef] [PubMed]
  26. Shalev, T.J.; El-Dien, O.G.; Yuen, M.M.; Shengqiang, S.; Jackman, S.D.; Warren, R.L.; Coombe, L.; van der Merwe, L.; Stewart, A.; Bohlmann, J.; et al. The western redcedar genome reveals low genetic diversity in a self-compatible conifer. Genome Res. 2022, 32, 1952–1964. [Google Scholar] [CrossRef] [PubMed]
  27. Khan, R.; Hill, R.S. Reproductive and leaf morpho-anatomy of the Australian alpine podocarp and comparison with the Australis subclade. Bot. Lett. 2022, 169, 237–249. [Google Scholar] [CrossRef]
  28. Khan, R.; Hill, R.S. Morpho-anatomical affinities and evolutionary relationships of three paleoendemic podocarp genera based on seed cone traits. Ann. Bot. 2021, 128, 887–902. [Google Scholar] [CrossRef] [PubMed]
  29. Duarte, J.M.; Wall, P.K.; Edger, P.P.; Landherr, L.L.; Ma, H.; Pires, P.K.; Leebens-Mack, J.; dePamphilis, C.W. Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels. BMC Evol. Biol. 2010, 10, 1–18. [Google Scholar] [CrossRef]
  30. Kearse, M.; Moir, R.; Wilson, A.; Stones-Havas, S.; Cheung, M.; Sturrock, S.; Buxton, S.; Cooper, A.; Markowitz, S.; Duran, C.; et al. Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 2012, 28, 1647–1649. [Google Scholar] [CrossRef]
  31. Li, W.; Jaroszewski, L.; Godzik, A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001, 17, 282–283. [Google Scholar] [CrossRef]
  32. Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [PubMed]
  33. Huang, Y.; Niu, B.; Gao, Y.; Fu, L.; Li, W. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 2010, 26, 680–682. [Google Scholar] [CrossRef] [PubMed]
  34. Hancock-Hanser, B.L.; Frey, A.; Leslie, M.S.; Dutton, P.H.; Archer, F.I.; Morin, P.A. Targeted multiplex next-generation sequencing: Advances in techniques of mitochondrial and nuclear DNA sequencing for population genomics. Mol. Ecol. Resour. 2013, 13, 254–268. [Google Scholar] [CrossRef] [PubMed]
  35. Hugall, A.F.; O’Hara, T.D.; Hunjan, S.; Nilsen, R.; Moussalli, A. An exon-capture system for the entire class Ophiuroidea. Mol. Biol. Evol. 2015, 33, 281–294. [Google Scholar] [CrossRef] [PubMed]
  36. Waycott, M.; van Dijk, J.K.; Biffin, E. A hybrid capture RNA bait set for resolving genetic and evolutionary relationships in angiosperms from deep phylogeny to intraspecific lineage hybridization. bioRxiv 2022. [Google Scholar] [CrossRef]
  37. Andermann, T.; Cano, Á.; Zizka, A.; Bacon, C.; Antonelli, A. SECAPR—A bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments. PeerJ 2018, 6, e5175. [Google Scholar] [CrossRef] [PubMed]
  38. Bankevich, A.; Nurk, S.; Antipov, D.; Gurevich, A.A.; Dvorkin, M.; Kulikov, A.S.; Lesin, V.M.; Nikolenko, S.I.; Pham, S.; Prjibelski, A.D.; et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 2012, 19, 455–477. [Google Scholar] [CrossRef] [PubMed]
  39. Harris, R.S. Improved Pairwise Alignment of Genomic DNA. Ph.D. Thesis, The Pennsylvania State University, State College, PA, USA, 2007. [Google Scholar]
  40. Katoh, K.; Standley, D.M. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 2013, 30, 772–780. [Google Scholar] [CrossRef]
  41. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef]
  42. Yang, Y.; Smith, S.A. Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: Improving accuracy and matrix occupancy for phylogenomics. Mol. Biol. Evol. 2014, 31, 3081–3092. [Google Scholar] [CrossRef]
  43. Jackson, C.; McLay, T.; Schmidt-Lebuhn, A.N. hybpiper-nf and paragone-nf: Containerization and additional options for target capture assembly and paralog resolution. Appl. Plant Sci. 2023, 11, e11532. [Google Scholar] [CrossRef]
  44. Minh, B.Q.; Schmidt, H.A.; Chernomor, O.; Schrempf, D.; Woodhams, M.D.; Von Haeseler, A.; Lanfear, R. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 2020, 37, 1530–1534. [Google Scholar] [CrossRef]
  45. Kalyaanamoorthy, S.; Minh, B.Q.; Wong, T.K.; Von Haeseler, A.; Jermiin, L.S. ModelFinder: Fast model selection for accurate phylogenetic estimates. Nat. Methods 2017, 14, 587–589. [Google Scholar] [CrossRef]
  46. Hoang, D.T.; Chernomor, O.; Von Haeseler, A.; Minh, B.Q.; Vinh, L.S. UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 2018, 35, 518–522. [Google Scholar] [CrossRef] [PubMed]
  47. McLay, T.G.; Birch, J.L.; Gunn, B.F.; Ning, W.; Tate, J.A.; Nauheimer, L.; Joyce, E.M.; Simpson, L.; Schmidt-Lebuhn, A.N.; Baker, W.J.; et al. New targets acquired: Improving locus recovery from the Angiosperms353 probe set. Appl. Plant Sci. 2021, 9, e11420. [Google Scholar] [CrossRef] [PubMed]
  48. Li, Z.; Baniaga, A.E.; Sessa, E.B.; Scascitelli, M.; Graham, S.W.; Rieseberg, L.H.; Barker, M.S. Early genome duplications in conifers and other seed plants. Sci. Adv. 2015, 1, e1501084. [Google Scholar] [CrossRef] [PubMed]
  49. Stull, G.W.; Qu, X.J.; Parins-Fukuchi, C.; Yang, Y.Y.; Yang, J.B.; Yang, Z.Y.; Hu, Y.; Ma, H.; Soltis, P.S.; Soltis, D.E.; et al. Gene duplications and phylogenomic conflict underlie major pulses of phenotypic evolution in gymnosperms. Nat. Plants 2021, 7, 1015–1025. [Google Scholar] [CrossRef] [PubMed]
  50. Murray, B.G. Nuclear DNA amounts in gymnosperms. Ann. Bot. 1998, 82 (Suppl. 1), 3–15. [Google Scholar] [CrossRef]
  51. Kinlaw, C.S.; Neale, D.B. Complex gene families in pine genomes. Trends Plant Sci. 1997, 2, 356–359. [Google Scholar] [CrossRef]
  52. Philippe, H.; Brinkmann, H.; Lavrov, D.V.; Littlewood, D.T.J.; Manuel, M.; Wörheide, G.; Baurain, D. Resolving difficult phylogenetic questions: Why more sequences are not enough. PLoS Biol. 2011, 9, e1000602. [Google Scholar] [CrossRef]
  53. Whitfield, J.B.; Lockhart, P.J. Deciphering ancient rapid radiations. Trends Ecol. Evol. 2007, 22, 258–265. [Google Scholar] [CrossRef] [PubMed]
  54. Mongiardino Koch, N. Phylogenomic subsampling and the search for phylogenetically reliable loci. Mol. Biol. Evol. 2021, 38, 4025–4038. [Google Scholar] [CrossRef] [PubMed]
  55. Chen, L.; Jin, W.T.; Liu, X.Q.; Wang, X.Q. New insights into the phylogeny and evolution of Podocarpaceae inferred from transcriptomic data. Mol. Phylogenetics Evol. 2022, 166, 107341. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Heatmap showing gene recovery and read depth across samples. The gene recovery includes samples that were flagged as potentially paralogous using the SECAPR pipeline. Sample abbreviations as in Table S1.
Figure 1. Heatmap showing gene recovery and read depth across samples. The gene recovery includes samples that were flagged as potentially paralogous using the SECAPR pipeline. Sample abbreviations as in Table S1.
Biology 13 00361 g001
Figure 2. Box and whisker plots showing the recovery of paralogs (number of paralogous genes/sample) by family. Abbreviations: Ar., Araucariaceae; Cu., Cupressaceae; Pi., Pinaceae; Po., Podocarpaceae; Ta., Taxaceae; Sc., Sciadopityaceae. X = Mean, Middle horizontal bar = Median, and the lower bounds of the box are the 75 and 25 quartiles.
Figure 2. Box and whisker plots showing the recovery of paralogs (number of paralogous genes/sample) by family. Abbreviations: Ar., Araucariaceae; Cu., Cupressaceae; Pi., Pinaceae; Po., Podocarpaceae; Ta., Taxaceae; Sc., Sciadopityaceae. X = Mean, Middle horizontal bar = Median, and the lower bounds of the box are the 75 and 25 quartiles.
Biology 13 00361 g002
Figure 3. Maximum likelihood phylogeny estimated from the concatenated MO ortholog gene alignments. All branches have maximum bootstrap support unless indicated adjacent to the branch. Sample abbreviations as in Table S1.
Figure 3. Maximum likelihood phylogeny estimated from the concatenated MO ortholog gene alignments. All branches have maximum bootstrap support unless indicated adjacent to the branch. Sample abbreviations as in Table S1.
Biology 13 00361 g003
Table 1. Targeted nuclear gene regions and the length of the probe sequences.
Table 1. Targeted nuclear gene regions and the length of the probe sequences.
S#Picea abies Gene NameArabidopsis thaliana Putative HomologLength of the Probe Sequences
1MA_10437158AT5G06430195
2MA_10437143AT1G12370480
3MA_10437077AT5G02250621
4MA_10437070AT5G10920720
5MA_10436603AT1G03750945
6MA_10436489AT4G37510878
7MA_10435966AT4G38890822
8MA_10435879AT2G33630613
9MA_10435851AT5G04520510
10MA_10435433AT2G44760426
11MA_10435005AT2G40570787
12MA_10434812AT1G363101088
13MA_10434753AT1G49380539
14MA_10433768AT2G31955942
15MA_10433107AT5G64150825
16MA_10432498AT1G74640453
17MA_10431375AT2G24830321
18MA_10430781AT4G35910432
19MA_10429426AT1G30070240
20MA_10428930AT1G15390256
21MA_10428614AT2G34640259
22MA_10428345AT1G57770.1315
23MA_10428134AT2G04560273
24MA_10427767AT1G21370291
25MA_10427729AT5G675301224
26MA_10427590AT1G17160480
27MA_10427203AT2G36740543
28MA_10426631AT4G363901533
29MA_10426581AT2G33450231
30MA_10426376AT2G38270504
31MA_9578808AT4G18372387
32MA_9514062AT5G20220315
33MA_9503281AT1G48175257
34MA_8815984AT2G346401693
35MA_8715484AT4G38020501
36MA_8687206AT4G26980408
37MA_8286794AT3G17170342
38MA_8140147AT2G28605480
39MA_7890741AT2G44660783
40MA_5587080AT4G20060447
41MA_957334AT1G05055462
42MA_945784AT5G06410380
43MA_939779AT4G27390468
44MA_938037AT5G49570580
45MA_894439_AT2G301001306
46MA_824260AT4G28020441
47MA_762004AT1G28560675
48MA_759516AT5G08720461
49MA_749379AT4G11980201
50MA_587488AT4G01040377
51MA_546546AT4G17760252
52MA_537299AT5G54840264
53MA_458270AT5G06830690
54MA_388031AT2G20330486
55MA_341112AT5G11980276
56MA_332596AT2G34460333
57MA_314789AT1G56345.1603
58MA_261436AT4G330301290
59MA_253636AT3G51050768
60MA_225872AT5G14260456
61MA_224167AT2G20790900
62MA_199851AT3G01660350
63MA_196209AT4G36530273
64MA_187402AT4G31460471
65MA_173127AT4G28740548
66MA_159115AT2G276001191
67MA_159115AT4G276001056
68MA_127668AT3G15290465
69MA_123340AT2G198701137
70MA_121485AT1G02410749
71MA_121026AT1G08460570
72MA_106933AT2G266801636
73MA_104872AT3G26580507
74MA_99242AT4G29070412
75MA_98424AT1G07130558
76MA_95157AT5G09820292
77MA_83545AT5G65860514
78MA_78599AT2G40760252
79MA_73742AT2G21840939
80MA_73742AT1G21840939
81MA_67861AT2G26680369
82MA_66902AT2G36145234
83MA_66902AT2G34145234
84MA_63465AT3G24315290
85MA_61548AT1G65030681
86MA_55048AT5G19130858
87MA_43083AT5G48330717
88MA_41847AT3G03790303
89MA_35149AT3G02300431
90MA_34295AT1G43580378
91MA_30194AT5G16210369
92MA_29076AT3G57910513
93MA_26068AT2G37560414
94MA_25177AT1G07970472
95MA_24252AT4G24090600
96MA_19954AT2G02590414
97MA_11407AT3G47860312
98MA_10909AT2G04270318
99MA_6888AT3G240802286
100MA_4586AT2G22650303
Table 2. Recovery of targeted gene regions, including potentially paralogous loci, across conifer families and four non-conifer gymnosperms. Locus recovery is averaged across samples (n), and the minimum and maximum recovery per sample is indicated.
Table 2. Recovery of targeted gene regions, including potentially paralogous loci, across conifer families and four non-conifer gymnosperms. Locus recovery is averaged across samples (n), and the minimum and maximum recovery per sample is indicated.
FamiliesNAverage Locus RecoveryMinMax
Araucariaceae5855397
Cupressaceae229889100
Pinaceae11969590
Podocarpaceae26937697
Sciadopityaceae195--
Taxaceae3979697
Non conifer Gymnosperms4817592
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khan, R.; Biffin, E.; van Dijk, K.-j.; Hill, R.S.; Liu, J.; Waycott, M. Development of a Target Enrichment Probe Set for Conifer (REMcon). Biology 2024, 13, 361. https://doi.org/10.3390/biology13060361

AMA Style

Khan R, Biffin E, van Dijk K-j, Hill RS, Liu J, Waycott M. Development of a Target Enrichment Probe Set for Conifer (REMcon). Biology. 2024; 13(6):361. https://doi.org/10.3390/biology13060361

Chicago/Turabian Style

Khan, Raees, Ed Biffin, Kor-jent van Dijk, Robert S. Hill, Jie Liu, and Michelle Waycott. 2024. "Development of a Target Enrichment Probe Set for Conifer (REMcon)" Biology 13, no. 6: 361. https://doi.org/10.3390/biology13060361

APA Style

Khan, R., Biffin, E., van Dijk, K. -j., Hill, R. S., Liu, J., & Waycott, M. (2024). Development of a Target Enrichment Probe Set for Conifer (REMcon). Biology, 13(6), 361. https://doi.org/10.3390/biology13060361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop