The Importance of Genomics for Deciphering the Invasion Success of the Seagrass Halophila stipulacea in the Changing Mediterranean Sea

: The Mediterranean Sea is subject to pressures from biological invasion due to coastal anthropic activities and global warming, which potentially modify its biogeography. The Red Sea tropical seagrass Halophila stipulacea entered the Eastern Mediterranean over a century ago, and its occurrence is expanding towards the northwest. Here, we highlight the importance of genomics for deciphering the evolutionary and ecological procedures taking place during the invasion process of H. stipulacea and review the relatively sparse genetic information available for the species to date. We report the ﬁrst draft whole-genome sequencing of a H. stipulacea individual from Greece, based on Illumina Sequencing technology. A comparison of the Internal Transcribed Spacer (ITS) regions revealed a high divergence of the herein sequenced individual compared to Mediterranean populations sequenced two decades ago, rendering further questions on the evolutionary processes taking place during H. stipulacea adaptation in the invaded Mediterranean Sea. Our work sets the baseline for a future analysis of the invasion genomic of the focal species.


Introduction
Biological invasion is a major force of altering marine ecosystems [1], as anthropic activity and climate change redistribute diversity outside their native biogeographical range [2]. The Mediterranean Sea has become a hot spot of non-indigenous species of tropical origin due to the Suez Canal opening, along with increased maritime traffic and aquaculture activities [3]. At the same time, the resemblance of the thermal conditions in the Mediterranean to the natural ranges of tropical species facilitates the expansion and establishment of tropical species in the region [4]. Furthermore, the Mediterranean Sea, particularly the Eastern part of the basin, are warming up, suggesting that the performance and competitive capacity of tropical species may be enhanced in the future Mediterranean Sea contrary to that of their native counterparts, which appear to be detrimentally affected by warming [5].
Halophila stipulacea (Forsskål) Ascherson 1867 is a tropical seagrass species (native to the Red Sea and Indian Ocean) that entered the Mediterranean Sea following the opening of the Suez Canal in 1869. At present, the species occupies the Eastern and Central Mediterranean, but preliminary estimations show that it will be able to expand throughout the Mediterranean during the next 100 years [6]. Species traits that facilitate the successful invasion of H. stipulacea include its ability to extend over a wide range and despite the expansion of H. stipulacea, to our knowledge, no relevant information on H. stipulacea genome is currently available.
Here, we aim to review the genetic information available for H. stipulacea and provide the first draft whole-genome assembly of the species, offering the seagrass community a tool that will enhance H. stipulacea research at the genomics level. Most importantly, it will offer the possibility to study the invasion genomics of the species through the application of further genomic tools (e.g., RAD-Seq technologies) which substantially benefit from the availability of reference genomes. Such experiments will unveil the population structure of invasive versus source populations and will allow researchers to search for candidate loci responsible for the invasion success and adaptation to the new environment.

Literature Survey
To get an overview of the research interest on the genetics of our focal species, we performed a literature review of studies related to seagrass genomics and/or genetics using the Web of Science search engine and the search terms "seagrass" AND "genetic" OR "genomic". We took into account all relevant publications published until February 2020. The search produced a total of 347 results.

Sampling
Divers collected H. stipulacea shoots by hand on 29/5/2019 at a depth of 20 m from Crete, Greece (Hersonissos, 35 • 18 53.74"N, 25 • 25 7.23"E). The shoots were transferred to the laboratory in seawater immediately after sampling and were gently cleaned to remove debris and epiphytes. A single fragment of H. stipulacea bearing five shoots that did not have any signs of degradation was selected for further preparation. Extra care was taken to only select green leaves and rhizomes (no roots). The tissues were immersed in RNAlater TM Stabilization Solution, incubated for 48 h at 4 • C, and then stored at −80 • C until DNA extraction.

DNA Extraction, Library Preparation, and Sequencing
High molecular weight DNA was extracted using a modified protocol of the CTAB chloroform/isoamyl alcohol (24:1) isolation method, followed by post-extraction RNase treatment (Ambion ® RNase Cocktail™, ThermoFisher Scientific, 27 Forge Parkway, Franklin, MA, USA) [39,[47][48][49] and cleaning with ProNex-beads technology (ProNex ® Size-Selective Purification System, Promega Corporation 2800 Woods Hollow Road Madison, WI, USA), allowing us to eliminate any RNA traces and enzyme traces, improve the purity, and size select for next generation whole-genome sequencing. The final elution was made with 50 µL EB Qiagen's ® buffer (Tris-EDTA pH = 8. The DNA integrity was assessed by electrophoresis in 0.4% w/v megabase agarose gel. Template DNA for Illumina sequencing was sheared by ultrasonication by employing a Covaris instrument. A PCR-free library was prepared with the Kapa Hyper Prep DNA kit with TruSeq Unique Dual Indexing. Paired end 2 × 150 bp sequencing was performed at the Norwegian Sequencing Centre (NSC) on an Illumina Hiseq4000 platform.

Bioinformatic Analyses
All raw sequences have been uploaded to the NCBI SRA database (BioProject ID: PRJNA642709). The quality of the Illumina Sequencing reads was assessed using FastQC (https://www.bioinformatics. babraham.ac.uk/projects/fastqc/). Then, to remove low-quality reads and adapters and trim low-quality read edges, we used fastp v0.19.10 (https://github.com/OpenGene/fastp) with the parameters '-detect_adapter_for_pe -w 20 -l 140 -q 25 -u 20 -y -c'. The high-quality reads that passed the filtering criteria were used for downstream analyses. To evaluate the level of contamination of the sequenced reads, we used kraken2 [50] against the precompiled database 16S_Silva138_20200326.
Prior to assembly, we estimated the H. stipulacea genome size using kmergenie (http://kmergenie. bx.psu.edu/). Then, we input the quality-filtered reads in Spades v3.13.0 [51] to build the genome assembly. The produced assembly was assessed in terms of contiguity using QUAST [52] and quality with BUSCO v3 [53] using the database odb10. BUSCO assesses the presence of key genes in a taxonomically-informed manner. For our focal species, we searched the produced assembly against the viridiplantae geneset. The resulting assembly has been uploaded in the NCBI (BioProject ID: PRJNA642712).
Finally, to see how the sequenced individual compared to other H. stipulacea individuals, we downloaded all available H. stipulacea ITS sequences from the NCBI (see Supplementary File 1 for the list). The complete sequences were kept and used for a blast search against the H. stipulacea genome, with the aim of identifying the homologous sequence. All the different queries returned one contig as the best hit, which included the ITS of the sequenced individual. The top hit contig was combined with all downloaded sequences and aligned with mafft (-auto) [54]. The aligned sequence was manually curated in Jalview2 [55] (see Supplementary File 2 for the alignment). Then, the alignment was used for phylogenetic analysis in RaxML (model GTRCAT) using 100 bootstraps for branch support [56].

Research Interest on H. stipulacea Genetics
Although the need for the application of genomic techniques to seagrasses was recognized early enough, it still lags behind compared to other taxa, as important challenges emerge at multiple levels during the application of such technologies, e.g., tissue sampling, the application of wet lab techniques, and bioinformatic analyses. For H. stipulacea in particular, research interest on molecular profiling is limited compared to other seagrass species ( Figure 1). It appears that there is no information on H. stipulacea genome, and even any genetic information is extremely limited [57][58][59][60], despite the increase in the total research effort for this species shown above.
25 -u 20 -y -c'. The high-quality reads that passed the filtering criteria were used for downstream analyses. To evaluate the level of contamination of the sequenced reads, we used kraken2 [50] against the precompiled database 16S_Silva138_20200326. Prior to assembly, we estimated the H. stipulacea genome size using kmergenie (http://kmergenie.bx.psu.edu/). Then, we input the quality-filtered reads in Spades v3.13.0 [51] to build the genome assembly. The produced assembly was assessed in terms of contiguity using QUAST [52] and quality with BUSCO v3 [53] using the database odb10. BUSCO assesses the presence of key genes in a taxonomically-informed manner. For our focal species, we searched the produced assembly against the viridiplantae geneset. The resulting assembly has been uploaded in the NCBI (BioProject ID: PRJNA642712).
Finally, to see how the sequenced individual compared to other H. stipulacea individuals, we downloaded all available H. stipulacea ITS sequences from the NCBI (see Supplementary File 1 for the list). The complete sequences were kept and used for a blast search against the H. stipulacea genome, with the aim of identifying the homologous sequence. All the different queries returned one contig as the best hit, which included the ITS of the sequenced individual. The top hit contig was combined with all downloaded sequences and aligned with mafft (--auto) [54]. The aligned sequence was manually curated in Jalview2 [55] (see Suppl. File 2 for the alignment). Then, the alignment was used for phylogenetic analysis in RaxML (model GTRCAT) using 100 bootstraps for branch support [56].

Research Interest on H. stipulacea Genetics
Although the need for the application of genomic techniques to seagrasses was recognized early enough, it still lags behind compared to other taxa, as important challenges emerge at multiple levels during the application of such technologies, e.g., tissue sampling, the application of wet lab techniques, and bioinformatic analyses. For H. stipulacea in particular, research interest on molecular profiling is limited compared to other seagrass species (Figure 1). It appears that there is no information on H. stipulacea genome, and even any genetic information is extremely limited [57][58][59][60], despite the increase in the total research effort for this species shown above. The existing studies mostly account for genetic diversification of the species using a single locus or a few multi-locus markers. Although undoubtedly important, such analyses cannot infer population genetic parameters at a deeper level, and cannot precisely address the origin of invasions or whether the introduction occurred once or multiple times. The comparison of ribosomal ITS regions between Red Sea and Mediterranean populations by [59] proposed that the studied invading populations originated from the Red Sea; a deduction based on the absence of differentiation among these populations for the ITS region, where as they underlined, phylogenetic analyses when using this ribosomal region, must be taken with caution. Moreover, results of the same study showed that The existing studies mostly account for genetic diversification of the species using a single locus or a few multi-locus markers. Although undoubtedly important, such analyses cannot infer population genetic parameters at a deeper level, and cannot precisely address the origin of invasions or whether the introduction occurred once or multiple times. The comparison of ribosomal ITS regions between Red Sea and Mediterranean populations by [59] proposed that the studied invading populations originated from the Red Sea; a deduction based on the absence of differentiation among these populations for the ITS region, where as they underlined, phylogenetic analyses when using this ribosomal region, must be taken with caution. Moreover, results of the same study showed that there was a high degree of intra-individual variability for this DNA region, whilst [58] found no ITS intra-individual nucleotide diversity for an East-Aegean population. On the other hand, an analysis of the first extensive population recorded in the western part of the basin with randomly amplified polymorphic DNA (RAPD) markers showed a high genetic diversity and clear genetic difference between shallow and deep stands of the same population [57]. Using RAPD, [60] were able to confirm the molecular identity of El-Bardawil lake isolates of Halophila stipulacea and also verified the absence of intra-individual variability for the ITS region considered among the isolates from Turkey [58]. A recent comparative study on karyotypes across Halophila sp. [61] excludes the possibility of polyploidy being the factor for the observed intra-individual variability reported by [59], and revealed that our focal species has a relatively smaller 2C value (<10 pg) compared to other Halophila species [61]. These findings make the genome sequencing of H. stipulacea a much more feasible target compared to some other Halophila species (H. decipiens and H. spinulosa also have small genomes). The availability of the H. stipulacea genome is of extreme importance as it would enable scanning for genes related to H. stipulacea plasticity and invasiveness; assist with transcriptomic analyses; and also assist with the development of more polymorphic and reliable markers that span across the genome and thus with tracking, with a greater precision, the origin of invasions.

The First Draft Reference Genome of H. Stipulacea
Our sequencing effort yielded~626 million reads in total, summing~94.5 Gbases. Quality control led to the elimination of 77% of the raw dataset, maintaining 482 million high-quality reads summing up to~72 Gb. The filtered dataset was used for the genome size estimation, which resulted in a kmer of 43 and a genome size of 3.4 Gb. Based on the genome size estimation, we also calculated that the genomic coverage of our sequencing effort was~21X. Then, a draft genome assembly was built and assessed (see Table 1 for basic assembly statistics). The assembly covered 3.7 Gb overall, and scored 48.8% complete and 19.5% fragmented BUSCO genes, summing up to 68.3% discovered genes (294 genes out of 430 tested). A search for contaminants showed a low percentage of sequence contamination (0.71% overall), with top contaminant sequences coming from Archaeplastida (including green and red algae and terrestrial plants) and Cyanobacteria (see Supplementary File 3 for a complete list). Our genome assembly (3.4 Gb) is highly comparable to the estimated genome size (3.6 Gb) of the congeneric H. ovalis [62] (i.e., the same tool for genome estimation was used in both studies), which suggests a rather large genome for all Halophila species, but a slightly smaller genome size for H. stipulacea compared to H. ovalis, as also found in a karyotypic study [61]. The resulting draft genome assembly is close to our genome size estimation, confirming that H. stipulacea has a much larger genome compared to Z. marina, with a genome assembly of~200 Mb and BUSCO results of 93.9% complete and 1.9% fragmented. Note: * Given that all contigs are sorted in terms of their length, the N50 value refers to the length of the contig that divides the assembly into two equal parts in terms of bases; ** L50 is the number of contigs whose summed length is N50.
Based on the BUSCO results, it appears that we have captured at least~70% of the H. stipulacea genome using solely short reads. This outcome is rather satisfying given the difficulties characterizing the Halophila genomes, which are mainly reflected by their genome size and organization; factors that are most probably correlated with the number of repetitive sequences [63,64], such as "low-complexity" sequences [65], and transposable elements [66]. These genomic elements are considered as key contributors to the eukaryotic genome structure and in many plants, their abundance can be up to 90% of their genome sequence [63,67]. Bioinformatic data analysis in seagrasses, like in many plants, is usually complicated and more time-and resource-consuming relative to animal genomic analysis. The major factors influencing the success of building a genomic reference for seagrasses and plants in general are the often larger genome size (e.g., the sugar pine Pinus lambertiana 31Gb genome [68]), the genomic complexity (e.g., the rice genome [69]), and the possibility for polyploidy (e.g., strawberry genome [70]). Efforts to sequence the genome of H. ovalis have taken place [62], but this remains the first successfully assembled genome for the genus of Halophila. However, the current assembly can be improved through two possible avenues: either by increasing the sequencing effort to reach the genomic coverage obtained by studies sequencing smaller seagrass genomes (e.g., the Z. marina genome was sequenced at a coverage of~50x [39]) or by using the long reads provided by third-generation sequencing platforms, such as Oxford Nanopore and Pacific Biosciences. The latter option ensures the contiguity and quality of the assembly, as confirmed by other studies combining the two technologies from other species of interest (e.g., [71,72]), even though it has not yet been applied to seagrasses. However, we expect that with the future use of long reads, the contiguity and quality of the assembly will be improved, reaching the level of the first published seagrass genome, that of Z. marina.
To see how the sequenced individual relates to other H. stipulacea individuals, we compared the ITS sequence of the genome with complete sequences available in NCBI from the study of [59]. The resulting phylogenetic tree showed a remarkably high divergence of the sequenced individual compared to all other sequences from Greece, Italy, and Egypt (see Figure 2). Although the topology clustered the sequenced H. stipulacea individual with an individual from Egypt and another from Italy, the clustering was not supported by bootstrap values (all <10). The extreme divergence observed in the sequenced ITS region resembles the regions described as pseudogenes by [59], characterized by large branch lengths reflecting faster evolutionary rates of the duplicated ITS regions. Other such sequences with longer branches are included in Figure 2. The presence of duplicated ITS regions leads to the relaxation of selective pressures acting on the duplicated sequences, which then evolve at a faster rate. This hampers the phylogenetic positioning of species when using duplicated markers. Therefore, the conducted analysis could not resolve the relationship between the newly sequenced individual and other sequenced haplotypes. However, this divergence, including tandem duplications and possible pseudogenization [59], needs to be further studied by sampling multiple individuals from multiple populations and conducting a thorough population genomic analysis.
In this study, we managed to build the first draft assembly of H. stipulacea. The next step would be to predict the gene models using transcriptomic data and then compare the protein-coding sequences to those of other seagrasses and terrestrial plants. Such study will allow a comparison of the gene content among the sequenced seagrasses and will deepen our understanding of the pathways that have been secondarily lost during the transition to the sea (e.g., stomatal genes), as described in detail in [39] and further illustrated through including Z. muelleri, H. ovalis, and five non-marine plant species in the comparison [62]. Finally, it may lead to the identification of large-scale duplications that will inform us about past whole-genome duplication events (see the Ks-based age distributions analysis in [39]) and will highlight genes with a selective pressure and possible functions linked not only to the adaptation to seawater, but also to the response to climate change. In this study, we managed to build the first draft assembly of H. stipulacea. The next step would be to predict the gene models using transcriptomic data and then compare the protein-coding sequences to those of other seagrasses and terrestrial plants. Such study will allow a comparison of the gene content among the sequenced seagrasses and will deepen our understanding of the pathways that have been secondarily lost during the transition to the sea (e.g., stomatal genes), as described in detail in [39] and further illustrated through including Z. muelleri, H. ovalis, and five nonmarine plant species in the comparison [62]. Finally, it may lead to the identification of large-scale duplications that will inform us about past whole-genome duplication events (see the Ks-based age distributions analysis in [39]) and will highlight genes with a selective pressure and possible functions linked not only to the adaptation to seawater, but also to the response to climate change.

Closing Remarks
Although our literature search reflected the response of the seagrass community to the 'genomics age', only a few seagrass genomes have previously been sequenced. Inspired by the ecological importance of the focal invasive species, we managed to provide the first draft wholegenome assembly. Although all Halophila genomes are remarkably large, it is still feasible to sequence them using only short reads. However, the addition of third-generation sequencing long reads will assist in refining the assembly. Here, we have offered the backbone that will unleash multiple downstream genomic analyses needed not only for understanding the complex evolutionary and ecological procedures taking place during the invasion process of H. stipulacea, but also the genomic changes linked to climate change.

Closing Remarks
Although our literature search reflected the response of the seagrass community to the 'genomics age', only a few seagrass genomes have previously been sequenced. Inspired by the ecological importance of the focal invasive species, we managed to provide the first draft whole-genome assembly. Although all Halophila genomes are remarkably large, it is still feasible to sequence them using only short reads. However, the addition of third-generation sequencing long reads will assist in refining the assembly. Here, we have offered the backbone that will unleash multiple downstream genomic analyses needed not only for understanding the complex evolutionary and ecological procedures taking place during the invasion process of H. stipulacea, but also the genomic changes linked to climate change.