A First Genome Survey and Genomic SSR Marker Analysis of Trematomus loennbergii Regan, 1913

Simple Summary The scaly rockcod (Trematomus loennbergii) is distributed in the Antarctic Ocean and this area is isolated by the Antarctic Circumpolar Current. It is important region to study evolutionary diversity. Trematomus is the main genus, having 11 species, and their habit distribution is well known. However, their genetic and genomic information is not studied. In addition, some species have similar morphology. In this study, a genome survey of T. loennbergii and microsatellite motif analysis were conducted to obtain genomic profile. The fundamental data such as genome size, heterozygosity ratio, duplication ration and microsatellite motifs were obtained. These data will provide a foundation for further whole-genome sequencing and the development of new molecular markers of T. loennbergii. Abstract Trematomus loennbergii Regan, 1913, is an evolutionarily important marine fish species distributed in the Antarctic Ocean. However, its genome has not been studied to date. In the present study, whole genome sequencing was performed using next-generation sequencing (NGS) technology to characterize its genome and develop genomic microsatellite markers. The 25-mer frequency distribution was estimated to be the best, and the genome size was predicted to be 815,042,992 bp. The heterozygosity, average rate of read duplication, and sequencing error rates were 0.536%, 0.724%, and 0.292%, respectively. These data were used to analyze microsatellite markers, and a total of 2,264,647 repeat motifs were identified. The most frequent repeat motif was di-nucleotide with 87.00% frequency, followed by tri-nucleotide (10.45%), tetra-nucleotide (1.94%), penta-nucleotide (0.34%), and hexa-nucleotide (0.27%). The AC repeat motif was the most abundant motif among di-nucleotides and among all repeat motifs. Among microsatellite markers, 181 markers were selected and PCR technology was used to validate several markers. A total of 15 markers produced only one band. In summary, these results provide a good basis for further studies, including evolutionary biology studies and population genetics of Antarctic fish species.


Introduction
The Antarctic shelf is the deepest ice shelf in the world. The depth of the Ross Ice Shelf, which is the largest ice shelf of Antarctica, is approximately 500 m, which is much deeper than the average depth (130 m) of other continental shelves. Moreover, the Southern Ocean is isolated by the Antarctic Circumpolar Current (ACC), and the temperature of the coastal water is cold with parts of it being frozen, making it an extreme environment. Therefore, the Antarctic Ocean is an important region for biologists to study survival adaptations and evolutionary diversification [1,2]. In the order Perciformes, Notothenioidei is the dominant suborder of fishes in the Antarctic area according to diversity and biomass [3], and fish species belonging to the family Notothenioidei have a variety of characters distinguishing them from other teleost fishes, such as the lack of hemoglobin and AFGP (antifreeze glycoprotein), which protect their body fluids from freezing [4,5]. The subfamily Trematominae is fundamental to the study of the coastal Antarctic ecosystem and includes only 14 species. Trematomus is the main most well-known genus in this family; it comprises 11 species, and their plasticity and diversity in habit distribution are well-known [6]. However, it is difficult to clearly distinguish between several Trematomus species because they have very similar morphologies. For example, Trematomus loennbergii and Trematomus lepidorhinus differ only in the absence or presence of scales on the preorbital and lower jaw [7]. Trematomus loennbergii Regan, 1913, known as the scaly rockcod, and the average length of this species is 20 cm, according to FishBase [8]. It is widely distributed in the Southern Ocean and is commonly found at a depth of over 300 m [9]. Its swimming activity is more spontaneous than that of other benthic fishes from the same family. The diet of females and males is similar; they feed on a wide range of prey, and their main food resources are epifaunal and tube-dwelling polychaetes [10].
T. loennbergii is an important species for studying evolution in the Antarctic area. Some morphological studies have emphasized the difficulty of distinguishing between this species and similar species. Owing to the development of molecular biology research techniques, especially next-generation sequencing (NGS) technology, genomic data, such as whole genome sequencing data, could be used for this purpose. Recently, some Antarctic species have been studied using this technology, but the genome of T. loennbergii has not been assessed to date. Therefore, low-coverage genome sequencing (e.g., genome surveys by K-mer analysis) needs to be carried out before performing large-scale sequencing to provide a genomic reference. Moreover, microsatellites or simple sequence repeats (SSRs), which consist of one to 10 nucleotides, are widely distributed throughout the genome of eukaryotes [11][12][13]. Therefore, in the present study, we conducted K-mer and QDD analyses to investigate the genome size and repeat sequences of T. loennbergii and to develop new microsatellite markers. These data can provide useful basic information on T. loennbergii genome.

Sample Collection and DNA Extraction
A T. loennbergii ( Figure 1) was caught in the Ross Sea (77 • 05 S, 170 • 30 E on CCAMLR Subarea 88.1), Antarctica, and stored in a freezer. The conventional phenol-chloroform method [14] was used for DNA extraction from the muscle tissue of frozen specimens. Quality check was conducted using a fragment analyzer (Agilent Technologies, Palo Alto, CA, USA), and DNA quantity was estimated using a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).

Library Construction and Sequencing
DNA library preparation was performed according to the Illumina Truseq DNA PCR-Free Library prep protocol. For library sample preparation, 2 ug of genomic DNA

Library Construction and Sequencing
DNA library preparation was performed according to the Illumina Truseq DNA PCR-Free Library prep protocol. For library sample preparation, 2 µg of genomic DNA for 550 bp insert size was randomly sheared to yield DNA fragments using the Covaris S2 system (Covaris, Woburn, MA, USA). The fragments were blunt-ended, and a single 'A' nucleotide was added to the 3 ends of the fragments for adaptor ligation. After the ligation step with adaptors having different sequences at the 5 and 3 ends of each fragment, library quality check was conducted using Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The library was clustered on the Illumina cBOT station, and paired ends were sequenced for 101 cycles on an Illumina Novaseq 6000 sequencer (Illumina, San Diego, CA, USA) according to the Illumina cluster and sequencing protocols.

K-mer Analysis, Genome Assembly, and Microsatellite Analysis
After evaluation of quality values, all clean reads were used for K-mer analysis using Jellyfish and GenomeScope [15,16]. After estimating genome size, the assembly of T. loennbergii genome was conducted by MaSuRCA [17]. QDD version 3.1.2 pipeline [18] was used to identify the microsatellite motifs of T. loennbergii. The microsatellites repeat units in the genome were analyzed to calculate their length, quantity, and sequence. The following parameters were analyzed: number of mono-nucleotide repeats, di-nucleotides repeats, tri-nucleotide repeats, tetra-nucleotide repeats, penta-nucleotide repeats, and hexanucleotide repeat. These repeats were extracted from steps 1, 2 and 3. The parameter of each step was -contig 1, -make_cons 0 and -contig 1, respectively. After QDD analysis, total 181 primer set were selected by following parameters: motifs with more than five repetitions, 100-300 bp amplification product, 18-28 mer primer size and 58-62 • C for melting temperature. Among these primer pairs, 40 sets were randomly selected by 20 bp primer size and 60 • C annealing temperature. The PCR was conducted in total 20 µL including 5 µL genomic DNA (30 ng/µL), 10 µL 2× EmeraldAmp PCR Master Mix (Takara Bio, Shiga, Japan), 1 µL (10 pmole/L) each forward and reverse primers and 3 µL ddH 2 O. The PCR program was 2 min at 94 • C, followed by 35 cycles of 94 • C for 30 s, 60 • C for 30 s and 70 • C for 1 min, and the final extension was 10 min at 72 • C. The amplified PCR products were separately by 4% agarose gel electrophoresis and the 20 bp DNA ladder (Takara Bio, Shiga, Japan) was used to estimate the PCR product size.

Sequencing Data Statistics
In this study, the paired-end method with the Illumina NovaSeq platform was used to generate raw sequence data, and low-quality reads were filtered out. As a result, a total of 53.48 Gb of data were obtained. The data showed that Q20 was 96.3% and Q30 was 91.3% ( Table 1). The Illumina NGS platform specifies that high-quality reads should have a Q20 of at least 90% and a Q30 of at least 85% [19]. Therefore, T. loennbergii sequencing data were highly accurate. Genome assembly and thereby the genome sequencing quality can be influenced by GC content; high GC content can reduce the sequencing coverage and cause sequencing bias. However, GC content between 30% and 50% has no effect on genome sequence quality [20][21][22]. In the present study, the GC content of T. loennbergii was 41.3% (Table 1); thus, it had no effect on the assembly results.

K-mer Analysis and Genome Size Prediction
Using the sequencing data, K-mer analysis was conducted and the 25-mer frequency distribution was the best. The estimated genome size was 815,042,992 bp ( Table 2) and it is quite similar to the size of Pogonophryne albipinna (~883.8 Mb) which inhabits the deep waters of the Antarctic Southern Ocean [23] but smaller than the Antarctic blackfin icefish Chaenocephalus aceratus (1.06 Gb) [24]. The genome size of most fish is approximately 1 Gb, except for a few species [25], and the estimated genome size of T. loennbergii was similar to that of most fish species. Based on K-mer analysis, the heterozygosity was 0.536% and the average rate of read duplication was 0.724%. The sequencing error rate was 0.292%, and the highest frequency was near 40× coverage ( Figure 2). Using the sequencing data, K-mer analysis was conducted and the 25-mer distribution was the best. The estimated genome size was 815,042,992 bp (Tab is quite similar to the size of Pogonophryne albipinna (~883.8 Mb) which inhabit waters of the Antarctic Southern Ocean [23] but smaller than the Antarctic black Chaenocephalus aceratus (1.06 Gb) [24]. The genome size of most fish is approxim except for a few species [25], and the estimated genome size of T. loennbergii w to that of most fish species. Based on K-mer analysis, the heterozygosity was 0 the average rate of read duplication was 0.724%. The sequencing error rate w and the highest frequency was near 40× coverage ( Figure 2).

De Novo Assembly
To perform de novo assembly, MaSuRCA [17] was used with all clean reads. The total size of contigs was 820,644,295 bp, and the number of contigs was 613,288. The largest contig size was 59,484 bp, and the number of contigs larger than 1 K bp was 192,849 (31.4%). N50 contig length was 148,364 bp and L50 contig number was 1526. GC content was 40.65% and contig having more than 1K nt was 31.4% (Table 3). If heterozygosity rate is lower than 0.5%, it is not difficult to assemble [15], and heterozygosity rate of T. loennbergii is~0.5%. In addition, high quality reads should have at least 85% on Q30 [19] and the high quality reads of T. loennbergii were 91.3%.
The assembly data is a first genome survey of T. loennbergii and it would be useful information for genomic research for Trematomus group. However, to improve whole genome sequencing and de novo assembly data, further study is needed to combine with more advanced technologies such as PacBio long read sequencing and high-throughput chromosome conformation capture (Hi-C).

Identification of Microsatellite Motifs
Based on the genome assembly, QDD pipeline was used to identified SSR markers and the total number of identified microsatellite motifs was 2,264,647. Among them, dinucleotide repeats were the most abundant (1,970,270, 87.00%) followed by tri-nucleotide repeats (236,541, 10.45%), tetra-nucleotide repeats (43,907, 1.94%), penta-nucleotide repeats (977,333, 0.34%), and hexa-nucleotide repeats (6196, 0.27%). The tendency of di-nucleotide repeats frequency in the studied species was similar to that in other fish species, such as Pseudosciaena crocea and Megalobrama amblycephala [26,27]. Furthermore, this result was consistent with data indicating that the repeat frequency decreases with the increase in repeat length because long mutations are related to high mutation rates [28]. The most frequent repeat motif among the four types of di-nucleotide repeats was the AC/GT repeat, and it was also the most abundant repeat motif (1,394,543, 61.57%) among all repeat motifs. The hexa-nucleotide repeat motif was the least frequent but had the highest number (n = 80) of repeat motif types (Table 4). After this analysis, we selected 181 microsatellite primer pairs (Table S1). Among these primer sets, we randomly chose 40 primer pairs and conducted PCR amplification with T. loennbergii. As a result, 15 primer pairs produced one clear band ( Figure 3). Therefore, the present data applied for molecular genetic marker and further validation studies using various Trematomus groups are needed.

Conclusions
In the present study, a genome survey was conducted and the estimated genome size of T. loennbergii was reported. Furthermore, microsatellite analysis was performed to study this genome. We found that the genome size was approximately 815.04 Mb, heterozygosity rate was 0.536%, and GC content was 40.65%. In addition, microsatellite motifs analysis showed that the most abundant repeat motifs were di-nucleotide motif and the most abundant repeat type was the AC repeat motif, accounting for 61.57% of the total number of repeats. These findings will be helpful for future studies on population genetics and evolutionary biology of T. loennbergii and related species.