Identification of Mitochondrial DNA (NUMTs) in the Nuclear Genome of Daphnia magna

This is the first study in which the Daphnia magna (D. magna) nuclear genome (nDNA) obtained from the GenBank database was analyzed for pseudogene sequences of mitochondrial origin. To date, there is no information about pseudogenes localized in D. magna genome. This study aimed to identify NUMTs, their length, homology, and location for potential use in evolutionary studies and to check whether their occurrence causes co-amplification during mitochondrial genome (mtDNA) analyses. Bioinformatic analysis showed 1909 fragments of the mtDNA of D. magna, of which 1630 were located in ten linkage groups (LG) of the nDNA. The best-matched NUMTs covering >90% of the gene sequence have been identified for two mt-tRNA genes, and they may be functional nuclear RNA molecules. Isolating the total DNA in mtDNA studies, co-amplification of nDNA fragments is unlikely in the case of amplification of the whole tRNA genes as well as fragments of other genes. It was observed that TRNA-MET fragments had the highest level of sequence homology, thus they could be evolutionarily the youngest. The lowest homology was found in the D-loop-derived pseudogene. It may probably be the oldest NUMT incorporated into the nDNA; however, further analysis is necessary.


Introduction
Daphnia, commonly known as the water flea, are small crustaceans usually inhabiting freshwater ponds and lakes on all continents of the globe. It has long been used as a model for the elucidation of animal responses and adaptations to environmental changes [1]. It has been also used in diverse biological research areas such as ecology, ecotoxicology, evolution, and reproductive biology due to its important position in the aquatic food chain, a high degree of phenotypic plasticity, and cyclical parthenogenesis responding to environmental stimuli [1][2][3][4][5][6][7]. Its sensitive behavioral and physiological responses are parameters used as biomarkers of the effect induced by various substances [1][2][3][4][5][6][7][8]. Thus, it has been used for reproduction tests, acute toxicity studies, and chronic toxicity tests in the OECD Guidelines [9,10]. The current state of knowledge of the nuclear genome of Daphnia magna is based on reports by Routtu et al. [11,12], Dukić et al. [13], and Lee et al. [14]. Low-and high-density genetic linkage maps were obtained, in which they assembled the whole genome sequence of D. magna. Specific genetic markers from a high-resolution genetic linkage map of D. magna Xinb3 were evaluated in toxicological studies by Korea Institute of Toxicology (KIT) [14].
Genomic resources are steadily being developed for many species of the genus Daphnia. In particular, a database of around 12,000 expressed sequence tags (EST) is currently available (http://wfleabase.org) [15], providing a useful resource to isolate polymorphic genetic markers in this species. However, there is no information about pseudogenic sequences of mitochondrial origin (NUMTs) in the D. magna genome.
Nuclear DNA sequences that are homologous to the mitochondrial genome are often referred to as mitochondrial pseudogenes, or NUMTs [16]. NUMTs may differ in length and be as large as the full length of the mitochondrial genome [17]. It has been reported that the NUMT length is positively correlated with the genome size, suggesting potential roles of non-coding DNA gain and loss in NUMT accumulation [18]. NUMTs have been documented in almost all eukaryotic genomes studied [19]. The transfer of mitochondrial DNA sequences into the nuclear genome is an ongoing evolutionary process [20], which has markedly influenced the evolution and function of eukaryotic genomes [19]. Thus, NUMTs are good materials for studying the evolution of nuclear sequences without selective constraints [21]. However, because of their homology, NUMTs may confound mtDNA studies, as the NUMT co-amplification product could interfere with sequence analysis [22]. This is the first study in which the D. magna nuclear genome deposited in the GenBank database was analyzed for pseudogene sequences of mitochondrial origin. The present study aimed to identify NUMTs, their length, homology, and location for potential use in evolutionary studies and to check whether their occurrence causes co-amplification during mitochondrial genome analyses.

Results
Bioinformatic analysis showed 1909 fragments of the mitochondrial genome, of which 1630 fragments were located in ten linkage groups of the nuclear genome of D. magna. The other fragments were localized in scaffolds which were used during genome sequencing. All the NUMTs found in this research are listed in supplementary materials (Table S1). The total length of the NUMT sequences in the linkage groups corresponded to the number of fragments on individual LG ( Table 1). The total length of NUMTs in the D. magna genome was 44.391 base pairs (bp), which accounted for 0.042% of the length of the nuclear genome. Their percentage content was 0.037% in the longest linkage group 2, in which 228 NUMTs were identified, and 0.047% in the shortest linkage group 10 ( Table 1). The most frequently occurring fragments of the mtDNA sequence in the nuclear genome included ND2 (115), ND3 (113), TRNA-CYS (110), and 16S rRNA (105). However, the highest number of NUMTs was observed for the non-coding area, i.e., the D-loop (147) ( Table 2). The lowest numbers of mtDNA fragments found in nDNA were observed mainly for genes encoding tRNA molecules: TRNA-PRO and TRNA-MET (6), TRNA-TYR and TRNA-ASN (4), and TRNA-SER1 (2). The highest numbers of mtDNA fragments were recorded on LG2-228, and the lowest-on LG8-134 ( Table 2).
The longest fragments of the mitochondrial genome present in the nuclear genome were observed for the D-loop (182 bp), ND4 (108 bp), ND3 (99 bp), and ND5 and COX3 (94 bp each) ( Table 3). The 182-bp fragment of the D-loop constituted 63% of the entire sequence of this region. In contrast, in the case of the other protein-encoding genes, the fragment size ranged from 4% (CYTB) to 32% (ATP8). In turn, NUMTs were recorded among genes encoding tRNA, constituting over 90% of the mtDNA gene sequence: TRNA-ARG and TRNA-THR (95% each) and TRNA-GLU (97%). All the analyzed sequence fragments had a minimum length in the range of 16-25 bp (Table 3).    Of the 1630 NUMTs (Table 1), 253 fragments, representing 16% of all NUMTs, showed 100% homology with the mtDNA gene sequences. 100% sequence homology for TRNA-MET was found for all 6 NUMTs (Tables 2 and 4). 23 NUMTs whose sequence homology was 100% were observed for the D-loop region and 21 NUMTs for the gene ND3. In contrast, in the case of genes TRNA-SER1 and TRNA-ALA, no NUMTs with 100% sequence homology were observed ( Table 2). The mean values of the percentage of sequence identity ranged from 88.0% (ATP8) to 100% (TRNA-MET). The percentage identity for the individual linkage groups was in the range of 90-91%. The largest homology was recorded on LG10 (90.7%) and the lowest-on LG6 (90.2%) ( Table 4). At least one sequence with 100% homology was identified on each of the linkage groups. Table 5 compares the NUMTs' number and length with mtDNA fragments found in pseudogenes and transcriptome of D. magna. The overall number of mtDNA fragments in pseudogenes was 599 and their highest length was estimated between 18 bp (TRNA-MET) to 99 bp (ND3). The lowest number of mtDNA fragments being part of pseudogenes was observed for TRNA-PRO (1), ATP8 (2), and TRNA-ASP and TRNA-ILE (3). The highest number of these fragments was observed for 16s rRNA (63) and D-loop (48). In the case of mtDNA fragments found in the transcriptome, the highest number was observed for TRNA-CYS (101) and 16S rRNA (100). The lowest number of mtDNA fragments was found for TRNA-SER1 and TRNA-PHE gene (1). The highest length ranged between 20 bp to 64 bp. The main localization of mtDNA fragments in transcriptome was in different mRNA transcript variants of various protein-coding genes as well as ncRNA and misc_RNA sequences (Table S1). The overall number of mtDNA fragments in transcriptome was 1275. Table 4. Mean % identity of mtDNA gene fragments located in the linkage groups.

Discussion
This is the first study in which the D. magna nuclear genome deposited in the GenBank database was analyzed for pseudogene sequences of mitochondrial origin. The first complete information about the genome of D. magna was published by Lee et al. in 2019 [14]. To date, there is no information about pseudogenes localized in the genome of the water flea. Our research provides comprehensive bioinformatic details about the location of NUMTs found in the reference genome, which may be useful for future phylogenetic, evolution, and/or population analyses.
A computer-based search for NUMTs in the nuclear genome of 85 species of animals, plants, fungi, and protists showed that the total length of detected NUMTs varied from 0 to 823.9 kb per nuclear genome [23]. For instance, the NUMT content was 0 in Anopheles gambiae, 263.478 bp in Homo sapiens, and more than 800 kbp in Oryza sativa [23]. In the case of D. magna (Table 1), the overall length of NUMT in the nuclear genome was 44.391 bp. The total length of 24 NUMTs was 9.989 bp in the Pteromalus puparum genome, and 42.972 bp in Nasoni vitripennis [24], and more than 230 kbp in Apis mellifera [25]. There seems to be a positive correlation between the haploid genome sizes (C-values) and NUMT amount/prevalence in eukaryotes [26].
Another explanation for the differences in the total length of NUMTs is the fact that they accumulate in the genomes in a continuous evolutionary process [18]. Like in other species, e.g., Myotis lucifugus [27], the percentage of NUMTs in the genome was less than 0.1%. In contrast, the number and length of NUMTs may vary depending on the computer-based query for NUMTs in the BLAST tool, as in the case of the genome of Canis lupus familiaris [22,28]. In our study, the sequence search in BLASTN 2.6.0 implemented in CLC Genomics Workbench 12.0 yielded 1909 results, although 279 results were found in various scaffolds used during the sequencing of the D. magna genome. In this paper, however, the NUMT results from the scaffold were excluded due to the potential occurrence of artifacts created during sequencing, as observed by Shi et al. [27], where surprisingly, an entire mitochondrial genome was found in the scaffold AAPE02072785 in the M. lucigufus genome. It is also worth considering that, in this work, the BLAST search result took into account all search results, even those fragments whose length was only 16 bp ( Table 3). The NUMTs found in scaffolds and mitochondrial DNA fragments found in pseudogenes and transcriptome are listed in supplementary materials (Table S1).
In the case of various mtDNA gene fragments, such as TRNA-MET, COX1, and ND4 where their number is higher in the transcriptome than in the nuclear genome, it may be the result of the presence of multiple mRNA transcripts or ncRNA forms in the transcriptome. The variability of mRNA and ncRNA forms may be caused by alternative splicing. Moreover, the transcriptome is changing during the lifespan and may depend on the type of tissue [29]. It is not excluded that some NUMT are parts of protein-coding genes and/or participate in gene regulation through ncRNA formation. In this research, the authors focused on describing the presence of mitochondrial DNA in the nuclear genome, while the investigation of the role of NUMTs in gene expression and regulation in D. magna should be verified experimentally in further analyses.
NUMTs were used to define characteristics and to clarify phylogenetic inconsistencies suggested by paralog sequences [18,30]. The analysis performed by Mishmar et al. [31] revealed that mtDNA fragments, which were integrated into the nucleus before the radiation of modern human mtDNAs, confirming that mtDNAs similar to today's African macro-haplogroup L were the first human mtDNAs. The analysis of NUMTs in D. magna revealed that the latest evolutionary sequences are pseudogenes derived from the sequence TRNA-MET as the homology of the entire gene sequence was 100%. The homology of TRNA-MET gene fragments localized in transcriptome is below 100%, however, it could be a result of gene modifications after expression (Table S1). In addition, all mtDNA fragments in transcriptome might differ from each other due to different ways of modulating gene expression depending on the tissue. The lowest homology (70.33%) was characteristic for the pseudogene derived from the mtDNA D-loop sequence located on LG5 (Table S1). It may probably be the oldest element of mitochondrial DNA incorporated into the nuclear genome. However, due to the high degree of mutation in the D-loop, the thesis requires further verification. Similarly, NUMTs derived from TRNA-ALA and TRNA-SER1 may have been one of the first sequences derived from mtDNA. However, by assessing only sequence homology, the sequence of incorporation of mitochondrial pseudogenes into the nuclear genome cannot be determined. Nevertheless, as observed by Mishmar et al. [31], the nuclear genome accumulates mutational changes at a much slower rate than mtDNA. Hence, the sequences of "recent" NUMTs can provide valuable information about the mtDNA sequences of the earliest humans.
NUMTs exhibit different degrees of homology to their mitochondrial counterparts. They are variable in size, evenly distributed within and among chromosomes, and, in some cases, they are highly rearranged and/or fragmented [32]. However, the mitochondrial chromosome's size does not correlate with the NUMT frequency or size distribution [19]. The transfer to the nucleus can be influenced by mitochondria's vulnerability to stress and other factors that may cause the escape of mtDNA to the cytoplasm [32]. Mutations in mtDNA may occur in the entire mitochondrial genome; however, they are most frequently detected in the hypervariable regions of D-loops [33][34][35].
The number of mutations as well as their incidence in the D-loop area may be related to the number of NUMTs occurring in the nuclear genome. The higher the mutation rate, the greater the likelihood of transfer of the D-loop fragment into the cytoplasm followed by its incorporation into the nuclear genome as a pseudogene. In the D. magna genome, the highest numbers of pseudogenes from the D-loop (147) were observed, and only 23 of them had 100% sequence homology (Table 1).
No pseudogenes derived from the TRNA-ILE gene sequence were observed in the nuclear genome (Table 1). However, the detailed search of TRNA-ILE fragments in pseudogenes revealed the presence of three of them in LOC116920659, LOC116923942, and LOC116927475 and two fragments in the transcriptome (Table S1). The difference between the number of NUMTs and pseudogenes and transcriptome for the TRNA-ILE gene might be the result of the searching method. Pseudogenes were generated by automated computational analysis using gene prediction method: Gnomon [36]. Gnomon is a gene prediction HMM-based program. The core algorithm is based on Genscan which uses a 3-periodic fifth-order Hidden Markov Model (HMM) for the coding propensity score and incorporates descriptions of the basic transcriptional, translational, and splicing signals, as well as length distributions and compositional features of exons, introns, and intergenic regions. For the genes for which no experimental information is available, Gnomon is creating conventional ab initio predictions [36]. As a result of this prediction, the proposed gene contains the fragment of TRNA-ILE. The presence of this sequence fragments derived from mtDNA cannot be excluded, however, further experimental analysis of the protein should be held. Due to the lack of comparison with other genome sequences for D. magna the authors could not exclude any of the results. The parameters of the Gnomon prediction were widely used in the description of other model organisms' genomes i.e., Arabidopsis thaliana, Danio rerio, Mus musculus, and Drosophila melanogaster [36]. Thus, the authors recognize the possibility of occurrence of tRNA-ILE-derived NUMTs in the D. magna genome.
In the human genome, NUMTs are commonly associated with repetitive elements, suggesting a possible role for transposable elements in mtDNA integration in the nuclear genome [31]. Certain NUMTs are repeated multiple times within the human genome [32,37]. In the case of the D. magna genome, some NUMTs were also observed, which were repeated many times in different linkage groups (Table S1). However, their association with repetitive elements in the nuclear genome requires additional research. The average levels of NUMT sequence homology for the individual linkage groups do not differ significantly from each other, which may indicate a random and even inclusion of sequence fragments into each of them ( Table 4).
The cytochrome C oxidase subunit I (COI) has possibly been the most commonly studied marker. However, its popularity is mainly associated with its use as a marker for the DNA barcoding of animal diversity [38]. There are several factors causing inadequacy of mtDNA in general and COI individually, such as male-biased gene flow, the selection on any mtDNA nucleotide(s) (as the whole genome is one linkage group), retention of ancestral polymorphism, and introgression following hybridization [39]. Presently, there are huge numbers of COI sequences in public databases, and most of them have a limited length, generally close to the length of the barcoding region. It is known that the possibility of the presence of NUMTs in the existing data should not be ignored [40]. Since the success of taxonomic differentiation is positively correlated with the barcode length, the minibarcode length is usually kept above 100 bp. For example, an approximately 250-bp region of 16S rRNA can be successfully amplified from various medicinal preparations and food products. It provides the correct identification of animal species [41,42]. Gene fragments that are often used for species identification in D. magna are in the following ranges: COX1 (19-72 bp), CYTB (18-46 bp), 12s rRNA (18-72 bp), and 16s rRNA (19-78 bp); each of them constitutes less than 10% of the length of the entire gene (Table 3). Hence, the NUMT sequences of frequently analyzed genes are generally shorter than the respective mitochondrial sequence; thus, the possibility of NUMT co-amplification should decrease with an increased length of the targeted mitochondrial marker. However, it is worth paying attention to the coverage of NUMTs derived from the TRNA-ARG (95%), TRNA-GLU (97%), and TRNA-THR (95%) genes ( Table 3). The transcriptome analysis for TRNA-ARG revealed the existence of the NUMT in lncRNA sequence transcript variants of the gene LOC116925827 (Table S1). The presence of this gene fragment in a non-coding sequence may suggest its role in gene regulation. The TRNA-THR-derived NUMT was identified in mRNA gene transcript variant LOC116930084. On the other hand, the sequence fragment of TRNA-GLU was only partially found in the transcriptome in zinc finger CCHC domain-containing protein 7-like mRNA localized in LG2. Perhaps, in these cases, they are not NUMTs but functional gene fragments taking part in gene modulation. Nonetheless, to confirm it, a detailed analysis of the transcriptome on larger research groups should be held.
NUMTs are highly polymorphic in terms of the sequence, homo/heterozygosis status, and presence/absence at a specific locus [43]. These features facilitate the use of NUMTs as specific population markers, as proposed for the human population by Lang et al. [44] The biological importance of NUMTs may correlate with their location on the chromosome. Depending on the insertion location, NUMTs may perturb the function of the genes [45]. Additionally, de novo integration of NUMT pseudogenes into the nuclear genome has an adverse effect in some cases: promoting various disorders and aging, as observed in humans [46]. Chatre and Ricchetti [47] report that migratory mitochondrial DNA can also impact the replication of the nuclear region in Saccharomyces cerevisiae.

Materials and Methods
The whole sequence of the Daphnia magna genome, strain SK and annotation of the nuclear and mitochondrial genome of D. magna were obtained from GenBank (the accession numbers for the nuclear and mitochondrial genome are GCA_003990815.1 and NC_026914.1, respectively). Each of the mitochondrial genes were aligned separately to detect plausible NUMTs in the nuclear genome. The presence of mtDNA fragments in defined pseudogenes was estimated by the alignment of mtDNA sequences with the GCF_003990815.1_ASM399081v1_pseudogene_without_product.fna sequence. The presence of mtDNA fragments in the transcriptome was established by the search in GCF_003990815.1_ASM399081v1_rna.fna sequence. The detailed results were presented in Supplementary Table S1. The SK strain genome assembly was released on 02-Jan-2019 and was integrated with the high-density genetic linkage map of the strain Xinb3. The presence of NUMTs in the D. magna nuclear genome GCA_003990815.1 was evaluated using the BLAST (BLASTN 2.6.0) (Table S1) [48] program implemented within the CLC Genomics Workbench 12.0 software package (https://www.qiagenbioinformatics.com). In the presented study, we implemented a low complexity filter to mask off the query sequence segments with low compositional complexity. Such filtering was used to eliminate statistically significant but biologically uninteresting reports from the BLAST output, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. The expectation value (e-value = 10) describing the threshold for reporting matches against database sequences based on the assumption that ten matches are expected to be found merely by chance according to the stochastic model of Karlin and Altschul (1990) [49]. If the E-value ascribed to a match was greater than the expected threshold, no match was reported. The lower value of the threshold caused more stringent searching criteria, leading to fewer chance matches being reported. Higher threshold results in more matches being reported, but many may just match by chance, not due to any biological similarity. The value of match/mismatch, which assigned a score for aligning and evaluated the quality of a pairwise sequence alignment was set to 2/3. The penalty to open gap and penalty to extend gap (Gap Cost.) was set to 5 for existence and 2 for extension. The maximum number of database sequences, where BLAST found matches to a query sequence, to be included in the BLAST report was set to 100.5.

Conclusions
This article described the first occurrence of mitochondrial pseudogenic sequences (NUMTs) in the nuclear genome of D. magna. There was no full sequence homology for two genes: TRNA-SER1 as well as TRNA-ALA NUMTs. The total length of NUMTs in the nuclear genome was 44.391 bp (from 16 to 182 bp), which accounted for 0.042% of the entire genome. The best-matched NUMTs covering more than 90% of the mtDNA gene sequence were identified for the TRNA-ARG (95%), and TRNA-THR (95%) genes, and they may be included in the functional nuclear RNA molecules. The NUMT length varied from 16 to 63 bp for tRNA genes, from 17 to 108 bp for coding genes, from 18 to 78 bp for rRNA genes, and from 17 to 182 bp for the D-loop region. Therefore, using the product of total DNA isolation in mtDNA studies, co-amplification of nDNA fragments is unlikely especially in the case of amplification of the whole tRNA genes and fragments of other genes the D-loop with a length exceeding 200 bp. It was observed that fragments TRNA-MET (from 16 to 18 bp length) had the highest level of sequence homology, which means that they could be evolutionarily the youngest. The lowest degree of homology was found in the pseudogene derived from the mtDNA D-loop sequence. It may probably be the oldest element of mitochondrial DNA incorporated into the nuclear genome; however, due to the high degree of mutation in the D-loop, the thesis requires further analysis and elucidation.

Conflicts of Interest:
The authors declare no conflict of interest.