Stability across the Whole Nuclear Genome in the Presence and Absence of DNA Mismatch Repair

We describe the contribution of DNA mismatch repair (MMR) to the stability of the eukaryotic nuclear genome as determined by whole-genome sequencing. To date, wild-type nuclear genome mutation rates are known for over 40 eukaryotic species, while measurements in mismatch repair-defective organisms are fewer in number and are concentrated on Saccharomyces cerevisiae and human tumors. Well-studied organisms include Drosophila melanogaster and Mus musculus, while less genetically tractable species include great apes and long-lived trees. A variety of techniques have been developed to gather mutation rates, either per generation or per cell division. Generational rates are described through whole-organism mutation accumulation experiments and through offspring–parent sequencing, or they have been identified by descent. Rates per somatic cell division have been estimated from cell line mutation accumulation experiments, from systemic variant allele frequencies, and from widely spaced samples with known cell divisions per unit of tissue growth. The latter methods are also used to estimate generational mutation rates for large organisms that lack dedicated germlines, such as trees and hyphal fungi. Mechanistic studies involving genetic manipulation of MMR genes prior to mutation rate determination are thus far confined to yeast, Arabidopsis thaliana, Caenorhabditis elegans, and one chicken cell line. A great deal of work in wild-type organisms has begun to establish a sound baseline, but far more work is needed to uncover the variety of MMR across eukaryotes. Nonetheless, the few MMR studies reported to date indicate that MMR contributes 100-fold or more to genome stability, and they have uncovered insights that would have been impossible to obtain using reporter gene assays.


Overview
We considered 123 independent nuclear genome mutation rate measurements, gathered from over 90 studies performed in either wild-type or MMR-deficient strains of 48 eukaryotic species. We confined the analysis to whole-genome studies that explicitly report rates or, rarely, to studies from which rates may be easily calculated. We present mean mutation rates for similar systems that are either MMR-proficient (Table 1) or MMRdeficient (Table 2). Studies are listed in Table 3. Granular details and notes on each study may be found in Table S1. Where available, rates per generation and per cell division are both presented. We classify each estimate as using either germline or somatic cells, although there is no distinction for most unicellular eukaryotes, and some organisms, e.g., hyphal fungi and many plants, lack dedicated germlines. We highlight trends and extremes, and then comment on how whole-genome rates elucidate mechanisms of MMR.

A Brief History
Mutation accumulation (MA) experiments are a venerable approach for estimating spontaneous mutation rates (reviewed in [1]). Theorized in the 1920s and first implemented in the 1960s, MA experiments use replicate lines derived from an ancestral population that can evolve independently. The population is subjected to periodic artificial bottlenecks to fix mutations regardless of their effects on selective fitness. Originally, mutations were selected via phenotypic changes due to mutations in reporter loci. Sequencing a reporter locus in the final population allowed for mutation detection and counting, resulting in mutation spectra and mutation rate estimates for that locus. However, no reporter locus can simulate all possible contexts, transcription states, chromatin states, replication times, or proximity to various genomic features. The advent of whole-genome sequencing bypassed these restrictions by making the entire genome the reporter. Mutations may be called by comparing the parental sequence to the sequences of progeny populations.
The first successful whole-genome MA experiments were published in 2008, first in Saccharomyces cerevisiae (baker's yeast) [2] and then in the bacterium Salmonella typhimurium [3]. Wild-type, whole-genome mutation rates were previously compared across kingdoms [1], and therefore here we confine the discussion to the eukaryotic nuclear genome. Lynch et al. found 33 mutations in four wild-type haploid Saccharomyces cerevisiae lines that had been each propagated for approximately 4800 cell divisions [2]. They estimated the wholegenome mutation rate at 0.33 Gbp −1 division −1 . The race was on to find rates in as many diverse species as possible. By the end of 2010, the list included model organisms such as Drosophila melanogaster (fruit or vinegar fly; 0.1 Gbp −1 division −1 ; [4]), Caenorhabditis elegans (a roundworm; 0.32 Gbp −1 division −1 ; [5]), and Arabidopsis thaliana (thale cress; 0.22 Gbp −1 division −1 ; [6]).
The first whole-genome mutation rate estimates for genetically manipulated eukaryotes were also published in 2010. Zanders et al. performed the first estimates for DNA mismatch repair (MMR)-deficient organisms, a baker's yeast strain with a temperaturesensitive variant of the MMR gene MLH1 (mlh1-7 ts ; 3.7 Gbp −1 division −1 [7]). Comparison with the wild-type rates of Lynch et al. implied MMR repair of over 90% of replication errors (MMR-/MMR+ = correction efficiency; 3.7/0.33 = 11.2). This comported well with prior reporter locus estimates in [8]. Larrea et al. then used MMR-deficient (msh2∆) baker's yeast with a variant of DNA polymerase (Pol) δ (pol3-L612M; [9]). The known mutation bias of pol3-L612M, found in previous experiments in vitro [10], showed the bulk of Pol δ synthesis to occur on the nascent lagging strand. This extended results from previous mutation accumulations in reporter genes [11]. Thus, whole-genome mutation collections were shown to be useful for revealing cellular mechanisms.
A study in 2010 also reported the first whole-genome mutation rate estimate for humans (11 Gbp −1 generation −1 [12]). This estimate could not come from whole-genome MA experiments. Baker's yeast can reproduce through budding, a form of binary fission. Baker's yeast, roundworms, and thale cress can reproduce through selfing. Vinegar flies neither bud nor self-fertilize, but they can be inbred in order to fix mutations. None of these options are available for humans, and therefore Roach et al. sequenced the genomes of a nuclear family and inferred mutations by comparing children to parents. Such parentoffspring sets are now a standard method for finding whole-genome mutation rates in outcrossing species, including wild populations.
The following decade saw scores of whole-genome rate measurements, plus a host of mutation frequencies and spectra from tumor genomes (e.g., [13]). Note that tumor studies often use similar terminology and technology to the experiments listed here, but, lacking cell division counts, they may report mutation frequencies rather than rates. This restriction has been circumvented somewhat by measurements in cancer cell lines (e.g., chicken DT40 tumor line [14,15] and human cell line RPE1 [16], and by raising organoids from tumor samples [17]). The latter is also useful for estimating mutation rates in normal somatic tissues [17,18]. Some progress has been made in calculating mutation rates given incomplete knowledge of ancestral states or generation counts. Where complete pedigrees are unknown or ancestral samples are unavailable but little selective pressure is expected, mutations may be inferred by deriving the genotype of the last common ancestor. This technique, known as identity by descent, limits analysis to certain highly conserved segments [19]. Likewise, the number of cell divisions in the stem line for a particular tissue may be unknown. Given a representative sample of the whole tissue, the mutation rate in the first few rounds of replication may be inferred from variant allele frequencies (VAF). VAF methods are easiest with blood and require sophisticated modelling to account for unequal contributions of early embryonic cells [20].

Nuclear Mutation Rates in MMR-Proficient Germ Cells
Mutation rates are, by necessity, conditional. There is little ab initio reason to expect mutation rates to remain constant across differing species, environmental conditions, stressors, exposures, tissues, and germline versus somatic status. For instance, mutation rates may vary with organismal, tissue, or parental ages. Human mutation counts increase with parental age, particularly paternal age, which affects the mutation rate per generation [21]. Wherever necessary, Table S1 uses assumed average parental age, as defined by the authors of the study in question. Somatic rates are averaged across estimates, including across tissues [18] and growth conditions [22], where rates vary little. Rates are not combined if conditions are known to cause large rate differences, such as different ploidies [23] or homozygous versus hybrid or otherwise highly heterozygous individuals [24]. Mutation rates are also conditional across individual genomes, a subject we address below in our discussion of MMR.
Wild-type generational mutation rates range from 0.00761 Gbp −1 generation −1 in the ciliate Tetrahymena thermophila [25] to 3380 Gbp −1 generation −1 in the hyphal fungus Neurospora crassa (red bread mold; [26]). These extremes are largely explainable by how these organisms transmit their genetic code through generations. Ciliates such as Tetrahymena and Paramecium tetraurelia [27] keep dozens of working copies of their genome in transcriptionally active compartment called the macronucleus while protecting a germ copy in a protected micronucleus. In contrast, red bread mold has no separate germ line, undergoing an average of 300 asexual divisions per sexual generation [26]. The asexual rate is listed as "somatic" in Table 1, although this definition is debatable. However, for reasons that are not entirely clear, most mutations per sexual generation occur in the last few divisions, perhaps only during meiosis. Is this the case with meiosis in other organisms? Rates are averaged (mean) over all experimental estimates (unweighted). Rates are rounded to two significant digits. Color code: "supergroup" and "lower clade" columns are colored to highlight related clades; green saturation increases linearly with experiment counts in column "ct."; a gradient from blue to red was applied across "Gbp −1 div −1 " columns of Tables 1 and 2, with blue indicating the lowest rates and red the highest. Abbreviations: ct. = number of independent estimates; g = germline; s = somatic cells; gen. = generation; div. = cell division; -= not determined.
Another hyphal fungus, the fairy ring mushroom Marasmius oreades, has the lowest measured mutation rate per cell division at 0.0038 Gbp −1 division −1 [26]. This is even lower than in the ciliates, but without an obvious mechanistic explanation. Nonetheless, with over 19,000 divisions per generation, this still yields a relatively high mutation rate of 73 Gbp −1 generation −1 . How is the low rate per division maintained and how is the high rate per generation tolerated? Is this situation common among hyphal fungi? The cell divisions per generation for the fairy ring mushroom in Table 1 are estimated from the ratio of rates, per-generation divided by per-division. This would be incorrect if it has a sexual rate dominated by mutations in later cell divisions, as in red bread mold. Perhaps clarity will emerge through testing more organisms with more diverse lifestyles and genetic architectures. The highest wild-type "germline" mutation rate per division is 0.98 Gbp −1 division −1 , in the haploid unicellular alga Micromonas pusilla [28]. How does this organism deal with a rate per division over 250-times higher than in the fairy ring mushroom? This rate is in turn dwarfed by those in animal somatic cells. Rates are averaged (mean) over all experimental estimates (unweighted). Rates and correction efficiencies are rounded to two significant digits. Color code: "supergroup" and "lower clade" columns are colored to highlight related clades; green saturation increases linearly with experiment counts in column "ct."; a gradient from blue to red was applied across "Gbp −1 div −1 " columns of Tables 1 and 2, with blue indicating the lowest rates and red the highest. Notes: a = efficiencies calculated from mutation rates per generation; b = efficiencies calculated from mutation rates per cell division. Abbreviations: ct. = number of independent estimates; g = germline; s = somatic cells; gen. = generation; div. = cell division; -= not determined.

Nuclear Mutation Rates Trends in MMR-Proficient Organisms
Three trends in nuclear mutation rates appear in the data. First, as previously stated for humans, mutation rates increase with parental age. Second, in plants, highly heterozygous lines have higher mutation rates than homozygous lines. Third, in animals, somatic mutation rates exceed germline rates. Mutation counts increase with parental age in many species. In humans, paternal age has a particularly strong effect on offspring mutation counts, commensurate with continuing cell divisions in the male germline (reviewed [21]). However, maternal age is also a factor, which is more difficult to explain. Although outside the scope of this review, whole mitochondrial genome sequencing studies also show age-dependent increases in both point mutations [29] and large deletions [30]. The situation is even more extreme in large, long-lived hyphal fungi [31,32] and trees [33][34][35][36][37]. Because they grow outward linearly, lack a dedicated germline, and tend to fruit near their maximum extent, each consecutive fruiting results in more offspring mutations. Will whole-genome mutation rate studies ever find age-related increases in shorter lived or unicellular eukaryotes?
Only two whole-genome mutation rate studies were found that compared homozygous lines with highly heterozygous lines. Both were in plants, encompassing three species. Yang et al. found 3.6-fold higher rated in heterozygous thale cress and 3.4-fold higher rates in heterozygous rice (Oryza sativa) [24]. Likewise, Xie et al. found a more modest 1.6-fold increase in a hybrid peach tree (Prunus davidiana × P. persica) versus in a weakly heterozygous peach tree (P. persica) [33]. Both studies concluded that highly heterozygous lines have higher mutation rates than homozygous lines. The idea that heterozygosity is tied to plant mutation rates has been discussed [33] and is supported by previous reporter locus assays (e.g., [38]). Will the results of these few experiments be recapitulated in other plants or in other eukaryotic clades?
One study measured comparable somatic and germline mutation rates per cell division in two organisms: humans and house mice [39]. The highest measured wild-type mutation rate per cell division belongs to house mouse fibroblasts at 8.1 Gbp −1 division −1 , roughly 70-fold higher than in the germline. Likewise, human fibroblasts rates were 2.7 Gbp −1 division −1 , roughly 80-fold higher than in the germline. Is this a general feature of multicellular organisms other than hyphal fungi, or is it limited to just animals or to mammals only? How are lower mutation rates maintained in the germline? Does MMR play a part or is it only a matter of protection from insult exposure? More information is needed in other animals and multicellular fungi, plants, and stramenopiles (e.g., kelp). Table 2 lists overall mutation rates in MMR-deficient cells. These come from baker's yeast, fission yeast (Schizosaccharomyces pombe), thale cress, roundworms (C. elegans), and an immortalized chicken cell line (Gallus gallus domesticus DT40). The mean rates have non-overlapping ranges: MMR-proficient with 0.23-0.91 Gbp −1 division −1 , and MMRdeficient with 13-72 Gbp −1 division −1 . Correction efficiencies are remarkably consistent, ranging from 50-to 130-fold, despite disparate species, ploidies, cellular lineages (i.e., somatic versus germline), and methods for ablating MMR (see Table S1 for genotypes and notes). The correction efficiencies are bimodally distributed, with fission yeast, chicken cells, and diploid baker's yeast clustered at 51-57× and thale cress, roundworms, and haploid baker's yeast efficiencies from 100-130×. Is this a coincidental artefact of the few systems studied? Regardless, these whole-genome rate measurements have clearly shown that MMR is highly efficient, repairing at least 98% of replication errors. Indeed, this is probably an underestimate (see Section 8).

Genome-Wide Mutations and the Mechanisms of MMR
For long-lived organisms, reporter locus experiments are an inefficient way to collect mutations. For shorter-lived organisms, given the expense of whole-genome sequencing and the time required for mutation accumulation experiments (ideally hundreds of generations), why not use reporter loci? First, reporter loci do not adequately model the sequence complexity of the genome (as discussed above). Second, reporter loci cannot replicate the diversity of selective pressures across the genome. Both factors are essential for the study of MMR.
For example, the baker's yeast genome is GC-poor, but certain AT-rich features are concentrated outside of regions that are translated into proteins (like most reporter loci). AT homopolymer tracts, particularly long tracts, are concentrated in untranslated regions (UTRs) that flank most genes [40]. This leads reporter locus assays to underestimate the rates of deletions in long homopolymers and the rates of multi-base insertions and deletions (indels) [41]. Whole-genome mutation accumulations show that these regions become indel hotspots upon removal of MMR [40], with rates and indel sizes increasing with tract length [40,42]. In fact, the shape of the curve of rate versus tract length is diagnostic of the degree to which mismatch extension is favored over proofreading. Extension could be driven by a proofreading defect [43] or by alteration of nucleotide concentrations [44].
Unlike in yeast, AT homopolymers in humans are concentrated in genes, where cancer genomes indicate strong transcriptional strand asymmetry for indels [45,46]. Studies of tumors with Pol δ proofreading defects suggest that MMR repairs about threefold more mismatches produced during lagging strand replication compared with leading [45]. Massive studies of cancer genomes have allowed the construction of mutation spectrum signatures that are diagnostic of such processes as MMR [47,48]. Tumors with mutations in DNA polymerase (Pol) ε have mutation spectra that resemble spectra from cell lines with defects in both Pol ε and MMR [49]. This suggests that MMR is somehow suppressed in those tumors. Conversely, there appears to be a mutational hotspot in the gene that encodes the catalytic subunit of Pol ε in MMR-deficient mouse lymphomas [50]. Spectra in MMR-deficient chicken cells allowed Németh et al. to collapse six MMR-associated COSMIC signatures into two [15]. They found no correlation between these signatures and the identity of the defective MMR genes in the tumors (i.e., MSH2, MSH6, or MLH1). This suggests that either modulation of transcription or translation or some form of inhibition are to blame for the MMR defects in these tumors. This is a profound revelation, given that MMR-deficient cancers generate mutant neoantigens that make them sensitive to immune checkpoint blockade [51]. Thus, whole-genome mutation rate experiments may affect cancer diagnosis and treatment.
Whole-genome experiments have revealed that MMR preferentially protects many genome features. In baker's yeast, it protects UTRs and inter-nucleosome linkers from indels, translated gene bodies from point mutations, and sequence-encoded nucleosome positions from substitutions [40]. Much of this is recapitulated in thale cress [52], and in humans, MMR selectively protects exons relative to introns [53]. In fission yeast, MMR selectively protects euchromatin [54]. Baker's yeast strains have slightly higher rates in early as opposed to late replicating regions, with some indication of higher MMR efficiency early in replication [40]. Likewise, variable human MMR is thought to cause elevated mutation rates in late replicating heterochromatin compared to early replicating euchromatin [55]. Are MMR proteins depleted or in some other way impaired later in replication? In humans, some MMR proteins are differentially expressed across the cell cycle [56]. In mice, histone modifications can target MMR to transcriptionally active regions [57], both locally and globally [58]. The extent of targeting elsewhere and in other organisms is unknown. Unfortunately, those these trends point in the same direction, only a few of these studies report rates [15,40,52,54], making it difficult to compare effects across organisms in a quantitative manner.
Why does MMR appear to selectively protect some features over others? Perhaps the extent of MMR targeting, as in mice, is underappreciated. Alternatively, MMR may operate at a similar rate across each genome, but some contexts are simply more mutable. This would be expected if natural selection effectively erases mutations missed by MMR. Over evolutionary timescales, mutable sequences would disappear in regions under little selection. Depletion of MMR would then reveal the fingerprints of past selection (discussed in [40]).

Summary
Herein, we have gathered known whole-genome mutation rates, encompassing 90 studies (Table 3). We hope that future researchers will expand the list and use the information to uncover new insights into the patterns of mutagenesis across eukaryotes and beyond. We have also outlined some advances in the understanding of mutagenesis since the advent of whole-genome experiments. These advances reveal variation in eukaryotic DNA mismatch repair mechanisms that were invisible to most reporter locus assays. Further progress requires more breadth in the organisms, tissues, and conditions. In particular, new strains are required to uncover the interplay between mismatch repair and other nuclear systems, such as nucleotide pool maintenance, exonucleolytic proofreading, and ribonucleotide excision repair.   Experiments are arranged by publication date. A study with multiple measurements in the same species with the same MMR genotype is listed only once. More details about each measurement are available in Table S1. Abbreviations: MMR = DNA mismatch repair.

More Future Questions
In addition to questions throughout this review, others arise due to the following. MMR efficiency calculations presented here assume that all mutations are due to replication and are subject to mismatch repair. The veracity of these assumptions is an outstanding question. For instance, most spontaneous mutations in wild-type yeast could be due to mutagenic repair of spontaneous lesions [118], which may not be amenable to MMR. Indeed, 40-85% of mutations in the wild-type baker's yeast CAN1 reporter are attributable to errors made by DNA polymerase ζ [119][120][121][122]. Is this true across the genome, in other organisms, other conditions, or in various tissues? How much of the remaining wild-type mutation rate is due to other assumption-breaking processes? Is MMR dependent on other systems, such that a mutation that effects MMR also alters, say, polymerase proofreading or ribonucleotide excision repair, thus causing additional complicating mutagenesis? Until such questions are answered, all MMR efficiency calculations are likely to be minimum estimates and should be treated as provisional.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/cells10051224/s1, Table S1: Individual nuclear genome mutation rates from whole-genome experiments. Abbreviations: concat = concatenation of select columns; PMID = PubMed identification number; g = germline; s = somatic cells; w = wild type; m = MMR-deficient. Genome reference sizes represent NCBI Genome median assembly lengths, where available. Otherwise, they are median reference lengths from cited studies.
Funding: This study was supported by Project Z01 ES065070 to T.A.K from the Division of Intramural Research of the NIH, NIEHS.

Conflicts of Interest:
The authors declare no conflict of interest.