Mutation Rate Model Used in the DNA VIEW Program

: The ﬁrst problem considered in this paper is the problem of correctness of a mutation model used in the DNA VIEW program. To this end, we theoretically predict population allele frequency changes in time according to this and similar models (we determine the limit frequencies of alleles—they are uniformly distributed). Furthermore, we evaluate the speed of the above changes using computer simulation applied to our DNA database. Comparing uniformly distributed allele frequencies with these existing in the population (for example, using entropy), we conclude that this mutation model is not correct. The evolution does not follow this direction (direction of uniformly distributed frequencies). The second problem relates to the determination of the extent to which an incorrect mutation model can disturb DNA VIEW program results. We show that in typical computations (simple paternity testing without maternal mutation) this inﬂuence is negligible, but in the case of maternal mutation, this should be taken into account. Furthermore, we show that this model is inconsistent from a theoretical viewpoint. Equivalent methods result in different error levels.


Introduction
Nowadays, DNA polymorphism is widely used in genetic expertise for fixing biological paternity and consanguinity of people. Results of DNA profiling are based on probabilistic and statistical interpretation. This interpretation leads to obtaining the probability of paternity or another relationship determined by Bayesian analysis (likelihood ratio). This probability depends on three factors: (i) a priori probability (usually assumed to be equal to 0.5), (ii) allele frequencies (main factor), and (iii) mutation rate.
The last factor, i.e., mutation rate, is usually small, but in some situations essential. From a viewpoint of population, mutation is the primary generator of genetic diversity, whereas drift tends to reduce the genetic diversity. Most of the human genome has a very low mutation rate: for a typical nucleotide site ω = 2 × 10 −8 (the probability that a mutation will occur during gene transmission from a parent to a child). Short Tandem Repeat (STR) (also known as a microsatellite) mutation rates are even higher (approximately 1 or 2 mutations per locus per thousand generations). However, the key problem is that statistical analysis of the observed mutation cases is therefore almost impossible, because there is not a sufficient amount of observations (cf. [1,2]). Some observations suggest taking the Stepwise Mutation Model (SMM; cf. [3][4][5]) in which a mutant allele has either k − 1 or k + 1 repeats, each with a probability of 1 2 , where k is the current repeat number. The SMM is known to be false because the mutation rate increases with allele length and occasionally two-step mutations occur. In [6][7][8], other defects of the SMM are shown.

•
organisms are diploid; • only sexual reproduction occurs; • generations are nonoverlapping; • mating is random; • population size is infinitely large; • each sex has the same allele frequencies; • there is no migration, gene flow, admixture, mutation, and selection.
Despite these assumptions, as we show below, there exist mutation models (Model 4) in which the Hardy-Weinberg hypothesis holds. Given these assumptions, the population's genotypes and allele frequencies remain unchanged over successive generations, and the population is said to be in the Hardy-Weinberg equilibrium (cf. [11]). In practice, if we consider loci with {A 1 , A 2 , . . . A k } possible alleles and we observe the frequencies of a pair of alleles (A i , A j ) in the population P i,j , 1 ≤ i ≤ j ≤ k, then the good approximation (as the right-hand side values are estimated only from our database sample, and, as a consequence, these values in the whole population can slightly differ) of allele frequencies is The Hardy-Weinberg equilibrium can be stated as The above approximation is usually (cf. [12] Chapter 3 and [13] Section 3.2) verified by Pearson's χ 2 goodness-of-fit test (cf. [14,15], here arises the problem of rare alleles whose frequencies must be bound together), Fisher's exact test [16][17][18], or the likelihood ratio test or permutation test (known as the G-test, cf. [19]). Large deviations from the Hardy-Weinberg equilibrium are a consequence of errors in the STR amplification procedure, whereas the small ones follow from the departure from the above assumptions. In most genetic laboratories, the interpretation of forensic investigations is obtained as a result of computations of Brenner's DNA VIEW computer program [20,21]. In this paper, we concentrate on the considerations of the consequences of the mutation model used in this program which is a generalization of the SMM model in three aspects: 1.
The problem of satisfying the Hardy-Weinberg hypothesis and some mathematical inconsistency as a consequence of the assumed mutation model 2.
Paternity testing application. We show that in some cases the assumed different mutation models affect different paternity results (exclusion or acceptance of a putative father).

3.
The influence of the assumed mutation model on the long-time simulations of a population's behavior.
The aspect of satisfying the Hardy-Weinberg hypothesis is essentially theoretical, whereas the other two are numeric. The considered models are described in Section 2; the materials and methods are described in Section 3; Sections 4-6 are devoted to Problems 1, 2, and 3, respectively; whereas the last section presents conclusions.
Our considerations are similar to those presented in [22]. The main differences are (i) they consider similar but not the same mutation models; (ii) they state nonstationarity on the basis of the computation of allele frequencies in the consecutive generations, whereas we give a mathematical proof of this assertion; (iii) they state some problems in the case of paternity testing, we evaluate them; (iv) we focus on some mathematical inconsistency, which they do not deal with; and (v) we consider the behavior of a population under the assumed mutation model in the long distance of sequential generations, while they do not consider this problem.
Thus, our approach is wider and deeper than that taken in [22].

Materials and Methods
All computations were performed using large database of allelic DNA frequencies obtained from the Medical University of Lublin. This database contains DNA results of 7076 individuals, including personal information such as names, dates, places of birth etc. From this database we took 4410 unrelated persons and computed the distribution of allele frequencies of this sample (for example Table 1 presents the frequencies for the D5S818 locus). Three-thousand-nine-hundred-and-seventy-three individuals were typed using the classical electrophoresis method and 437 by NGS (next generation sequencing). All individuals were genotyped within 17 STR (short tandem repeat) markers located on the autosomal chromosomes. The techniques of DNA amplification are well described in Butler's book [23] in Section 1.  The loci of 3973 individuals were amplified using commercial kits: PowerPlex 16Hs System and PowerPlex 17ESX System (Promega Corporation, Madison, WI, USA). PCR products were separated and detected on the ABI 3130 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). Data were analyzed with the Gene Mapper ID software (v3.2, Applied Biosystems, Foster City, CA, USA, 2005). Laser-induced fluorescence was used in CE systems (Capillary Electrophoresis) with detection (by sensors) as low as 10 −18 to 10 −21 mol. The sensitivity of the techniques is attributed to the high intensity of incident light and the ability to accurately focus the light on the capillary. Multi-color fluorescence detection can be achieved by including multiple dichroic mirrors and bandpass filters to separate the fluorescence emission among multiple detectors (e.g., photomultiplier tubes) or by using a prism or grating to project spectrally resolved fluorescence emission onto a position-sensitive detector (sensor) such as a CCD (Charge Coupled Device) array. CE systems with 4-and 5-color LIF (Laser-Induced Fluorescence) detection systems are used routinely for capillary DNA sequencing and genotyping (DNA fingerprint) applications.
The 437 individuals were profiled by next-generation sequences (NGS). Libraries were prepared using the ForenSeq DNA Signature Prep. Kit (Illumina) according to the manufacturer's protocol. Sequencing was performed on MiSeq FGx using the Miseq FGx Reagent Kit v.3 (600 cycles) and the ForenSeq Universal Analysis Software package. Following cluster generation, clusters were imaged using LED and filter combinations specific to each of four fluorescently labeled dideoxynucleotides. After the imaging of a tile was complete, the flow cell was moved into place to expose the nest tile. The process was repeated for each cycle of sequencing. Following image analysis, the software performed base calling, filtering, and quality scoring.
All amplification results were stored in a database. All computations were performed in the R-3.6.0 language on a RSTUDIO platform (v1.2.1335, RStudio, Boston, MA, USA, 2019).
The starting point for our considerations is the following allele frequencies table (for example, for locus D5S818).
Here, and in what follows, N denotes the number of alleles; n i denotes the number of gene repetitions; i denotes the number of alleles; and ω M , ω F , ω, v i denote the maternal, paternal, and overall mutation rate and the frequency of i-th allele, respectively. It is worthwhile to remark that n i is not a number but rather a name (17.3 means, 17 full repetitions of gene, and 3 additional pairs of alkali), but throughout the paper, as in [20,21], we will consider n i as a number.

Mutation Models
Throughout this paper, the notation (i → j) means allele i was given by parents and arises as allele j in descendant. Although the maternal and the paternal mutations are generally different, we do not take this aspect into account, therefore we do not specify how the parent gives allele i.
Some assumptions (usually assumed) on equal chance of every allele to be transmitted are fulfilled too. Depending on whether mutation takes place or not, the following formula is used to describe the transmission of alleles, where ω i,j , 1 ≤ i, j ≤ N indicates how frequently i mutates into j, provided that i mutates. The different mutation models are derived from a different evaluation of ω i,j , 1 ≤ i, j ≤ N. The value ω i,i means that there was mutation, but the number of gene repetition does not change (it is almost not possible to determine this value from observations). Historically, two models were considered: Model 1.
In the description of the DNA VIEW program (Mutation model implemented in DNA VIEW [20]), one can read the following.

Rule A.
Therefore, as a rule of thumb, I suggest assuming that 50% of all mutations increase by one step 50% decrease by one step 5% increase by two steps 5% decrease by two steps 0.5% increase by three steps 0.5% decrease by three steps ... etc.
Never mind that these numbers add to more than 100%.
This description leads us to the following model of mutation.

Model 3.
where Previously, in the case of Variable Number of Tandem Repeat (VNTR) investigations, the most popular model of mutation was Model 4.
What assumptions should the mutation model satisfy? At first, we remark that a correctly defined mutation model should be such that because the i-th allele must be transmitted "to somebody". On the other hand, the second condition deals with the allele frequencies in parent and child populations. At first, allele j has the frequencies ν j , whereas after transmission, it has the frequency ∑ The last condition is a formulation of the Hardy-Weinberg hypothesis described in the Introduction.
Using (3), these conditions can be formulated in an equivalent form: Furthermore, it is easy to check that in some models (1 and 4) the computed total mutation is not equal to the observed ω. In Model 1 it is (1 − 1 N )ω and for the i-th allele in Model 4 (1 − ν i )ω (instead of ω in both cases). This is a small disturbance in the observed mutation rate.

Hardy-Weinberg Equilibrium
In this section, we consider how the above mutation models satisfy conditions I and II (Hardy-Weinberg hypothesis).
Let us define a matrix then rank(A N ) = N − 1.
Proof of Theorem 1. Let A k , 1 ≤ k ≤ N denote the k-th main minor of the matrix A N . Assume contrary that det(A k ) = 0 for some 1 ≤ k < N (cf. Exercise 11 (e) p. 48 [25]). It means that there exist numbers λ 1 , λ 2 , ..., λ k−1 such that and where v i denotes the i-th column of the matrix A k . It can be rewritten (after renaming rows and columns in the minor) precisely as Among the numbers λ 1 , λ 2 , . . . , λ k−1 there are positive and non-positive ones. By I 1 we denote a set of positive numbers. This set is nonempty. Indeed, if I 1 = ∅, then summing up both sides of (16), we get and on the left-hand side we have a negative number, whereas on the right hand side it is non-negative. Let t be an index of the maximal λ i (if there is more than one such index, we take the smallest one), Then, taking the t-th equation in (16), we get and we have a positive number on the left-hand side of the above inequality and a non-positive number on its right-hand side. Thus, this contradiction leads us to det(A k ) = 0, for every 1 ≤ k < N. Now we prove that det(A N ) = 0. It follows from assumption (iii) which can be rewritten as The same result can be obtained if we multiply columns instead of rows.

Corollary 2.
Mutation models 1-3 satisfy Condition I, but Condition II is only satisfied when 1≤i≤N ν i = 1 N .

Mutation model 4 always satisfies both conditions.
Proof of Corrollary 2. Let us consider the following matrices. and Now, if v o denotes the vector of allele frequencies of parents' population, then Condition II (Hardy-Weinberg hypothesis) can be rewritten as where I N denotes the N × N unit matrix. From Corollary 1 we have rank(A Nv N =v N , k = 1, 2, 3, thus it is a unique solution (among the vectors with non-negative coordinates such that ||x|| l 1 = 1).

Theorem 2. Let us define the sequence of vectors
where v o denotes the vector of starting allele frequencies in some loci, N is the number of alleles, and i = 1, 2, 3 is the selected model of mutations (v k denotes the frequencies of alleles in the k-th population applying the i-th model of mutation). Then, Proof of Theorem 2. Let us consider the Banach space N with the norm ||x|| l 1 = ∑ N i=1 |x i |. Furthermore, let K = {x ∈ N : ||x|| l 1 = 1, 0 ≤ x i ≤ 1, i = 1, 2, 3, ..., N} be a compact, separable subset of the space N . Obviously, the fixed point of the mapping T i on K is the same as the fixed point of A (i) N and from Corollary 3 this point isv N . The mapping T i is not a contraction because ||T 1 || = ||A (i) N || = 1, as the matrix T i consists of non-negative numbers such that their sums are equal to 1 in each row and each column. The norm is achieved for an arbitrary vector x with ||x|| l 1 = 1 and all non-negative or all non-positive coordinates. Thus, we cannot apply the Banach Contraction Theorem. However, Theorem 9.4 (and Corollary 9.1) in [28] considers the matrix such as T i . As a conclusion we get a thesis.
More information about the determinants and algebra of matrices can be found in [29,30].

Mutation Models and Paternity Testing
Let us denote unordered pairs P = {p 1 , p 2 }, M = {m 1 , m 2 }, D = {d 1 , d 2 } of alleged paternal, maternal, and child's STR profile, respectively, (V = {P, D, M}). As usual in paternity testing, we consider two hypotheses: Hypothesis 1 (H 1 ). Man P is a father of D assuming that M is a mother of D.

Hypothesis 2 (H 2 ).
A man different than P is a father of D assuming that M is a mother of D.

Now, from the Bayesian formula we get
and by putting a priori probabilities equal to 0.5 we obtain (for more details of the computations see [10,23,31,32]). However, the computed probability depends on the mutation model (we will write P i [V|H 1 ], i = 1, 2, 3, 4). Thus, we compute The frequency of each result V is a frequency of alleles P and M and from Mendel's law, the probability of achieving from such a pair of parents the child's profile D (it also depends on a mutation, but the differences are negligible). We consider two different situations:

•
Maternal mutation situation, i.e., when {m 1 , We show the maximal values of ρ i,j in Table 2 and the distribution of ρ i,j in Figure 1 (the differences in the case without maternal mutation between all models and in the case when maternal mutation takes place between Models 1 and 2 were lower than 0.02, thus negligible). Table 2. Maximal difference of probability of paternity values between i-th and j-th model (ρ i,j ) in two cases: with maternal mutation and without it, at the D5S818 locus. For example, the maximal value 0.9985 was achieved for P = (15, 15), D = (7, 15), M = (14,14). In this case, we have the overall mutation rate ω = 0.001, and the frequencies of alleles 7, 14, and 15 are equal to 0.002718, 0.009399, and 0.000566, respectively. Thus, we have the Table 3. Table 3. Probability that we achieve DNA profiles V, providing man P is a father of D (P[V|H 1 ]), probability that we achieve DNA profiles V, providing P is not a father of D (P[V|H 2 ]), and probability that P is a father of D (P[H 1 |V]) in the case of (P, D, M) = ((15, 15), (7,15), (14,14)) at locus D5S818. The above result follows from the fact that in the case of Model 3, allele 15 is significantly more likely to be a mutation of allele 14 than a transfer from a putative father.

Value
Finally, we verify some remarks on the inconsistency of theoretical equations and their practical application. Let P(p, s) be the probability that two alleles p and s are transferred from the same allele (they are brothers). Then, we have but, on the other hand, we have In the formula (31), the assumed mutation model plays an important role, whereas in (30), only the overall mutation rate ω is important. The maximal differences for locus D5S818 between the values computed using these formulas for different p and s are equal to 0.000172, 0.000164, and 0.000123, and for the same p = s = 12, are equal to 0.000197, 0.000271, and 0.000271 for Models 1, 2, and 3, respectively. Model 4 always gives an error of 0. All of them are small, smaller than the mutation rate. Which equation should be chosen by the investigator? With very complicated computations, this difference may become considerable. What about other similar situations?

Speed of Convergence of Allele Frequencies
Obviously, the speed of convergence depends on the matrix A (i) N , i = 1, 2, 3, the starting point v o , and the number of alleles. We measure the speed of convergence by the entropy of frequencies in a population's sequential generations. By entropy we denote where for i = 1, 2, 3. The speed of convergence is described by an approximation in a mean-square sense of behavior of entropy in sequential generations. Another measure of distance between two probability measures is the so-called Kolmogorov distance. For frequencies of sequential generations, we define the Kolmogorov distance between the k-th generation and the 1st generation: and investigate the behavior of the sequence K (i) k . The third measure of distance of probability distribution is called a similarity index. We investigate The behavior of the investigated values for one typical locus D3S1358 is presented in the following Figure 2. k for i = 1, 2, 3, the best in a mean-square sense is the logarithmic function: The optimal values for H k belong to the intervals a ∈ (0.097(VWA), 0.197(FESFPS)), b ∈ (2.2(TPOX), 4.156(FESFPS)), for K k to a ∈ (0.021(TH01), 0.062(CSF1PO)), b ∈ (0.099(VWA), 0.243(D5S818)), and for I k to a ∈ (−0.107(CSF1PO), −0.037(TH01)), b ∈ (0.460(FGA), 0.824(VWA)). This means that after about 8000 generations, we obtain almost uniformly distributed allele frequencies.

Conclusions
(i) The mutation model assumed in the DNA VIEW program is not correct. After a long-time simulation (~8000 generations), this model results in almost uniformly distributed frequencies (the value of entropy in the D3S1358 locus case is over 93% its maximal value, much more than the value observed in human population, and the trend is still growing, however much slower than in earlier generations) of alleles but in reality no such uniform distribution at any locus can be observed. The uniformly distributed allele frequencies give the maximal entropy, but in reality, the entropy at different loci is smaller than the maximal one (between 62.9% of the maximal entropy in the case of LPL and 87.9% in the case of Penta E).
(ii) The speed of convergence H k is directly proportional to the number of allele N and inversely proportional to the difference between the uniform distribution and the distribution of ν (o) (expressed by H 1 ). A statistical analysis gives the formula H ∞ − H (3) k = 0.33923 + 0.013062 × N − 0.138466 × H 1 (p = 0.0146 for N and p = 0.00229 for H 1 ). In the case of the Kolmogorov distance and indices of similarity, this dependence is not statistically significant. (iii) Among the considered models: Models 1, 2, and 3, Model 3 is the best. However, from a viewpoint of the Hardy-Weinberg hypothesis, Model 4 is the best. On the other hand, Models 1 and 4 "change" the observed mutation rate; however, this change is very small. Model 3 is preferred from the viewpoint of genetic observation. (iv) The choice of model taken for paternity testing has an essential influence in the case of maternal mutation. The only cause why it is hard to detect is that maternal mutations occur rarely. However, if they do, the differences can be enormous, as it was shown in Table 3 (P[H 1 |V] value 0.0003674 in Model 3 suggests that the man almost sure is not a father while values over 0.9967 in other models contradict this statement). In this case, we may get a different conclusion dependent on the taken mutation model (the defendant is accepted or excluded as a father). (v) By choosing a model different than Model 4, the researcher can omit the difference between two theoretically identical equations (see (30) and (31)). From a theoretical viewpoint, this is a significant lack of cohesion.
Funding: This research received no external funding.