# Analytic Combinatorics for Computing Seeding Probabilities

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Background

#### 3.1. Weighted Generating Functions

**Definition**

**1.**

**Example**

**1.**

**Proposition**

**1.**

**Proof.**

**Example**

**2.**

#### 3.2. Transfer Graphs and Transfer Matrices

**Definition**

**2.**

**Remark**

**1.**

**Theorem**

**1.**

**Proof.**

#### 3.3. Asymptotic Estimates

**Theorem**

**2.**

**Lemma**

**1.**

**Proof**

**of**

**Lemma**

**1.**

**Proof**

**of**

**Theorem**

**2.**

**Corollary**

**1.**

**Theorem**

**3.**

**Proof.**

## 4. Results

#### 4.1. Reads, Error Symbols and Error-Free Intervals

#### 4.2. Substitutions Only

**Proposition**

**2.**

**Example**

**3.**

#### 4.3. Substitutions and Deletions

**Proposition**

**3.**

**Example**

**4.**

#### 4.4. Substitutions, Deletions and Insertions

**Remark**

**2.**

**Proposition**

**4.**

**Example**

**5.**

#### 4.5. Accuracy of the Approximations

## 5. Discussion

## Acknowledgments

## Conflicts of Interest

## Appendix A

## References

- Reuter, J.A.; Spacek, D.V.; Snyder, M.P. High-throughput sequencing technologies. Mol. Cell
**2015**, 58, 586–597. [Google Scholar] [CrossRef] [PubMed] - Quilez, J.; Vidal, E.; Dily, F.L.; Serra, F.; Cuartero, Y.; Stadhouders, R.; Graf, T.; Marti-Renom, M.A.; Beato, M.; Filion, G. Parallel sequencing lives, or what makes large sequencing projects successful. Gigascience
**2017**, 6, 1–6. [Google Scholar] [CrossRef] [PubMed] - Li, H.; Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform.
**2010**, 11, 473–483. [Google Scholar] [CrossRef] [PubMed] - Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- Sun, Y.; Buhler, J. Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinform.
**2006**, 7, 133. [Google Scholar] [CrossRef] [PubMed] - Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol.
**1990**, 215, 403–410. [Google Scholar] [CrossRef] - Karlin, S.; Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA
**1993**, 90, 5873–5877. [Google Scholar] [CrossRef] [PubMed] - Karlin, S.; Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA
**1990**, 87, 2264–2268. [Google Scholar] [CrossRef] [PubMed] - Ferragina, P.; Manzini, G. Opportunistic Data Structures with Applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 12–14 November 2000; pp. 390–398. [Google Scholar]
- Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics
**2009**, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed] - Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol.
**2009**, 10, R25. [Google Scholar] [CrossRef] [PubMed] - Flajolet, P.; Odlyzko, A. Singularity analysis of generating functions. SIAM J. Discrete Math.
**1990**, 3, 216–240. [Google Scholar] [CrossRef] - Flajolet, P.; Sedgewick, R. An introduction to the analysis of algorithms, 2nd ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1996. [Google Scholar]
- Flajolet, P.; Sedgewick, R. Analytic Combinatorics, 1st ed.; Cambridge University Press: New York, NY, USA, 2009. [Google Scholar]
- Lladser, M.E.; Betterton, M.D.; Knight, R. Multiple pattern matching: A Markov chain approach. J. Math. Biol.
**2008**, 56, 51–92. [Google Scholar] [CrossRef] [PubMed] - Fu, J.C.; Koutras, M.V. Distribution Theory of Runs: A Markov Chain Approach. J. Am. Stat. Assoc.
**1994**, 89, 1050–1058. [Google Scholar] [CrossRef] - Regnier, M.; Kirakossian, Z.; Furletova, E.; Roytberg, M. A word counting graph. In London Algorithmics 2008: Theory and Practice (Texts in Algorithmics); Chan, J., Daykin, J.W., Sohel, M., Eds.; Rahman London College Publications: London, UK, 2009; p. 31. [Google Scholar]
- Nuel, G. Pattern Markov Chains: Optimal Markov Chain Embedding Through Deterministic Finite Automata. J. Appl. Prob.
**2008**, 45, 226–243. [Google Scholar] [CrossRef] - Nuel, G.; Delos, V. Counting Regular Expressions in Degenerated Sequences Through Lazy Markov Chain Embedding. In Forging Connections between Computational Mathematics and Computational Geometry: Papers from the 3rd International Conference on Computational Mathematics and Computational Geometry; Chen, K., Ravindran, A., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 235–246. [Google Scholar]
- Chaisson, M.J.; Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): Application and theory. BMC Bioinform.
**2012**, 13, 238. [Google Scholar] [CrossRef] [PubMed] - Joyal, A. Une théorie combinatoire des séries formelles. Adv. Math.
**1981**, 42, 1–82. [Google Scholar] [CrossRef] - Bona, M. Handbook of Enumerative Combinatorics; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
- Flajolet, P.; Gardy, D.; Thimonier, L. Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-organizing Search. Discrete Appl. Math.
**1992**, 39, 207–229. [Google Scholar] [CrossRef] - Pemantle, R.; Wilson, M.C. Analytic Combinatorics in Several Variables; Cambridge University Press: New York, NY, USA, 2013. [Google Scholar]
- Bender, E.A. Asymptotic Methods in Enumeration. SIAM Rev.
**1974**, 16, 485–515. [Google Scholar] [CrossRef] - Nakamura, K.; Oshima, T.; Morimoto, T.; Ikeda, S.; Yoshikawa, H.; Shiwa, Y.; Ishikawa, S.; Linak, M.C.; Hirai, A.; Takahashi, H.; et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res.
**2011**, 39, e90. [Google Scholar] [CrossRef] [PubMed] - R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2015. [Google Scholar]

**Figure 1.**Read as sequence of symbols. Reads consist of correct nucleotides (white boxes), substitutions (black boxes), deletions (vertical bars) and insertions (grey boxes). A single deletion can correspond to several missing nucleotides.

**Figure 2.**Read as sequence of error-free intervals or error symbols. Consecutive correct nucleotides can be lumped together in error-free intervals. The representation of a read as a sequence of either error-free intervals or error symbols is unique.

**Figure 3.**Transfer graph of reads with uniform substitutions. Reads are viewed as sequences of error-free intervals (symbol ${\Delta}_{0}$) or substitutions (symbol S). $F\left(z\right)$ and $pz$ are the weighted generating functions of error-free intervals and individual substitutions, respectively. The head vertex is represented as a small white circle, and the tail vertex as a small black circle.

**Figure 4.**Example estimates for substitutions only. The analytic combinatorics estimates of Proposition 2 are benchmarked against random simulations. Shown on both panels are the probabilities that a read of given size contains a seed, either estimated by 10,000,000 random simulations (dots), or by Proposition 2 (lines). The curves are drawn for $\gamma =17$ and $p=0.08$, $p=0.10$ or $p=0.12$ (from top to bottom).

**Figure 5.**Transfer graph of reads with uniform substitutions and deletions. Reads are viewed as sequences of error-free intervals (symbol ${\Delta}_{0}$) or substitutions (symbol S). Deletions are implicitly represented by the fact that an error-free interval can follow another one if a deletion is present in between. $F\left(z\right)$ and $pz$ are the weighted generating functions of error-free intervals and individual substitutions, respectively. $\delta F\left(z\right)$ is the weighted generating function of a deletion followed by an error-free interval. The head vertex is represented as a small white circle, and the tail vertex as a small black circle.

**Figure 6.**Example estimates for substitutions and deletions. The analytic combinatorics estimates of Proposition 3 are benchmarked against random simulations. Shown on both panels are the probabilities that a read of given size contains a seed, either estimated by 10,000,000 random simulations (dots), or by Proposition 3 (lines). The curves are drawn for $\gamma =17$, $p=0.05$ and $\delta =0.14$, $\delta =0.15$ or $\delta =0.16$ (from top to bottom).

**Figure 7.**Transfer graph of reads under the full error model. Reads are viewed as sequences of error-free intervals (symbol ${\Delta}_{0}$), substitutions (symbol S) or insertions (symbol I). The body of the transfer graph is shown on the left, and the head and tail edges are shown on the right. $F\left(z\right)$ and $pz$ are the weighted generating functions of error-free intervals and individual substitutions, respectively. $\delta F\left(z\right)$ is the weighted generating function of a deletion followed by an error-free interval. $rz$ and $\tilde{r}z$ are the weighted generating functions of the first and all subsequent insertions of a burst, respectively. $(1-\tilde{r})F\left(z\right)/(1-r)$ is the weighted generating function of an error-free interval following an insertion and $(1-\tilde{r})pz/(1-r)$ is the weighted generating function of a substitutition following an insertion. The head vertex is represented as a small white circle, and the tail vertex as a small black circle.

**Figure 8.**Example estimates for substitutions, deletions and insertions. The analytic combinatorics estimates of Proposition 4 are benchmarked against random simulations. Shown on both panels are the probablities that a read of given size contains a seed, either estimated by 10,000,000 random simulations (dots), or by Proposition 4 (lines). The curves are drawn for $\gamma =17$, $p=0.05$, $\delta =0.15$, $\tilde{r}=0.45$ and $r=0.04$, $r=0.05$ or $r=0.06$ (from top to bottom).

**Figure 9.**Example worst case for seeding with substitutions only. The analytic combinatorics estimates of Proposition 2 are benchmarked against 10,000,000 random simulations. Shown on both panels are the probabilities that a read of given size contains a seed of size $\gamma =17$, either estimated by random simulations (dots), or by Proposition 2. The curves are drawn for $p=0.005$, $p=0.010$ or $p=0.015$ (from top to bottom). The largest difference between the estimates and the simulations is around $0.015$.

**Figure 10.**Accuracy of the approximations in the uniform substitution model. The estimates of Proposition 2 were computed for $\gamma =17$ and different values of the substitution rate p ranging from $0.01$ to $0.50$ and for different read sizes k ranging from 15 to $10,000$. The reference target value was computed by recurrence through expression (12) and the relative error between the two terms was computed. All calculations were done in double precision arithmetic using R [27].

© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Filion, G.J.
Analytic Combinatorics for Computing Seeding Probabilities. *Algorithms* **2018**, *11*, 3.
https://doi.org/10.3390/a11010003

**AMA Style**

Filion GJ.
Analytic Combinatorics for Computing Seeding Probabilities. *Algorithms*. 2018; 11(1):3.
https://doi.org/10.3390/a11010003

**Chicago/Turabian Style**

Filion, Guillaume J.
2018. "Analytic Combinatorics for Computing Seeding Probabilities" *Algorithms* 11, no. 1: 3.
https://doi.org/10.3390/a11010003