# Secondary Structure Libraries for Artificial Evolution Experiments

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

^{6}-fold (and sometimes much more) have also been described [7]. Because nucleic acids (especially DNA molecules) are relatively inexpensive to synthesize and easy to work with, for some applications they represent promising alternatives to proteins. Examples include the use of aptamers as artificial antibodies [8], allosterically regulated ribozymes and deoxyribozymes as sensors [9,10], fluorescent aptamers as genetic reporters [11,12,13], and RNA-cleaving ribozymes and deoxyribozymes that recognize substrates by base pairing as artificial nucleases [14].

^{15}random sequences and purifying rare variants with a desired biochemical function by iterative cycles of selection and amplification (Figure 1A). Most selection experiments use libraries containing at least 40 randomized positions. The number of possible RNA or DNA sequences of this length is many orders of magnitude larger than 10

^{15}. For example, the number of possible variants of a 40-nucleotide sequence is 4

^{40}= 1.2 × 10

^{24}. This means that functional sequences identified in an initial selection experiment are unlikely to include the most active variants of the motif. More efficient variants can typically be identified by generating a second library by randomly mutating a single example of the motif (usually at a rate of 15% to 25% per position) and performing another selection experiment (Figure 1A) [15,16]. However, such variants are still unlikely to represent global optima, because only a small fraction of the possible sequences with the secondary structure of the motif will have been present in either the initial random sequence library or the library used in the reselection. One way to appreciate this point is to consider the probability of obtaining variants of a secondary structure in a randomly mutagenized library in which all base pairs differ from those present in the starting sequence (see also [16,17]). For instance, in a library generated by randomly mutating a single variant of a motif made up of canonical base pairs (A-U, U-A, C-G, or G-C) at a standard rate of 20% per position, the probability of obtaining a canonical pair that differs from that present in the starting sequence is 0.013. For a secondary structure with 15 canonical pairs, the probability of obtaining a variant in which each of these pairs has changed to another canonical pair is therefore 0.013

^{15}= 7.5 × 10

^{−29}. When the original base pairs can also change to G.U or U.G wobble pairs, this probability is 0.071

^{15}= 6.0 × 10

^{−18}, which is considerably higher but still extremely low. These probabilities indicate that the 3

^{15}= 1.4 × 10

^{7}possible variants of the secondary structure in which each of the 15 base pairs present in the original variant has changed to a different canonical pair, or the 5

^{15}= 3.1 × 10

^{10}possible variants in which the original pair has changed to either a canonical or wobble pair, will be poorly represented, even in large randomly mutagenized libraries.

## 2. Results

#### 2.1. Maximizing the Probability that Two Positions in a Library Will Form a Base Pair

#### 2.2. Maximizing the Number of Unique Sequences in a Library that Form a Specific Stem

^{15}sequences for each of the optimized coding strategies described in Figure 2 (Equations (1) and (2)). Our calculations took into account the probability of obtaining a sequence with the potential to form each of the pairs in the stem. This probability decreases exponentially as the number of pairs in the stem increases, and decreases more quickly when the probability of forming a pair is lower (Figure 4, green curves). They also considered the number of distinct sequences with the potential to form a given stem that are possible for different coding schemes. This number increases exponentially as the number of pairs in the stem increases, and increases more quickly when the number of pairs allowed by the coding scheme is higher (Figure 4, blue curves). These calculations demonstrate that the optimal coding scheme depends on both the number of pairs in the stem and the number of sequences in the starting library (Figure 4 and Figures S1–S3). For short stems and large libraries, using a scheme in which each of the six possible pairs can occur maximizes the number of unique sequences with the potential to form all of the pairs in the stem. In the case of libraries containing 10

^{15}sequences, for example, such a strategy is best for stems containing up to 13 pairs (Figure 4A). As the number of pairs in the stem increases, however, the optimal strategy is to use a coding scheme that allows fewer pairs to occur but increases the probability that a pair can form. For stems containing 14 or 15 pairs, a coding scheme in which 5 pairs can occur is optimal (Figure 4B). For stems containing 16 to 19 pairs, a coding scheme in which 4 pairs can occur is best (Figure 4C). For stems containing 20 to 35 pairs, a coding scheme in which 3 pairs can occur is ideal (Figure 4D). And for stems containing more than 35 pairs, the best strategy is to use a scheme in which the probability of forming a pair is one and the number of possible pairs is two (for example, by encoding G at one position in the pair and Y at the other) (Figure 4E).

^{15}sequences that can form each of the pairs in a series of stems of different lengths (Equations (3)–(9), (12) and (13)). These calculations used stems that contained only canonical base pairs as a starting point, and allowed stem variants to contain canonical (A-U, U-A, C-G, and G-C) or wobble (G.U and U.G) pairs. We then compared the number of different variants of each stem that could be generated using the optimal rate of random mutagenesis with the number generated using the optimal coding scheme for pairs for a stem of the same length (Figure 5). The two methods gave similar results for stems of up to 12 pairs. Note that, for stems of this size, complete sampling is possible in both secondary structure libraries and randomly mutagenized libraries.

^{6}. This enrichment was also observed in smaller libraries (Figure S4). Taken together, these calculations show that secondary structure libraries can contain significantly more sequences with the potential to form each of the pairs in a stem than randomly mutagenized libraries. They also indicate that such libraries are likely to be more useful for the optimization of larger and more complex structures than for the optimization of simpler folds.

#### 2.3. Synthesis of Secondary Structure Libraries Using a Split-and-Pool Approach

^{3}= 216 different oligonucleotides would be needed to encode a hairpin structure containing three pairs. By encoding pairs using degenerate positions, this problem can be minimized to some extent (Figure 6).

^{3}= 216 to 4

^{3}= 64 without reducing base pair diversity. Permitting mismatches can further reduce these numbers. For instance, by synthesizing one oligonucleotide containing R (A or G) at one position in a pair and Y (C or U) at the second, and a second oligonucleotide containing Y (C or U) at one position in a pair and R (A or G) at the second, it is possible to reduce the number of oligonucleotides needed from 6

^{3}= 216 to 2

^{3}= 8 (Figure 6). However, as the number of base pairs (N) in the secondary structure increases, even 2

^{N}possibilities will eventually become limiting. For instance, in the case of the fifteen base pair self-thiophosphorylating ribozyme discussed in the introduction, this would require synthesis of 2

^{15}= 32,768 oligonucleotides. An advantage of a split-and-pool approach, especially when used in combination with degenerate positions, is that it can increase the probability of obtaining pairs relative to strategies in which the secondary structure library is generated in a single synthesis. However, a disadvantage is that a large number of oligonucleotides are required even for modestly-sized structures.

#### 2.4. Secondary Structure Libraries Based on Known Motifs

^{15}, but see Figures S1–S7 for calculations for smaller libraries).

^{9}= 10

^{7}possible stem sequences. It also contains 32 possible combinations of mutations in unpaired positions, for a total of 3.2 × 10

^{8}possible variants in the secondary structure library if each of the six possible pairs (A-T, T-A, C-G, G-C, G.T, or T.G) can occur. By encoding base pairs with N (A, C, G, or T) at both positions, and unpaired positions with degenerate bases consistent with its sequence requirements, each of these variants can be generated in a single synthesis. A library containing almost as many (5.5-fold fewer) variants can be generated by random mutagenesis using the optimal rate of 46% per position (Equations (3)–(9), (12) and (13)). In addition, libraries containing the same number of variants as that present in the secondary structure library can be generated by using a restricted random mutagenesis strategy (in which invariant unpaired nucleotides are not mutated during library synthesis; Equations (3)–(9), (12) and (13)) or a smart random mutagenesis strategy (in which only nucleotides known to be compatible with function can occur at unpaired positions during library synthesis, Equations (10)–(13)). When smaller libraries were used for the calculations, the advantage of using a secondary structure library became more pronounced (Figures S5–S7). In addition, the optimal coding scheme for base pairs changed (Figures S5–S7). This example shows that, for simple motifs and large pools, secondary structure libraries are not significantly enriched for the secondary structure of interest relative to randomly mutagenized libraries.

^{12}= 2.2 × 10

^{9}possible stem sequences. It also contains 1.5 × 10

^{5}possible combinations of mutations in unpaired positions, for a total of 3.2 × 10

^{14}possible variants in the secondary structure library. By encoding base pairs with N (A, C, G, or U) at one position and K (G or U) at the other, and unpaired positions with degenerate bases consistent with its sequence requirements, a library containing 2.4 × 10

^{11}of the 3.2 × 10

^{14}possible variants in the secondary structure library can be generated in a single synthesis. In comparison, a library generated using the optimal level of random mutagenesis (26% per position) would contain 119-fold fewer unique variants of the motif than the number in the secondary structure library, while a library synthesized using restricted mutagenesis would contain 32-fold fewer sequences and a library generated using smart mutagenesis would contain 12-fold fewer sequences. As before, the advantage of using a secondary structure library was more pronounced for smaller libraries, and the optimal coding scheme for base pairs also changed based on the library size (Figures S5–S7).

^{15}= 4.7 × 10

^{11}possible stem sequences. It also contains 3.1 × 10

^{4}possible combinations of mutations in unpaired positions, for a total of 1.5 × 10

^{16}possible variants in the secondary structure library. A library that maximizes the number of variants consistent with this secondary structure can be generated by encoding base pairs with Y (C or U) at one position and R (A or G) at the other. A library made in this way would contain 4.5 × 10

^{11}of the 1.5 × 10

^{16}possible variants encoded by the secondary structure library. In comparison, a library generated using the optimal level of random mutagenesis (19% per position) would contain 676-fold fewer unique variants of the motif than the number in the secondary structure library, while a library synthesized using restricted mutagenesis would contain 154-fold fewer sequences and a library generated using smart mutagenesis would contain 40-fold fewer sequences. As for the other motifs, the advantage of using a secondary structure library was generally more pronounced for smaller libraries, and the optimal coding strategy depended on the library size (Figures S5–S7). This example highlights that, for complex structures, only a small fraction of the sequence space of the secondary structure library can be sampled even using the methods described here.

## 3. Discussion

^{15}different sequences. Once the sequence of a functional motif is known, the sequence space around it is explored using a second library generated by randomly mutating a single variant of the motif at a rate of 15% to 25% per position. This library is usually generated by solid-phase synthesis, although mutagenic PCR can also be used when lower rates of mutagenesis (on the order of 1% per position) are desired [26]. The synthetic protocol can also be modified in various ways to incorporate deletions [27,28]. Selections using such libraries often yield variants with improved biochemical properties, and also provide valuable information about the sequence requirements and secondary structure of the motif [15,16,19,20]. However, such experiments are unlikely to identify the most active variant of the motif. This is due to incomplete sampling: sequence space is vast, and only a tiny fraction of the possible variants of a given secondary structure are likely to be present in the neighborhood of a single sequence. Here we describe a method to more effectively explore the sequence space of a secondary structure of interest. Our method uses biased nucleotide frequencies to increase the probability that paired positions in the secondary structure of the motif will also have the potential to form pairs in sequences in the library. It also uses information about the sequence requirements of the motif to determine which mutations can occur in unpaired regions [20]. By increasing the number of different variants of the secondary structure of a functional motif in the library, the likelihood of finding variants with improved properties should also increase.

^{15}sequences. As the complexity of the motif increases, however, our calculations show that secondary structure libraries will contain significantly more unique sequences with the potential to form the secondary structure than either random sequence libraries or randomly mutagenized libraries based on a single example of a motif. They also indicate that the optimal coding strategy for pairs (in this study defined as canonical A-U, U-A, C-G, and G-C base pairs as well as G.U and U.G wobble pairs) depends on the complexity of the motif (Figure 8). For less complex motifs such as the streptavidin aptamer, N-N (six possible pairs and a 0.375 probability of forming a base pair) is the optimal strategy. As the complexity of the motif increases, coverage can be maximized by encoding pairs with combinations of nucleotides that maximize the probability of obtaining a viable pair, although this comes at a cost of reducing the number of pairs that can occur. For instance, base pairs in a 40-nucleotide ATP aptamer can be optimally encoded using K-N (four possible pairs and a probability of 0.5 of forming a pair), while those in a 50-nucleotide kinase ribozyme are best encoded by R-Y (three possible pairs and a probability of 0.75 of forming a base pair). For more complex motifs, such as the 119-nucleotide b1-207 variant of the Class I ligase ribozyme (made up of 33 base pairs, 16 invariant unpaired positions, 10 unpaired positions at which two nucleotides are possible, 13 unpaired positions at which three nucleotides are possible, 8 positions at which four nucleotides are possible, and a six-nucleotide substrate binding site that was left constant for these calculations) [16,29,30], the optimal coding strategy is one in which the probability of obtaining a pair in one and two pairs is possible. This could be achieved by encoding C-G (and U.G) pairs by Y-G, G-C (and G.U) pairs by G-Y, A-U pairs by R-U, and U-A pairs by U-R (note that this coding scheme also ensures that the starting sequence will be present in the library). Every variant in a library made in this way would have the potential to form each of the 33 pairs in the secondary structure of the ribozyme, and ~10

^{15}different variants of the secondary structure would be represented. In comparison, 62,707-fold fewer unique sequences consistent with the constraints of the secondary structure would be present in a library generated by randomly mutagenizing this ribozyme at an optimal rate of 6% per position. A library synthesized using restricted mutagenesis would also contain 62,707-fold fewer sequences (the same coverage is reachable by normal random mutagenesis), while a library generated using smart mutagenesis would contain 22,129-fold fewer sequences.

^{6}sequences. In comparison, our approach can be used to generate libraries of ~10

^{15}sequences in a single synthesis. Another approach uses an algorithm to determine a mixing matrix (the nucleotide composition of degenerate positions) that can be used to synthesize a library which maximizes the fraction of sequences predicted to form a target structure [37,38,39]. Because this approach uses an RNA folding algorithm to evaluate library quality, it cannot be applied to motifs which contain structural elements that cannot be effectively predicted, such as pseudoknots, triplexes, and G-quadruplexes. In comparison, the approach described here can be applied to any structure for which the sequence requirements have been determined.

## 4. Materials and Methods

#### 4.1. Secondary Structure Library Design

^{15}) and ${p}_{0}^{{n}_{bp}}$ is the probability that the sequence will have the potential to form all of the pairs in the stem. Therefore, the maximum possible number of unique sequences in the library with the potential to form all base pairs in the stem is:

#### 4.2. Random Mutagenesis

^{15}). For the first two methods of mutagenesis, the rate with the highest number of allowed mutations is chosen without limiting the number of possible mutations to a whole number. For example, if we are comparing two rates and one allows a maximum of 5.0 mutations and the other allows a maximum of 5.5 mutations, the same types of sequences will be present in libraries generated at both rates. In the second case, however, the average copy number of sequences containing 5 mutations will be higher, so this rate is considered to be better. For the third method of mutagenesis, we compare results for rates which generate the highest number of distinct sequences consistent with the model by comparing the average copy number of sequences from the least abundant sequence type present in the library, and choose as the optimal rate that at which this minimum is the largest. For example, if we are comparing two rates for which the least abundant sequence type contains sequences with average copy numbers of 2.2 and 2.8, then the second rate is better.

#### 4.3. Normal Random Mutagenesis in Which the Entire Sequence Is Mutagenized

#### 4.4. Restricted Random Mutagenesis

#### 4.5. Smart Random Mutagenesis

_{2}, N

_{3}, and N

_{4}at which 2, 3, or 4 nucleotides are allowed. Positions which form base pairs are also classified as positions at which 4 nucleotides are allowed. The total number of nucleotides which are randomly mutagenized is $N={N}_{2}+{N}_{3}+{N}_{4}$. If we look again at formula (3) we see that it cannot be used directly in this case, because $t$ is different for positions at which different numbers of nucleotides are allowed. For this reason, we have to write separate terms for sequence groups with constant $t$:

#### 4.6. Calculating the Number of Unique Sequences in a Library

## 5. Conclusions

## Supplementary Materials

^{12}sequences, Figure S2: Maximizing the number of unique sequences that form a specific stem in libraries of 10

^{9}sequences, Figure S3: Maximizing the number of unique sequences that form a specific stem in libraries of 10

^{6}sequences, Figure S4: Enrichment of stem variants in secondary structure libraries relative to randomly mutagenized libraries, Figure S5: Secondary structure libraries based on known motifs for library sizes of 10

^{12}sequences, Figure S6: Secondary structure libraries based on known motifs for library sizes of 10

^{9}sequences., Figure S7: Secondary structure libraries based on known motifs for library sizes of 10

^{6}sequences.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Sample Availability

## References

- Wilson, D.S.; Szostak, J.W. In vitro selection of functional nucleic acids. Annu. Rev. Biochem.
**1999**, 68, 611–647. [Google Scholar] [CrossRef][Green Version] - Bartel, D.P.; Unrau, P.J. Constructing an RNA world. Trends Cell Biol.
**1999**, 9, M9–M13. [Google Scholar] [CrossRef] - Breaker, R.R. Natural and engineered nucleic acids as tools to explore biology. Nature
**2004**, 432, 838–845. [Google Scholar] [CrossRef] - Joyce, G.F. Directed evolution of nucleic acid enzymes. Annu. Rev. Biochem.
**2004**, 73, 791–836. [Google Scholar] [CrossRef] [PubMed][Green Version] - Silverman, S.K. Catalytic DNA: Scope, applications, and biochemistry of deoxyribozymes. Trends Biochem. Sci.
**2016**, 41, 595–609. [Google Scholar] [CrossRef][Green Version] - Gold, L.; Polisky, B.; Uhlenbeck, O.; Yarus, M. Diversity of oligonucleotide functions. Ann. Rev. Biochem.
**1995**, 64, 763–797. [Google Scholar] [CrossRef] [PubMed] - Chandrasekar, J.; Silverman, S.K. Catalytic DNA with phosphatase activity. Proc. Natl. Acad. Sci. USA
**2013**, 110, 5315–5320. [Google Scholar] [CrossRef] [PubMed][Green Version] - Zhou, J.; Rossi, J. Aptamers as targeted therapeutics: Current potential and challenges. Nat. Rev. Drug Discov.
**2017**, 16, 181–202. [Google Scholar] [CrossRef] [PubMed][Green Version] - Tang, J.; Breaker, R.R. Rational design of allosteric ribozymes. Chem. Biol.
**1997**, 4, 453–459. [Google Scholar] [CrossRef][Green Version] - Koizumi, M.; Soukup, G.A.; Kerr, J.N.; Breaker, R.R. Allosteric selection of ribozymes that respond to the secondary messengers cGMP and cAMP. Nat. Struct. Biol.
**1999**, 6, 1062–1071. [Google Scholar] - Babendure, J.R.; Adams, S.R.; Tsien, R.Y. Aptamers switch on fluorescence of triphenylmethane dyes. J. Am. Chem. Soc.
**2003**, 125, 14716–14717. [Google Scholar] [CrossRef] - Paige, J.S.; Wu, K.Y.; Jaffrey, S.R. RNA mimics of green fluorescent protein. Science
**2011**, 333, 642–646. [Google Scholar] [CrossRef] [PubMed] - Dolgosheina, E.V.; Jeng, S.C.; Panchapakesan, S.S.; Cojocaru, R.; Chen, P.S.; Wilson, P.D.; Hawkins, N.; Wiggins, P.A.; Unrau, P.J. RNA mango aptamer-fluorophore: A bright, high-affinity complex for RNA labeling and tracking. ACS Chem. Biol.
**2014**, 9, 2412–2420. [Google Scholar] [CrossRef] - Liu, M.; Chang, D.; Li, Y. Discovery and biosensing applications of diverse RNA-cleaving DNAzymes. Acc. Chem. Res.
**2017**, 50, 2273–2283. [Google Scholar] [CrossRef] [PubMed] - Ellington, A.D.; Szostak, J.W. In vitro selection of RNA molecules that bind specific ligands. Nature
**1990**, 346, 818–822. [Google Scholar] [CrossRef] [PubMed] - Ekland, E.H.; Bartel, D.P. The secondary structure and sequence optimization of an RNA ligase ribozyme. Nucleic Acids Res.
**1995**, 23, 3231–3238. [Google Scholar] [CrossRef][Green Version] - Knight, R.; Yarus, M. Analyzing partially randomized nucleic acid pools: Straight dope on doping. Nucleic Acids Res.
**2003**, 31, e30. [Google Scholar] [CrossRef][Green Version] - Gutell, R.R.; Power, A.; Hertz, G.Z.; Putz, E.J.; Stormo, G.D. Identifying constraints on the higher-order structure of RNA: Continued development and application of comparative sequence analysis methods. Nucleic Acids Res.
**1992**, 20, 5785–5795. [Google Scholar] [CrossRef][Green Version] - Curtis, E.A.; Bartel, D.P. New catalytic structures from an existing ribozyme. Nat. Struct. Mol. Biol.
**2005**, 12, 994–1000. [Google Scholar] [CrossRef] - Curtis, E.A.; Bartel, D.P. Synthetic shuffling and in vitro selection reveal the rugged adaptive fitness landscape of a kinase ribozyme. RNA
**2013**, 19, 1116–1128. [Google Scholar] [CrossRef][Green Version] - Ruff, K.M.; Snyder, T.M.; Liu, D.R. Enhanced functional potential of nucleic acid aptamer libraries patterned to increase secondary structure. J. Am. Chem. Soc.
**2010**, 132, 9453–9464. [Google Scholar] [CrossRef] [PubMed] - Bing, T.; Yang, X.; Mei, H.; Cao, Z.; Shangguan, D. Conservative secondary structure motif of streptavidin-binding aptamers generated by different laboratories. Bioorg. Med. Chem.
**2010**, 18, 1798–1805. [Google Scholar] [CrossRef] [PubMed] - Sassanfar, M.; Szostak, J.W. An RNA motif that binds ATP. Nature
**1993**, 364, 550–553. [Google Scholar] [CrossRef] [PubMed] - Jiang, F.; Kumar, R.A.; Jones, R.A.; Patel, D.J. Structural basis of RNA folding and recognition in an AMP-RNA aptamer complex. Nature
**1996**, 382, 183–186. [Google Scholar] [CrossRef] [PubMed] - Dieckmann, T.; Suzuki, E.; Nakamura, G.K.; Feigon, J. Solution structure of an ATP-binding RNA aptamer reveals a novel fold. RNA
**1996**, 2, 628–640. [Google Scholar] [PubMed] - Cadwell, R.C.; Joyce, G.F. Randomization of genes by PCR mutagenesis. PCR Methods Appl.
**1992**, 2, 28–33. [Google Scholar] [CrossRef][Green Version] - Wang, Q.S.; Unrau, P.J. Ribozyme motif structure mapped using random recombination and selection. RNA
**2005**, 11, 404–411. [Google Scholar] [CrossRef][Green Version] - Arriola, J.T.; Muller, U.F. A combinatorial method to isolate short ribozymes from complex ribozyme libraries. Nucleic Acids Res.
**2020**, 48, e116. [Google Scholar] [CrossRef] - Johnston, W.K.; Unrau, P.J.; Lawrence, M.S.; Glasner, M.E.; Bartel, D.P. RNA-catalyzed RNA polymerization: Accurate and general RNA-templated primer extension. Science
**2001**, 292, 1319–1325. [Google Scholar] [CrossRef][Green Version] - Wochner, A.; Attwater, J.; Coulson, A.; Holliger, P. Ribozyme-catalyzed transcription of an active ribozyme. Science
**2011**, 332, 209–212. [Google Scholar] [CrossRef] [PubMed] - Batey, R.T.; Rambo, R.P.; Doudna, J.A. Tertiary motifs in RNA structure and folding. Angew. Chem. Int. Ed. Engl.
**1999**, 38, 2326–2343. [Google Scholar] [CrossRef] - Duca, M.; Vekhoff, P.; Oussedik, K.; Halby, L.; Arimondo, P.B. The triple helix: 50 years later, the outcome. Nucleic Acids Res.
**2008**, 36, 5123–5138. [Google Scholar] [CrossRef] - Brown, J.A. Unraveling the structure and biological function of RNA triple helices. Wiley Interdiscip. Rev. RNA
**2020**, 11, e1598. [Google Scholar] [CrossRef] [PubMed] - Kinghorn, A.B.; Fraser, L.A.; Liang, S.; Shiu, S.C.C.; Tanner, J.A. Aptamer bioinformatics. Int. J. Mol. Sci.
**2017**, 18, 2516. [Google Scholar] - Vorobyeva, M.A.; Davydova, A.S.; Vorobjev, P.E.; Pyshnyi, D.V.; Venyaminova, A.G. Key aspects of nucleic acid library design for in vitro selection. Int. J. Mol. Sci.
**2018**, 19, 470. [Google Scholar] [CrossRef][Green Version] - Kinghorn, A.B.; Dirkzwager, R.M.; Liang, S.; Cheung, Y.W.; Fraser, L.A.; Shiu, S.C.C.; Tang, M.S.L.; Tanner, J.A. Aptamer affinity maturation by resampling and microarray selection. Anal. Chem.
**2016**, 88, 6981–6985. [Google Scholar] [CrossRef] - Kim, N.; Gan, H.H.; Schlick, T. A computational proposal for designing structured RNA pools for in vitro selection of RNAs. RNA
**2007**, 13, 478–492. [Google Scholar] [CrossRef][Green Version] - Kim, N.; Shin, J.S.; Elmetwaly, S.; Gan, H.H.; Schlick, T. RagPools: RNA-as-graph-pools—A web server for assisting the design of structured RNA pools for in vitro selection. Bioinformatics
**2007**, 23, 2959–2960. [Google Scholar] [CrossRef] - Kim, N.; Izzo, J.A.; Elmetwaly, S.; Gan, H.H.; Schlick, T. Computational generation and screening of RNA motifs in large nucleotide sequence pools. Nucleic Acids Res.
**2010**, 38, e139. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**Concept and design of a secondary structure library. (

**A**) Typical workflow to identify and optimize a functional nucleic acid motif. The starting library usually contains ~10

^{15}random sequences flanked by primer binding sites. After identifying functional motifs by selection, a second library is prepared by randomly mutagenizing a single sequence corresponding to one of the most active variants at a rate of 15% to 25% per position. Additional rounds of selection are performed to identify active variants of this sequence, most of which will adopt the same fold. Information from these variants can be used to design a secondary structure library, which is the topic of this paper. (

**B**) Design of a secondary structure library. In this hypothetical example, variants 1–3 are three variants of a functional RNA motif from the “active variants” step of the workflow in panel a. A secondary structure library combining information from these three variants is shown on the right. Nucleotides that differ from variant 1 are shown in purple. X

_{1}-X

_{2}= A-U, U-A, C-G, G-C, G.U, or U.G; R = A or G; W = A or U; Y = C or U; K = G or U.

**Figure 2.**Encoding base pairs with degenerate positions. (

**A**) The ten possible architectures for encoding base pairs by solid-phase synthesis. The number of possible nucleotides at each position in the base pair in each architecture is shown on the left, and an example is shown on the right. (

**B**) Tradeoff between the number of possible pairs that can be encoded in each of the ten architectures (x axis) and the maximum probability of forming a pair in the architecture (y axis). Y = C or U; R = A or G; K = G or U; B = C, G, or U; D = A, G, or U; N = A, C, G, or U.

**Figure 3.**Encoding stems with degenerate positions. (

**A**) A hypothetical stem made up of 20 base pairs. The sequence is arbitrary and does not affect the calculations in this panel. (

**B**) Number of variants of this stem (including canonical A-U, U-A, C-G, and G-C base pairs as well as G.U and U.G wobble pairs) at various mutational distances from the starting sequence in a library in which base pairs are encoded by N-N. The total number of possible stem variants is 6

^{20}= 3.7 × 10

^{15}. (

**C**) Probability distribution of sequences in a library based on this stem in which base pairs are encoded by N-N (N = A, C, G, or U; probability of forming a pair = 0.375). The y axis indicates the probability that a sequence in the library will have the potential to form each of the 20 base pairs in the stem. (

**D**) Same as panel B, but for a library in which base pairs are encoded by R-Y (R = A or G; Y = C or U; probability of forming a pair = 0.75). Because only three of the six possible pairs can occur with this coding scheme, the number of possible stem variants is 3

^{20}= 3.5 × 10

^{9}. (

**E**) Same as panel C, but for a library in which base pairs are encoded by R-Y.

**Figure 4.**Maximizing the number of unique sequences that form a specific stem in libraries of 10

^{15}sequences. The graphs showing the relationship between the number of base pairs in a stem, the number of possible variants of the stem for the indicated coding scheme (blue curves), and the expected number of variants in a library of 10

^{15}sequences with the potential to form all of the pairs in the stem (green curves) for the indicated coding scheme. The number of unique variants in the library at each point on the x axis is indicated by the curve with the lower value, and the average copy number of library members is greater than one to the left of each intersection point and less than one to the right of each intersection point. (

**A**) Coding scheme in which 6 pairs can occur. An example is N (A, C, G, or U) and N. The probability of forming a pair is 0.375. (

**B**) Coding scheme in which 5 pairs can occur. An example is D (A, G, or U) and N (A, C, G, or U). The probability of forming a pair is 0.417. (

**C**) Coding scheme in which 4 pairs can occur. An example is K (G or U) and N (A, C, G, or U). The probability of forming a pair is 0.5. (

**D**) Coding scheme in which 3 pairs can occur. An example is R (A or G) and Y (C or U). The probability of forming a pair is 0.75. (

**E**) Coding scheme in which 2 pairs can occur. An example is G and Y (C or U). The probability of forming a pair is 1. (

**F**) Coding scheme in which 1 pair can occur. An example is G and C. The probability of forming a base pair is 1.

**Figure 5.**Enrichment of stem variants in secondary structure libraries relative to randomly mutagenized libraries. The optimal coding strategy for base pairs and the optimal rate of random mutagenesis was determined for a series of stems containing 10 to 50 base pairs. Enrichment of distinct variants of the stem in the secondary structure library (y axis) was calculated by dividing the number of different variants of the stem expected to occur in a secondary structure library (generated using the optimal coding strategy for base pairs) by the number expected to occur in a randomly mutagenized library (generated using the optimal rate of mutagenesis). Calculations were performed for a library of 10

^{15}sequences. The breakpoints in this graph are due to changes in the maximum number of mutations a sequence can contain to be present in the library.

**Figure 6.**Synthesis of secondary structure libraries using a split-and-pool approach. In this example, a library containing all possible variants of a stem is constructed by synthesizing eight different oligonucleotides in which base pairs are encoded by different combinations of R-Y and Y-R. These oligonucleotides are mixed to generate the final library containing 512 different sequences, including each of the 216 possible stem sequences. Z

_{1}-Z

_{2}= R-Y or Y-R.

**Figure 7.**Secondary structure libraries based on known motifs for library sizes of 10

^{15}sequences. (

**A**) Expected number of unique variants with the potential to form the secondary structure of a DNA aptamer that binds streptavidin [22] in a library of 10

^{15}sequences using different coding strategies to encode base pairs. The column labeled “Ran” indicates the number for a library generated at the optimal rate of random mutagenesis using the method described in Section 4.3. (

**B**) Possible secondary structure library for this motif. (

**C**,

**D**), the same, but for an RNA aptamer that binds ATP [23,24,25]. (

**E**,

**F**). the same, but for a kinase ribozyme that thiophosphorylates itself using GTPγS as a substrate [19,20]. Y = C or T (U); R = A or G; K = G or T (U); W = A or T (U); S = C or G; D = A, G, or T (U); H = A, C, or T (U); V = A, C, or G; N = A, C, G, or T (U).

**Figure 8.**Relationship between the complexity of the secondary structure and the optimal coding scheme for base pairs in secondary structure libraries. Y = C or U (T); R = A or G; K = G or U (T); D = A, G, or U (T); N = A, C, G, or U (T).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sgallová, R.; Curtis, E.A. Secondary Structure Libraries for Artificial Evolution Experiments. *Molecules* **2021**, *26*, 1671.
https://doi.org/10.3390/molecules26061671

**AMA Style**

Sgallová R, Curtis EA. Secondary Structure Libraries for Artificial Evolution Experiments. *Molecules*. 2021; 26(6):1671.
https://doi.org/10.3390/molecules26061671

**Chicago/Turabian Style**

Sgallová, Ráchel, and Edward A. Curtis. 2021. "Secondary Structure Libraries for Artificial Evolution Experiments" *Molecules* 26, no. 6: 1671.
https://doi.org/10.3390/molecules26061671