Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences

Tarhio, Jorma

doi:10.3390/a19010030

Open AccessArticle

Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences

by

Jorma Tarhio

Department of Computer Science, Aalto University, P.O. Box 1540, FI-00076 Aalto, Finland

Algorithms 2026, 19(1), 30; https://doi.org/10.3390/a19010030

Submission received: 17 November 2025 / Revised: 11 December 2025 / Accepted: 22 December 2025 / Published: 29 December 2025

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figure

Versions Notes

Abstract

We consider online pattern matching algorithms for IUPAC nucleotide sequences. We present an experimental comparison of search algorithms while allowing 3% degenerate symbols. In addition, we introduce two new algorithms, one utilizing SIMD instructions for short patterns and another for long patterns. A BNDM variation with 6-grams turned out to be the best general method to search IUPAC sequences.

Keywords:

bioinformatics; search algorithms; IUPAC nucleotide symbols; degenerate symbols; comparison of algorithms

1. Introduction

DNA sequencing has become a fundamental tool in modern medicine and microbiology. The remarkable development of successive sequencing technologies has enabled the rapid, cost-effective, and large-scale sequencing of entire genomes and metagenomes. This massive generation of data has created significant computational challenges, particularly regarding search efficiency and analysis of the resulting sequences.

A common way to represent the resulting consensus nucleotide sequences, particularly when there is uncertainty or variation at a specific position, is to use the IUPAC (International Union of Pure and Applied Chemistry) nucleotide alphabet [1] for degenerate symbols. Each IUPAC symbol represents a certain subset of the four standard base symbols of DNA. The presence of these degenerate symbols means that classical algorithms for exact string matching [2] (which assume a perfect one-to-one match between characters) cannot directly handle the comparison, leading to incorrect or incomplete results. Therefore, algorithms that can handle degenerate symbols are needed.

We specifically focus on pattern matching algorithms for IUPAC nucleotide sequences, where the search space is expanded because both the text (the DNA sequence being searched) and the pattern (the query sequence) can contain degenerate symbols. The methods we consider are online methods, which operate by reading the text and searching for the pattern without relying on preprocessing or utilizing any indexing structures (like suffix trees or arrays) of the text. The Backward Nondeterministic Dawg Matching (BNDM) algorithm [3] was one of the first practical solutions capable of handling degenerate symbols as character classes defined by the IUPAC symbols. In addition to BNDM, we consider several previous solutions to the problem: the brute-force algorithm, SBNDMq [4], BADPM [5], and SSB2 [6]. Although SBNDMq and SSB2 are earlier algorithms, they have never been applied to character classes or IUPAC sequences before. In addition, we introduce two new algorithms, SSB6I and N32I. According to our experimental comparison of the algorithms, N32I is the best algorithm for very short patterns, SSB6I for very long patterns, and SBNDM6 otherwise.

The rest of this paper is organized as follows: Section 2 explains the background and notations. Section 3 reviews four previous algorithms, and Section 4 introduces our new algorithms. Section 5 presents the results of our practical experiments, and Section 6 concludes this article.

2. Background

The IUPAC nucleotide alphabet [1] consists of 15 symbols. Four of them {A, C, G, T} are base symbols, and the others are degenerate symbols, which represent 2, 3, or 4 base symbols. Table 1 shows the interpretation of each IUPAC symbol. With the IUPAC nucleotide alphabet, a single sequence can represent a set of sequences.

We consider pattern matching in sequences as a string matching problem. In the following, we use mostly string matching terminology. Assume that we are given a finite alphabet

Σ

, a text

T = T_{0} \dots T_{n - 1}

of length n, and a pattern

P = P_{0} \dots P_{m - 1}

of length m. In the string matching problem, the task is to find all the occurrences of P in T, that is, all possible i such that for all j in

[0, m - 1]

,

T_{i + j} = P_{j}

. A string of q characters is called a q-gram. Most algorithms examine potential matches in alignment windows of m-grams in the text. The string matching problem can be extended considering degenerate strings, which means that each

T_{i}

and

P_{j}

is a character class, i.e., a non-empty set of characters belonging to

Σ

. In the matching problem of degenerate strings, the task is to find all the occurrences of P in T, that is, all possible i such that for all j in

[0, m - 1]

,

T_{i + j} \cap P_{j} \neq \emptyset

holds. As a special case, P or T is allowed to contain only singletons. In the literature, strings containing symbols that represent sets of characters are also called indeterminate strings [7,8].

In the pseudocode of algorithms, the operators ‘&’, ‘∣’, and ‘

< <

’ denote bit-parallel and, or, and left shift, respectively. The size of a computer word, which is typically 64, is indicated by w.

Several algorithms presented in Section 3 apply loop peeling. In loop peeling, a number of iterations are moved in front of the loop. As a result, the code becomes faster due to fewer loop tests.

3. Earlier Algorithms

We will review existing solutions to string matching in IUPAC sequences.

3.1. Brute Force Algorithm BF

We start with the brute force (BF) algorithm. As an individual algorithm, BF (shown as Algorithm 1) is not useful because it is inefficient, but we will apply it as a check part of other algorithms. The array f is an implementation of the bit encoding shown in Table 1.

Algorithm 1 BF

1:: $count \leftarrow$ 0
2:: for $j \leftarrow 0$ to $n - m$ do
3:: $i \leftarrow 0$
4:: while $i < m$ and $f [P_{i}] & f [T_{i + j}]$ do $i \leftarrow i + 1$
5:: if $i = m$ then $count \leftarrow count + 1$

3.2. BNDM

One of the fundamental algorithms for exact string matching is the Backward Nondeterministic DAWG Matching algorithm (BNDM) developed by Navarro and Raffinot [3,9]. BNDM is a bit-parallel algorithm and works for patterns with up to w characters. BNDM simulates a nondeterministic automaton without explicitly constructing it. A trivial extension of BNDM can handle patterns of any length by searching for the w characters long prefix of a pattern and by checking each extended occurrence of the prefix.

With a small change in preprocessing, BNDM also works for IUPAC sequences. Algorithm 2 shows the preprocessing of the pattern

P_{0} \dots P_{m - 1}

. In Algorithm 2, the symbol E represents a set containing characters

c_{1}, \dots, c_{k_{E}}

where

k_{E} > 1

. For example, in the case of the IUPAC alphabet, if

P_{i}

is M, both

B [A]

and

B [C]

are updated in line 2, and

B [A] ∣ B [C]

is assigned to

B [M]

in line 4.

Algorithm 2 Enhanced BNDM preprocessing

1:: for $i \leftarrow 0$ to $m - 1$ do
2:: for $c \in P_{i}$ do $B [c] \leftarrow B [c] ∣ (1 < < (m - i - 1))$
3:: for each character E do
4:: for $i \leftarrow 1$ to $k_{E}$ do $B [E] \leftarrow B [E] ∣ B [c_{i}]$

3.3. SBNDMq

SBNDM [10] is a simplified variation of BNDM without prefix recognition. Based on SBNDM, Ďurian et al. [4] developed SBNDMq, which applies loop peeling to speed up search by processing q characters before entering a loop. SBNDMq works with degenerate strings in the same way as BNDM. Algorithm 3 works for any pattern length.

F (i, q)

denotes the result of processing characters

T_{i + q - 1}, \dots, T_{i}

. For example,

F (i, 4)

is

(B [T_{i + 3}] < < 3) & (B [T_{i + 2}] < < 2) & (B [T_{i + 1}] < < 1) & B [T_{i}]

In our experiments we use variations

q = 4

, 6, and 8 with 16-bit reading. These variations utilize an auxiliary table

B 2

representing a resulting bitvector for a 2-gram. Then

F (i, 4)

is

(B 2 [T_{i + 3}] < < 2) & B 2 [T_{i}]

.

B 2

is computed in Algorithm 4 for the IUPAC symbols.

Algorithm 3 SBNDMq

1:: $m^{'} \leftarrow m i n (m, w)$
2:: preprocess $P_{0} \dots P_{m^{'} - 1}$ with Algorithm 2
3:: $i \leftarrow m^{'} - q$
4:: while $i \leq n - m + m^{'} - q$ do
5:: $d \leftarrow F (i, q)$
6:: if $d \neq 0$ then
7:: $j \leftarrow i - (m^{'} - q + 1)$
8:: do $i \leftarrow i - 1$
9:: $d \leftarrow (d < < 1) & B [T_{i}]$
10:: until $d = 0$
11:: if $i = j$ then
12:: if $m > w$ then check $T_{i + 1} \dots T_{m + 1}$
13:: else report occurrence at $i + 1$
14:: $i \leftarrow i + 1$
15:: $i \leftarrow i + m - q + 1$

Algorithm 4 Computation of

B 2

1:: for $i \leftarrow$ A to Y do
2:: for $j \leftarrow$ A to Y do
3:: $B 2 [(j < < 8) + i] \leftarrow (B [j] < < 1) & B [i]$

3.4. BADPM

Procházka and Holub [5] present a specialized algorithm BADPM (Byte-Aligned Degenerate Pattern Matching) for string matching in the IUPAC alphabet. BADPM is not a pure online method, because the text is preprocessed into a compressed form. The base symbols A, C, G, and T are encoded with two bits, and another data structure holds degenerate symbols and their locations. Before the search, all possible 8-gram factors of the pattern are tabulated. If the rate of degenerate symbols is low, BADM works in sublinear time.

3.5. SSB2

LBNDM [10] is an algorithm for searching long patterns with BNDM. LBNDM is only suitable for large alphabets. Sparse SBNDM (SSB for short) [6] is an extension of LBNDM handling q-grams instead of single characters and works for small alphabets as well. In SSB, a pattern is partitioned into r consecutive segments of a characters, and each segment corresponds to a position in a superimposed pattern. The set of q-grams of the pattern ending in a segment is accepted in the corresponding position of the superimposed pattern. A fingerprint or a hash value of a q-gram in SSB is a bit vector formed from the q-gram. SSB is a filtering algorithm: it produces potential matches that are then checked. SSB reads q-grams backward at fixed distances in an alignment window, i.e., q-grams ending at

m - 1, m - 1 - a, m - 1 - 2 a

, and so on, and the algorithm works like the standard SBNDM when matching strings of character classes. The maximum shift of the alignment window is

r \cdot a

. SSB does not fix how the fingerprint of a q-gram is formed or computed, and any method could be applied.

For

q = 2

, 16-bit reading can be applied to SSB. This preserves the ability to handle character classes. We call this version SSB2. Let

x_{1} x_{2}

be a 2-gram in the pattern. In plain exact string matching,

x_{1} x_{2}

is used as is. When character classes are allowed, all 2-qrams

y_{1} y_{2}

in the text satisfying

f [x_{1}] \cap f [y_{1}] \neq \emptyset

and

f [x_{2}] \cap f [y_{2}] \neq \emptyset

correspond to

x_{1} x_{2}

of the pattern. As above, f is the bit encoding of the IUPAC symbols shown in Table 1. We added a short code to the preprocessing phase of the pattern to implement this.

4. New Algorithms

We will introduce two new search algorithms for IUPAC sequences.

4.1. SSB6I

Because SSB2 is slower than SBNDM6 for DNA data, we developed SSB6I, a special version to search IUPAC strings. We set

q = 6

. In preprocessing the pattern, we go through all the 6-grams. If a 6-gram contains degenerate symbols, we process all combinations of base symbols that can represent that 6-gram. For each such 6-gram of base symbols, we form a bit string of 12 bits, where each base symbol is represented by two bits. Because degenerate symbols in real sequences are rare, we skip 6-grams that contain degenerate symbols during searching. We compute

\sum_{j = 0}^{5} g (T_{i + j})

for a 6-gram

T_{i} \dots T_{i + 5}

, where

g (c)

is 0 for the base symbols and 1 for the degenerate symbols. If the sum is zero, we know that the 6-gram does not contain degenerate symbols. The candidates produced by SSB6I are checked with the brute-force algorithm.

To make SSB6I faster, our implementation applies 16-bit reading in searching, i.e., each 6-gram is processed as three 2-grams.

4.2. N32I

For exact string matching, Tarhio et al. [11] presented a naive search algorithm which uses the SIMD instruction architecture. The algorithm compares

α

characters in parallel, where

α

is 16 or 32. The name N32 is used for the variation

α =

32 that uses the AVX2 instruction set [12]. We present N32I, a modification of N32, as Algorithm 5 to search IUPAC sequences. In N32I, only line 6 is different from the corresponding line of N32.

Algorithm 5 N32I

1:: construct vector $(c)$ for each $c \in Σ$
2:: $count \leftarrow$ 0; $i \leftarrow 0$
3:: while $i \leq n - m$ do
4:: $found \leftarrow 2^{α} - 1$
5:: for $j \leftarrow 0$ to $m - 1$ do
6:: $found \leftarrow found$ and SIMDtest( $t_{i + j},$ vector $(f (P_{j})), α$ )
7:: if $found = 0$ then goto out
8:: $count \leftarrow count$ + popcount( $found$ )
9:: out: $i \leftarrow i + α$

The key idea of Algorithm 5 is to test

α

consecutive potential occurrences of the pattern in parallel. For that purpose, a comparison vector is constructed in line 1 for each character of the alphabet. The comparison vector contains

α

copies of the bit encoding of the character. The algorithm first compares the vector of

P_{0}

with

T_{0} \dots T_{31}

, then compares the vector of

P_{1}

with

T_{1} \dots T_{32}

and so on. The bitvector

found

of 32 bits keeps track of active match candidates. The intrinsic function _mm_popcnt_u32 [12] is used to count matches in line 8. The SIMDtest function uses six intrinsic functions [12], and is shown as Algorithm 6. The purpose of lines 2–6 of SIMDtest is to replace each character in a chunk of

α

characters with its IUPAC bit representation.

Algorithm 6 SIMDtest

(x, y, 32)

1:: yp = _mm256_loadu_si256(y)
2:: ap = _mm256_blendv_epi8(
3:: _mm256_shuffle_epi8(m1,yp),
4:: _mm256_shuffle_epi8(m2,yp),
5:: _mm256_cmpgt_epi8(ap,tp))
6:: xp = _mm256_loadu_si256(x)
7:: return _mm256_movemask_epi8(
8:: _mm256_cmpgt_epi8(
9:: _mm256_and_si256(xp, ap),zp))

The variables yp, ap, and xp, as well as the constants m1, m2, zp, and tp are of type __mm256i. The constant zp contains 32 null characters. The constant tp contains 32 ‘O’ characters. The constants m1 and m2 for the shuffle are defined as follows:

\begin{matrix} m 1 : & (0, 15, 3, 0, 12, 0, 0, 11, 4, 0, 0, 13, 2, 14, 1, 0, \\ 0, 15, 3, 0, 12, 0, 0, 11, 4, 0, 0, 13, 2, 14, 1, 0) \\ m 2 : & (0, 0, 0, 0, 0, 0, 10, 0, 9, 7, 0, 8, 6, 5, 0, 0, \\ 0, 0, 0, 0, 0, 0, 10, 0, 9, 7, 0, 8, 6, 5, 0, 0) \end{matrix}

Because shuffling occurs in units of 16 bytes, both m1 and m2 contain two identical chunks of 16 bytes.

SIMDtest (Algorithm 6) works as follows. In line 1, the next 32 bytes in the text are assigned to yp. The suffle instructions switch the characters to bit encoding. Because the suffle instructions operate on lower half-bytes and the character codes of some IUPAC characters share the same lower half-byte (e.g., D and T), we need two suffles. Line 5 forfms a mask for the occurrences of R, S, T, V, W, and Y having higher character codes. The blend instruction in lines 2–5 collects the results of the two suffles. Let us examine how D is transformed. The lower half-byte of D is 4, which corresponds to the fifth byte of m1 on the right. That byte is 13, which is 1101 in the IUPAC bit encoding of D. The result of and operation on the result of the blend operation and the pattern vector is tested against zero, and a resulting vector of 32 bits is formed with the movemask instruction.

In addition to SIMD instructions, loop peeling plays a key role in the efficiency of N32I. The term ‘peeling factor’ refers to the number of iterations in line 6 that moved outside the inner loop.

5. Experimental Results

We tested the algorithms presented in the previous sections and compared their efficiency. The results of the experiments are given below.

5.1. Setting

The experiments were run on an Intel Core i7-4578U processor with 16 GB RAM. The algorithms were implemented (the codes are available at https://users.aalto.fi/tarhio/codes/iupac.tgz (accessed on 19 December 2025)) in the C programming language and compiled with gcc 5.4.0 using the O3 optimization level. The tests were carried out within the framework of Hume and Sunday [13]. As the text, we used the genome of fruitfly (15 MB), which contains only base symbols of DNA. Because the cache size of the i7-4578U processor is 4 MB, the text of 15 MB is long enough to avoid cache interference with search times [14]. The patterns of lengths

4, 6, \dots, 4096

were randomly selected from the text. The set of patterns of each length contains 200 patterns.

5.2. Results with Fixed Rate 3% of Degenerate Symbols

Procházka and Holub [5] experimentally tested vcf (Variant Call Format) files from the 1000 Genomes Projects, and they found that the frequency of the degenerate symbols varies from 2% to 3%. Thus, we decided to apply the rate of 3% in our experiments. Before testing, both text and patterns were artificially degenerated so that 3% of characters were randomly changed to one of the seven degenerate symbols holding the original base symbol. For example, because the IUPAC symbol classes M, R, W, V, H, D, and N contain A, the base symbol A could be changed to any of these.

We tested the following algorithms: BF, BNDM [3], BADPM [5], SBNDMq [4] for

q =

4, 6, 8, SSB2 [6], SSB6I, and N32I. For BNDM and its derivatives, we used 64 for w. The peeling factor of N32I was 5, except 4 for

m = 4

.

Table 2 shows the search times of the algorithms for single patterns in milliseconds. In Table 2 the best time is underlined for each m as well as the times at most 20% more than the best time. Algorithms with underlined times can be considered good methods.

Figure 1 shows the search times of six algorithms of Table 2. BADPM, SBNDM6, SSB2, and SSB6I showed clear sublinear behavior. However, SSB2 was slightly slower for

m > 512

because the number of 2-grams collisions grew.

In Table 2, N32I was good up to

m = 8

, SBNDM6 was good from

m = 12

to

m = 256

, and SSB6I was good from

m = 256

. In addition, SBNDM4 was good for

m = 8

, and SBNDM8 and SSB2 were good for

m = 256

. Clearly, SBNDM6 was the best general algorithm among the tested medthods.

When testing pattern lengths between 256 and 4096, we noticed that SSB6I was faster than the other algorithms for

m \geq 300

. Although SSB6I is very fast for long patterns, the preprocessing time of the pattern increases with m. The preprocessing times for 256, 1024, and 4096 were 0.04, 0.09, and 0.27 ms. However, SSB6I is the fastest algorithm for

m \leq 1024

even in including preprocessing of a pattern in the case of a text of 15 MB.

It is not fair to compare the search time of BADPM with those of the other algorithms because BADPM requires preprocessing of the text, which takes approximately 16 ms/MB and thus reduces the applicability of BADPM.

Dehghani et al. [7] present two pattern matching algorithms on indeterminate strings. These algorithms are suitable for our problem, but they are not competitive. The search time of the best algorithm was about 50% of that of the BF algorithm in our tests.

5.3. Additional Experiments

We conducted additional experiments on SBNDM6a, which is SBNDM6 without 16-bit reading. SBNDM6a was clearly slower than SBNDM6. The difference was approximately 80% for

m \leq 20

.

We also performed experiments with N32Ib, which is a variation of N32I. We removed the transformation part (lines 2–5) from SIMDtest and replaced ap with yp and preprocessed the text to hold the bit encoding of the IUPAC symbols given in Table 1. The search times for N32Ib were approximately 2.2 ms for

m \geq 6

. Thus, N32Ib was about 40% faster than N32I. The preprocessing of the text takes approximately 8 ms/MB using the tr command of Linux.

When the frequency of degenerate symbols grows, all algorithms become slower. As an example, Table 3 shows the search times for SBNDM6 and N32I for

m = 12

and for various frequencies of degenerate symbols. Note that N32I worked faster than SBNDM6 for high frequencies greater than 20%. This might not have practical significance because the number of found occurrences is very high. The advantage of N32I decreases and finally disappears, when m grows.

6. Conclusions

We have presented an experimental comparison of online search algorithms for IUPAC nucleotide sequences, allowing 3% degenerate symbols. In addition, we introduced two new algorithms, N32I and SSB6I. N32I utilizes SIMD instructions based on AVX2 technology, and it is faster than other algorithms for patterns shorter than eight characters. SSB6I is faster than other algorithms for patterns at least 300 characters long. SBNDM6 turned out to be the best general method to search IUPAC sequences.

Funding

This research received no external funding.

Data Availability Statement

The data is available with the codes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cornish-Bowden, A. Nomenclature for incompletely specified bases in nucleic acid sequences: Recommendations 1984. Nucleic Acids Res. 1985, 13, 3021–3030. [Google Scholar] [CrossRef] [PubMed]
Faro, S.; Lecroq, T. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv. 2013, 45, 1–42. [Google Scholar] [CrossRef]
Navarro, G.; Raffinot, M. A bit-parallel approach to suffix automata: Fast extended string matching. In Annual Symposium on Combinatorial Pattern Matching; Farach-Colton, M., Ed.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 14–33. [Google Scholar]
Ďurian, B.; Holub, J.; Peltola, H.; Tarhio, J. Improving practical exact string matching. Inf. Process. Lett. 2010, 110, 148–152. [Google Scholar] [CrossRef]
Procházka, P.; Holub, J. On-line searching in IUPAC nucleotide sequences. In Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019)—Volume 3: BIOINFORMATICS, Prague, Czech Republic, 22–24 February 2019; Maria, E.D., Fred, A.L.N., Gamboa, H., Eds.; SciTePress: Setúbal, Portugal, 2019; pp. 66–77. [Google Scholar]
Tarhio, J. Searching long patterns with BNDM. Softw. Pract. Exp. 2024, 54, 2160–2169. [Google Scholar] [CrossRef]
Dehghani, H.; Lecroq, T.; Mhaskar, N.; Smyth, W.F. Practical KMP/BM style pattern-matching on indeterminate strings. Discret. Appl. Math. 2025, 370, 22–33. [Google Scholar] [CrossRef]
Holub, J.; Smyth, W.F. Algorithms on indeterminate strings. In Proceedings of the 14th Australasian Workshop on Combinatorial Algorithms, AWOCA, Seoul, Republic of Korea, 13–16 July 2003; pp. 36–45. [Google Scholar]
Navarro, G.; Raffinot, M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algorithmics 2000, 5, 4. [Google Scholar] [CrossRef]
Peltola, H.; Tarhio, J. Alternative algorithms for bit-parallel string matching. In String Processing and Information Retrieval. SPIRE 2003; Nascimento, M.A., de Moura, E.S., Oliveira, A.L., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2857, pp. 80–94. [Google Scholar]
Tarhio, J.; Holub, J.; Giaquinta, E. Technology beats algorithms (in exact string matching). Softw. Pract. Exp. 2017, 47, 1877–1885. [Google Scholar] [CrossRef]
Intel. Intel Intrinsics Guide. Available online: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html# (accessed on 11 June 2025).
Hume, A.; Sunday, D. Fast string searching. Softw. Pract. Exp. 1991, 21, 1221–1248. [Google Scholar] [CrossRef]
Pakalén, W.; Peltola, H.; Tarhio, J.; Watson, B.W. Pitfalls of algorithm comparison. In Prague Stringology Conference 2021; Holub, J., Zdárek, J., Eds.; Czech Technical University: Prague, Czech Republic, 2021; pp. 16–29. [Google Scholar]

Figure 1. Search times of selected algorithms of Table 2.

Table 1. Mapping of the IUPAC symbols.

Symbol	Subset	Bit Encoding
A	{A}	0001
C	{C}	0010
G	{G}	0100
T	{T}	1000
M	{A,C}	0011
R	{A,G}	0101
W	{A,T}	1001
S	{C,G}	0110
Y	{C,T}	1010
K	{G,T}	1100
V	{A,C,G}	0111
H	{A,C,T}	1011
D	{A,G,T}	1101
B	{C,G,T}	1110
N	{A,C,G,T}	1111

Table 2. Search times in milliseconds in the DNA of fruitfly with 3% degenerated symbols. The best time is underlined for each m as well as the times at most 20% more than the best time.

Algorithm∖m	4	6	8	12	16	64	256	1024	4096
BF	65	65	65	65	65	65	65	65	65
BNDM	46	32	25	18	14	4.1	4.2	4.1	4.1
BADPM	—	—	—	20	13	3.9	1.5	1.02	—
SBNDM4	11	5.4	4.0	3.1	2.8	1.8	1.8	1.8	1.8
SBNDM6	—	16	5.4	2.5	1.9	1.12	1.20	1.19	1.21
SBNDM8	—	—	21	4.4	2.6	1.8	1.08	1.14	1.08
SSB2	23	19	17	12	10	3.1	1.25	1.29	14
SSB6I	—	68	45	21	12	2.5	1.19	0.28	0.08
N32I	3.9	3.9	4.0	4.0	4.1	4.1	4.0	4.0	4.0

Table 3. Search times of SBNDM6 and N32I in milliseconds for

m = 12

and for various frequencies of degenerate symbols.

Table 3. Search times of SBNDM6 and N32I in milliseconds for

m = 12

and for various frequencies of degenerate symbols.

	0%	5%	10%	15%	20%	40%	60%
SBNDM6	2.6	2.7	3.0	3.5	4.1	13	30
N32I	3.8	4.1	4.9	5.4	6.6	9.5	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tarhio, J. Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences. Algorithms 2026, 19, 30. https://doi.org/10.3390/a19010030

AMA Style

Tarhio J. Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences. Algorithms. 2026; 19(1):30. https://doi.org/10.3390/a19010030

Chicago/Turabian Style

Tarhio, Jorma. 2026. "Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences" Algorithms 19, no. 1: 30. https://doi.org/10.3390/a19010030

APA Style

Tarhio, J. (2026). Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences. Algorithms, 19(1), 30. https://doi.org/10.3390/a19010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Online Searching Algorithms for IUPAC Nucleotide Sequences

Abstract

1. Introduction

2. Background

3. Earlier Algorithms

3.1. Brute Force Algorithm BF

3.2. BNDM

3.3. SBNDMq

3.4. BADPM

3.5. SSB2

4. New Algorithms

4.1. SSB6I

4.2. N32I

5. Experimental Results

5.1. Setting

5.2. Results with Fixed Rate 3% of Degenerate Symbols

5.3. Additional Experiments

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI