Special Issue "Algorithms for Sequence Analysis and Storage"

Quicklinks

A special issue of Algorithms (ISSN 1999-4893).

Deadline for manuscript submissions: closed (15 March 2013)

Special Issue Editor

Guest Editor
Prof. Dr. Veli Mäkinen

Department of Computer Science, University of Helsinki, P.O. Box 68, FI-00014 Helsinki, Finland
Website | E-Mail
Interests: algorithms and data structures; computational molecular biology; sequence analysis; string algorithms; data compression; algorithm engineering

Special Issue Information

Dear Colleagues,

Analysis of high-throughput sequencing data has become a crucial component in genome research. For example, methods based on latest developments in compressed data structures, namely index structures exploiting Burrows-Wheeler transform, are widely deployed in the discovery of disease causing mutations. The success of such approaches is due to being able to solve the dilemma of the indexing requiring more space than the data itself, where the data itself is enormous. Also parallel computation and use of special hardware like GPUs have shown to be important paradigms to provide scalable analysis methods. With our already increased knowledge about the genomic structure of the whole human population, and with the development of sequencing techniques and their applications in studying RNAs, metapopulations, and epigenetics, the field seeks for new innovative and universal algorithmic approaches that scale for current and future needs in the analysis and storage of biological sequences. This special issue is dedicated to approaches to biological sequence analysis that have algorithmic novelty and potential for fu ndamental impact in methods used for genome research. Also theoretical studies increasing our understanding on the limits of indexing, compression, and parallel computation in this context are welcome.

Prof. Dr. Veli Mäkinen
Guest Editor

Submission

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. Papers will be published continuously (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are refereed through a peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed Open Access quarterly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 300 CHF (Swiss Francs). English correction and/or formatting fees of 250 CHF (Swiss Francs) will be charged in certain cases for those articles accepted for publication that require extensive additional formatting and/or English corrections.


Keywords

  • high-throughput sequencing
  • compressed data structures
  • parallel computation
  • sequence alignment
  • fragment assembly
  • genomics
  • transcriptomics
  • metagenomics
  • epigenomics

Published Papers (7 papers)

View options order results:
result details:
Displaying articles 1-7
Export citation of selected articles as:

Editorial

Jump to: Research

Open AccessEditorial Editorial: Special Issue on Algorithms for Sequence Analysis and Storage
Algorithms 2014, 7(1), 186-187; doi:10.3390/a7010186
Received: 14 March 2014 / Revised: 19 March 2014 / Accepted: 19 March 2014 / Published: 25 March 2014
PDF Full-text (71 KB) | HTML Full-text | XML Full-text
Abstract This special issue of Algorithms is dedicated to approaches to biological sequence analysis that have algorithmic novelty and potential for fundamental impact in methods used for genome research. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)

Research

Jump to: Editorial

Open AccessArticle Modeling Dynamic Programming Problems over Sequences and Trees with Inverse Coupled Rewrite Systems
Algorithms 2014, 7(1), 62-144; doi:10.3390/a7010062
Received: 19 March 2013 / Revised: 6 February 2014 / Accepted: 14 February 2014 / Published: 7 March 2014
Cited by 7 | PDF Full-text (613 KB)
Abstract
Dynamic programming is a classical algorithmic paradigm, which often allows the evaluation of a search space of exponential size in polynomial time. Recursive problem decomposition, tabulation of intermediate results for re-use, and Bellman’s Principle of Optimality are its well-understood ingredients. However, algorithms often
[...] Read more.
Dynamic programming is a classical algorithmic paradigm, which often allows the evaluation of a search space of exponential size in polynomial time. Recursive problem decomposition, tabulation of intermediate results for re-use, and Bellman’s Principle of Optimality are its well-understood ingredients. However, algorithms often lack abstraction and are difficult to implement, tedious to debug, and delicate to modify. The present article proposes a generic framework for specifying dynamic programming problems. This framework can handle all kinds of sequential inputs, as well as tree-structured data. Biosequence analysis, document processing, molecular structure analysis, comparison of objects assembled in a hierarchic fashion, and generally, all domains come under consideration where strings and ordered, rooted trees serve as natural data representations. The new approach introduces inverse coupled rewrite systems. They describe the solutions of combinatorial optimization problems as the inverse image of a term rewrite relation that reduces problem solutions to problem inputs. This specification leads to concise yet translucent specifications of dynamic programming algorithms. Their actual implementation may be challenging, but eventually, as we hope, it can be produced automatically. The present article demonstrates the scope of this new approach by describing a diverse set of dynamic programming problems which arise in the domain of computational biology, with examples in biosequence and molecular structure analysis. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Open AccessArticle Sublinear Time Motif Discovery from Multiple Sequences
Algorithms 2013, 6(4), 636-677; doi:10.3390/a6040636
Received: 11 June 2013 / Revised: 30 September 2013 / Accepted: 1 October 2013 / Published: 14 October 2013
Cited by 1 | PDF Full-text (479 KB) | HTML Full-text | XML Full-text
Abstract
In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an
[...] Read more.
In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet, Σ. A motif G = g1g2 ... gm is a string of m characters. In each background sequence is implanted a probabilistically-generated approximate copy of G. For a probabilistically-generated approximate copy b1b2 ... bm of G, every character, bi, is probabilistically generated, such that the probability for bi gi is at most α. We develop two new randomized algorithms and one new deterministic algorithm. They make advancements in the following aspects: (1) The algorithms are much faster than those before. Our algorithms can even run in sublinear time. (2) They can handle any motif pattern. (3) The restriction for the alphabet size is a lower bound of four. This gives them potential applications in practical problems, since gene sequences have an alphabet size of four. (4) All algorithms have rigorous proofs about their performances. The methods developed in this paper have been used in the software implementation. We observed some encouraging results that show improved performance for motif detection compared with other software. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Open AccessArticle Efficient in silico Chromosomal Representation of Populations via Indexing Ancestral Genomes
Algorithms 2013, 6(3), 430-441; doi:10.3390/a6030430
Received: 18 March 2013 / Revised: 18 July 2013 / Accepted: 23 July 2013 / Published: 30 July 2013
Cited by 5 | PDF Full-text (165 KB) | HTML Full-text | XML Full-text
Abstract
One of the major challenges in handling realistic forward simulations for plant and animal breeding is the sheer number of markers. Due to advancing technologies, the requirement has quickly grown from hundreds of markers to millions. Most simulators are lagging behind in handling
[...] Read more.
One of the major challenges in handling realistic forward simulations for plant and animal breeding is the sheer number of markers. Due to advancing technologies, the requirement has quickly grown from hundreds of markers to millions. Most simulators are lagging behind in handling these sizes, since they do not scale well. We present a scheme for representing and manipulating such realistic size genomes, without any loss of information. Usually, the simulation is forward and over tens to hundreds of generations with hundreds of thousands of individuals at each generation. We demonstrate through simulations that our representation can be two orders of magnitude faster and handle at least two orders of magnitude more markers than existing software on realistic breeding scenarios. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Open AccessArticle Filtering Degenerate Patterns with Application to Protein Sequence Analysis
Algorithms 2013, 6(2), 352-370; doi:10.3390/a6020352
Received: 29 March 2013 / Revised: 30 April 2013 / Accepted: 3 May 2013 / Published: 22 May 2013
Cited by 2 | PDF Full-text (427 KB) | HTML Full-text | XML Full-text
Abstract
In biology, the notion of degenerate pattern plays a central role for describing various phenomena. For example, protein active site patterns, like those contained in the PROSITE database, e.g., [FY ]DPC[LIM][ASG]C[ASG], are, in general, represented by degenerate patterns with character classes. Researchers have
[...] Read more.
In biology, the notion of degenerate pattern plays a central role for describing various phenomena. For example, protein active site patterns, like those contained in the PROSITE database, e.g., [FY ]DPC[LIM][ASG]C[ASG], are, in general, represented by degenerate patterns with character classes. Researchers have developed several approaches over the years to discover degenerate patterns. Although these methods have been exhaustively and successfully tested on genomes and proteins, their outcomes often far exceed the size of the original input, making the output hard to be managed and to be interpreted by refined analysis requiring manual inspection. In this paper, we discuss a characterization of degenerate patterns with character classes, without gaps, and we introduce the concept of pattern priority for comparing and ranking different patterns. We define the class of underlying patterns for filtering any set of degenerate patterns into a new set that is linear in the size of the input sequence. We present some preliminary results on the detection of subtle signals in protein families. Results show that our approach drastically reduces the number of patterns in output for a tool for protein analysis, while retaining the representative patterns. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Open AccessArticle Practical Compressed Suffix Trees
Algorithms 2013, 6(2), 319-351; doi:10.3390/a6020319
Received: 18 March 2013 / Revised: 24 April 2013 / Accepted: 26 April 2013 / Published: 21 May 2013
Cited by 6 | PDF Full-text (560 KB)
Abstract
The suffix tree is an extremely important data structure in bioinformatics. Classical implementations require much space, which renders them useless to handle large sequence collections. Recent research has obtained various compressed representations for suffix trees, with widely different space-time tradeoffs. In this paper
[...] Read more.
The suffix tree is an extremely important data structure in bioinformatics. Classical implementations require much space, which renders them useless to handle large sequence collections. Recent research has obtained various compressed representations for suffix trees, with widely different space-time tradeoffs. In this paper we show how the use of range min-max trees yields novel representations achieving practical space/time tradeoffs. In addition, we show how those trees can be modified to index highly repetitive collections, obtaining the first compressed suffix tree representation that effectively adapts to that scenario. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Open AccessArticle Multi-Sided Compression Performance Assessment of ABI SOLiD WES Data
Algorithms 2013, 6(2), 309-318; doi:10.3390/a6020309
Received: 18 March 2013 / Revised: 23 April 2013 / Accepted: 27 April 2013 / Published: 21 May 2013
Cited by 2 | PDF Full-text (288 KB) | HTML Full-text | XML Full-text
Abstract
Data storage is a major and growing part of IT budgets for research since manyyears. Especially in biology, the amount of raw data products is growing continuously,and the advent of the so-called "next-generation" sequencers has made things worse.Affordable prices have pushed scientists to
[...] Read more.
Data storage is a major and growing part of IT budgets for research since manyyears. Especially in biology, the amount of raw data products is growing continuously,and the advent of the so-called "next-generation" sequencers has made things worse.Affordable prices have pushed scientists to massively sequence whole genomes and to screenlarge cohort of patients, thereby producing tons of data as a side effect. The need formaximally fitting data into the available storage volumes has encouraged and welcomednew compression algorithms and tools. We focus here on state-of-the-art compression toolsand measure their compression performance on ABI SOLiD data. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)

Journal Contact

MDPI AG
Algorithms Editorial Office
St. Alban-Anlage 66, 4052 Basel, Switzerland
algorithms@mdpi.com
Tel. +41 61 683 77 34
Fax: +41 61 302 89 18
Editorial Board
Contact Details Submit to Algorithms
Back to Top