Bacterial DNA Sequence Compression Models Using Artificial Neural Networks

It is widely accepted that the advances in DNA sequencing techniques have contributed to an unprecedented growth of genomic data. This fact has increased the interest in DNA compression, not only from the information theory and biology points of view, but also from a practical perspective, since such sequences require storage resources. Several compression methods exist, and particularly, those using finite-context models (FCMs) have received increasing attention, as they have been proven to effectively compress DNA sequences with low bits-per-base, as well as low encoding/decoding time-per-base. However, the amount of run-time memory required to store high-order finite-context models may become impractical, since a context-order as low as 16 requires a maximum of 17.2 x 109 memory entries. This paper presents a method to reduce such a memory requirement by using a novel application of artificial neural networks (ANN) to build such probabilistic models in a compact way and shows how to use them to estimate the probabilities. Such a system was implemented, and its performance compared against state-of-the art compressors, such as XM-DNA (expert model) and FCM-Mx (mixture of finite-context models) , as well as with general-purpose compressors. Using a combination of order-10 FCM and ANN, similar encoding results to those of FCM, up to order-16, are obtained using only 17 megabytes of memory, whereas the latter, even employing hash-tables, uses several hundreds of megabytes.


Introduction
The motivation behind specialization on DNA compression comes from the fact that general purpose compression algorithms are far from optimum and are not aware of the particularities of DNA sequences.The huge growth in DNA databases urges for better compression algorithms, and the increasing knowledge on DNA structure allows one to design compression algorithms that take advantage of its particularities and redundancies, as, for example, the so-called complemented inverted repeats.
The first encoder specialized in DNA compression was proposed in 1993 [1].Biocompress, as the authors named it, uses a substitution approach similar to the well-known Lempel-Ziv [2] coding scheme, but featured a mechanism to consider not only repetitions, but also complementary palindromes, as well.A second version, Biocompress-2, includes an order-2 finite-context model (FCM) as a fall-back scheme [3].
Besides other substitutional encoders, such as [4], an innovative model was proposed in [5], where several predictors based on inexact matches are combined and are used to estimate the entropy of DNA.Three years later, an algorithm employing context-tree weighting and substitutional compression was published [6], which used context-tree weighting whenever the substitutional approach yielded worse results.Furthermore, using a substitutional approach, GenCompress-1 and GenCompress-2 [7] generalized this concept, relying on operations, such as replacements, insertions and deletions, in order to transform a given past sub-string into the sub-string to be encoded.
The use of finite-context models regained attention in [8], again, as a fall-back algorithm, where the main algorithm, a normalized maximum likelihood substitutional approach, is replaced by an order-2 FCM whenever better and, as a last-resort, by a transparent mode.
Moreover, in [9], order-2 and order-3 FCMs were also used as fall-back mechanisms, in a substitutional-like compressor aimed at speed, giving more attention to the substitution of large sequences, rather than to small ones.
This approach was mainly improved by [10,11].The state-of-the-art compressor expert model (XM) [12] uses various experts consisting of order-1 and order-2 FCMs and a copy-expert that creates a symbolic dependence between the next symbol and a previously occurring one.The data-fusion of this expert mixture is made using Bayesian averaging and is followed by an arithmetic encoder.As such, up to this point, much of the work was based on the substitutional method, and if low performance is obtained, a low-order FCM is used.
Finally, FCM gained protagonism in [13][14][15][16], implemented up to order-16.As in the previous cases, the FCMs were followed by an arithmetic encoder, and several different-order FCMs compete.However, now, only this type of predictor was employed.This architecture lays on what is generally described as a probabilistic model estimator and an arithmetic encoder, as is depicted in Figure 1.The results show that, despite the simplicity of the algorithm, some sub-sequences can be accurately modeled by such models, which suggests that a short-term memory system such as this can make reasonable predictions in comparison to long-term memory systems as those employing the substitutional paradigm.Moreover, the encoder is very fast, in addition to achieving an encoding ratio similar to that obtained by the state-of-the-art Expert Model [11].The speed and the inherent simplicity of such an algorithm are great features, which makes it suitable for hardware implementations, such as field-programmable gate arrays (FPGAs), which would translate into high-performance compressors for high throughput sequencers.However, a major challenge of such an implementation is that, when the order of such FCMs is increased, the memory requirements increase exponentially, as order-N requires 4 N +1 memory entries (the actual memory required is dependent on the size of each entry).This problem can be minimized by using hash-tables, thus allocating memory only when a new conditioning context is encountered.However, this not only increases the complexity of such a system if it is intended to be implemented in FPGA, but also fails to address the memory requirement problem if very large sequences are to be encoded, as a large variety of conditioning contexts exist.
To tackle this memory/complexity problem, this paper proposes a novel architecture that has lower memory requirements and uses a well-known technology.To achieve that, the high-order memory intensive FCMs are replaced by artificial neural networks (ANNs), which have the particularity of being a compact way to interpolate data and for which a FPGA implementation is feasible.Such ANNs are used simultaneously with low-order FCMs, in a competitive fashion, using a forward-adaptive method, like the one presented in [14], which chooses which of the models is best suited to encode the next block and sends an identifier of the selected model as side information.
This work was particularly focused on bacterial DNA, due to their high entropy and few repeating regions, being thus harder to effectively encode, strengthening the need for specialized compressors.
To implement a DNA compressor using ANNs, several aspects must be addressed.After this introductory section, Section II presents how the data should be formatted at the input of the ANN and what targets should be used as training data.Section III formalizes the problem and specifies how should the training be performed in an effective way.Then, Section IV presents experimental results against other existing DNA compression methods using bacterial DNA sequences from the National Center for Biotechnology Information (NCBI) database, and finally, Section V draws some conclusions and points out some future work.

Data Representation
DNA sequences are coded by four bases, commonly represented within the alphabet, A = {A, C, G, T }.Due to the symbolic (hence, unordered) nature of these bases, care should be taken to keep the system independent of any relabelling between symbols when representing and manipulating the data in a computer environment.To cope with such a requirement, the data must be kept symbolic across the ANN rather than assigning numeric values to each symbol, which is commonly referred to as a classification problem [17,18].In this scenario, each base is coded as an N -tuple, where each element refers to a symbol, and all elements are zero, except the one corresponding to the coding symbol, which is set to one.Usually, in practical implementations, the values, 0.1 and 0.9, are used instead, since the sigmoidal activation functions have limits on these two values, thus requiring high weight values on the input branches.
Regarding the output, regular ANNs relate the input with the output deterministically.As such, since finite-context modelling is a non-deterministic problem-for the same preceding sequence, the next symbol, x n+1 , has a probabilistic model associated with it, P (s|x n−k+1...n )-such deterministic output is likely to be wrong.Moreover, when an ANN is trained, at each iteration, the weights of all connecting edges, usually denoted by a single weight vector, w, or a set of matrices, are optimized to minimize the result of a given cost function for a set-training set-of input/output (or target) pairs.However, if the next symbol, x n+1 , is not related to the input, x n−k+1...n , deterministically, the neural network is likely to fail converging during training and on making useful estimations.
To overcome this, the target outputs used for training should not be the symbols per se, but their probabilistic model instead.This method not only provides a deterministic output (for the same conditioning context, the same probabilistic model should be used), but also simplifies the implementation, as the output of the ANN can be used, after normalization, directly to feed the arithmetic encoder, as depicted in Figure 2.This configuration was used along with FCMs.To do that, an approach similar to [14] was taken: the sequence data is split in blocks, and for each block, all the probabilistic models are evaluated.The best one, the one which yields a lower entropy, is selected to encode the block, and an identifier of the model is coded as side information.A simple probabilistic model of the side information is maintained to compress it, too, using the arithmetic encoder.This competitive fashion allows the use of several models, and the final result is the minimum entropy of all models at each block, plus the side information.

Probabilistic Models Using Artificial Neural Networks
Within the alphabet, A = {A, C, G, T }, consider P (s|X) as the order-k conditional probability of symbol, s ∈ A, given X ∈ A k past outcomes.In these conditions, the error of the ANN over all the training set, E( w), for a given weight vector, w, and neural-network output, o s ( w, X), is defined as: where P (s|X) plays the role of target t s,X .This assumes that the finite-context probability models are known a priori, thus bringing no advantages, since such models must then be stored somehow.This obstacle will be overcome in the next two subsections.

Updating Probabilistic Models
Known applications of FCMs applied to the compression of highly non-stationary data, like DNA sequences, do not know the probabilistic models a priori.Instead, they start with a uniformly distributed model, and update it continuously, for example, through the use of counters, to estimate the probabilistic models based on incoming symbols [15].
Formally, this continuous update can be understood as a set of FCMs that change over the sequence index, n, and is here denoted by the set of probability models, P n (s|X) ∀ X ∈ A k , ∀ s ∈ A, shortened by C n .If the model set is updated every time a new symbol, x n+1 , is processed, then C n+1 is computed using not only x n+1 and the conditioning context, x n−k+1...n , but also C n , so that: where L is a short-hand notation to denote the number of occurrences of the conditioning context, x n−k+1...n , the parameter, α, is used to balance between a maximum likelihood estimator and a uniform distribution and |A| is the cardinality of the alphabet, which, in this case, is four.Note that for α = 1, this is the well known Laplace estimator.
As far as ANNs are concerned, each C n corresponds to a weight vector, w n .In analogy, considering an already trained network, w n , the training set composed by input/output target pairs used to obtain w n+1 through training is computed using x n+1 , the conditioning context, x n−k+1...n , and w n , as well.This way, the training targets used to obtain w n+1 are defined as: With this method, there is no longer the need to store all probabilistic models.Instead, we use the ANN with the former weight vector, w n , to retrieve such models and allow one to train w n+1 .However, L still poses a problem, as it requires |A| k entries, corresponding to the counters of each conditioning context.Moreover, it is not guaranteed that, for a given input, the sum of the ANN output is one.To overcome these two problems, α is omitted, and L is considered to be: where L is constant, equal for all conditioning contexts and quantifies the weight of a new symbol in the probabilistic model.However, while retraining the ANN with the training set consisting of all conditioning contexts could be manageable for low order models, that is not true for high order FCMs, due to the amount of different contexts.This makes the retraining time impractical.It is then mandatory to simplify such a training method.

Training Artificial Neural Networks
Several methods exist to train ANNs; yet, the most widely used is backpropagation with gradient descent, using a fixed-size network [17,18].This method is based on the calculation of the error, E( w), over the whole training set, which uses as the error function half the Euclidean distance between the output and target and backpropagating it towards the input layer, using only the first derivatives.Each weight is then updated according to the error contribution of each node, the error of each output unit and a learning rate, η.This is then repetitively done until the training error converges to a minimum or the cross-validation error starts to rise, due to over-fitting.Regarding the activation functions for each node, the logistic sigmoid was used.
However, when dealing with a large training set, iterating over all data before each parameter update, the so-called batch gradient descent could be too expensive in a computational sense; as such, some alternatives exist, such as stochastic and mini-batch gradient descent [17].The former method calculates the error and updates the weights accordingly for each training example.The latter splits the training set in smaller subsets, calculates the error and updates the weights for each subset.In a general sense, it is not guaranteed that the computed error decreases after each training iteration, but these stochastic methods may converge after one single run over all the training data.Moreover, stochastic gradient descent and mini-batch gradient descent both have the capability to deal with non-stationary data.If the ANN is being trained with input/output data relative to a time-varying nonlinear function and each training example is relative to a different moment in time, these methods are very robust.This makes this feature particularly useful for our application, due to the known non-stationarity of the DNA sequences.
From these two methods, the latter was selected, because it is the least computationally-intensive method and because it takes advantage of the fact that a forward-adaptive method was used, and as such, the data is already, by default, split into subsets.Note that this is equivalent to training the ANN using exclusively Equations ( 6) and (7).The effect on the ANN in regard to those other contexts referred to in Equation ( 5) is affected by its generalization capability and learning rate.However, it is not trivial to quantify this dependence individually.Instead, we only measured its coding efficiency, and it was able to slightly surpass current state-of-the-art encoders.
An important aspect worth mentioning is that the weight vector in a neural network must be initialized to random values prior to any training.Due to the symmetry imposed by an encoder/decoder system, such random values should be the same on both systems, and as such, should not be guaranteed by sending the random weight vector through the channel, as it is likely to jeopardize the coding efficiency, due to excessive side information.To overcome this, both systems use the same pseudo-random number generator and initialize their seed to a previously defined value.Moreover, since no information is transported between each encoded sequence and the seed used for the generation of the random values was kept constant, the encoding is deterministic, and as such, each individual sequence was only encoded once.
Three encoder configurations employing the proposed model were chosen.Two of them were based on FCMs with order-2, -4, -6, -8 and -10, using α = 1 for all contexts, except order-10, which used α = 0.1.Additionally, competing with these finite-context models, one neural network was added to each encoder.The encoder, hereafter referred to as NN1 (neural-network 1), used an ANN with a memory depth of 40 and 15 hidden nodes in its single hidden layer.The encoder, hereafter referred to as NN2, used an ANN with a memory depth of 16 and 15 hidden nodes.The third encoder, hereafter referred to as NN3, was composed of FCMs similar to FCM-M16, but with an additional artificial neural network similar to the one used in NN2.This last configuration was aimed at improving just the coding efficiency, as it brings no advantages regarding run-time memory requirements.Due to the nature of these compressors, they were all implemented over the original FCM source code, in the C language, and compiled using GCC (GNU C Compiler).Additionally, 7-Zip version 9.04 and PAQ12a were used for comparison against the special-purpose encoders.
Since this compressor is aimed at DNA sequences, it was evaluated using a large database of bacterial DNA sequences, namely, the one from the National Center for Biotechnology Information [19], which has 3, 652 complete sequences (without uncertainty) ranging from 1, 285 to 13, 033, 779 bases and totalling 6, 836, 141, 677 bases.
Unfortunately, there is no proven method to choose the number of layers and their size for an artificial neural network.As such, these values were selected by sweeping the memory depth and hidden node count from two to 40, and five to 30, respectively, and choosing the best compromise, taking into account not only the encoding ratio, but also the encoding time, as for larger networks the encoding time becomes impractical.The learning rate, η, used was five, which was the value that yielded the best results for a subset of NCBI bacteria DNA database and which have proven to give satisfying results over the rest of the set.Nevertheless, these results are not highly dependent on small variations of the node count.To illustrate that, Figure 3 shows the overall bits-per-base and the usage-ratio of the ANN model, when sweeping the memory depth from five to 60 and the hidden node count from five to 45 on the NN1 encoder.This was carried out on two uncorrelated random subsets of the complete data-set used in the rest of the work.It can be seen that there are no discontinuities in the plot and that the bits-per-base and usage surfaces are smooth.
Three parameters were measured during simulations: the overall bits-per-base (bpb), the overall encoding time in microseconds per base (µspb) and the run-time memory required.These results are summarized in Table 1, which represents the performance over all 3, 652 sequences.Figure 4 depicts the memory usage dependency over the number of bases for each sequence.In regards to memory, due to the 64-bit nature of the machine where these simulations were carried out, each finite-context memory entry uses four bytes and each ANN weight uses eight bytes.Regarding the execution time, all implementations were run in the same machine, an Intel Core 2 Duo P8600 at a constant clock frequency of 2.4 GHz and with 4 GiB of RAM.
In Figure 4, the discrepancy between the reference and the proposed encoders is clear.Moreover, note that the 17 MB used by encoders FCM-M10, NN1 and NN2 are mainly due to the order-10 FCM, which requires 4 10+1 memory entries, since the largest ANN is estimated to occupy the negligible amount of 20 k floating-point memory entries.  .Run-time memory usage of the encoders over the number of bases of each sequence.This dependency arises from the fact that, for models of an order higher than 10, hash-tables are used to minimize the memory allocated.Regarding these results, all three encoders, NN1, NN2 and NN3, have particular properties.In regards to computing time, NN1 is the slowest, due to the size of the neural network used.Moreover, its advantage over NN2 is not significant.NN3, on the other hand, has the best encoding ratio, with the same memory requirements as FCM-M16 and slightly less encoding time than XM-500.Finally, NN2 represents a good compromise between coding efficiency, time and memory used, as its coding efficiency is almost the same as XM-500, but slightly worse than FCM-M16.Additionally, its computation time is the third best among all encoders used.Moreover, it uses slightly less than 17 MB, against the average 195 MB (maximum 738 MB) of FCM-M16.FCM-M10, although the fastest, yielded the worst encoding efficiency, which reinforces the fact that an additional neural network indeed increases the coding performance.
Additionally, the competition among the various models was also measured, as it indicates the importance of each model in the encoding results presented previously.These are presented in Tables 2-4.Table 5 is used to compare the effect of the additional ANN in NN1 and NN2 over the FCM.
Inspecting Table 2, for example, the first line indicates that the order-2 FCM was used in 8.52% of the 100-bases blocks, and from those, the mean bpb is 1.9211.As such, up to order-6, all FCMs have an average bpb higher than ANN 40:15, which was used 45.19% of the time, with a mean average of 1.8388 bpb.Note, although, that the low-order FCMs were used whenever all the other models yielded worse results for the same block.
In Table 4, the model usage of ANN 16:15 decreases slightly, but keeps close to 40%.Note that although the order-16 FCM exhibits an impressive 1.18 bpb, it was only used 4% of the time, which means that it was able to accurately model small regions of the sequences.However, in all the other 96% of the blocks, the other models were better.In conclusion, from these tables, we see that although ANNs do not have an overall low bpb, its usage percentage indicates that they are better than all the FCMs 40% of the time and, as such, contribute to the reduction of the average bpb, as can be confirmed by the model usage and overall bpb of the plain FCM-M10 encoder.

Conclusions
In this work, it was verified that artificial neural networks are an effective way to compress data using a fraction of the run-time memory of the other encoders and, as such, to compact probabilistic models of data.Moreover, this architecture opens the way for the use of more complex machine learning techniques for probability-model estimation, which could be used in other areas of information theory.Despite the fact that FCMs have the ability to detect very low entropy regions where ANNs do not, two conclusions may be drawn from this.On the one hand, ANNs should be used along with FCMs, but, on the other hand, the finite-context order does not need to be too large.It is important to note that although large ANNs have encoding time as major a disadvantage, their simplicity and low memory requirement make them suitable for high-performance hardware, such as FPGAs.
Regarding the dimensioning of the ANN, unfortunately, there is no proven method to select an optimal ANN architecture.As such, other configurations could prove to be better.More work could be developed in this area.Moreover, if advances are made in this application of ANNs and a low-entropy is obtained, it is easier to draw biological conclusions from the strength of each neural network connection at the input layer, as it indicates the importance of each previous base to predict the next one, which is commonly referred to as dimensionality reduction.
Future work may consist in the evaluation of the performance of the encoder using fixed-point arithmetic within the ANN, so that the encoder could be effectively implemented in an FPGA.Moreover, since artificial neural networks were used only as a proof-of-concept, other machine learning techniques, such as support vector machines or deep-learning, can be used in this encoder architecture.Therefore, further investigation of different methods would be very interesting.

Figure 1 .
Figure 1.Structure of an encoder architecture, which employs a probabilistic model estimator and an arithmetic encoder for the symbols {A, C, G, T}.

Figure 2 .
Figure 2. Structure of the Time-Series Artificial Neural Network proposed.The bias units within the hidden layers are omitted for readability.

Figure 3 .
Figure 3. Bits-per-base and artificial neural network (ANN) model usage, for the NN1 encoder, by sweeping the memory depth (L1) and the hidden node count (L2).Two different subsets were used.

Table 1 .
Simulation results of the seven encoders used, over all sequences, depicting bits-per-base, µs per base and the average run-time memory required.bpb, bits-per-base; µspb, overall encoding time in microseconds per base; FCM, finite-context model; NN, neural-network; XM, expert model.

Table 2 .
Average model usage and compression ratio for NN1 over all sequences.

Table 3 .
Average model usage and compression ratio for NN2 over all sequences.

Table 4 .
Average model usage and compression ratio for NN3 over all sequences.

Table 5 .
Average model usage and compression ratio for FCM-M10 over all sequences.