Self-Organizing Map for Characterizing Heterogeneous Nucleotide and Amino Acid Sequence Motifs

A self-organizing map (SOM) is an artificial neural network algorithm that can learn from the training data consisting of objects expressed as vectors and perform non-hierarchical clustering to represent input vectors into discretized clusters, with vectors assigned to the same cluster sharing similar numeric or alphanumeric features. SOM has been used widely in transcriptomics to identify co-expressed genes as candidates for co-regulated genes. I envision SOM to have great potential in characterizing heterogeneous sequence motifs, and aim to illustrate this potential by a parallel presentation of SOM with a set of numerical vectors and a set of equal-length sequence motifs. While there are numerous biological applications of SOM involving numerical vectors, few studies have used SOM for heterogeneous sequence motif characterization. This paper is intended to encourage (1) researchers to study SOM in this new domain and (2) computer programmers to develop user-friendly motif-characterization SOM tools for biologists.


Introduction
A self-organizing map or SOM [1] is a grid of artificial neurons that are used to learn patterns from training data and use the learned pattern to perform non-hierarchical clustering to represent input vectors as discretized clusters, with vectors in the same cluster sharing similar features.It is used extensively in analyzing transcriptomic data, especially with gene expression measured over a number of time points [2][3][4][5][6][7].While SOM has almost always been presented as a non-hierarchical clustering method for numerical vectors (e.g., [1] and pp.231-250 of [8]), it theoretically can be adapted to any set of objects from which a pairwise distance between two objects can be computed.Thus, it should be applicable for fixed-length strings such as nucleotide or amino acid sequences.For example, the Kozak consensus [9,10] as a motif signal for ribosomes to locate mammalian translation start codon is often represented as a 7mer (RccAUGG).Similarly, 5 and 3 splicing sites are typically represented as exon-intron junctions with m nucleotides on the exon side and n nucleotides on the intron side, with a fixed length of m + n nucleotide per sequence [11,12].Splice sites are often heterogeneous, with some for major-class spliceosomes, some for minor-class spliceosomes, and some for splicing by special mechanisms.For example, the transcript of yeast gene, HAC1, is processed by a non-spliceosomal splicing mechanism [13][14][15][16], and splice sites of HAC1 introns are drastically different from those of regular introns [11].One may use SOM to characterize these heterogeneous splice sites, quantify similarities among them, and potentially discover new type of introns.Several studies have demonstrated the values of using SOM to characterize sequence motifs [17][18][19][20][21][22], but their efforts do not seem sufficiently appreciated by biologists.Given the many advantages of SOM over conventional clustering [1], biologists should gain a new perspective on the relationship among sequence motifs through the SOM extension in this paper.
SOM involves setting up a grid of artificial neurons, initializing them either with random values or with values from routine multidimensional scaling methods such as PCA, computing a distance (or similarity) between an input vector and each neuron to identify the winning neuron (which has the shortest distance or greatest similarity to the input vector), revising the features of the winning neuron and its neighbors as a learning process, and continuing with other input vectors until the process is converged (i.e., when the vector values of neurons no longer change).Such a trained SOM can then be used to classify input vectors that are not in the training data.
There are various ways of applying SOM to motif characterization, with the main difference among them being the measurement of distance/similarity between a SOM node and an input motif.I present this new extension of SOM in parallel with the familiar SOM application involving numerical vectors.In this way, the readers will find it easy to understand SOM with sequence motifs as input.

Distance or Similarity between Two Vectors
A key element in SOM is to compute a distance or similarity between two vectors, one representing the neuron and the other representing the input.For input with numeric vectors, the neuron is also a numeric vector, and there are many distances defined between two numeric vectors [7].For vectors x and y in N-dimensions, one of the simplest distances is the Euclidean distance (d) defined as For SOM on fixed-length sequences, the neuron will be represented either by a sequence or a set of sequences.I present two candidate distances, one for homologous input sequences and one for non-homologous input sequences.

Distance for Homologous Input Sequences
For homologous input sequences, one may use evolutionary distances derived from substitution models such as JC69 [23], K80 [24], F84 [25,26], HKY85 [27], TN93 [28], and GTR [29,30] models.Distances derived from these models are respectively referred to as JC69 distances, K80 distances, and so on.Each of these distances can be estimated by either independent estimation (IE) or simultaneous estimation (SE).IE uses information in two sequences only, and SE uses information in all pairs of sequences [31,32].As many biologists, including some practising phylogeneticists, are unaware of the difference between IE and SE methods, I offer a numeric illustration of these two estimation methods by using data in Table 1 and the K80 model.
The K80 model has two parameters which can be expressed either as αt and βt, or as D and κ, where D is evolutionary distance that we are interested in.The expected proportions of transitions and transversions between the two sequences, designated by E(P) and E(Q) and expressed in D and κ, respectively, are (2) Suppose we have three aligned sequences, S1, S2, and S3, with sequence length of L (=100).The observed number of sites with transitional and transversional differences are shown in Table 1.
For IE estimation, one simply substitutes E(P) and E(Q) in Equation (2) by the observed P and Q (which equal N s /L and N v /L, respectively (Table 1), and then solves for D and κ.Note that D cannot be estimated between S1 and S3 by this IE approach (Table 1 We take partial derivative of lnL with respect to D 12 , D 13 , D 14 , and κ, set them to zero and solve the four simultaneous equations for the four unknowns.This leads to κ = 6.2385 and the three D ij values shown in Table 1.There is no problem for the SE method to estimate D 13 which was not possible with the IE method.SE distances were implemented in DAMBE [33,34] as MLCompositeF84, MLCompositeTN93.For SOM, this κ is estimated from all sequence pairs and does not change during the learning process. Table 1.Observed number of sites between two sequences that are (1) identical (N i ), (2) different by a transition (N s ), and (3) different by a transversion (N v ) from pairwise comparisons among three sequences (S1, S2, and S3) of length 100.Independently estimated K80 distances (D IE ) and simultaneously estimated K80 distances (D SE ) from the maximum likelihood (ML) framework are included.Note that K80 distance cannot be computed for S1 and S3 using independent estimation method (labelled as 'inapplicable').An alternative, and simpler, approach is simply to use the alignment score as a similarity index.Alignment score can be obtained from either global or local alignment by dynamic programming, or by local string-matching algorithms such as BLAST and FASTA, the latter having been used in SOM for motif characterization [22].In addition to fixed-length string, SOM has also been used with variable-length strings [35,36].Such studies reduce sequences into word frequencies (e.g., trinucleotide frequencies) so that each sequence, regardless of its length, is represented as a fixed-length vector for use with conventional SOM.The distances derived from these non-model-based approaches do not correct for multiple substitutions and are not expected to increase linearly with time.For this reason, they are not appropriate for molecular phylogenetics.For unaligned sequences of different lengths, PhyPA [37] is statistically sound and rivals the maximum likelihood method in phylogenetic accuracy when highly divergent sequences are involved.

Distance for Non-Homologous Sequences
The input sequences used in SOM for motif characterization typically are not homologous.For example, splice sites of introns mentioned before share similarities but are typically non-homologous.In such cases, a position weight matrix (PWM, [38][39][40]), numerically illustrated in detail [41], is often used to represent a SOM neuron [17][18][19][20].I will illustrate the PWM approach in conjunction with the SOM algorithm.The shortcoming in PWM is that it does not take site-dependence into account.In contrast, Markov models can overcome this problem ( [8] pp.110-114) and have been used in conjunction with SOM [21].
Various other distance/similarity measures have been used for sequence motifs in the SOM context.The simplest ones use word frequencies (e.g., dinucletide, trinucleotide, or longer oligonucleotide frequencies, [35,36,42]).Nucleotide motifs can also be codified into a 3D entity and Euclidian distance can be computed between two such recoded motifs [43,44].
Lorenzo-Redondo et al. [44] mapped each nucleotide to a tetrahedron in which the four vertices are occupied by four nucleotides so that a nucleotide can be represented as {x, y, z}.A nucleotide sequence of length L can then be represented as {x 1 , y 1 , z 1 , x 2 , y 2 , z 2 , . . ., x L , y L , z L }, making it easy to compute a distance between two sequences of the same length.The tetrahedron could be shaped in such a way that the distance between A and G, and that between C and T, are smaller than that between a purine and a pyrimidine to accommodate the frequently observed transition bias where observed number of transitions can be nearly 80 times of the observed number of transversions [45].Delgado et al. [43] used exactly the same approach to convert a sequence to a vector.
One may also use alignment scores derived from BLAST or FASTA, with the latter having been used in the SOM context [22].FASTA algorithms with different word lengths for local string matching have been illustrated in detail before ([8] pp. .
Regardless of what distance/similarity one may use for fixed-length strings, the underlying SOM algorithm is the same.In what follows, I will illustrate the approach with PWM.The software SOMBRERO [18,19] implements SOM with PWM.

Training Data
Training SOM with a set of numeric vectors has been illustrated in detail before ( [8] pp.231-250).Suppose that we have 150 genes whose expression is measured at seven time points, so an input vector could be {2, 1, 34, 37, 3, 2, 66}.For using SOM with sequence motifs, suppose we have a set of 150 nucleotide sequences of a length of 7.

SOM Grid Size and Initialization
We first need to decide the size of SOM, i.e., the number of nodes (neurons) to have and whether the nodes should be arranged in one dimension, two dimensions, or a higher number of dimensions.Most SOMs use a two-dimensional grid of nodes.Choosing the right number of nodes may be tricky.Too many nodes increase computation and may not achieve sufficient data reduction/simplification, but too few nodes may not provide sufficient fit to the data.The minimum is to have two nodes, which leads to binary classification, i.e., the input vectors will be assigned to one of the two nodes.The rule of thumb is to use the following equation to determine the number of nodes in SOM where N node is the number of nodes in SOM rounded to integer, and N in is the number of input vectors.After the first run, one may decide to increase or decrease N node depending on how well the trained SOM fits the input data.We increase the number of neurons to improve the fit to the input data.
For our data with 150 sequences, N node = 36.7,so we may just use 36 nodes in a 6 × 6 configuration.
If we work with a set of numerical vectors, we would generate 36 random vectors for the 36 nodes as initial values.These nodes will change with the training process.For fixed-length nucleotide sequences, we will generate 36 random sequences of length 7 (same length as input sequences), with each nucleotide drawn randomly from the pool of nucleotides with frequencies being the same as that of the pooled input sequences.Table 2 (a) shows one such randomly generated sequence, together with the position weight matrix (PWM) derived from the sequences (Table 2 where i is site index (=1, 2, . . ., 7), S i is the nucleotide (A, C, G, or T) at site i, P i,S i is the site-specific frequency of S i at site i, and P S is the background frequency of S. P S values are typically the pooled frequencies from all input sequences, but can also be specified in different ways for different purposes [41].Most SOM programs with this approach would demand that the user input background frequencies which could include not only 4 nucleotide or 20 amino acid (AA) frequencies, but also dinucleotide and trinucleotide frequencies or di-AA frequencies.For example, SOMBRERO [18,19], which is such a SOM program, would ask use to input a file containing these frequencies.
PWM in Table 2 (b) is computed after adding a pseudocount of 0.01 to each cell in Table 2 (a), with the background frequencies being 0.3, 0.2, 0.2, and 0.3 for A, C, G, and T, respectively.The pseudocount is necessary to avoid taking logarithm of zero.Note that we have 36 nodes, with each node represented in the form of Table 2.The same binary coding of sequences has been used in several previous studies [17][18][19][20]46,47].

Update SOM
SOM is updated for each input vector or input sequence motif, and the updating involves three steps.The first is to identify the winning node for the input.The second is for the winning node to learn from the input.The third is for the neighbors of the winning node to learn from the input.This learning process continues until input vectors or sequence motifs result in negligible learning by SOM.

Identify the Winning Node
For input with numeric vectors, we compute the distance (e.g., Euclidean distance) between an input vector and each of the 36 node vectors, and the node with the smallest distance (or largest similarity index) to the input vector is the winning node and will learn from this input vector.Identification of the winning node with fixed-length nucleotide sequences is the same in principle but differs in operational details.Suppose our input sequence is "GCCATTA".We need to compute a distance or a similarity between this input sequence and each of the 36 nodes.We will use a similarity index called PWM score (PWMS) defined as Given the input sequence of "GCCATTA", the PWMS from the PWM in Table 2 (b) is Note that PWMS is a similarity index.The larger the PWMS, the more similarity between the input sequence and the node.As we have 36 nodes, we need to compute 36 PWMSs to find the winning node which is the one with the largest PWMS.Also note that the input sequences do not need to be of the same length but need to be at least as long as the node site dimension.Take nucleotide frequencies for example.If the node is represented as a 4 × 7 matrix, then the input sequence should be at least seven bases long.Longer input sequences are fine because we can scan the input sequence with a sliding window of seven-i.e., we obtain PWMS 1 for sites 1-7, PWMS 2 for sites 2-8, and so on-and use the largest PWMS i as the similarity between the input sequence and the node.

Learning by Revising the Winning Node and Its Neighbors: Numeric Vectors
The winning node and its neighbors will learn from the input vector through the following learning function where w i refers to values in the winning node's vector, and p i refers to values in the input vector.One could devise alternative learning functions different from that specified in Equation ( 8).The α parameter in Equation ( 8) is the learning rate.An α equal to 0 implies w i = w i , which means that the winning node does not learn from the input vector and will never change.An α equal to 1 implies w i = p i , which means that the winning node cannot retain any learned knowledge and will always mirror the features of the input vector.Thus, the α value should be greater than 0 but smaller than 1.In practise, α will initially be large (close to 1), but will diminish with each iteration.
The w i values will replace the original w i values of the winning node, which completes the update of the winning node.The next step is to modify the neighbors of the winning node because the neighboring nodes will also learn from the input vector.Updating the values of neighbors is governed by the following learning function where α n is the learning rate for the neighbors.Again, one can use one of many possible alternative learning functions, but we will just use Equation ( 9) to keep things simple.
In practice, the learning function of neighbors depends on how we define neighbors, and α n will be larger for immediate neighbors than for remote neighbors.For obvious reasons, they should also be smaller than α.If we designate α n1 as the learning rate for the immediate neighbors (i.e., nodes in physical contact with the winning nodes), α n2 as the learning rate for the neighbors of the immediate neighbors, and so on, then one simple way of choosing α n values is to set α n1 = α/2, α n2 = α n1 /2, and so on.
For our illustration, we will just define each node to have a maximum of four neighbors, i.e., the one to its left, the one to its right, the one above it, and the one below it.Thus defined, we need only one α n value which we will set to α/2.A winning node in a corner will have only two neighbors, with one above it and one to its right.We repeat this with decreasing α and α n values with each cycle of iteration until α equals a preset α min > 0 (We do not want to decrease α to zero because SOM will stop learning when α = 0).
How should we decrease α and α n with each cycle of updating?There is no optimal way of decreasing α.In my SOM implementation in AMIADA [7], I used the following equation where N v is the number of input vectors.In our case, N v = 150 and Q = 0.9933, i.e., α will be multiplied by Q after each cycle of iteration.Continuing the learning process will eventually lead to convergence, i.e., when the values in the nodes do not change any more or the change is smaller than a pre-fixed small value in two consecutive cycles of iteration.This final set of nodes is a trained SOM.Different nodes in the final trained SOM have different properties reflecting our data structure.For example, some vectors may have high values, some low values, some increasing with time (if the vector values represent measurements at specific time points), some decreasing with time, etc.
The trained SOM can now be used for classification.During the classification stage, the node values do not change.An input vector is assigned to a node if its distance is the smallest to this node which is often referred to as the host node of the input vector.However, before we use the trained SOM to do the classification of new genes, it is crucial to check how well the SOM fits the training data.This is typically done by first assigning the input vectors to their host nodes, and then computing the distance (or its square) between each gene and its host node.

Learning by Revising the Winning Node and its Neighbors: Fixed-Length Sequences
The handling of fixed-length nucleotide sequences is the same in principle as numeric vectors.Suppose that the input sequence is "GCCATTA" and the winning node is the one in Table 2 (a), then we simply add the new sequence to the original to obtain Table 3 (a), with the corresponding new PWM shown in Table 3 (b).Thus, the winning node, starting from a random sequence, will change to become a sequence profile with site-specific frequencies depending on the input sequences.Its PWM will depend on the changing site-specific frequencies and the constant background frequencies.For updating the neighboring nodes, we simply modify them by adding half of the input sequences, i.e., 0.5G, 0.5C, and so on.The rest of the computation is the same as with numeric vectors as input.

The Fit of SOM to Input Data
Measuring the fit of SOM to the input data has an easy side and a difficult side.The easy side is to characterize the deviation of each input vector to the node vector.This can be measured by Kohonen's quantization error [1] or by the conventional residual sum of squares (RSS).Such deviations between input vectors and SOM nodes can be reduced by increasing the grid size of SOM at the risk of overfitting.There are model selection protocols [48] that can help determine the optimal SOM grid size.The difficult side is to characterize the "data configuration" and the "SOM configuration" [49][50][51].Specifically, what information is lost when reducing the multi-dimensional data to two dimensions in a typical SOM grid, and which dimension has lost significant information?
One approximation to this problem is the topographic product [50] based on a simple rationale.That is, if pairs of points are the closest to each other in the original multidimensional space, and if they are also mapped to the same (or neighboring) SOM nodes, then at least the neighborhood relationship is preserved.Operationally, if pairwise distances of points in the original multi-dimensional space correlate highly with pairwise distance of points mapped to the SOM nodes, then the mapping is good, otherwise it is not.Topological measures based on this rationale are termed the 'neighborhood preserving approach' [52][53][54].SOM performs well in preserving neighborhood relationships [1].However, this neighborhood preservation approach is associated with two types of uncertainties.The first is how well a particular characterization of neighborhood relationships-e.g., the neighborhood function expressed in Equation (4) in Villman [54]-captures the true neighborhood relationships.The second is how well neighborhood relationships can summarize the global topology.Increasing SOM dimension from a 2-D grid to higher dimensions will preserve neighborhood relationships better [49] but this is somewhat self-defeating because the main objective of SOM is to visualize spatial relationships in low-dimensions.
Assuming that data topology is well represented by SOM, Kaski and Lagus [51] further developed an index for measuring similarity in topology between two data sets.The rationale can be simplified as follows.Suppose we train and map data set 1 to SOM1 of size (m, n) and designate the number of data points mapped to SOM1 nodes as N 1,1 , N 1,2 , . . ., N 2,1 , N 2,2 , . . ., N m,n .We can now map points in data set 2 to SOM1 nodes and obtain another set of N i,j values for data set 2. The correlation between the two sets of N i,j values is a measure of topological similarity between the two data sets.We can also start with data set 2, and train and map the data points to SOM2, and then use SOM2 to obtain two sets of N i,j values and compute their correlations.Such a topological similarity index has a simple interpretation.For example, if our two data sets are transcription factor binding sites from human and mouse, and we found nearly perfect correlation in N i,j values between the two data sets, then we may conclude that gene regulation mechanisms are nearly identical between the two mammalian species.
All these developments are interesting, but their significance to biologists is not clear.Most computational methods adopted by biologists are simply by trial and error and not by mathematical justification.In the case of SOM, mathematical justification is difficult anyway [55].Computational methods in biology are filtered by 'natural selection', i.e., methods generating biological insights are preserved and those not are eliminated.

Software Implementing SOM with PWM
The software tools implementing SOM with the PWM approach are already available, e.g., SOMBRERO [18,19].Note that SOMBRERO is different from SOMbrero [56] which is an R package for SOM but does not deal specifically with sequence motif discovery and characterization.There are several other papers reporting SOM implementation involve PWM or variations of PWM, e.g., SOMEA [21], READ [47], and READ CSF [46], but I was not able to find a working program.

Conclusions
SOM is a versatile neural network algorithm that can be used in a wide variety of biological data analysis.My illustration of its application for characterizing fixed-length sequences should pave the way for using SOM to analyze a variety of sequence motifs involved in gene regulation, facilitating the identification of new drug targets.

Table 2 .
A matrix representation of sequence "ACCGTTA" (a).The resulting position weight matrix (b) is obtained after adding a pseudocount of 0.01 to each cell in (a), with background frequencies being 0.3, 0.2, 0.2, and 0.3 for A, C, G, and T, respectively.
(a) by the new input sequence "GCCATTA" and the resulting position weight matrix (PWM) obtained in (b) after adding a pseudocount of 0.01 to each cell in (a).

Table 3 .
Updating the node in Table2