Previous Article in Journal
Randomized Competitive Analysis for Two Server Problems

Algorithms 2008, 1(2), 43-51; https://doi.org/10.3390/a1020043

Article
A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance
1
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada N2L 3G1
2
Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka 819-0395, Japan
3
Department of Mathematics, National University of Singapore, Singapore 117543
*
Author to whom correspondence should be addressed.
Received: 5 September 2008; in revised form: 1 October 2008 / Accepted: 9 October 2008 / Published: 9 October 2008

Abstract

:
In this paper we consider a basic clustering problem that has uses in bioinformatics. A structural fragment is a sequence of points in a 3D space, where is a fixed natural number. Two structural fragments $f 1$ and $f 2$ are equivalent if and only if $f 1 = f 2 · R + τ$ under some rotation R and translation τ. We consider the distance between two structural fragments to be the sum of the squared Euclidean distance between all corresponding points of the structural fragments. Given a set of n structural fragments, we consider the problem of finding k (or fewer) structural fragments $g 1 , g 2 , … , g k$, so as to minimize the sum of the distances between each of $f 1 , f 2 , … , f n$ to its nearest structural fragment in $g 1 , … , g k$. In this paper we show a polynomial-time approximation scheme (PTAS) for the problem through a simple sampling strategy.
Keywords:
Clustering 3D point sequences; squared Euclidean distance; algorithm; polynomial-time approximation scheme.

1. Introduction

In this paper we consider the problem of clustering similar sequences of 3D points. Two such sequences of points are considered the same if they are equivalent under rotation and translation. The scenario which we consider is as follows. Suppose there is an original sequence of points that gave rise to a few variations of itself, through slight changes in some or all of its points. Now given these variations of the sequence, we are to reconstruct the original sequence. A likely candidate for such an original sequence would be a sequence which is “nearest" in terms of some distance measure, to the variations.
A more complicated scenario involves k original sequences of the same length. Formally, we formulate the problem as follows. Given n sequences of points $f 1 , f 2 , … , f n$, we are to find a set of k sequences $g 1 , … , g k$, such that the sum of distances
$∑ 1 ≤ i ≤ n min 1 ≤ j ≤ k d i s t ( f i , g j )$
is minimized. In this paper we consider the case where $d i s t$ is the minimum sum of squared Euclidean distances between each of the points in the two sequences $f i$ and $g k$, under all possible rigid transformations on the sequences of points. A cost function in the form of the squared Euclidean distance is used in many techniques for clustering 3D points . Since our clustering problem is quite different from those previously studied, it calls for a new technique. (The “square" in the distance measure is to fulfill a condition needed by the method in this paper. The method does not work, for example, in the case of the root mean squared Euclidean distance. On the other hand, the method easily adapts to other distance measures that fulfill the required condition.)
Such a problem has potential use in clustering protein structures. A protein structure is typically given as a sequence of points in 3D space, and for various reasons, there are typically minor variations in their measured structures. The problem can be considered a model of the situation where we have a set of measurements of a few protein structures, and are to reconstruct the original structures.
In this paper, we show that there is a polynomial-time approximation scheme (PTAS) for the problem, through a sampling strategy. More precisely, we show that an optimal solution obtained by sampling smaller subsets of the input suffices to give us an approximate solution, and the approximation ratio improves as we increase the size of the subsets we sample.

2. Preliminaries

Throughout this paper we let be a fixed non-zero natural number. A structural fragment is a sequence of 3D-points. The mean square distance ($M S$) between two structural fragments $f = ( f [ 1 ] , … , f [ ℓ ] )$ and $g = ( g [ 1 ] , … , g [ ℓ ] )$, is defined to be
$M S ( f , g ) = min R ∈ R , τ ∈ T ∑ i = 1 ℓ ∥ f [ i ] - ( R · g [ i ] + τ ) ∥ 2$
where $R$ is the set of all rotation matrices, $T$ the set of all translation vectors, and $∥ x - y ∥$ is the Euclidean distance between $x , y ∈ R 3$.
The root of the $M S$ measure, $R M S ( f , g ) = M S ( f , g )$ is a measure that has been extensively studied. Note that $R ∈ R$, $τ ∈ T$ that minimize $∑ i = 1 ℓ ∥ f [ i ] - ( R · g [ i ] + τ ) ∥ 2$ to give us $M S ( f , g )$ will also give us $R M S ( f , g )$, and vice versa. Since given any f and g, there are closed form equations [2,3] for finding R and τ that give $R M S ( f , g )$, $M S ( f , g )$ can be computed efficiently for any f and g.
Furthermore, it is known that to minimize $∑ i = 1 ℓ ∥ f [ i ] - ( R · g [ i ] + τ ) ∥ 2$, the centroid of f and g must coincide . Due to this, without loss of generality we assume that all structural fragments have centroids at the origin. Such transformations can be done in $O ( n ℓ )$ time. After such transformations, in computing $M S ( f , g )$, only the parameter $R ∈ R$ need to be considered, that is,
$M S ( f , g ) = min R ∈ R ∑ i = 1 ℓ ∥ f [ i ] - R · g [ i ] ∥ 2$
Suppose that given a set of n structural fragments $f 1 , f 2 , … , f n$, we are to find k structural fragments $g 1 , … , g k$, such that each structural fragment $f i$ is “near", in terms of the $M S$, to at least one of the structural fragments in $g 1 , … , g k$. We formulate such a problem as follows:
 k-Consensus Structural Fragments Problem Under $M S$ Input: n structural fragments $f 1 , … f n$, and a non-zero natural number $k < n$. Output: k structural fragments $g 1 , … g k$, minimizing the cost $∑ i = 1 n min 1 ≤ j ≤ k M S ( f i , g j )$.
In this paper we will demonstrate that there is a PTAS for the problem.
We use the following notations: Cardinality of a set A is written $| A |$. For a set A and non-zero natural number n, $A n$ denotes the set of all length n sequences of elements of A. Let elements in a set A be indexed, say $A = { f 1 , f 2 , … , f n }$, then $A m !$ denotes the set of all the length m sequences $f i 1 , f i 2 , … , f i m$, where $1 ≤ i 1 ≤ i 2 ≤ … ≤ i m ≤ n$. For a sequence S, $S ( i )$ denotes the i-th element in S, and $| S |$ denotes its length.

3. PTAS for the k-Consensus Structural Fragments

The following lemma, from , is central to the method.
Lemma 1 () Let $a 1 , a 2 , … , a n$ be a sequence of real numbers and let $r ∈ N$, $1 ≤ r ≤ n$. Then the following equation holds:
$1 n r ∑ 1 ≤ i 1 , i 2 , … , i r ≤ n ∑ i = 1 n ( a i 1 + a i 2 + ⋯ + a i r r - a i ) 2 = r + 1 r ∑ i = 1 n ( a 1 + a 2 + ⋯ + a n n - a i ) 2$
Let $P 1 = ( x 1 , y 1 , z 1 ) , P 2 = ( x 2 , y 2 , z 2 ) , … , P n = ( x n , y n , z n )$ be a sequence of 3D points.
$1 n r ∑ 1 ≤ i 1 , i 2 , … , i r ≤ n ∑ i = 1 n ∥ P i 1 + P i 2 + ⋯ + P i r r - P i ∥ 2$
$= 1 n r ∑ 1 ≤ i 1 , … , i r ≤ n ∑ i = 1 n ( x i 1 + … + x i r r - x i ) 2 + ( y i 1 + … + y i r r - y i ) 2 + ( z i 1 + … + z i r r - z i ) 2$
$= r + 1 r ∑ i = 1 n ( x 1 + … + x n n - x i ) 2 + ( y 1 + … + z n n - z i ) 2 + ( z 1 + … + z n n - z i ) 2$
$= r + 1 r ∑ i = 1 n ∥ P 1 + P 2 + ⋯ + P n n - P i ∥ 2$
One can similarly extend the equation for structural fragments. Let $f 1 , … , f n$ be n structural fragments, the equation becomes:
$1 n r ∑ 1 ≤ i 1 , … , i r ≤ n ∑ i = 1 n ∥ f i 1 + ⋯ + f i r r - f i ∥ 2 = r + 1 r ∑ i = 1 n ∥ f 1 + ⋯ + f n n - f i ∥ 2$
The equation says that there exists a sequence of r structural fragments $f i 1 , f i 2 , … , f i r$ such that
$∑ i = 1 n ∥ f i 1 + ⋯ + f i r r - f i ∥ 2 ≤ r + 1 r ∑ i = 1 n ∥ f 1 + ⋯ + f n n - f i ∥ 2$
Our strategy uses this fact —in essentially the same way as in — to approximate the optimal solution for the k-consensus structural fragments problem. That is, by exhaustively sampling every combination of k sequences, each of r elements from the space $R ′ × { f 1 , … , f n }$, where $f 1 , … , f n$ is the input and $R ′$ is a fixed selected set of rotations, which we next discuss.

3.1. Discretized Rotation Space

Any rotation can be represented by a normalized vector u and a rotation angle θ, where u is the axis about which an object is rotated by θ. If we apply $( u , θ )$ to a vector v, we obtain vector $v ^$, which is:
$v ^ = u ( v · u ) + ( v - w ( v · w ) ) cos θ + ( v × w ) sin θ$
where · represents dot product, and × represent cross product.
By the equation, one can verify that a change of ϵ in u will result in a change of at most $α 1 ϵ | v |$ in $| v ^ |$ for some computable $α 1 ∈ R$; and a change of ϵ in θ will result in a change of at most $α 2 ϵ | v |$ in $| v ^ |$ for some computable $α 2 ∈ R$. Now any rotation along an axis through the origin can be written in the form $( θ 1 , θ 2 , θ 3 )$, where $θ 1 , θ 2 , θ 3 ∈ [ 0 , 2 π )$ are respectively a rotation along each of the $x , y , z$ axes. Similarly, changes of ϵ in $θ 1$, $θ 2$ and $θ 3$ will result in a change of at most $α ϵ | v |$, for some computable $α ∈ R$.
We discretize the values that each $θ i$, $1 ≤ i ≤ 3$ may take within the range $[ 0 , 2 π )$ into a series of angles of angular difference ϑ. There are hence at most $O ( 1 / ϑ )$ of such values for each $θ i$, $1 ≤ i ≤ 3$. Let $R ′$ denote the set of all possible discretized rotations $( θ 1 , θ 2 , θ 3 )$. Note that $| R ′ |$ is of order $O ( 1 / ϑ 3 )$.
Let $d$ be the diameter of a ball that is able to encapsulate each of $f 1 , f 2 , … , f n$. Hence any distance between two points among $f 1 , … , f n$ is at most $d$. In this paper we assume d to be constant with respect to the input size. Note that for a protein structure, $d$ is of order $O ( ℓ )$ . For any $b ∈ R$, we can choose ϑ so small that for any rotation R and any point $p ∈ R 3$, there exists $R ′ ∈ R ′$ such that $∥ R · p - R ′ · p ∥ ≤ α ϑ d ≤ b$.

3.2. A Polynomial-time Algorithm With Cost $( ( 1 + ϵ ) D o p t + c )$

Our algorithm for the k-consensus structural fragments problem is summarized in Table 1.
This is what the algorithm does: In (2), we explore m distinct subsets $A 1 , … , A m$ from $f 1 , … , f n$, in the hope that each subset is from a distinct cluster in the optimal clustering. Since we explore all possible such subsets this is bound to happen. We then try to evaluate the score of each subset $A j$ by sampling up to r structural fragments (allowing repeats) from it (from (2.1) onwards). Such an evaluation is possible due to Equation 7. The evaluation also requires us to exhaustively try out all possible transformations in $R ′$, which is what we try to do in (2.2). Each of these samplings of $A j$ produces a consensus structural fragment $u j$ for $A j$ in (2.3), the score of which is evaluated in (2.4). Finally in (3), we output the consensus patterns $u 1 , … , u m$ which give us the best score.
We now analyze the runtime complexity of the algorithm. Consider the number of $F 1 , F 2 , … , F m$ in (2.1) that are possible. Let each $F j$ be represented by a length r string of $n + 1$ symbols, n of which each represents one of $f 1 , … , f n$, while the remaining symbol represents “nothing". It is clear that for any $A j$, any $F j ∈ A j r !$, or $F j ∈ A j | A j | !$ (where $| A j | ≤ r$), can be represented by one such string. Furthermore, any $F 1 , F 2 , … , F m$ can be completely represented by k such strings — that is, to represent the case where $m < k$, $k - m$ strings can be set to “nothing" completely. From this, we can see that there are at most $( n + 1 ) r k = O ( n r k )$ possible combinations of $F 1 , F 2 , … , F m$.
For each of these combinations, there are $| R ′ | r k$ possible combinations of $Θ 1 , Θ 2 , … , Θ m$ at (2.2), hence resulting in $O ( ( n | R ′ | ) r k )$ iterations to run for (2.3) to (2.5). Since (2.3) can be done in $O ( r k ℓ )$, (2.4) in $O ( n k | R ′ | ℓ )$, and (2.5) in $O ( n )$ time, the algorithm completes in $O ( k ℓ ( r + n | R ′ | ) ( n | R ′ | ) r k )$ time.
We argue that $D m i n$ eventually is at most $( r + 1 ) / r$ of the optimal solution plus a factor. Suppose the optimal solution results in the $m ≤ k$ disjoint clusters $A 1 , A 2 , … , A m ⊆ { f 1 , … , f n }$.
For each $A j$, $1 ≤ j ≤ m$, let $u j$ be a structural fragment which minimizes $∑ f ∈ A j M S ( u j , f )$. Furthermore, for each $f ∈ A j$, let $R f$ be a rotation where
$R f ∈ arg min R ∈ R ∥ u j - R · f ∥ 2$
and let
Table 1. Polynomial-time algorithm for the problem. By the property of the $M S$ measure, it can be shown that $u j$ is the average of ${ R f · f ∣ f ∈ A j }$. For each $A j$ where $| A j | > r$, by Equation 6,
$1 | A j | r ∑ F j ∈ A j r ∑ f ∈ A j ∥ R F j ( 1 ) · F j ( 1 ) + ⋯ + R F j ( r ) · F j ( r ) r - R f · f ∥ 2 = r + 1 r D j$
For each such $A j$, let $F j ∈ A j r$ be such that
$∑ f ∈ A j ∥ R F j ( 1 ) · F j ( 1 ) + ⋯ + R F j ( r ) · F j ( r ) r - R f · f ∥ 2 ≤ r + 1 r D j$
Without loss of generality assume that each $F j ∈ A j r !$. Let
Then we may write,
$∑ j = 1 m ∑ f ∈ A j ∥ μ j - R f · f ∥ 2 ≤ r + 1 r D$
For each rotation $R f$, let $R f$ be a closest rotation to $R f$ within $R ′$. Also, let
Since we exhaustively sample all possible $F j ∈ A j r !$ for all possible $A j$ and for all $R ∈ R ′$, it is clear that:
$D m i n ≤ ∑ j = 1 m ∑ f ∈ A j ∥ μ j - R f · f ∥ 2$
We will now relate the LHS of Equation 14 with the RHS of Equation 16. The RHS of Equation 16 is
$∑ j = 1 m ∑ f ∈ A j ∥ μ j - R f · f ∥ 2$
$= ∑ j = 1 m ∑ f ∈ A j ∥ μ j + ( μ j - μ j ) + ( R f · f - R f · f ) - R f · f ∥ 2$
$≤ ∑ j = 1 m ∑ f ∈ A j ( ∥ μ j - R f · f ∥ + ( ∥ μ j - μ j ∥ + ∥ R f · f - R f · f ∥ ) ) 2$
$= ∑ j = 1 m ∑ f ∈ A j ∥ μ j - R f · f ∥ 2 + ( ∥ μ j - μ j ∥ + ∥ R f · f - R f · f ∥ ) 2$
$+ 2 ∥ μ j - R f · f ∥ ( ∥ μ j - μ j ∥ + ∥ R f · f - R f · f ∥ )$
$≤ ∑ j = 1 m ∑ f ∈ A j ∥ μ j - R f · f ∥ 2 + 8 n ℓ b$
Hence by Equation 14, $D m i n$ is at most $( r + 1 ) / r = 1 + 1 / r$ of the optimal solution plus a factor $c = 8 n ℓ b$. Let $ϵ = 1 / r$,
Theorem 2 For any $c , ϵ ∈ R$, a $( ( 1 + ϵ ) D o p t + c )$-approximation solution for the k-consensus structural fragments problem can be computed in
$O ( k ℓ ( 1 ϵ + n | R ′ | ) ( n | R ′ | ) k ϵ )$
time.
The factor c in Theorem 2 is due to error introduced by the use of discretization in rotations. If we are able to estimate a lower bound of $D o p t$, we can scale this error by refining the discretization such that c is an arbitrarily small factor of $D o p t$. To do so, in the next section we show a lower bound to $D o p t$.

3.3. A Polynomial-time 4-approximation Algorithm

We now show a 4-approximation algorithm for the k-consensus structural fragments problem. We first show the case for $k = 1$, and then generalizes the result to all $k ≥ 2$.
Let the input n structural fragments be $f 1$, $f 2$, , $f n$. Let $f a$, $1 ≤ a ≤ n$ be the structural fragment where
$∑ 1 ≤ j ≤ n ∧ j ≠ a M S ( f a , f j )$
is minimized. Note that $f a$ can be found in time $O ( n 2 ℓ )$, since for any $1 ≤ i , j ≤ n$, $M S ( f i , f j )$ (more precisely, $R M S ( f i , f j )$) can be computed in time $O ( ℓ )$ using closed form equations from .
We argue that $f a$ is a 4-approximation. Let the optimal structural fragment be $f o p t$, the corresponding distance $D o p t$, and let $f b$ ($1 ≤ b ≤ n$) be the fragment where $M S ( f b , f o p t )$ is minimized.
We first note that the cost of using $f a$ as solution, $∑ i ≠ a M S ( f a , f i ) ≤ ∑ i ≠ b M S ( f b , f i )$. To continue we first establish the following claim.
Claim 1 $M S ( f , f ′ ) ≤ 2 ( M S ( f , f ′ ′ ) + M S ( f ′ ′ , f ′ ) )$.
PROOF. In , it is shown that
$R M S ( f , f ′ ) ≤ R M S ( f , f ′ ′ ) + R M S ( f ′ ′ , f ′ )$
Squaring both sides gives
$M S ( f , f ′ ) ≤ M S ( f , f ′ ′ ) + M S ( f ′ ′ , f ′ ) + 2 R M S ( f , f ′ ′ ) R M S ( f ′ ′ , f ′ )$
Since
$2 R M S ( f , f ′ ′ ) R M S ( f ′ ′ , f ′ ) ≤ M S ( f , f ′ ′ ) + M S ( f ′ ′ , f ′ )$
we have $M S ( f , f ′ ) ≤ 2 ( M S ( f , f ′ ′ ) + M S ( f ′ ′ , f ′ ) )$. ▮
By the above claim,
$∑ i ≠ b M S ( f b , f i ) ≤ 2 ∑ i ≠ b ( M S ( f b , f o p t ) + M S ( f o p t , f i ) )$
$= 2 ∑ i ≠ b M S ( f b , f o p t ) + 2 ∑ i ≠ b M S ( f i , f o p t )$
$≤ 2 ∑ i ≠ b M S ( f b , f o p t ) + 2 D o p t$
$≤ 2 ∑ j ≠ b M S ( f j , f o p t ) + 2 D o p t$
$≤ 2 D o p t + 2 D o p t = 4 D o p t$
Hence $∑ i ≠ a M S ( f a , f i ) ≤ 4 D o p t$. We now extend this to k structural fragments. We first pre-compute $M S ( f , f ′ )$ for every pair of $f , f ′ ∈ S$, which takes time $O ( n 2 ℓ )$. Then, at step (1), there are at most $O ( n k )$ combinations of A, each which takes $O ( n k )$ time to compute at step (2). Hence in total we can perform the computation in $O ( n 2 ℓ + k n k + 1 )$ time. To see that the solution is a 4-approximation, let $S 1 , S 2 , … , S m$ where $m ≤ k$ be an optimal clustering. Then, by our earlier argument, there exists $f i 1 ∈ S 1$, $f i 2 ∈ S 2$, , $f i m ∈ S m$ such that each $f i x$ is a 4-approximation for $S x$, and hence $f i 1 , f i 2 , … , f i m$ is a 4-approximation for the k-consensus structural fragments problem. Since the algorithm exhaustively search for every combination of up to k fragments, it gives a solution at least as good as $f i 1 , f i 2 … , f i m$, and hence is a 4-approximation algorithm.
Theorem 3 A 4-approximation solution for the k-consensus structural fragments problem can be computed in $O ( n 2 ℓ + k n k + 1 )$ time.

3.4. A $( 1 + ϵ )$ Polynomial-time Approximation Scheme

Recall that the algorithm in Section 3.2 has cost $D ≤ ( 1 + ϵ ) D o p t + 8 n ℓ b$ where $b = α ϑ d$. From Section 3.3 we have a lower bound D$o p t$ of $D o p t$. We want $8 n ℓ b ≤ ϵ D o p t ≤ ϵ D o p t$. To do so, it suffices that we set $ϑ ≤ ϵ D o p t / ( 8 n ℓ α d )$. This results in an $| R ′ |$ of order $O ( 1 / ϑ 3 ) = O ( ( n ℓ d ) 3 )$. Substituting this in Theorem 2, and combining with Theorem 3, we get the following.
Theorem 4 For any $ϵ ∈ R$, a $( ( 1 + ϵ ) D o p t )$-approximation solution for the k-consensus structural fragments problem can be computed in
$O ( n 2 ℓ + k n k + 1 + k ℓ ( 2 ϵ + n λ ) ( n λ ) 2 k ϵ )$
time, where $λ = ( n ℓ d ) 3$.

4. Discussions

The method in this paper depends on Lemma 1. For this reason, the technique does not extend to the problem under distance measures where Lemma 1 cannot be applied, for example, the $R M S$ measure. However, should Lemma 1 apply to a distance measure, it should be easy to adapt the method here to solve the problem for that distance measure.
One can also formulate variations of the k-consensus structural fragments problem. For example, While the cost function of the k-consensus structural fragments problem resembles that of the k-means problem, the cost function of the k-closest structural fragments resembles that of the (absolute) k-center problem. One interesting problem for future study is whether this problem has a PTAS or not. It is not clear how to generalize the technique employed in this paper to k-closest structural fragments problem under $M S$.

References

1. Jain, A. K.; Murty, M. N.; Flynn, P. J. Data clustering: a review. ACM Computing Surveys 1999, 31(3), 264–323. [Google Scholar] [CrossRef]
2. Arun, K. S.; Huang, T. S.; Blostein, S. D. Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 9(5), 698–700. [Google Scholar] [CrossRef] [PubMed]
3. Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13(4), 376–380. [Google Scholar] [CrossRef]
4. Qian, J.; Li, S. C.; Bu, D.; Li, M.; Xu, J. Finding compact structural motifs. In Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007, Proceedings; Ma, B., Zhang, K.Z., Eds.; Springer, 2007; Vol. 4580 of Lecture Notes in Computer Science, pp. 142–149. [Google Scholar]
5. Hao, M.; Rackovsky, S.; Liwo, A.; Pincus, M.; Scheraga, H. Effects of compact volume and chain stiffness on the conformations of native proteins. Proc. Natl. Acad. Sci. 1992, 89, 6614–6618. [Google Scholar] [CrossRef] [PubMed]
6. Boris, S. A revised proof of the metric properties of optimally superimposed vector sets. Acta Crystallographica Section A 2002, 58(5), 506. [Google Scholar]