4.3. Tags
A tag of length k, or k-tag, is defined by peaks from a spectrum S, such that each two neighbor ones are separated by the mass of an amino acid. Thus, a k-tag t has an amino acid sequence and an offset equal to the mass of the leftmost peak .
A set
of 4-tags, to become the input for tag convolution, was generated with the method implemented within the Twister software tool [
31,
32] for de novo sequencing of peptides from top-down tandem mass spectra. Thereby, the default parameters of Twister were used: tag length
, mass tolerance
mDa, peak reflection applied to individual deconvoluted spectra, and water loss ions eliminated. Further, for a preprocessed spectrum
S, a spectrum graph
was constructed, the vertices of which corresponded to the peaks of
S, and for two vertices—
u and
w—an edge from
u to
w was introduced if
matched the mass of some amino acid within
, where
denotes the mass of the peak from
S that gave rise to the vertex
v. The vertices of
were scored with the intensities of their underlying peaks, and an optimal path with respect to the vertex scores was extracted from each connected component of
. Finally, from each obtained path of at least
in length, all the possible 4-tags were derived.
An important point here is that the application of a small constant mass tolerance at the time of generating the edges of
assures that the resulting
k-tags are highly accurate. A detailed description of the above procedure can be found in [
31].
4.4. Sequence Fragments
The first part of the input of the proposed method is a set
of amino acid strings supposed to represent sequence fragments of the proteins from the sample being analyzed. In our experiments, we used as
the amino acid sequences of the aggregated paths generated with Twister, as described in [
32], from the set of MS/MS spectra acquired from the respective sample.
In brief, Twister takes a set of deisotoped and charge state deconvoluted MS/MS spectra as input, and first generates from them a set of highly accurate k-tags using the strategy described in the previous section. Next, it assembles a number of de novo strings from the tags consistent with each other in terms of both amino acid sequences and offsets, each assigned a mass offset equal to the smallest offset among those of the tags contributing to it. (For example, if we have two 4-tags derived from HCD spectra, with the amino acid string SGAT and GATF, respectively, and offset 500 and 587, respectively, we note that , and therefore, those tags may be due to the same protein—e.g., one with a subsequence SGATF preceded by an N-terminal fragment of mass 500; having glued the two tags, we will obtain a de novo string SGATF with the offset of 500.) Finally, Twister combines the derived de novo strings into a number of aggregated strings endowed with direct and reversed offsets; the amino acid sequence of an aggregated string typically represents a longer sequence fragment of a protein contained in the sample, and its associated offsets reflect the location of the respective fragment within the entire sequence.
To generate the aggregated strings, we ran Twister with the default parameters (see above) on the CAH2 and alemtuzumab data sets. The amino acid sequences of the 70 and 92 aggregated strings obtained for CAH2 and alemtuzumab, respectively, which served as input for the algorithm being described, are listed in the
supplementary file Aggregated-strings-Twister.xls. Their correct fragments, at least four in length, are highlighted in color, and for each of those, its first and last position in the corresponding protein sequence is indicated; in the case that the former exceeds the latter, the respective fragment occurs in the sequence in the reversed form.
4.5. Tag Convolution
For an amino acid sequence s, let denote its reversed copy.
Tag convolution was defined in [
33] as follows. For a set of
k-tags
, let
denote the set of all their amino acid sequences. Given two
k-mers
, tag convolution
considers all pairs
of tags from
, such that
and
, and computes the difference
of their offsets. For each difference encountered thereby (up to a predefined tolerance), tag convolution records how many times it occurred. Thus, its output comprises a set of pairs, each composed of a registered offset difference
and its multiplicity
:
, where
h is the number of distinct offset difference values observed.
Subsequently, the above concept was generalized to the case of strings, and slightly adjusted so that for two subsequences and of s, where , the value contributed to the output of tag convolution by the pairs of tags matching either and or and would equal , i.e., the mass of the subsequence separating and in s. This was formalized in the following way.
For a real δ, a shift of by δ is defined as . To compute for two amino acid strings and , we first iterate over all the pairs of k-mers from and , respectively; thereby, a pair contributes the output of to an auxiliary set . Next, we analogously form a set . Having merged together and , we obtain . Note that .
In [
33], we described a procedure for validating de novo peptide sequences. In particular, for an amino acid
of a candidate sequence
, where
, it computes
and checks whether
occurs in it with a high enough multiplicity. According to our experiments, for a correct peptide sequence
s, the multiplicity
usually clearly dominates that of the other values present in
. This suggests that a similar idea might be applied to check whether two amino acid strings
and
are subsequences of a longer sequence
s: to this end, one would compute
and verify whether the multiplicity of the most frequently observed offset difference
is significantly greater than the second-highest multiplicity. If so,
would be reported as the mass of the subsequence separating
and
in
s; otherwise, the verdict would be that
,
and
s are not related in that way.
However, such an approach would work fine only for a rather short peptide sequence s, and its subsequences and separated by at most a few amino acids, and turns out to be inapplicable to the top-down case, with long protein sequences and large gaps between the retrieved fragments of those. The underlying issues, along with the means to resolve them, are discussed in the next section.
4.6. Gap Estimation
Given two amino acid strings and , we aim to verify whether they represent two disjoint fragments of the same protein sequence s, and if the answer is positive, report an approximate mass of the sequence separating them in s. To this end, we compute based on a set of k-tags extracted from top-down MS/MS spectra; however, only pairs of tags from the same spectrum are allowed to contribute to , and its output needs to be treated in a different way, as compared to the bottom-up case.
To generate the set
, we again apply the strategy being part of the Twister approach, which assures high accuracy of the resulting tags (see
Section 4.3). In particular, we use a stringent mass tolerance
mDa when deciding whether the difference between two peak masses matches the mass of some amino acid, thereby relying upon the observation that the errors in close masses tend to be similar.
However, when we switch to the differences between tag offsets, which can be quite large, this kind of assumption can no longer be made. Moreover, the same value can appear as a difference of two relatively small offsets, and also as that of two large offsets, and in the latter case, the error in it may be substantially larger than in the former case. To avoid the need to keep track of the way in which concrete values were obtained, we apply the binning strategy similar to the one introduced in [
32] for analyzing the offsets of aggregated strings. Furthermore, namely, each offset difference
d is first scaled through multiplication by
(in our experiments,
), and rounded to the nearest integer; subsequently, each obtained
scaled difference is assigned a multiplicity
equal to the number of the offset differences that got transformed into it. In addition, an integral
binned difference is calculated for
d by rounding it to the nearest integer; its multiplicity is defined as
, where
are the scaled counterparts of the offset differences that got transformed into
,
.
Let our hypothesis be that and are two disjoint subsequences of the same (unknown) protein sequence s, and precedes in s. In order to disprove it, we proceed as follows. First, we calculate , along with the respective sets of scaled and binned offset differences endowed with multiplicities. Next, we focus on the binned differences, and select the non-negative ones not exceeding a predefined threshold . Further, from the binned differences still under consideration, which have the multiplicity at least , we pick up those with the highest multiplicity . For each such difference , we calculate its score as , assuming that a value that does not appear as a binned difference has a zero multiplicity. In this way, we account for the well-known Da errors in large enough deconvoluted masses. Finally, the top-scoring binned difference is selected (the smallest one is picked up in case of ties), then its corresponding scaled difference with the highest multiplicity is detected, and the value of is reported as a candidate estimate of the gap between and .
As a last step, we check whether the tags that contributed to the binned counterpart of the estimate together would cover at least a certain number of amino acids in both and . To this end, we introduce a threshold , and note that k-tags with distinct labels all corresponding to the same string will always cover amino acids in it. The estimate is accepted if at least tags that support are observed for each of and , or and tags are observed for one and the other string, respectively. If neither is the case, we check whether the respective numbers are both at least , and if so, whether either or occurred among the binned differences, and was supported by at least and tags for the two strings, respectively. In case this holds, the estimate is accepted. Otherwise, we conclude that the hypothesis was wrong.
Since the protein sequence fragments may appear in the output of Twister in a direct as well as reversed form, when processing the amino acid sequences and of two aggregated paths, we apply the above procedure to up to four pairs of strings, and namely, and , and , and , and and . If for some pair, a gap estimate was obtained, the two strings are joined to form a gapped path, and the remaining pairs are not considered.
To enable iterative construction of gapped paths, we proceed as follows. The gapped paths are initialized with the input strings, and further examined pairwise. As in the case of regular amino acid strings, for a pair , of gapped paths, we consider four combinations comprising the direct and/or reversed versions of those: and , and , and , and and . Without loss of generality, let us discuss in more detail the first case.
When processing and , we first try to append to . To decide whether it is possible, we pick up the last sequence fragment of and the first sequence fragment of , and verify as stated above whether and represent two fragments of the same protein sequence. If the answer is positive, is appended to ; otherwise, we consecutively examine the gaps from , and for each gap large enough to potentially accommodate , perform a similar check for the sequence fragment of immediately preceding this gap, and . If, according to its outcome, precedes in some protein sequence, we additionally verify whether upon embedding of into this gap, its tail would overlap the fragment of immediately after the gap. If not, is appropriately merged into after . The overlap check amounts to a comparison of the mass offset of the end of upon embedding, and that of the beginning of (the offsets may be calculated e.g., with respect to the beginning of ), which is carried out using a tolerance specified in ppm. In case could not be embedded into , a similar procedure is applied with a goal of embedding into .