#### 4.3. Tags

A tag of length k, or k-tag, is defined by $k+1$ peaks ${p}_{1},\dots ,{p}_{k+1}$ from a spectrum S, such that each two neighbor ones are separated by the mass of an amino acid. Thus, a k-tag t has an amino acid sequence $s\left(t\right)={a}_{1}\dots {a}_{k}$ and an offset $o\left(t\right)$ equal to the mass $Mass\left({p}_{1}\right)$ of the leftmost peak ${p}_{1}$.

A set

$\mathcal{T}$ of 4-tags, to become the input for tag convolution, was generated with the method implemented within the Twister software tool [

31,

32] for de novo sequencing of peptides from top-down tandem mass spectra. Thereby, the default parameters of Twister were used: tag length

$k=4$, mass tolerance

$\epsilon =4$ mDa, peak reflection applied to individual deconvoluted spectra, and water loss ions eliminated. Further, for a preprocessed spectrum

S, a spectrum graph

$G\left(S\right)$ was constructed, the vertices of which corresponded to the peaks of

S, and for two vertices—

u and

w—an edge from

u to

w was introduced if

$m\left(w\right)-m\left(u\right)$ matched the mass of some amino acid within

$2\epsilon $, where

$m\left(v\right)$ denotes the mass of the peak from

S that gave rise to the vertex

v. The vertices of

$G\left(S\right)$ were scored with the intensities of their underlying peaks, and an optimal path with respect to the vertex scores was extracted from each connected component of

$G\left(s\right)$. Finally, from each obtained path of at least

$k=4$ in length, all the possible 4-tags were derived.

An important point here is that the application of a small constant mass tolerance at the time of generating the edges of

$G\left(S\right)$ assures that the resulting

k-tags are highly accurate. A detailed description of the above procedure can be found in [

31].

#### 4.4. Sequence Fragments

The first part of the input of the proposed method is a set

$\mathcal{A}$ of amino acid strings supposed to represent sequence fragments of the proteins from the sample being analyzed. In our experiments, we used as

$\mathcal{A}$ the amino acid sequences of the aggregated paths generated with Twister, as described in [

32], from the set of MS/MS spectra acquired from the respective sample.

In brief, Twister takes a set of deisotoped and charge state deconvoluted MS/MS spectra as input, and first generates from them a set of highly accurate k-tags using the strategy described in the previous section. Next, it assembles a number of de novo strings from the tags consistent with each other in terms of both amino acid sequences and offsets, each assigned a mass offset equal to the smallest offset among those of the tags contributing to it. (For example, if we have two 4-tags derived from HCD spectra, with the amino acid string SGAT and GATF, respectively, and offset 500 and 587, respectively, we note that $587=500+Mass\left(S\right)$, and therefore, those tags may be due to the same protein—e.g., one with a subsequence SGATF preceded by an N-terminal fragment of mass 500; having glued the two tags, we will obtain a de novo string SGATF with the offset of 500.) Finally, Twister combines the derived de novo strings into a number of aggregated strings endowed with direct and reversed offsets; the amino acid sequence of an aggregated string typically represents a longer sequence fragment of a protein contained in the sample, and its associated offsets reflect the location of the respective fragment within the entire sequence.

To generate the aggregated strings, we ran Twister with the default parameters (see above) on the CAH2 and alemtuzumab data sets. The amino acid sequences of the 70 and 92 aggregated strings obtained for CAH2 and alemtuzumab, respectively, which served as input for the algorithm being described, are listed in the

supplementary file `Aggregated-strings-Twister.xls`. Their correct fragments, at least four in length, are highlighted in color, and for each of those, its first and last position in the corresponding protein sequence is indicated; in the case that the former exceeds the latter, the respective fragment occurs in the sequence in the reversed form.

#### 4.5. Tag Convolution

For an amino acid sequence s, let $\overline{s}$ denote its reversed copy.

Tag convolution was defined in [

33] as follows. For a set of

k-tags

$\mathcal{T}$, let

$\mathcal{K}\left(\mathcal{T}\right)=\left\{w\right|\exists t\in \mathcal{T}:s\left(t\right)=w\}$ denote the set of all their amino acid sequences. Given two

k-mers

${w}_{1},{w}_{2}\in \mathcal{K}\left(\mathcal{T}\right)$, tag convolution

$\tau ({w}_{1},{w}_{2})$ considers all pairs

$({t}_{1},{t}_{2})$ of tags from

$\mathcal{T}$, such that

$s\left({t}_{1}\right)={w}_{1}$ and

$s\left({t}_{2}\right)={w}_{2}$, and computes the difference

$o\left({t}_{2}\right)-o\left({t}_{1}\right)$ of their offsets. For each difference encountered thereby (up to a predefined tolerance), tag convolution records how many times it occurred. Thus, its output comprises a set of pairs, each composed of a registered offset difference

${d}_{i}$ and its multiplicity

${m}_{i}$:

$\tau ({w}_{1},{w}_{2})=\left\{({d}_{i},{m}_{i})\right|1\le i\le h\}$, where

h is the number of distinct offset difference values observed.

Subsequently, the above concept was generalized to the case of strings, and slightly adjusted so that for two subsequences ${s}_{1}={a}_{i}\dots {a}_{i+q}$ and ${s}_{2}={a}_{j}\dots {a}_{j+r}$ of s, where $1\le i\le i+q<j\le n-r$, the value contributed to the output of tag convolution $T({s}_{1},{s}_{2})$ by the pairs of tags matching either ${s}_{1}$ and ${s}_{2}$ or $\overline{{s}_{2}}$ and $\overline{{s}_{1}}$ would equal $Mass({a}_{i+q+1}\dots {a}_{j-1})$, i.e., the mass of the subsequence separating ${s}_{1}$ and ${s}_{2}$ in s. This was formalized in the following way.

For a real δ, a shift of $\tau ({w}_{1},{w}_{2})$ by δ is defined as ${\tau}_{\delta}({w}_{1},{w}_{2})=\left\{(d+\delta ,m)\right|(d,m)\in \tau ({w}_{1},{w}_{2})\}$. To compute $T({s}_{1},{s}_{2})$ for two amino acid strings ${s}_{1}={x}_{1}\dots {x}_{e}$ and ${s}_{2}={y}_{1}\dots {y}_{f}$, we first iterate over all the pairs of k-mers from ${s}_{1}$ and ${s}_{2}$, respectively; thereby, a pair $({x}_{i}\dots {x}_{i+k-1},{y}_{j}\dots {y}_{j+k-1})$ contributes the output of ${\tau}_{-Mass({x}_{i}\dots {x}_{e})-Mass({y}_{1}\dots {y}_{j-1})}({x}_{i}\dots {x}_{i+k-1},{y}_{j}\dots {y}_{j+k-1})$ to an auxiliary set $\tau ({s}_{1},{s}_{2})$. Next, we analogously form a set $\tau (\overline{{s}_{2}},\overline{{s}_{1}})$. Having merged together $\tau ({s}_{1},{s}_{2})$ and $\tau (\overline{{s}_{2}},\overline{{s}_{1}})$, we obtain $T({s}_{1},{s}_{2})$. Note that $T({s}_{1},{s}_{2})=T(\overline{{s}_{2}},\overline{{s}_{1}})$.

In [

33], we described a procedure for validating de novo peptide sequences. In particular, for an amino acid

${a}_{i}$ of a candidate sequence

$s={a}_{1}\dots {a}_{n}$, where

$k<i\le n-k$, it computes

$T({a}_{1}\dots {a}_{i-1},{a}_{i+1}\dots {a}_{n})$ and checks whether

$Mass\left({a}_{i}\right)$ occurs in it with a high enough multiplicity. According to our experiments, for a correct peptide sequence

s, the multiplicity

$Mass\left({a}_{i}\right)$ usually clearly dominates that of the other values present in

$T({a}_{1}\dots {a}_{i-1},{a}_{i+1}\dots {a}_{n})$. This suggests that a similar idea might be applied to check whether two amino acid strings

${s}_{1}$ and

${s}_{2}$ are subsequences of a longer sequence

s: to this end, one would compute

$T({s}_{1},{s}_{2})$ and verify whether the multiplicity of the most frequently observed offset difference

${d}^{*}$ is significantly greater than the second-highest multiplicity. If so,

${d}^{*}$ would be reported as the mass of the subsequence separating

${s}_{1}$ and

${s}_{2}$ in

s; otherwise, the verdict would be that

${s}_{1}$,

${s}_{2}$ and

s are not related in that way.

However, such an approach would work fine only for a rather short peptide sequence s, and its subsequences ${s}_{1}$ and ${s}_{2}$ separated by at most a few amino acids, and turns out to be inapplicable to the top-down case, with long protein sequences and large gaps between the retrieved fragments of those. The underlying issues, along with the means to resolve them, are discussed in the next section.

#### 4.6. Gap Estimation

Given two amino acid strings ${s}_{1}$ and ${s}_{2}$, we aim to verify whether they represent two disjoint fragments of the same protein sequence s, and if the answer is positive, report an approximate mass of the sequence separating them in s. To this end, we compute $T({s}_{1},{s}_{2})$ based on a set $\mathcal{T}$ of k-tags extracted from top-down MS/MS spectra; however, only pairs of tags from the same spectrum are allowed to contribute to $T({s}_{1},{s}_{2})$, and its output needs to be treated in a different way, as compared to the bottom-up case.

To generate the set

$\mathcal{T}$, we again apply the strategy being part of the Twister approach, which assures high accuracy of the resulting tags (see

Section 4.3). In particular, we use a stringent mass tolerance

$\epsilon =4$ mDa when deciding whether the difference between two peak masses matches the mass of some amino acid, thereby relying upon the observation that the errors in close masses tend to be similar.

However, when we switch to the differences between tag offsets, which can be quite large, this kind of assumption can no longer be made. Moreover, the same value can appear as a difference of two relatively small offsets, and also as that of two large offsets, and in the latter case, the error in it may be substantially larger than in the former case. To avoid the need to keep track of the way in which concrete values were obtained, we apply the binning strategy similar to the one introduced in [

32] for analyzing the offsets of aggregated strings. Furthermore, namely, each offset difference

d is first scaled through multiplication by

${10}^{h}$ (in our experiments,

$h=4$), and rounded to the nearest integer; subsequently, each obtained

scaled difference ${d}^{s}$ is assigned a multiplicity

$\mu \left({d}^{s}\right)$ equal to the number of the offset differences that got transformed into it. In addition, an integral

binned difference ${d}^{b}$ is calculated for

d by rounding it to the nearest integer; its multiplicity is defined as

$\mu \left({d}^{b}\right)=\mu \left({d}_{1}^{s}\right)+\dots +\mu \left({d}_{g}^{s}\right)$, where

${d}_{i}^{s}$ are the scaled counterparts of the offset differences that got transformed into

${d}^{b}$,

$1\le i\le g$.

Let our hypothesis be that ${s}_{1}$ and ${s}_{2}$ are two disjoint subsequences of the same (unknown) protein sequence s, and ${s}_{1}$ precedes ${s}_{2}$ in s. In order to disprove it, we proceed as follows. First, we calculate $T({s}_{1},{s}_{2})$, along with the respective sets of scaled and binned offset differences endowed with multiplicities. Next, we focus on the binned differences, and select the non-negative ones not exceeding a predefined threshold ${G}_{max}$. Further, from the binned differences still under consideration, which have the multiplicity at least ${B}_{min}$, we pick up those with the highest multiplicity ${b}^{max}$. For each such difference ${d}^{{b}_{max}}$, we calculate its score as $Score\left({d}^{{b}_{max}}\right)=\mu \left({d}^{{b}_{max}}\right)+\mu ({d}^{{b}_{max}}-1)+\mu ({d}^{{b}_{max}}+1)$, assuming that a value ${d}^{\prime}$ that does not appear as a binned difference has a zero multiplicity. In this way, we account for the well-known $\pm 1$ Da errors in large enough deconvoluted masses. Finally, the top-scoring binned difference ${d}_{top}^{{b}_{max}}$ is selected (the smallest one is picked up in case of ties), then its corresponding scaled difference ${d}_{0}^{s}$ with the highest multiplicity is detected, and the value of $\widehat{d}={d}_{0}^{s}\xb7{10}^{-h}$ is reported as a candidate estimate of the gap between ${s}_{1}$ and ${s}_{2}$.

As a last step, we check whether the tags that contributed to the binned counterpart ${\widehat{d}}^{b}$ of the estimate $\widehat{d}$ together would cover at least a certain number of amino acids in both ${s}_{1}$ and ${s}_{2}$. To this end, we introduce a threshold ${A}_{min}$, and note that ${m}^{*}={A}_{min}-k+1$ k-tags with distinct labels all corresponding to the same string will always cover ${A}_{min}$ amino acids in it. The estimate $\widehat{d}$ is accepted if at least ${m}^{*}$ tags that support ${\widehat{d}}^{b}$ are observed for each of ${s}_{1}$ and ${s}_{2}$, or ${m}^{*}+1$ and ${m}^{*}-1$ tags are observed for one and the other string, respectively. If neither is the case, we check whether the respective numbers are both at least ${m}^{*}-1$, and if so, whether either ${\widehat{d}}^{b}-1$ or ${\widehat{d}}^{b}+1$ occurred among the binned differences, and was supported by at least ${m}^{*}-1$ and ${m}^{*}$ tags for the two strings, respectively. In case this holds, the estimate $\widehat{d}$ is accepted. Otherwise, we conclude that the hypothesis was wrong.

Since the protein sequence fragments may appear in the output of Twister in a direct as well as reversed form, when processing the amino acid sequences ${s}_{1}$ and ${s}_{2}$ of two aggregated paths, we apply the above procedure to up to four pairs of strings, and namely, ${s}_{1}$ and ${s}_{2}$, ${s}_{1}$ and $\overline{{s}_{2}}$, $\overline{{s}_{1}}$ and ${s}_{2}$, and $\overline{{s}_{1}}$ and $\overline{{s}_{2}}$. If for some pair, a gap estimate was obtained, the two strings are joined to form a gapped path, and the remaining pairs are not considered.

To enable iterative construction of gapped paths, we proceed as follows. The gapped paths are initialized with the input strings, and further examined pairwise. As in the case of regular amino acid strings, for a pair ${g}_{1}$, ${g}_{2}$ of gapped paths, we consider four combinations comprising the direct and/or reversed versions of those: ${g}_{1}$ and ${g}_{2}$, ${g}_{1}$ and $\overline{{g}_{2}}$, $\overline{{g}_{1}}$ and ${g}_{2}$, and $\overline{{g}_{1}}$ and $\overline{{g}_{2}}$. Without loss of generality, let us discuss in more detail the first case.

When processing ${g}_{1}$ and ${g}_{2}$, we first try to append ${g}_{2}$ to ${g}_{1}$. To decide whether it is possible, we pick up the last sequence fragment ${s}_{1}^{last}$ of ${g}_{1}$ and the first sequence fragment ${s}_{2}^{first}$ of ${g}_{2}$, and verify as stated above whether ${s}_{1}^{last}$ and ${s}_{2}^{first}$ represent two fragments of the same protein sequence. If the answer is positive, ${g}_{2}$ is appended to ${g}_{1}$; otherwise, we consecutively examine the gaps from ${g}_{1}$, and for each gap large enough to potentially accommodate ${g}_{2}$, perform a similar check for the sequence fragment ${s}^{\prime}$ of ${g}_{1}$ immediately preceding this gap, and ${s}_{2}^{first}$. If, according to its outcome, ${s}^{\prime}$ precedes ${s}_{2}^{first}$ in some protein sequence, we additionally verify whether upon embedding of ${g}_{2}$ into this gap, its tail would overlap the fragment ${s}^{\u2033}$ of ${g}_{1}$ immediately after the gap. If not, ${g}_{2}$ is appropriately merged into ${g}_{1}$ after ${s}^{\prime}$. The overlap check amounts to a comparison of the mass offset of the end of ${g}_{2}$ upon embedding, and that of the beginning of ${s}^{\u2033}$ (the offsets may be calculated e.g., with respect to the beginning of ${g}_{1}$), which is carried out using a tolerance ${\epsilon}_{abs}$ specified in ppm. In case ${g}_{2}$ could not be embedded into ${g}_{1}$, a similar procedure is applied with a goal of embedding ${g}_{1}$ into ${g}_{2}$.