Validation of De Novo Peptide Sequences with Bottom-Up Tag Convolution

De novo sequencing is indispensable for the analysis of proteins from organisms with unknown genomes, novel splice variants, and antibodies. However, despite a variety of methods developed to this end, distinguishing between the correct interpretation of a mass spectrum and a number of incorrect alternatives often remains a challenge. Tag convolution is computed for a set of peptide sequence tags of a fixed length k generated from the input tandem mass spectra and can be viewed as a generalization of the well-known spectral convolution. We demonstrate its utility for validating de novo peptide sequences by using a set of those generated by the algorithm PepNovo+ from high-resolution bottom-up data sets for carbonic anhydrase 2 and the Fab region of alemtuzumab and indicate its further potential applications.


Introduction
Tandem mass spectrometry (MS/MS) has established itself as the dominant technique in proteomics. First recognized as such was the more elaborated bottom-up technology, which analyzes peptides resulting from protein enzymatic digestion; however, the recently emerged top-down approach that analyzes intact proteins is nowadays rapidly gaining popularity.
Analysis of MS/MS spectra acquired from peptides or proteins often amounts to a consideration of pairwise differences of peak masses rather than those masses on their own. For instance, pairs of peaks separated by the amino acid masses give rise to edges in a spectrum graph [1,2], and ladders of such peaks define peptide sequence tags [3], which have become the basis of several methods for peptide and protein identification from database search [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] and also proved to be useful for limiting the number of de novo sequence possibilities [19,20]. The key step that precedes deisotoping and charge state deconvolution of MS/MS spectra is the detection of (candidate) isotopomer envelopes, the theoretical counterparts of which are represented by groups of equally spaced peaks [21][22][23][24][25][26][27]. A more sophisticated example is given by spectral convolution [28], which examines pairwise differences between the masses of peaks picked up from two distinct spectra, along with their multiplicities (i.e., the number of times they are observed) in order to estimate similarity between the latter. The key observation behind is that the multiplicity of zero equals the number of peaks the two spectra have in common, or shared peaks count for those, and presence of a few non-zero values with high multiplicities likely indicates that the spectra were acquired from two peptides that are a few mutations apart.
In [29], we introduced the notion of tag convolution for a top-down LC-MS/MS dataset, which may be viewed as generalization of spectral convolution, and it is computed across the entire set of input spectra-or, more precisely, over a set of sequence tags generated from those spectra. We demonstrated that tag convolution can be efficiently used for combining together protein sequence fragments generated by the Twister de novo sequencing algorithm from top-down MS/MS data; as a result, we obtained so-called gapped strings, in which the missing portions of the sequence were substituted by their masses. We also mentioned that the concept of tag convolution could be adapted to the bottom-up case and applied for validating de novo peptide sequences. However, to this end, it is essential to take into account that the number of tags derived from a typical bottom-up data set will be a few orders of magnitude larger than in the top-down case and that the pairs of tags matching the same peptide will have relatively close mass offsets.
Recall that for a spectrum S, a peptide sequence tag of length k, or k-tag, is defined by k + 1 peaks of S separated by the amino acid masses; the respective amino acids spell out the tag string, and the mass of the leftmost peak determines the mass offset, or simply offset, assigned to the resulting tag. Given a set T of k-tags extracted from a set of input MS/MS spectra and two k-mers w 1 and w 2 , tag convolution computes offset differences for the pairs of tags from T labeled with w 2 and w 1 , respectively, and along with each encountered value, it reports its multiplicity equal to the number of pairs of tags that contributed to it. The intuition behind is that if w 1 precedes w 2 in the sequence s of a protein or peptide subject to analysis, then the mass of the subsequence of s starting at the beginning of w 1 and ending right before w 2 will, thus, become registered with high multiplicity.
For this approach to work as expected, it is crucial that most of the tags composing T be correct. In order to ensure this holds, we employ the tag generation strategy introduced in [30] for the case of top-down MS/MS spectra and apply it to bottom-up MS/MS spectra collected at a high resolution [31]. It first deconvolutes the input spectra with MS-Deconv [27] and subsequently generates k-tags applying ultra-low constant mass tolerance of 4 mDa, thus profiting from the fact that while an absolute error in a peak mass (especially a large one) can be accordingly large, the difference in those corresponding to consecutive fragment ions tends to be substantially smaller.
In what follows, we provide a formal definition of bottom-up tag convolution, describe a procedure that uses it for validating de novo amino acid sequences, and illustrate its performance on bottom-up datasets for carbonic anhydrase 2 (CAH2) and alemtuzumab. We conclude by indicating future methods for developing this concept.

Generation of k-Tags
The input MS/MS spectra acquired at a high resolution are first deconvoluted, to which end we use MS-Deconv [27]. Let S denote the resulting set of deconvoluted spectra.
Subsequently, we extract from each spectrum S ∈ S a number of high-quality k-tags for a fixed length k. This is accomplished by means of the method first proposed in [30] for top-down MS/MS spectra and later successfully applied to bottom-up data [31]. First, a spectrum graph G S is constructed for S. Its vertices correspond to the peaks from S and are scored with underlying peak intensities; a directed edge uv is introduced between two vertices u and v if Mass(v) > Mass(u), and Mass(v) − Mass(u) equals the mass of some amino acid a up to a predefined tolerance, where Mass(u) and Mass(v) denote the masses of the peaks from S that gave rise to u and v, respectively. Thereby we rely upon the observation that peaks with nearby masses typically bear a similar error in those; therefore, the relative mass difference for two peaks corresponding to consecutive fragment ions should be highly accurate. Thus, for a small ε denoting the allowable deviation from the "anticipated" peak mass (which we expect to differ from the theoretical one by a certain value depending on the absolute mass), we check whether Mass(v) − Mass(u) < 2ε, and, if so, create an edge uv and label it with a. Based on automated and manual analysis of a few datasets, we set ε to 4 mDa and kept this value throughout all our experiments.
Next, an optimal path (with respect to the vertex scores) is computed for each connected component of G S , and all the possible k-tags are derived from it; note that two tags with a same amino acid string and offset originating from distinct spectra are consid-ered different. In this manner, we obtain a set T = T (S) of k-tags, based on which tag convolution is further computed.

Bottom-Up Tag Convolution
When describing computation of bottom-up tag convolution, we will generally follow the scheme from [29]. However, the masses potentially separating pairs of tags under consideration will be analyzed and processed in a distinct manner as compared to the top-down case due to the following reasons: • The number of tags originating from the same peptide is typically quite large; • The mass offsets of tags matching the same peptide are usually close; thus, their differences are accurate; • Unlike in the top-down case, ±1 Da deconvolution errors are rarely observed in the bottom-up MS/MS spectra.
For a tag t ∈ T , let s(t) and o(t) denote its amino acid string and offset, respectively. Moreover, let K = K(T ) denote the set of tag strings induced by T : For two k-mers w 1 , w 2 ∈ K, tag convolution τ(w 1 , w 2 ) examines each pair (t 1 , t 2 ) of tags from T labeled with w 1 and w 2 , respectively, and computes difference o(t 2 ) − o(t 1 ) of their associated offsets. Its output represents the set of observed values d i , each endowed with the multiplicity m i being equal to the number of pairs of tags that produced it (up to a predefined tolerance): where h is the number of offset differences encountered. If either w 1 , or w 2 , or both do not belong to K, the output of τ(w 1 , w 2 ) is an empty set. A toy example illustrating this concept is provided in Figure 1. Observe that spectral convolution [28] of two spectra S 1 and S 2 constitutes a special case of tag convolution for S = {S 1 , S 2 } and all the possible 0-tags-i.e., peaks from S 1 and S 2 -upon a convention that each 0-tag derived from S 1 and S 2 , respectively, has been assigned an artificial label z * 1 and z * 2 , respectively, where z * 1 and z * 2 are distinct. Intuitively, it should be expected that if the k-mers w 1 = a i . . . a i+k−1 and w 2 = a j . . . a j+k−1 represent two substrings of the sequence s = a 1 . . . a n of a target peptide P, where 1 ≤ i < j ≤ n − k + 1, and are unique with respect to the sequences of all the peptides subject to analysis (up to reversal), then the offset difference approximately equal to Mass(a i . . . a j−1 ) will appear in the output of τ(w 1 , w 2 ) with high multiplicity, while the other observed differences will have substantially lower multiplicities. This mass represents, in particular, the difference between the offsets of the tags labeled with w 1 and w 2 , respectively, defined by the peaks from the theoretical spectrum of P that correspond to the ladders of N-terminal ions of the same type. Thereby, we implicitly assume that fragmentation does produce ladders of ions leading to the tags labeled with w 1 and w 2 , respectively.

S S S
An important point is that even the spectra with very few peaks or an incorrect precursor mass, which could be neither interpreted de novo nor identified by means of a database search (should an appropriate database be available), may give rise to tags that will contribute to the "correct" offset difference (Figure 2a,b). In addition, such a pair of tags can originate from two distinct spectra, which potentially may be acquired from different-although starting at a same residue of the underlying sequence-peptides ( Figure 2c). Thus, tag convolution makes a remarkably extensive use of the information encapsulated in the input dataset, capturing the details commonly missed by existing tools for analyzing MS/MS data. Four spectra acquired from a toy protein with the amino acid sequence AVTDPVLSG-NATSMPGST. Tag convolution is being computed for the strings TDPVL and ATSMP. Two tags composing a pair that contributes the "correct" (i.e., equal to m(SGN)) value can be derived from the following: (a) a spectrum acquired from the entire protein; (b) a spectrum acquired from a fragment of the underlying protein; (c) two distinct spectra acquired from possibly different protein fragments starting at a same amino acid residue.

A V T D P V L S G N A T S M P G S T T S G P M S T A N G S L V P D T V A
However, in practice, w 1 and/or w 2 may happen not to be unique with respect to the protein sequence(s) contained in a sample, and, if so, pairs of tags corresponding to their non-correlated occurrences may produce an irrelevant offset difference endowed with a convincingly high multiplicity. A straightforward method for preventing such appearances of such fraud values comprises an appropriate selection of tag length k, which should then be large enough to ensure that a k-mer is unlikely to occur more than once in the sequence(s) being analyzed (note that an occurrence of its reversed copy would also count). Nevertheless, usage of short tags is often beneficial, despite the fact that they can be duplicated: for instance, 3-tags turn out to be particularly handy in analyzing poorly covered regions of the underlying sequence(s). On the other hand, it is often clear from the context which offset differences are more likely correct, and then incorrect values can be safely ignored regardless of their associated multiplicities. For example, if seeking to decide whether a sequence s represents a correct de novo interpretation of an input spectrum (see also Sections 2.3 and 3.2), for two k-mers w 1 and w 2 defined as above, we would expect Mass(a i . . . a j−1 ) to show up with high multiplicity. If this is the case, but some other values occur with comparably high, or even higher, multiplicities, their presence can be attributed to the fact that at least one of w 1 and w 2 occurred at least once more (possibly in a reversed form) in the sequences of the peptides contained the sample.
Another issue to be taken into account is that even for a highest-quality dataset, there is little hope to encounter a tag for every k-mer in the protein sequence(s). In order to overcome potential complications caused by the absence of some tags, we extend the concept of tag convolution from k-mers to longer strings, and the procedures outlined below capitalize on this generalization.
In order to define tag convolution for strings, we need to introduce two auxiliary operations that apply to tag convolution for k-mers. The first one is a shift by a value of δ, which transforms τ(w 1 , w 2 ) into the set The second operation is a merge of the outputs of tag convolution for two pairs of k-mers; typically, at least one of those will be appropriately shifted so that the two sets of offset differences would presumably match each other. For example, merging the outputs of τ(w 1 , w 2 ) and τ δ (u 1 , u 2 ) comprises merging the respective two sets of offset differences; for a difference that occurs in both sets, its multiplicity in the resulting set τ(w 1 , w 2 ) • τ δ (u 1 , u 2 ) is calculated as the sum of those in the original sets, while a difference contained in precisely one set simply inherits its corresponding multiplicity.
Assuming that s 1 and s 2 are substrings of s and s 1 precedes s 2 in s, let s * denote the substring of s separating s 1 and s 2 . Then, τ δ (w 1 , w 2 ) essentially provides us with a set of weighted estimates of the mass Mass(s * ) of s * computed from w 1 and w 2 , and T(s 1 , s 2 ) combines them all together, thus providing such set of estimates obtained from the entire strings s 1 and s 2 and their reversed copies s 1 and s 2 . Suppose we trust correctness of s 1 and s 2 but doubt that of s * . Then, the presence of Mass(s * ) in T(s 1 , s 2 ) with a high multiplicity would serve as an argument that s * is correct, while its absence from T(s 1 , s 2 ) or occurrence in T(s 1 , s 2 ) with a low multiplicity would be a "warning alarm." This simple idea underlies the sequence validation procedures outlined in the next section.
Observe that the above-mentioned drawback of using short tags is significantly reduced for tag convolution applied to long enough amino acid strings s 1 and s 2 . Indeed, even though for a pair of k-mers w 1 and w 2 cut out from s 1 and s 2 , or s 2 and s 1 , respectively, an incorrect offset difference may dominate in τ(w 1 , w 2 ), it is unlikely that the same value will also appear with a high multiplicity in the output of tag convolution for other pairs of k-mers contributing to T(s 1 , s 2 ). On the contrary, the correct value should be produced with a relatively high multiplicity for each pair of k-mers from s 1 and s 2 , or s 2 and s 1 , respectively, that both belong to K; consequently, they are expected to dominate in T(s 1 , s 2 ).

Sequence Validation
It is not uncommon that the amino acid strings generated by a de novo sequencing algorithm contain erroneous amino acids or even are entirely wrong. We propose the following method for validating de novo strings using bottom-up tag convolution.
Let s = a 1 . . . a n be an amino acid string subject to validation. With each amino acid a g of s except for the first and last k ones, we associate its tag score θ(a g ), where k < g ≤ n − k. With each amino acid a h of s, we associate its k-mer score κ(a h ), where 1 ≤ h ≤ n.
The tag score θ(a g ) equals the multiplicity of Mass(a g ) in the tag convolution T(s l , s r ) of the two substrings s l and s r of s located immediately to the left and right of a g , respectively. It should be noted, however, that the farther the two k-mers, w 1 and w 2 , are from each other within s, the less accurate the output of τ(w 1 , w 2 ) might be, and consequently, the contribution of the pair (w 1 ,w 2 ) to T(s l , s r ) results. In order to prevent potential errors introduced by such pairs of tags, we impose an upper bound L on the length of s l and s r , thus permitting s l = a max{0,g−L} . . . a g−1 and s r = a g+1 . . . a min{n,g+L} .
The k-mer score of an amino acid of s can be either 0 or 1. Initially, all of those are set to zero. Now, suppose that at time of calculation of θ(a g ), a pair w 1 = a i . . . a i+k−1 and w 2 = a j . . . a j+k−1 of k-mers from s l and s r , respectively, contributed the value of Mass(a g ) to T(w 1 , w 2 ), where 0 ≤ i ≤ g − k and g < j ≤ n − k + 1. On one hand, this boosts confidence in a g ; on the other hand, this also favours the amino acids composing w 1 and w 2 . To recognize this fact, the k-mer score of each of a i , . . . , a i+k−1 , a j , . . . a j+k−1 , if still zero, is risen to 1.
As an example, consider a toy protein with the amino acid sequence s = AVTDPVLSGNATSMPGST from which four spectra were acquired (see Figure 3). The red and blue peaks correspond to b-ions and y-ions, respectively. In total, there are are six 3-tags, out of which three are based on b-ions (those labeled with DPV, SMP, and PVL, respectively) and the other three are based on y-ions (those labeled with STA, VPD, and GSL, respectively). For each amino acid of s, the tag and 3-mer score calculated from those 3-tags are listed in Table 1. In particular, the amino acid score of N-10 is obtained as the multiplicity of its mass 114 in the output of T(AVTDPVLSG, ATSMPGST). Since 114 occurs precisely once in the output of the following: The amino acid score on N-10 equals 4 (see Figures 3 and 4). Furthermore, the tag score of N-10 is 0: This can be deduced immediately since it is not covered by any 3-tag. On the contrary, the tag score of each amino acid covered by some tag that together with another one contributed to the amino score of N-10 (namely, D-4, P-5, V-6, L-7, S-8, G-9, A-11, T-12, S-13, M-14, and P-15) can be immediately set to 1.  Table 1. The tag and 3-mer score for each amino acid of the protein sequence from the toy example provided in Figure 3. For a small enough tag length k, the introduced scores of the amino acids composing a correct string s usually are all positive, except for the k-mer score of the middle amino acid a k+1 of a string s of length 2k + 1, which is necessarily zero (while in this case, a k+1 is the only amino acid of s, for which the tag score is defined). Should a few similar interpretations have been proposed, e.g., for some spectrum, incorrect interpretations occasionally may also possess this property; however, the correct one will typically have a larger sum of the tags scores of its amino acids.

Datasets
We benchmarked our algorithms on bottom-up datasets acquired from carbonic anhydrase 2 (CAH2) and alemtuzumab [32]; brief details are provided below.
CAH2 solution was reduced with dithiothreitol (DTT), alkylated with iodoacetamide, digested overnight with trypsin, GluC or Lys-C, and analyzed using a nanoLC system coupled to a Thermo Q-Exactive mass spectrometer. MS and MS/MS spectra were collected at a resolution of 70,000 and 17,500, respectively. In total, 177,741 HCD MS/MS spectra were acquired (trypsin: 91,747 spectra; GluC: 43,026 spectra; Lys-C: 42,968 spectra).
Alemtuzumab solution was reduced with DTT, alkylated with iodoacetamide, digested overnight with trypsin, proteinase K or pepsin, and analyzed by a nanoLC system coupled with a Thermo LTQ Orbitrap XL mass spectrometer. MS spectra were collected at a resolution of 15,000. For every precursor, both HCD and a CAD iontrap spectra were recorded; HCD MS/MS spectra were collected at a resolution of 7500. In total, 3695 pairs of HCD and CAD MS/MS spectra were collected (trypsin: 1358 spectra; proteinase; K: 1052 spectra; pepsin: 1285 spectra). Only HCD MS/MS spectra were used to compute tag convolution and perform de novo sequence validation.

Sequence Validation
The input spectra were deconvoluted with MS-Deconv [27] using the default parameters and preprocessed; the latter amounted to reflecting peaks and merging nearby ones, as described in [30]. Subsequently, we applied the Twister approach [30], initially developed for the top-down case, to generate from them a set of de novo strings, and through searching those with BLAST against the non-redundant database, again following [30], detected and identified 32 and 2 contaminants in the CAH2 and alemtuzumab sample, respectively. The lists of contaminants are provided in Appendices A and B.
Subsequently, we ran PepNovo+ [33][34][35] on either dataset, with the fragment and precursor mass tolerance of 0.01 and 0.05 Da, respectively, and a fixed post-translational modification C+57. For CAH2 and alemtuzumab, 55,156 and 2471 spectra were thereby interpreted, respectively, in up to 20 ways each. A total of 806,934 and 38,936 de novo sequences of length at least seven were generated for CAH2 and alemtuzumab, respectively, among which 90,891 and 1765 were correct, respectively (i.e., represented a sequence fragment of either a target protein or contaminant).
Furthermore, we generated from either dataset a set of 3-tags as described in Section 2.1 using the mass tolerance of ε = 4 mDa. The obtained 419,136 and 7945 3-tags for CAH2 and alemtuzumab, respectively, were then used by the sequence validation procedure to evaluate de novo strings. When comparing the values output by tag convolution with the corresponding amino acid masses, we used an error tolerance of 0.02 Da.
When validating the de novo strings, we first restricted our attention to those with associated scores that are all positive. Next, for each spectrum, we sorted such strings (if any) by decreasing sum of the tag scores of their amino acids and iteratively eliminated for each string s all the subsequent strings s such that the following is the case: Here, Length(s) and Length(s ) denotes the length of the string s and s , respectively. As a final step, all the strings of length 7 with the middle tag score less than h were eliminated. For CAH2, the threshold h was set to 300, implying that approximately 37.67% of the sequences having length 7 were retained. However, for alemtuzumab, since the number of 3-tags was pretty small, we set h to 1 so that all the strings of length 7 still under consideration actually were retained.
In this manner, we were left with 104,211 and 1559 sequences for CAH2 and alemtuzumab, respectively, among which 79,451 and 1323 were correct, respectively. Thus, approximately 87.41% and 74.96% of the correct sequences were retained for CAH2 and alemtuzumab, respectively, while the fraction of those (in a corresponding set) increased from 11.26% and 4.53% to 76.24% and 84.86%, respectively.
The detailed statistics on the de novo strings generated from either dataset are provided in Table 2. Table 2. Statistics on the de novo strings for CAH2 and alemtuzumab. During validation, first the strings with associated scores all that were positive (necessarily of length above 7) were selected and made subject to filtration based on the alignment procedure described in the main text. Furthermore, the strings of length 7 were handled separately, and those with the middle tag score at least h were selected. The strings with length above and precisely 7 were retained upon alignment-based and middle tag score-based filtration, respectively, and they composed the set of strings that passed the validation procedure. The threshold h on the middle tag score was set to 300 and 1 for CAH2 and alemtuzumab, respectively. The details on the strings selected at some stage of the validation procedure are highlighted in bold. The percentage of the correct strings is given with respect to the total number of strings available upon completion of the respective stage.

The TagConvolution Software Tool
The proposed approach was implemented in a Java tool TagConvolution, which is freely available at http://bioinf.spbau.ru/en/twister/tag-convolution accessed on 8 November 2021, along with the sample input and output files.
The program takes as input two directories: one storing the file(s) containing the deconvoluted with MS-Deconv tandem mass spectra, which will be used by the validation procedure for tag generation, and the other-the file(s) with the amino acid sequences to be validated. The sequence files are either generated as output by PepNovo+ [34] or contain lists of candidate interpretations of the input spectra in a very simple format illustrated in the sample file TagConvolutionSampleInput.txt.
The tag generation strategy is the same as those used within the Twister de novo sequencing approachs [30,36]. Consequently, the TagConvolution tool inherits the following input parameters of Twister: the tag length k, the mass tolerance applied when retrieving tags, and two flags indicating whether peak reflection and water-loss peak elimination should be performed. Further details can be found in [30].
Moreover, the mass tolerance used by the sequence validation procedure when matching tag convolution values to the respective amino acid masses and the threshold on the minimum tag score of the middle amino acid in a string of length (2k + 1) are specified.
For each input file InputFileName.txt, two output files InputFileName.valid.txt and InputFileName.scores.txt are produced. For each MS/MS spectrum, at least one interpretation of which was classified as valid, all such candidate sequences are listed in the former file, and their associated tag and k-mer scores are provided in the latter.
The TagConvolution tool performs quite fast: in particular, on a modern laptop, the entire CAH2 dataset was processed in approximately 90 s.

Discussion
We have introduced the concept of tag convolution and demonstrated its utility in validating candidate tryptic peptide sequences based on a set of bottom-up MS/MS spectra collected at a high resolution. In practice, enzymes of any specificity can be used for digesting the target protein. Neither the protein size nor the peptide amino acid composition matters after digestion. The developed method can process sets of CID/CAD, ETD/ECD, or HCD MS/MS spectra acquired from the peptides subject to analysis.
In particular, this approach represents an elegant method for verifying de novo sequencing results using the same data, from which they were derived, yet it differs in processing. The proposed procedure can be easily adapted for localizing and identifying post-translational modifications (PTMs) in proteins or peptides: If for two disjoint sequence fragments, the value with the highest multiplicity output by tag convolution is not consistent with the sum of masses of the amino acid residues in-between, this likely points to one or a few PTMs that occurred on (some of) those, and the difference between the theoretically expected and observed value can be used to characterize the putative PTMs.
Additionally, bottom-up tag convolution can be applied for appropriately gluing together overlapping aggregated strings-protein sequence fragments derived from topdown spectra as described in [30,36]-assuming that bottom-up data were collected as well. We will benefit from that to further extend the Twister algorithm for de novo sequencing of proteins.
We implemented the sequence validation procedure in a standalone computer program freely available at http://bioinf.spbau.ru/en/twister/tag-convolution accessed on 8 November 2021, along with the sample input and output for the computational experiments described in this paper (however, the underlying tag generation strategy is the same as used within Twister [30,36]). Another direction for future work can be development of a more sophisticated software system for validating and possibly correcting amino acid sequences subject to examination.
Finally, we note that top-down deconvolution tools, including MS-Deconv, may not recognize some "good" isotopic envelopes in bottom-up MS/MS spectra because they differ in shape from those in top-down spectra. Consequently, several tags present in the original spectra may become lost at time of deconvolution. Therefore, it would be beneficial to adapt the scoring function employed by MS-Deconv for evaluating candidate isotopic envelopes in the case of high-resolution bottom-up mass spectrometry data so as to further enhance reliability of the proposed approach.
Funding: This research was funded by Ministry of Science and Higher Education of the Russian Federation (project 0791-2020-0011).

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: MS/MS Tandem mass spectrometry; CAH2 Carbonic anhydrase 2; DTT Dithiothreitol; HCD Higher-energy C-trap dissociation; CAD Collisionally activated dissociation.