1. Introduction
Post-translational modifications (PTMs) are changes in protein properties after they are synthesized, affecting both structure and function. These modifications add new properties that go beyond the initial role set by amino acids’ basic characteristics, greatly influencing various cellular processes, such as modulations in protein and molecular interactions, the regulation of gene expression, localization, and signaling [
1] (pp. 255–261).
There are two categories of PTMs. The first category of PTM includes changes resulting from the covalent addition of a modifying group, such as methyl, phosphoryl, acetyl, or glycosyl. It is very important to note that these modifications are reversible. The second category refers to alterations resulting from proteolytic cleavage (proteolysis), in which peptide bonds are broken to yield smaller peptides or amino acids [
2,
3]. According to Bradley [
4], advances in mass spectrometry have contributed to the discovery of over 600 different PTM classes and (Ramazi & Zahiri, 2021) [
2] highlighted the ten most studied PTMs, such as phosphorylation, acetylation, ubiquitination, methylation, glycosylation, SUMOylation, palmitoylation, myristylation, prenylation, and sulfation. Phosphorylation, methylation, acetylation, and ubiquitination are the focus of this study. These PTMs are the most frequently occurring PTMs in inherited human diseases.
Phosphorylation, the process of adding a phosphate group (PO
43−) from ATP, is one of the most studied types of PTM. Target residues for phosphorylation include S (serine), T (threonine), and Y (tyrosine) [
5]. Phosphorylation controls cellular functions, such as signal transduction, metabolism, and cell division [
6]. In the context of cardiomyopathy, phosphorylation has an impact on calcium sensitivity in the heart tissue that leads to slow muscle contraction and eventually heart failure [
7].
Methylation, like phosphorylation, is also a reversible process that is characterized by the addition of a methyl group (-CH3) to target amino acids. The process is facilitated by methyltransferases, and the donor of the methyl group is S-Adenosyl Methionine (SAM) [
2,
3]. Candidate residues for methylation are K (lysine) and R (arginine). Histones are the prime target for methylation. From a biomedical perspective, methylation affects chromatin organization and gene expression, with a primary focus on gene regulation. Genetic changes associated with cardiomyopathy may lead to alterations in heart tissue structure and malfunction [
8].
Acetylation is a process of adding an acetyl group (COCH
3) to the target amino acid, which is usually lysine (K). This process is catalyzed by lysine acetyltransferases (KATs), and acetyl-CoA serves as the donor. Similar to methylation, acetylation targets histones and affects gene expression and chromatin structure [
9]. Modified acetylation has been associated with cardiovascular diseases, such as cardiomyopathy. The presence of acetylation on proteins involved in mitochondrial function may alter energy metabolism, resulting in an energy deficit in heart cells. Acetylation also has an impact on the transcription factors responsible for cardiac hypertrophy and fibrosis [
10].
Ubiquitination is a PTM where a protein called ubiquitin, containing 76 amino acids, is attached to a target protein. This is a cascade process aided by different enzyme types, such as E1 or ubiquitin-activating enzyme, E2 or ubiquitin-conjugating enzyme, and E3 or ubiquitin ligases [
11]. Ubiquitination occurs mostly on K (lysine) and can cause proteasomal degradation, DNA repair, endocytosis, and signal transduction [
12]. The result of these processes in the context of cardiomyopathy is that they contribute to heart inflammation and fibrosis, which affects heart muscle remodeling and failure [
13].
The biological importance of PTMs has led to the creation of computer- and AI-driven approaches for their identification. Several ML tools, including MusiteDeep, PTMGPT2, and SiteTack, have been designed for the computational detection of PTMs. Nonetheless, it is crucial to note that the results of these models do not always align with experimental evidence.
HCM is a genetic cardiac disorder characterized by unexplained thickening of the myocardium, most often the interventricular septum, in the absence of other causes, such as hypertension or valvular disease [
14]. The pathology is characterized by several changes in the heart tissue, including myocyte hypertrophy and disarray (enlargement and disorganization of heart muscle cells), interstitial fibrosis (scarring between cells), and small-vessel disease (impairment of small blood vessels) [
15]. In some cases, left ventricular outflow tract (LVOT) obstruction is also present, leading to obstruction of the heart’s main blood flow [
16]. Some cases are asymptomatic, others experience dyspnea, chest pain, or fainting. Potential complications are atrial fibrillation (raising the risk of stroke), ventricular arrhythmias, and ultimately, sudden cardiac death [
17]. Although HCM is primarily caused by mutations in sarcomeric proteins, PTMs can strongly modify disease severity, progression, and phenotype, thereby representing an important area of investigation [
18].
In this paper, we aimed to benchmark the predictive PTM performance of three ML-based applications: MusiteDeep, PTMGPT2, and SiteTack for PTM detection, specifically for phosphorylation, methylation, acetylation, and ubiquitination on a small panel of sarcomeric proteins, commonly associated with HCM [
19]. The analysis of PTM distribution in the human body, based on PTMD 2.0 [
20], follows our PTM types selection: phosphorylation is the most frequent modification, followed by ubiquitination, acetylation, and methylation (
Figure 1).
The experiments utilized a dataset comprising protein sequences from the
MYH7,
MYBPC3,
TNNT2, and
TNNI3 genes, which encode key cardiac sarcomere proteins. MYH7 encodes the β-myosin heavy chain, a major motor protein of the cardiac thick filament vital for cardiac force generation.
MYBPC3 encodes cardiac myosin-binding protein C, which supports sarcomere structure and regulates cardiac contraction and relaxation.
TNNT2 (cardiac troponin T) and
TNNI3 (cardiac troponin I) are components of the troponin complex on the thin filament, which modulates calcium-dependent contraction [
21]. Additionally, this study examined whether any novel PTM sites are consistently predicted by three machine learning (ML) tools but not yet experimentally confirmed in PhosphoSitePlus. We identified five high-priority PTM candidates suitable for targeted experimental validation: MYH7 (K1451 acetylation, K129 methylation) and MYBPC3 (T705 phosphorylation, K14 acetylation, R44 methylation).
2. Materials and Methods
In this study, we aim to evaluate the performance of three ML applications: MusiteDeep, developed by researchers at University of Missouri-Columbia, USA and Jilin University, China (
https://www.musite.net/, accessed on 5 September 2025), PTMGPT2, hosted via the National Supercomputing Center for Life Sciences, Jeonbuk National University, South Korea (
https://nsclbio.jbnu.ac.kr/tools/ptmgpt2/, accessed on 5 September 2025), and SiteTack, developed by researchers at Massachusetts Institute of Technology and Broad Institute of MIT and Harvard (
https://sitetack.net/, accessed 5 September 2025) through a case study of four proteins associated with HCM: MYH7, MYBPC3, TNNT2, and TNNI3. Additionally, we have explored their ability to discover undiscovered PTM sites that could be designated as high-priority candidates for further experimental validation. In
Table 1, we provide details on protein full names, NCBI RefSeq ID, length, gene-coding symbol and chromosomal location. We aimed to benchmark the PTM predictive performance of MusiteDeep, PTMGPT2, and SiteTack across four PTM types: phosphorylation, acetylation, ubiquitination, and methylation. MusiteDeep and SiteTack are trained on large collections of PTM-annotated protein sequences (primarily from Uniport/Swiss-Prot and previously curated PTM datasets) rather than on specific protein families, covering multiple modification types [
22,
23]. PTMGPT2 reports a defined large-scale training dataset of roughly 388,000 annotated PTM instances used to fine-tune its language model [
24]. We aim to provide comprehensive predictive performance evaluations for each sarcomeric HCM-associated protein and for each PTM type.
We used PhosphoSitePlus v6.8.1 (
https://www.phosphosite.org/) as a benchmarking reference, which is a manually curated, interactive database that focuses on experimentally observed PTMs [
25]. PhosphoSitePlus is a widely used PTM annotation resource that covers a vast number of PTM sites across various types. It includes hundreds of thousands of unique, non-redundant modification sites, curated from more than 22,000 articles and numerous mass spectrometry datasets [
26]
Figure 2 shows an example visual output of the PhosphoSitePlus tool for the MYH7 protein.
The selection of the PTM prediction tools included in the benchmarking process was based on three criteria:
- (1)
They can provide coverage of multiple PTMs at once.
- (2)
They can be accessed directly as a web application.
- (3)
They employ contemporary ML methodologies.
This approach is intended to ensure that selected tools are both efficient and easy to use. Based on criteria 1–3, we have chosen: MusiteDeep [
22], PTMGPT2 [
24] and SiteTack [
23]. MusiteDeep utilizes deep learning, specifically convolutional neural networks (CNNs) with attention mechanisms, to predict PTM sites in proteins [
22]. PTMGPT2 is based on the GPT2 architecture and employs a transformer-based approach, fine-tuned with prompt-based learning to predict PTM sites [
24], while SiteTack incorporates deep learning models that utilize known PTM sites as part of the input encoding, enhancing prediction accuracy [
23].
Table 2 provides an overview of the selected tools.
PhosphoSitePlus records HTPs (High Throughput Papers), representing modification sites reported solely through proteomic discovery mass spectrometry, and LTPs (Low Throughput Papers), indicating sites identified by other methods. To enhance confidence and reliability, we considered PTM sites from PhosphoSitePlus that appeared in more than one HTP study (HTP > 1) as true positives for evaluating our reference set.
We used a probability threshold of 0.6 to predict positives in both the MusiteDeep and SiteTack models, aligning with the models’ confidence levels to facilitate comparison with the PhosphoSitePlus reference sites (HTP > 1). The PTMGPT2 tool employs a generative language model, which does not produce explicit prediction scores; thus, we relied on its default output without thresholding. The ML tools were compared using the following metrics: precision, recall, and F1-score. Precision indicates the proportion of correct positives among all predicted positives, recall shows the percentage of actual positives correctly identified, and F1-score balances both. These metrics offer a comprehensive performance assessment, especially in datasets with class imbalance.
Our methodological framework consists of several steps, performed for each protein of interest, including: downloading FASTA files from NCBI, obtaining prediction sites from PhosphoSitePlus, as well as from the three ML tools, gathering and preprocessing the data, and at the end calculating the statistics and visualizing the findings (
Figure 3). The pipeline has been implemented in Python 3.10, using pandas, seaborn and matplotlib as key libraries.
3. Results
Four PTM types—phosphorylation, acetylation, ubiquitination, and methylation—and four proteins—MYH7, MYBPC3, TNNT2, and TNNI3—are in the focus of the present research. Apart from the computational analysis on precision, recall, and F1-scores, we have also analyzed the rate of experimentally verified PTM locations, given a different HTP cutoff, as well as the predicted PTM locations with a varying threshold. Finally, we performed an analysis on the false positives.
3.1. PTM Identification with Threshold Adjustment
Figure 4 shows the number of experimentally verified PTM locations for each target protein of interest, for different HTP cutoffs. For the MYH7 protein, in the absence of any HTP filtering, PTM sites were identified for phosphorylation (105 sites), acetylation (42 sites), ubiquitination (2 sites), and methylation (1 site),
Figure 4a. When an HTP cutoff greater than 1 was applied, the number of detected sites decreased to 83 for phosphorylation, 34 for acetylation, 1 for ubiquitination, and methylation sites were no longer detected (
Figure 4a). With further increases in the HTP cutoff threshold, the number of PTM sites continued to decline, retaining only phosphorylation and methylation sites. In MYBPC3, only acetylation and phosphorylation sites were observed. One acetylation site has been experimentally verified for MYBPC3, independent of the HTP cutoff, whereas the number of experimentally detected phosphorylation sites decreased as the HTP cutoff increased (
Figure 4b). A total of 22 phosphorylation sites were validated by at least one HTP, decreasing to 19 sites at HTP > 2 and 15 sites at HTP > 3, and continuing to decline with increasing HTP cutoff, with only 9 phosphorylation sites at HTP > 10 (
Figure 4b). For the TNNT2 protein, two experimentally verified PTMs were identified: acetylation and phosphorylation (
Figure 4c). As the HTP cutoff increased, only phosphorylation sites remained, with 5 sites detected at HTP > 1 and only a single phosphorylation site at HTP > 2 (
Figure 4c). In the TNNI3 protein, 3 types of PTMs were observed: phosphorylation, acetylation, and methylation, corresponding to 17, 6, and 2 sites, respectively (
Figure 4d). Applying an HTP cutoff greater than 1 reduced these numbers to 15 phosphorylation, 5 acetylation, and 1 methylation site (
Figure 4d). As the HTP threshold increased further, a decline in modification sites was observed, leaving 12 phosphorylation and 3 acetylation sites at HTP > 2, and ultimately only phosphorylation sites at higher cutoffs (
Figure 4d).
The results from PhosphoSitePlus, with cutoff HTP > 1, which was selected as the threshold for further analysis to ensure higher confidence in site verification, showed that not every PTM type can be found in all four proteins. Phosphorylation was the most frequent PTM type and was observed in all four proteins (
Figure 4). Acetylation is identified in MYH7, MYBPC3, and TNNI3, as shown in
Figure 4a,b,d. Ubiquitination is identified only in MYH7 (
Figure 4a), while methylation is identified only in TNNI3 (
Figure 4d).
MusiteDeep predicted 54 phosphorylation sites, 12 acetylation sites, 32 ubiquitination sites, and 1 methylation site for the MYH7 protein, at a threshold of 0.5. For a threshold of 0.6, the numbers changed to 48, 4, 13, and 1, respectively (
Figure 5a). As the threshold increased, only phosphorylation sites (showing a gradual decline in number) and a single methylation site remained (
Figure 5a). Notably, the methylation site was not reported for this protein in PhosphoSitePlus (
Figure 4a). For MYBPC3 protein, MusiteDeep predicted 28 points of phosphorylation, 14 acetylation, 10 ubiquitination, and 2 methylation for a threshold of 0.5 (
Figure 5b). When the PTM score was set to 0.6, the corresponding counts were 20, 8, 2, and 2, and kept decreasing with the increased threshold (
Figure 5b). Although all four PTM types were detected for TNNT2 at a 0.5 threshold, increasing the threshold resulted in a reduction to only phosphorylation sites: 10 at 0.6, 6 at 0.7, and 1 at 0.8. For TNNI3, MusiteDeep predicted 17 sites of phosphorylation, 1 point of methylation, and 1 site of ubiquitination for a cutoff score of 0.5 (
Figure 5d). For a PTM cutoff score of 0.6 and higher, only phosphorylation sites were predicted (
Figure 5d).
Figure 6 shows the number of PTM sites predicted by SiteTack. For MYH7, phosphorylation, acetylation, ubiquitination, and methylation sites were identified at a probability score cutoff of 0.6, with counts of 78, 40, 21, and 13, respectively (
Figure 6a). The number of predicted sites was higher at a score of 0.5 and decreased with increasing probability thresholds. Similar patterns were observed for MYBPC3, TNNT2, and TNNI3. At a probability score of 0.6, MYBPC3 was predicted to have 36 phosphorylation, 24 acetylation, 17 ubiquitination, and 14 methylation sites (
Figure 6b). TNNT2 has 12 phosphorylation, 6 acetylation, 2 ubiquitination, and 10 methylation sites (
Figure 6c), while TNNI3 was predicted to contain 15 phosphorylation, 6 acetylation, 8 ubiquitination, and 7 methylation sites at the same threshold.
The previously observed trend is consistent across both experimental and ML approaches: applying more rigorous cutoffs reduces the number of identified PTM sites.
Figure 7 shows the number of PTM sites predicted by PTMGPT2 for the target proteins. For MYH7, 38 phosphorylation sites, 58 acetylation sites, 6 ubiquitination sites, and 4 methylation sites were detected. MYBPC3 was predicted to contain 17 phosphorylation, 3 acetylation, 1 ubiquitination, and 3 methylation sites (
Figure 7). No methylation sites were detected for TNNT2 or TNNI3 (
Figure 7). For these proteins, PTMGPT2 predicted 5 phosphorylation, 5 acetylation, and 1 ubiquitination point for TNNT2, while TNNI3 had 8 phosphorylation, 2 acetylation, and 1 ubiquitination site.
3.2. Evaluation of Predictive Performance
For the benchmarking purpose, we have taken PhosphoSitePlus experimentally verified PTM locations for HTP > 1 as true positives, while the threshold for predicted PTM positives was set to 0.6 in both ML applications, MusiteDeep and SiteTack. For PTMGPT2, we have used the model’s default output, as no cutoff score needed to be provided.
We have calculated the precision, recall, and F1-score for each protein, and then we have calculated the average (per-protein) statistics, presented on
Figure 8. In terms of phosphorylation, which was the dominant PTM type, all tools showed moderate and statistically not different precision, with MusiteDeep achieving the highest, 0.554, 95% CI (0.137, 0.920), followed by PTMGPT with 0.527, 95% (0.099, 0.926), and SiteTack with 0.511, 95% CI (0.209, 0.806), as shown in
Figure 8a. SiteTack demonstrated the highest recall of 0.691 (
Figure 8a). Accordingly, SiteTack showed the best balance between precision and recall, achieving an F1-score of 0.565 (
Figure 8a). MusiteDeep failed to identify acetylation sites, while PTMGPT2 showed better performance compared to SiteTack (
Figure 8b). Average statistics for ubiquitination were relevant only for SiteTack, which demonstrated a very low precision of 0.012 and maximum recall (
Figure 8c). The summary statistics for the methylation, represented by only 1 PTM location in TNNI3 protein, according to the experimentally verified results, did not provide any meaningful overview, since all tools failed to detect it (
Figure 8d).
Figure 9 and
Figure 10 provide a closer insight into the results for phosphorylation and acetylation per protein, and the detailed information is presented in
Table 3, which contains the calculated statistics for the relevant proteins and PTM types. As we can see from the results in
Figure 9, MYBPC3 and TNNT2 were more problematic for prediction compared to MYH7 and TNNI3. MusiteDeep had the highest precision for MYBPC3 and MYH7, PTMGPT2 was the most precise for TNNI3, while SiteTack was the most precise for TNNT2 (
Figure 9a). In general, SiteTack showed the highest recall, reaching 1 for TNNT2 (
Figure 9b). Based on the F1-scores for phosphorylation (
Figure 9c), MusiteDeep and SiteTack outperformed PTMGPT2. The highest F1-score was observed for TNNI3, 0.786 by MusiteDeep, but for all other proteins, SiteTack’s F1-score was higher (
Table 3).
Figure 10 confirms that the best results for acetylation were provided by PTMGPT2 with an F1-score of 0.652 for MYH7, and 0.571 for TNNI3, compared with the results from SiteTack, with an F1-score of 0.162 and 0.363, respectively (
Table 3).
Ubiquitination was only identified by SiteTack for MYH7, with a precision of 0.048, a recall of 1 and an F1-score of 0.09 (
Table 3).
Because MusiteDeep and SiteTack output probabilistic scores for PTM predictions, precision–recall (PR) and Receiver Operating Characteristic (ROC) curves for the main PTM types, phosphorylation and acetylation, were computed.
Figure 11 presents aggregated PR and ROC curves across the four analyzed proteins (MYH7, MYBPC3, TNNT2 and TNNI3), along with the corresponding AUC–ROC and AUC–PR values for each tool. Aggregated ROC and precision–recall analyses across all proteins revealed moderate predictive performance for phosphorylation sites for both tools. ROC analysis showed AUC–ROC values of 0.752 and 0.757 for MusiteDeep and SiteTack, respectively (
Figure 11a,c), indicating comparable discriminative ability. Consistently, precision–recall analysis (
Figure 11b,d) showed moderate performance for phosphorylation prediction (AUC–PR = 0.50–0.57), with MusiteDeep showing a slight improvement over SiteTack. In contrast, acetylation prediction performance was weak for both tools. ROC analysis produced AUC–ROC values close to random, 0.497 and 0.452 (
Figure 11a,c), and precision–recall analysis similarly indicated poor predictive performance of AUC–PR ≈ 0.11 (
Figure 11b,d). These results suggest that, while both predictors can moderately discriminate phosphorylation sites, they have limited ability to distinguish acetylated residues from non-modified sites in this dataset, likely reflecting the extreme class imbalance and the limited availability of experimentally validated acetylation sites.
We have additionally computed F1-optimal thresholds based on aggregated predictions across the analyzed proteins. In
Table 4, we summarize the AUC–ROC and AUC–PR values, as well as the best F1 thresholds, computed for both PTM types (phosphorylation and acetylation) and for both tools (MusiteDeep and SiteTack).
Although the threshold that maximizes the F1-score is lower (around 0.3), we have selected a classification threshold of 0.6 to prioritize prediction confidence. In the context of PTM site prediction, false positives can lead to unnecessary experimental validation, which is costly and time-consuming. Using a higher threshold increases precision at the expense of recall. This more conservative setting is further justified by the definition of the ground truth used in this study, which required true positive PTM sites to be supported by more than one independent experiment. Because this criterion favors highly reliable annotations and may exclude real but less frequently observed modifications, we opted for a stricter threshold to focus on the most confident predictions and produce a smaller, more reliable set of candidate PTM sites for downstream validation.
3.3. Analysis of False Positive Predictions
We have additionally analyzed the false PTM positives, or more specifically, the PTM sites that were predicted by all three tools, but were not supported by experimental, high-confidence reference data.
Figure 12 shows that SiteTack is the application which tends to produce the highest number of false positives, counting 97 false positives in MYH7, 81 in MYBPC3, 25 in TNNT2, and 23 in TNNI3, while MusiteDeep is the least prone to false positives.
False positives predicted by consensus of all three applications—MusiteDeep, PTMGPT2, and SiteTack—were also analyzed. We have discovered a total of 11 consensus false positives for HTP > 1, of which 7 are phosphorylation sites, 2 acetylation and 2 methylation sites. The phosphorylation sites are: T1513 and S1735 in MYH7, and S18, S297, T307, T705, Y79 in MYBPC3. The acetylation residues are K1451 (MYH7) and K14 (MYBPC3), while the methylation residues are K129 (MYH7) and R44 (MYBPC3). There were no consensus false positives in TNNT2 and TNNI3 proteins. Of these 11 consensus false positives (for HTP > 1), 5 (five) residues are completely absent from the experimentally validated set, regardless of the HTP value. Two consensus false positives are found in MYH7 protein: K1451 (acetylation) and K129 (methylation), and three are found in MYBPC3: T705 (phosphorylation), K14 (acetylation), R44 (methylation), as summarized in
Table 5 and visualized in
Figure 13.
MYH7 (β-myosin heavy chain) and
MYBPC3 (cardiac myosin-binding protein C) are the two most common sarcomere genes linked to HCM, responsible for ≈50% of familial HCM cases [
16]. To investigate the potential functional impact of the five PTM candidates: K129 and K1451 in MYH7 and K14, R44, T705 in MYBPC3 (
Table 5), we have analyzed publicly available protein databases, including UniProt, Protein Data Bank (PDB), and AlphaFold Protein Structure Database. MYH17 K1451 is in the coiled-coil tail of the protein (839–1935), involved in thick filament assembly. Acetylation at this particular residue will neutralize lysine charge that can alter thick filament backbone interactions or affect crucial binding mechanisms, such as titin binding or binding to other myosin tails. MYH7 K129 is located in the myosin motor (head) domain (85–778), which binds ATP, interacts with actin, and undergoes large conformational changes during force generation. The domain is also highly conserved. The functional impact of a PTM in this region is expected to be more significant compared to that in the tail. MYBPC3 T705 is placed within the central Ig/Fn domain, or more precisely, Ig-like C2-type 5 (645–771). Phosphorylation adds a negative charge to the structured domain, which may alter how MYBPC3 interacts with myosin heads or actin, with a possible impact on tension regulation. MYBPC3 K14 is in the N-terminal region, important for binding myosin and modulating actin interaction. Acetylation neutralizes positive charge, which may weaken or shift regulatory binding interactions at the extreme N-terminal, potentially resulting in dysregulation. MYBPC3 R44 is also in the N-terminal region. That could affect protein stability or the local interaction network. All five PTM sites were identified as priority candidates for further experimental validation, given their potential functional roles, which suggest they may significantly impact protein activity and contribute to key biological processes.
3.4. Proposed Experimental Validation
To experimentally validate the predicted post-translational modification (PTM) sites in MYH7 (K1451 acetylation, K129 methylation) and MYBPC3 (T705 phosphorylation, K14 acetylation, R44 methylation), several approaches can be employed. For example, site-directed mutagenesis can generate non-modifiable or PTM-mimetic variants (e.g., MYH7 K1451R/K1451Q or MYBPC3 T705A/T705D), which can be introduced into human-induced pluripotent stem cell-derived cardiomyocytes (iPSC-CMs) to evaluate effects on sarcomere organization, calcium handling, and contractility using functional assays such as engineered heart tissues or traction force microscopy. IPSC-CM models are widely used to reproduce sarcomeric defects and contractile abnormalities observed in hypertrophic cardiomyopathy (HCM) caused by MYH7 or MYBPC3 alterations [
27]. Also, targeted proteomics approaches, including parallel reaction monitoring (PRM) or selected reaction monitoring (SRM), can be applied to cardiac tissue samples from HCM patients and controls to detect and quantify peptides containing these modified residues, thereby confirming whether these PTMs occur in vivo. Third, structural and interaction-based studies could be performed by combining AlphaFold structural modeling with biochemical assays, such as co-immunoprecipitation, to determine whether PTM-mimetic mutations alter interactions among MYH7, MYBPC3, and other sarcomeric proteins.
Confirming these PTMs could significantly refine our mechanistic understanding of sarcomere regulation in HCM. For example, phosphorylation of cardiac myosin-binding protein C is known to regulate the availability of myosin heads for actin interaction and influence calcium-dependent force generation, demonstrating how PTMs can directly modulate cross-bridge cycling and contractile dynamics [
28]. Similarly, alterations in MYH7 or MYBPC3 interactions within the thick filament can destabilize the super-relaxed state of myosin and increase the number of active cross-bridges, a mechanism thought to contribute to hypercontractility in HCM [
29]. Therefore, validating these candidate PTMs could reveal an additional regulatory layer in which dynamic biochemical modifications influence thick filament assembly, cross-bridge availability, and calcium sensitivity, thereby shaping the molecular mechanisms underlying HCM pathophysiology.
4. Discussion
In this study, we have selected three contemporary ML-based tools (MusiteDeep, PTMGPT2, and SiteTack), all accessible as web applications, capable of predicting post-translational modification (PTM) sites across four modification types of interest: phosphorylation, methylation, ubiquitination, and acetylation. We evaluated their performance using four protein sequences associated with HCM: MYH7, MYBPC3, TNNT2, and TNNI3. Although numerous PTM prediction tools have been reported in the literature, most are designed to detect a single modification type—for example, NetPhos [
30], DeepNphos [
31] for phosphorylation, DeepAcet [
32] and DeepUbi [
33] for acetylation and ubiquitination respectively, GPS-MSP for methylation [
34]—thereby limiting their applicability for comprehensive multi-PTM analyses. Additionally, there are several tools for predicting residue-targeted PTMs, such as RMTLysPTM [
35] for predicting lysine-targeted PTMs: acetylation, crotonylation, methylation, and succinylation, and MUscADEL [
36], which is a deep bidirectional LSTM framework designed to predict eight lysine-targeted PTMs in human and mouse proteomes.
Our evaluation was based on predefined thresholds selected prior to testing, ensuring consistent and comparable results. Specifically, it is more than one HTP for PhosphoSitePlus, from where we obtained the experimentally verified PTMs, and a prediction score threshold of 0.6 for the ML tools. The results from MYH7, MYBPC3, TNNT2 and TNNI3, representing a small panel of sarcomeric proteins implicated in HCM, showed that the tools behaved differently across the different PTM types. Phosphorylation was the only PTM type for which all three ML tools could be compared, showing that MusiteDeep has a more conservative prediction, with higher precision but lower recall, while SiteTack favored recall, making it more suitable for scenarios where detecting all PTM sites is more important than avoiding false positives. MusiteDeep could not accurately detect acetylation sites, while PTMGPT2 showed better performance than SiteTack. The latter also showed the ability to identify ubiquitination sites. All three tools failed to identify any true positive methylation sites. However, with a sample size restricted to four proteins, performance metrics exhibit considerable protein-specific variability. Although mean metrics are reported, this variation prevents robust statistical comparison across tools. Consequently, we focus our conclusions on broad, qualitative insights into overall tool behavior.
According to the study by Gutierrez et al., SiteTack + PTMs outperformed MusiteDeep in terms of AUC and Area under the precision–recall curve [
23]. Although SiteTack showed better overall performance, MusiteDeep performed better for acetylation and achieved a higher AUC for tyrosine phosphorylation (Y) [
23]. In the study where PTMGPT2 was introduced, the tool was compared with MusiteDeep, indicating that it outperformed MusiteDeep in methylation, ubiquitination, acetylation, and phosphorylation(Y), showing better precision, recall, and F1-score [
24]. For phosphorylation (S, T) PTMPGPT2 showed better precision and F1-score, but MusiteDeep had better MCC and recall [
24].
The direct primary use of MusiteDeep, SiteTack, and PTMGPT as central PTM analysis tools remains underrepresented in the literature, resulting in a lack of comparable results across additional studies. MusiteDeep, on the other hand, is widely referenced in later papers as a benchmark model in PTM prediction research, especially in studies proposing new deep learning methods [
37].
We additionally explored false positives consistently predicted by the three tools, identifying five residues not experimentally supported by PhosphoSitePlus: K1451 (acetylation) and K129 (methylation) for MYH7, T705 (phosphorylation) for MYBPC3, and K14 (acetylation) and R44 (methylation) for MYBPC3. Given that PTMs play essential roles in regulating sarcomere contractility, they can significantly influence disease severity, progression, and phenotypic manifestation. Therefore, the identified high-priority candidates present potential targets for future biological exploration.
Finally, it is important to explicitly acknowledge that the absence of a PTM annotation in PhosphoSite (used as a baseline) does not imply true absence in vivo, particularly for less-studied PTM types or poorly characterized proteins. Consequently, the reported performance metrics and false positives are conditional on this specific reference dataset and may underestimate the performance of tools that correctly predict biologically real but not yet annotated modification sites.
5. Conclusions
In this study, we have analyzed the ability of three machine learning tools (MusiteDeep, PTMGPT2, and SiteTack) to predict phosphorylation, acetylation, methylation, and ubiquitination sites on a very specific, small panel of sarcomeric proteins. The proteins were selected for their biological relevance to HCM, enabling us to evaluate tool performance in a focused, disease-relevant context. However, because our study is focused on a restricted protein set, the analysis may not reflect the full predictive capabilities of the tools. We have analyzed the number of detected PTMs under different predicting score thresholds and evaluated their performances, using precision, recall and F1-score. The results showed that MusiteDeep can be used when a precise detection of phosphorylation is needed for the major contributing HCM-associated proteins. For the same target proteins, PTMGPT2 showed best performances for acetylation. On the other hand, SiteTack can be suitable for phosphorylation site screening and for exploring other PTM types in HCM-related proteins, but it tends to produce the highest number of false positives. Finally, we report on five unknown PTM sites in the MYH7 and MYBPC3 proteins, suggesting them as high-priority candidates for experimental validation. Although these modifications have not yet been confirmed experimentally, their predicted functional impact makes them promising targets for future studies aimed at understanding sarcomeric regulation in HCM. Future functional assays will be critical to determine their ultimate PTM status.