Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity

El-Awadi, Radwa; Gomez, Oscar D.; Castillo-Secilla, Daniel; Torres, Carolina; Herrera, Luis J.; Rojas, Ignacio; Ortuño, Francisco M.

doi:10.3390/biomedicines14020378

Open AccessArticle

Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity

by

Radwa El-Awadi

^1,†

,

Oscar D. Gomez

^1,†,

Daniel Castillo-Secilla

²

,

Carolina Torres

³

,

Luis J. Herrera

¹

,

Ignacio Rojas

¹

and

Francisco M. Ortuño

^1,*

¹

Department of Computer Engineering, Automation and Robotics, University of Granada, 18071 Granada, Spain

²

CoE Data Intelligence, Fujitsu Technology Solutions S.A., 28224 Madrid, Spain

³

Department of Biochemistry and Molecular Biology III and Immunology, University of Granada, 18071 Granada, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Biomedicines 2026, 14(2), 378; https://doi.org/10.3390/biomedicines14020378

Submission received: 11 December 2025 / Revised: 21 January 2026 / Accepted: 4 February 2026 / Published: 6 February 2026

(This article belongs to the Special Issue The Human Proteinomics in Disease, Diagnostics and Translation into Precision Medicine: Current Status and Future Prospects—2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background: Comparing biological properties among related proteins has traditionally benefited research in areas such as biomedicine, phylogeny and evolution. Moreover, these kinds of properties have significantly increased as a result of the development of open-access resources and databases. In this context, the multiple sequence alignment (MSA) methods continue to be extensively applied to compare protein sequences and to identify evolutionarily conserved regions. Methods: In this work, we present INPROF, a novel web server designed to centralize and automate the computation of a large collection of features derived from protein sequences. This tool allows us to deeply analyze protein relationships and their conserved information by comparing them through their MSA. Specifically, this platform computes up to 46 different metrics including information at several proteomic levels (categories) like sequences, structures, domains or ontological terms. INPROF was designed to enable bioinformaticians and researchers to create a complete catalogue of features for subsequent prediction and artificial intelligence modeling based on proteins. The INPROF web server and source code are freely available. Results: INPROF were validated by predicting disease’s severity in several RNA-Seq datasets from COVID-19 patients. Specifically, INPROF were extracted from canonical proteins related to differentially expressed genes. Classification performance proved that INPROF were able to accurately classify COVID-19 severity, even outperforming classical classification with expression data. Conclusions: INPROF web server is a flexible platform designed to compute 46 quantitative metrics describing protein interactions which provide biologically meaningful characteristics applicable to downstream classification and prediction algorithms.

Keywords:

proteins; web server; protein interrelation; multiple sequence alignment (MSA); COVID-19; SARS-CoV-2; classification; feature extraction

Graphical Abstract

1. Introduction

The study of the relationships among genes or proteins has traditionally been considered a key task in understanding the underlying biological processes. However, it is still under discussion what the most adequate properties are to describe and quantify such relationships [1]. Consequently, several studies have recently proposed different measurements for this purpose [2]. These metrics are often calculated based on different biological properties: from the structural level (primary to quaternary) to functional level (domains, ontologies, etc.). For example, ATLAS [3], SoluProt [4], GTalign [5], ShiftCrypt [6], ProteoVision [7], or MAFFT-DASH [8], and the more recent EFG-CS [9] are web servers that leverage protein sequence and structural information to compute quantitative descriptors. Similarly, several tools implement approaches that integrate heterogeneous data at the protein structural level including PASSer [10], IntFOLD [11], SWISS-MODEL [12] and the latest developments of AlphaFold 3 [13], which provide highly accurate structural predictions and further incorporate the ability to model protein interactions with ligands (small molecules, DNA/RNA, and other proteins). Furthermore, newly developed platforms such as ProFeatX [14] and ESMFold [15] provide advanced structure prediction capabilities using deep learning and protein language models. Finally, several tools focus on functional-level properties, including g:Profiler [16], as well as recent updates such as ShinyGO [17] and Metascape [18], which integrate pathway enrichment and interpretation of protein interaction.

Regardless of the biological level considered, sequence conservation is a key indicator of protein interrelationships [19]. Thus, evolutionary conservation in biological sequences is commonly analyzed using pairwise alignment strategies. Moreover, these alignments need to be extended when three or more sequences are contrasted by applying multiple sequence alignment (MSA) approaches [20,21]. In this sense, MSAs are of great relevance in determining sequence similarities and can further complement information extracted from other biological levels, such as protein structure or functional annotations. For example, recent studies have successfully applied MSA tools to predict evolutionary relationships [22], phylogenetic trees [23,24], or infer protein–protein interactions [25,26]. Importantly, beyond their classical use for evolutionary inference, MSAs can also be considered as quantitative tools to summarize sequence-level similarity, heterogeneity, and structural compatibility within sets of proteins, even when they are less evolutionarily related. Consequently, the application of MSAs has proven to be extremely valuable for downstream analysis, including domain identification [21], structure prediction [13], and functional inference. In fact, recent studies have shown that incorporating information captured by MSAs can substantially enhance RNA- and protein-level modeling, motivating integrative approaches that bridge transcriptomic data with protein sequence, structural, and functional features [20].

In light of the interest of these features for the assessment of protein and gene relationships, we present a new web-based platform, named INterrelational PROtein Features (INPROF). Although simplistic, this web server allows users to select multiple similarity measures based on several protein properties. Specifically, users can extract over 46 relationship features among several sets of proteins. Consequently, every user can create their own customized protein-feature datasets for downstream analysis purposes. Several metrics computed by INPROF are directly derived from MSA results, as alignments are considered essential for the understanding of the relationship among proteins. In this sense, the exact feature dataset produced by this web server was already considered in subsequent bioinformatics machine learning tools related to sequences and MSA [27,28]. Moreover, some related feature subsets have been built based on the protein relevant knowledge for other recent machine learning studies. For example, similar protein features were found to be significantly important for predictions such as protein–protein interactions (PPIs) [29], evolutionary phylogenetic trees [30] or prediction of substrate specialties in membrane transport proteins [31] or palmitoylation sites [32]. More recently, deep learning models have reinforced the importance of integrating structural and evolutionary protein features for accurate predictive analysis [26]. However, to the best of our knowledge, no existing tool integrates such a simple and comprehensive collection of interrelated sequence quantitative features.

Finally, we conducted a real-world case study as a proof of concept illustrating how INPROF-derived features can be applied to explore disease severity patterns among COVID-19 patients. Since March 2020, SARS-CoV-2 has shaken life, and several studies have been developed to discover how the virus works. This virus is characterized by the heterogeneity of its symptoms, depending on the patients’ conditions [33]. Although most cases are asymptomatic or mild [34], there are COVID-19 patients who develop severe pulmonary symptoms [35] such as acute respiratory distress syndrome (ARDS), pulmonary edema, intense kidney injury and multiorgan failure, even leading to death [35,36]. Several studies have shown that Differentially Expression Genes (DEGs) are useful for studying the disease and its symptoms. More specifically, prior research has reported links between COVID-19 patient gene expression and the specific disease, investigating how changes in gene expression influence COVID-19 outcomes [37,38,39]. In this study, we show that INPROF-derived features can be applied to severity classification using proteins associated with DEGs, achieving similar performance and providing complementary predictive information compared to models based solely on DEGs.

2. Materials and Methods

2.1. Implementation and Usage

The INPROF web server can be directly consulted online without any installation requirement (https://www.ugr.es/~fortuno/inprof/inprof.php, accessed on 22 January 2026). Users can interact through the designed web server using a simple and intuitive web interface. Additionally, features can also be extracted from the web server through an API by using standard programming languages like Python (v3.12.3), Perl (v5.34.0) or R (v4.5.1). An example of a request using a Python script is available on the website. INPROF requires as input a set of at least two proteins, each unambiguously identified by its UniProtKB accession or name, in order to compute interrelational features.

This web server is made up of four well-defined software modules (see Figure 1): web interface, Apache web server, protein feature calculation, and local database of protein features. Each of those components, together with a detailed description of the features retrieved, is presented in detail in the following subsections.

2.1.1. Protein Features

The protein features computed by INPROF are classified into different protein categories (see complete list of features in Supplementary Table S1). Each category refers to the group of metrics that users can calculate separately to assess the relationships between proteins. These feature categories are:

Sequences: Basic statistical properties of amino acid chains, such as average length or length variance. For MSA, this category performs statistics about matches in the alignment.
Amino-acid Types: Percentages of different chemical classes of amino acids in the sequence (e.g., uncharged, nonpolar aliphatic, positively charged, etc.) composing the chain.
Domains: Properties related to functional domains as defined by Pfam [40]. Both Pfam domains and their corresponding clans are considered.
Secondary Structure: Percentage of amino acids located in different protein’s secondary structures (α-helix, β-strand, hydrogen-bonded turn). This information is retrieved from Uniprot [41].
Tertiary Structure: Properties regarding annotated structural data according to the Protein Data Bank (PDB) [42]. Among others, structural contacts are computed here. A contact is considered when any pair of atoms in two different amino acids, separated by more than five residues in the sequence, are close enough that no solvent molecule can occupy the space between them (similarly defined in [43]).
Ontological Terms: Features related to shared terms in the three ontologies defined by Gene Ontology (GO) [44]: biological process (BP), molecular function (MF) or cellular component (CC). Note that metrics in this category cannot be referred to MSA. Since GO terms are not associated to a specific location along sequences, alignment metrics cannot be computed from them.

When selected by the user, these features can be optionally expanded using multiple sequence alignment (MSA) among proteins. In this case, the resulting metrics are reinterpreted based on both the protein sequences and their alignments. These MSA-based features are not explicitly intended to infer evolutionary relationships among proteins, but rather to quantify complementary descriptors (sequence-level similarity, heterogeneity, and alignment coherence) within a group of proteins enriching protein representations for downstream classification tasks. Additionally, most INPROF are computed as normalized or aggregated statistics to ensure comparability across protein sets of different sizes. However, the total number of input proteins is still included as a separate feature to explicitly capture potential biological differences that may be associated with it.

Although INPROF requires valid UniProtKB identifiers, custom/partial protein sequences may be provided by users associated to those identifiers, not necessarily using those sequences already available in UniProtKB. In this case, all sequence- and alignment-based features as well as annotations related to those (domains, structures, etc.) are computed based on the overlapping with the supplied sequences. Table 1 summarizes the number of features included in each category. A detailed description of each individual feature and its computational procedure is provided in the tool’s documentation (https://www.ugr.es/~fortuno/inprof/help.html, accessed on 22 January 2026).

2.1.2. Web Interface

Protein queries can be provided directly by users specifying their Uniprot entry names or accession numbers [41]. They can be delivered both directly from a textbox at the web interface or uploaded as a list in a text file. Alternatively, protein sequences can be imported from a FASTA format file. In that case, sequences in the file are considered preferential for the computation, instead of downloading sequences from Uniprot.

Users can interactively select the desired feature categories (see above) through multiple checkboxes in the web interface. Moreover, if desired, they may choose to apply an MSA approach to obtain an extended set of features (see Table 1). In this case, three alignment tools are available: CLustalW [45], T-Coffee [46], and MUSCLE [47]. Although these tools represent classical approaches, they remain the most widely used MSA tools in current bioinformatics workflows. Finally, the computed similarity measures are provided in a downloadable tabular format, whereas the resulting sequence alignments can also be exported in MSF format if this option is selected.

In addition, the web interface enhances the feature table by linking each metric to its corresponding external resource. This allows users to directly access the original data sources from which each metric was derived, including databases such as UniProt, Pfam, PDB, and Gene Ontology, as previously described.

2.1.3. Apache Web Server

The back-end web server runs through a PHP module by managing the user query and generating the output results. This module includes SOAP-based web services, which can be queried from standard RESTful POST functions. INPROF are then internally processed through an in-house Perl script based on user preferences. After features have been computed, a response in JSON format is served to the front-end (or the querying user) with every requested feature, including their identifiers (labels), their metric values, and a list of complementary identifiers from external resources.

2.1.4. Metrics Calculation (Perl Module)

The 46 features are computed according to the categories and options selected by the user. Feature calculation is carried out in the back-end module by querying annotation data from a local MySQL database (see below for details). When a list of proteins is submitted, two scenarios may occur. If all proteins are already stored in the local database, the requested metrics are immediately computed, and the results are returned to the user. If some proteins are missing, the system automatically updates the local database by retrieving the missing information from external sources, which may temporarily increase the response time. To minimize such delays, the database is periodically updated and maintained to ensure complete protein coverage and quick query processing. Once the database has been refreshed, all features are computed consistently using newly retrieved protein annotations.

2.1.5. Protein Feature Database

The INPROF database is built in a MySQL framework with three main tables, each one connected to a different external source. The idea behind this layout is to combine several biological sources while keeping the structure manageable and consistent. A brief description of these tables is given below:

UniProt_Feats: This table includes all annotation details retrieved from Uniprot [41]. It also serves as the central table, linking each protein to its corresponding entries in the Pfam and PDB tables, which helps integrate both structural and functional information.
Pfam_Feats: This table contains information on functional domain obtained from the Pfam database [40]. It includes domain names, their positions along the protein sequence, and their associated families or clans.
PDB_Feats: This table stores the tertiary structure information provided by the Protein Data Bank [42], such as chain identifiers, the sequence regions to which they map, and several additional descriptors commonly used to describe protein structure.

The database is periodically updated with the latest external annotations. At the moment, it includes information for more than 560,000 proteins. Anyway, when a user requests a new protein with a valid UniProtKB accession which is not yet available locally, the system retrieves the necessary annotations in real time and incorporates them into the local database, to guarantee an optimal response time.

2.2. Case Study: COVID-19 Severity Classification

To evaluate the functionality of INPROF, we analyzed publicly available RNA-Seq datasets from patients infected with COVID-19. Specifically, we reproduced the severity classification study presented in [39]. In that study, the authors used the same GEO datasets and the severity classification relied exclusively on differential gene expression values. By applying INPROF to the same patient groups, we ensured that both approaches were evaluated under comparable conditions. Our objective was to assess whether the protein interrelated features created by INPROF can serve as an alternative and complement to gene expression models for severity classification. The complete validation workflow is summarized in Figure 2. All code used to reproduce the study, including data preprocessing, INPROF feature extraction, and model training, is available in our GitHub (git v2.43.0) repository: https://github.com/fortuno/INPROF/tree/main/INPROF-COVID19-Severity, accessed on 22 January 2026.

2.2.1. Data Acquisition

The datasets were obtained from the NCBI Gene Expression Omnibus (GEO) [48], corresponding to the series GSE156063 [49], GSE152075 [50] and GSE162835 [51]. Data was initially retrieved as raw count matrices. Severity metadata was explicitly available only for GSE162835; for the remaining series, clinical labels were retrieved by contacting the authors of the datasets via email.

The patients in these series were labeled into four classes: Outpatient (asymptomatic or mild cases), Inpatient (hospitalized, non-ICU), ICU (intensive-care patients) and Control. In this study, the Control group is used as the reference baseline, comparing each Outpatient, Inpatient, and ICU sample with all Control samples to compute the metrics of expressed genes in terms of log2 fold-change and p-value. After a careful preprocessing and outliers removal with the R-package KnowSeq [52] (see details below), the resulting cohort comprised 206 patients: 131 Outpatients, 18 Inpatients, 10 ICUs, and 47 Controls, each with expression profiles of 15,526 genes.

2.2.2. Differential Expression Analysis

The gene expression matrix for each patient group (Control, Outpatient, Inpatient, ICU) was built using KnowSeq package [52]. KnowSeq comprises all standard steps for transcriptomic gene expression analysis including mapping, preprocessing, outlier removal, normalization or differential expression analysis (underlaying standard packages like limma, cqn or edgeR are applied). Since original data was obtained as raw counts, outliers were initially removed using both an inter-quartile range (IQR) technique and MA-plot. MA-plot were represented by calculating Hoeffding’s statistic on the joint distribution of A and M for each of the samples [53]. Later, the batch effect, which is an intrinsic deviation of data produced on its extraction, was removed through the Surrogate Variable Analysis (SVA) method [54].

After preprocessing and normalization, differentially expressed genes (DEGs) were individually identified for each COVID-19 patient group (Outpatient, Inpatient and ICU) compared to the Control group. A refined subset of DEGs was obtained for each patient applying the Benjamini–Hochberg correction to control the False Discovery Rate (FDR) (adjusted p-value < 0.001) and imposing a threshold on the log₂ fold-change (LFC > 1.5). To limit the computational load during INPROF feature generation, a maximum of 170 top-ranked DEGs per patient were retained.

2.2.3. Retrieval of Canonical Proteins

The objective of this step is to generate a list of proteins associated with the DEGs identified for each COVID-19 patient. Once DEG lists were obtained, the corresponding protein products were determined by mapping each gene to its encoded protein. The canonical protein, defined as the amino acid sequence most likely to be expressed, was selected for further analysis.

The list of DEGs was first obtained using their gene names and subsequently annotated with their respective Ensembl identifiers [55]. Once this mapping was completed, the canonical protein sequences were obtained through the UniProtID mapping service. In this way, each patient could be represented by a defined set of proteins associated with their specific DEG profile.

2.2.4. INPROF Feature Extraction

The resulting protein datasets were then analyzed using INPROF for feature extraction. For the protein list of each patient, INPROF computed the 46 features describing both individual protein characteristics and properties derived from sequence alignments. In this study, MUSCLE was selected as the alignment tool; however, comparable results were obtained with the alternative options, ClustalW and T-Coffee. From the initial INPROF dataset, two alignment-related features (MSA_TC and MSA_TN) were excluded, as they yielded zero values across all samples and therefore did not provide discriminatory information. Consequently, the final dataset used for the analysis contained 44 informative features for the 159 patients with COVID-19 (the entire INPROF-based feature dataset is provided in Supplementary Table S2).

2.2.5. Feature Selection

After retrieving the final feature dataset, several machine learning algorithms were applied to evaluate the ability of these features to predict the severity of COVID-19. Initially, a preliminary graphical perspective about patient’s distribution on a bi-dimensional space was performed by two complementary dimensionality reduction approaches: t-SNE (t-Distributed Stochastic Neighbor Embedding) [56] and UMAP [57]. These models were performed with the R language Rtsne [58] and umap [57], respectively.

Nevertheless, increasing the complexity of the predictive models is expected to substantially enhance the separation observed in the t-SNE representation. To this end, the minimum Redundancy Maximum Relevance (mRMR) algorithm [59] was applied using the FeatureSelection function from KnowSeq. This approach generates a ranked list of features based on their relevance and redundancy. Thus, subsets of increasing size were progressively evaluated to identify the optimal feature combination for classification.

2.2.6. Classification Model

Although several classical classification models (KNN, Random Forest and SVM) were assessed, SVM yielded the best overall performance for this task. To determine the minimal number of features required to accurately discriminate among patient groups, a separate SVM classifier was trained for each progressively expanded subset of features ordered according to the previous mRMR ranking. Because our INPROF-derived model was compared against a DEG-based approach applied to the same datasets [39], we adopted an equivalent training and feature selection strategy to ensure comparability under identical conditions. Briefly, each model was trained using an 80/20 train–test split, followed by a 5-fold cross-validation on the training set to determine both the optimal number of features and the best performing hyperparameter configuration. Every model was evaluated using standard classification metrics: accuracy, sensitivity, specificity and F1-score.

3. Results

The previously described case study was conducted to assess the utility of the INPROF web server and its interrelated proteomic features. Initially, a subset of proteins was obtained for each patient from the list of DEGs obtained comparing against Control (see Section 2.2 for details). INPROF were then calculated for the list of protein of each patient. Features were initially reduced to a bi-dimensional space by both t-SNE and UMAP approaches, for an easier visualization purpose (Figure 3).

As shown in t-SNE (Figure 3, right panel), Inpatient and ICU groups are mostly clustered toward more distinct regions of the map, while the Outpatient samples form a broader and more dispersed distribution. Although the t-SNE representation may visually suggest a gradual separation among severity groups, such patterns should be interpreted with caution, as t-SNE may introduce noisy and artificial gradients not reflecting true biological trajectories [60]. To provide a complementary and more robust visual assessment, we additionally applied UMAP dimensionality reduction, which may preserve global structure better, representing more clinically meaningful variation. As shown in the UMAP projection (Figure 3, left panel), the three severity groups still form clearly separated clusters, reinforcing the notion that INPROF-derived features capture differences associated with COVID-19 severity. Notably, the Outpatient samples are further subdivided into multiple subclusters in UMAP, suggesting the presence of latent heterogeneity within this group that may be related to sample date, host response, comorbidities, or technical biases (for details, see limitations in Section 5). Although both visualizations indicate an acceptable separation among mild, moderate, and severe patient profiles, more sophisticated modeling approaches are required to capture complex non-linear patterns which better classify samples that are ambiguously positioned near incorrect classes

In order to improve the separation of the severity groups, a more complex feature ranking based on the mRMR algorithm was then performed. Although 44 INPROF were initially retrieved, those exhibiting very low variability across samples (specifically, features with fewer than 10 non-zero values) were excluded from this ranking to prevent training models on features with no discriminatory information. More specifically, a total of six INPROF, primarily related to 3D structural similarity and Pfam alignment metrics, were excluded from the ranking: the number of shared 3D structures (SEQ_CS), the number of structural contacts (SEQ_NC), the number of contacts correctly aligned (MSA_3D), the STRIKE score for alignment structural accuracy (MSA_SK) and the percentage of matches in the same domains (MSA_PT) and the same clans (MSA_PC). These metrics were found less informative for the heterogeneous protein sets derived from COVID-19 severity-associated DEGs, given their lower degree of functional and structural similarity. Nevertheless, the discarded features were found to be ranked near the bottom of the ranking. The final mRMR ranking was then obtained for the remaining highly variable 38 INPROF.

After ordering the INPROF feature, several SVM classification models were performed. Specifically, a total of 38 SVM models were constructed (one per feature ranked by mRMR), where the n-th model incorporated the top n-ranked features. As shown in Figure 4, the performance of the SVM classifier in the training subset improves rapidly as the first features are incorporated, stabilizing around approximately 10 features. Beyond this point, all metrics exhibit only minor fluctuations, indicating that most discriminative information is captured within the top-ranked features.

Detailed metric values for selected subsets (2, 8, 10, and 14 features) are reported in Table 2, providing a quantitative summary of the classifier’s behavior at key stages of the feature selection process. As shown in the table, classification performance generally increases with the number of features used for model training, accompanied by a clear reduction in standard deviation across the 5-fold cross-validation. This decreasing variability indicates that models become more stable and robust as additional informative INPROF are incorporated. Although the test subset performance is consistently lower than training performance, the results remain strong overall. Most of this reduction, particularly in Sensitivity and F1-score metrics, was anticipated due to the pronounced class imbalance, which makes correctly identifying the minority class (ICU samples) substantially more challenging. For this reason, and to remain consistent with the DEG-based study used for comparison [39], a leave-one-out cross-validation (LOOCV) scheme was additionally applied for a selected number of features to enhance the statistical robustness of the evaluation, given the limited number of ICU samples available. Specifically, as already observed in Figure 4, we considered the top-10 INPROF, as they provide a sufficient balance between high accuracy and overall performance across metrics. These features, which were found to be meaningful in explaining differences between COVID-19 patients with varying severity grades, are briefly presented here in the same order as they were ranked by mRMR:

Percentage of polar uncharged amino acids in protein sequences (SEQ_PL).
Number of Pfam clans shared between each protein pair (SEQ_CK).
Variance of the length in the protein sequences (SEQ_VA).
Maximum length in the protein sequences (SEQ_MX).
Number of Pfam domains shared between each two proteins (SEQ_CT).
Percentage of amino acids in protein sequences with atypical or unknown secondary structures (SEQ_SU).
Minimum length in protein sequences (SEQ_MN).
Percentage of amino acids in protein sequences that are included in any Pfam clan (SEQ_PC).
Percentage of matches in the alignment between pairs of amino acids included in β-strand secondary structures (MSA_TD).
Percentage of gaps in the alignment. This is a measure of how similar the protein sequences are (MSA_GP).

The confusion matrix obtained and its corresponding performance metrics applying the LOOCV evaluation scheme over the top-10 selected features is shown in Figure 5. The model achieves an overall accuracy of 97.48%, with strong performance across all three groups. Inpatient and Outpatient cases are classified with high reliability, with only one patient misclassified per group. Furthermore, despite the limited number of ICU samples, the model correctly identifies most of those patients (8 out of 10), reflecting balanced and robust behavior. The corresponding performance metrics (F1-score: 93.28%, Sensitivity: 91.23%, Specificity: 96.20%) indicate that the model is able to generalize reliably even though the validation setup was intentionally strict.

As mentioned earlier, we compared the INPROF-based model with the DEG approach published by Bajo et al. [39]. To match the structure of our analysis, the Control group was removed from their confusion matrix, keeping the three classes of clinical severity: Outpatient, Inpatient, and ICU. It is worth noting that the number of samples in each category differs slightly between the two studies due to variations in preprocessing and outlier removal. After comparing under the same evaluation conditions, INPROF model achieved slightly higher overall performance than the DEG model, with modest improvements in accuracy (97.5% vs. 96.8%) as well as in F1-score, sensitivity, and specificity. The most noticeable difference was observed in the Outpatient group, where the DEG model misclassified three samples to the ICU class. In contrast, both models correctly identified most of the ICU samples, even when they are the smallest and most challenging group. Collectively, these findings suggest that the proteomic interaction features generated by INPROF provide a robust and discriminative representation of patient molecular profiles. These results can offer a reliable complement to DEG-based markers with a comparable performance.

4. Discussion

The main purpose of this study was to introduce the INPROF web server and to assess whether it can produce biologically informative features based on protein interactions. To address this goal, we built a unified analysis pipeline that combined several publicly accessible and heterogeneous COVID-19 RNA-Seq datasets. For each patient, we identified differentially expressed genes (DEGs) and mapped them to their respective canonical proteins. This step enabled the extraction of interaction-level features through INPROF, which were subsequently used for disease severity classification. This analysis allowed us to explore the discriminative potential of these features and to assess whether INPROF captures patterns relevant to clinical outcomes for this specific case study.

The findings of our analysis suggest that the features generated by INPROF yield a robust and biologically coherent representation of the molecular states of the patients. Specifically, classification performance improved progressively as more features were incorporated, reflecting both an increasing accuracy and a reduction in variability across cross-validation folds. This pattern shows that these interaction-driven proteomic features may capture complementary biological signals distributed across multiple proteins, rather than being constrained to the influence of individual genes. Interestingly, once the model incorporated around ten features, its performance largely stabilized, indicating that only a small number of interaction-based measures are sufficient to reliably separate severity groups. This observation is consistent with the broader view that the severity of COVID-19 arises from coordinated molecular disruptions rather than changes confined to isolated genes.

The group of ten features ranked by the mRMR procedure provides interesting insights about the molecular attributes most closely related to the severity of disease. Several of the top-ranked features captured differences in protein sequence length (SEQ_VA, SEQ_MX, SEQ_MN). Interestingly, higher length variances were usually found in proteins from Inpatient and ICU patients (see SEQ_VA in Figure 6), suggesting that the diversity of protein size may indicate functional heterogeneity associated with severe disease. In fact, variation in protein length often reflects changes in domain architecture or structural modularity, influencing functions such as inflammation and antiviral activity [61]. This observation aligns with other analyses of the SARS-CoV-2 interactome, which show that many host proteins involved in viral entry, replication, and immune control strongly depend on their structural configuration and the way they interact with other molecules [62].

Consistent with this interpretation, other top-ranked features were also associated with Pfam domains (SEQ_CK, SEQ_CT, SEQ_PC), indicating that conserved functions play an important role in distinguishing between severity groups [63]. Moreover, recent studies suggest that many host proteins interacting with SARS-CoV-2 share common Pfam domains, reinforcing the relevance of domain-level organization in viral–host interactions and its possible influence on clinical outcomes [64]. Within our dataset, proteins from Inpatient and ICU patients repeatedly showed a higher number of shared domains compared to those from Outpatients (see SEQ_CT in Figure 6). Specifically, two domains were more frequently shared among proteins from severe patients: the fibronectin type III (FN3) domain (PF00041) and the Immunoglobulin I-set domain (PF07679). FN3 domains have been associated with cytokine binding and immune-modulatory roles [65], while Immunoglobulin I-set domains contribute to cell adhesion, cell–cell recognition, and immune receptor functions. All these observations support the broader understanding that severe COVID-19 is strongly influenced by dysregulated immune pathways, including exaggerated cytokine activity [66]. The entire list of shared domains in the three classes is provided in the Supplementary Table S3.

Additional informative features included metrics based on amino acid composition, such as the proportion of uncharged polar residues (SEQ_PL) and residues linked to unusual or undefined secondary structure annotations (SEQ_SU). Proteins with high sequence disorder or atypical residue composition often create flexible binding regions and support dynamic interaction behavior, a pattern well-documented in structural biology [67]. This type of conformational flexibility is increasingly viewed as an important factor in host responses to viral infection, shaping immune signaling pathways and cellular stress responses [68]. Taken together, these findings show that these INPROF again capture biologically coherent patterns linked to well-established mechanisms of COVID-19 severity, such as immune modulation, inflammation, and cellular stress responses. The fact that these features remain consistent across validation folds and datasets shows that INPROF can detect stable disease-relevant signals despite heterogeneity in transcriptomic profiles.

On the other side, INPROF derived from 3D annotations in PDB generally showed limited variability and low discriminative power in the COVID-19 severity case study. In fact, four of these features were excluded before feature selection (see Section 3) whereas the remaining two were ranked under the least informative features. This behaviour likely reflects weak structural and evolutionary relationships among DEG-derived protein sets, with rare common structures or structural similarities. Importantly, this does not imply these features are generally uninformative, but they are expected to increase their importance when more evolutionarily related or structurally conserved protein sets are considered.

Finally, it is important to acknowledge that the INPROF tool was inherently designed as disease-agnostic and generalizable to other diseases and independent cohorts, beyond the presented COVID-19 severity use case presented here. In fact, the COVID-19 analysis presented here should be regarded as a proof-of-concept case study rather than a comprehensive benchmark. Since INPROF are derived from fundamental protein properties (sequence composition, domain organization, structural annotations or protein similarity, they are not tied to a particular pathology, but rather reflect system-level characteristics of protein sets associated with a given condition. Therefore, INPROF can be widely applied to protein lists derived from differential expression, co-expression modules, or mutational analyses in diverse contexts, including cancer subtyping and staging, autoimmune disorders, neurodegenerative diseases, and other infectious diseases. Further evaluations across independent cohorts, diseases, and classification tasks are being performed to more extensively validate the general utility and performance of INPROF-derived features. Moreover, the use of interaction-level and set-based descriptors makes INPROF particularly well suited for cross-cohort analyses, where individual gene signatures may vary but functional patterns remain conserved. These properties make INPROF a versatile tool for proteomics-driven machine learning applications in translational and precision medicine.

Nonetheless, some limitations should be considered when interpreting our case study results. The datasets were collected in different countries and during different phases of the pandemic, which can differ in the circulating SARS-CoV-2 variant at those locations and dates. Such variability can substantially influence transcriptomic patterns in ways not explicitly captured by our pipeline. Another consideration involves the way clinical labels were assigned. This assignment was based on the patient’s condition at the time of the PCR test, without information on the disease progression afterward. As suggested by the t-SNE plot (Figure 3), some individuals labeled as Outpatient might have later developed more severe symptoms, which could introduce some bias. Also, contextual factors, including hospital capacity, admission policies, and demographic considerations, may have affected decisions about hospital or ICU admission. These decisions may not always reflect biological severity and can introduce additional uncertainty in the reference labels. Despite that, the results were consistent across the datasets and validation schemes. The reproducibility of the selected features and the robust classification outcomes suggest that these confounders had only a limited effect on the main conclusions, reinforcing the reliability of INPROF-based representations in this specific use case.

5. Conclusions

This study presents the INPROF web server, a flexible platform designed to compute quantitative metrics based on interactions that describe how proteins relate to each different biological dimension. INPROF integrates information from several sources, including protein sequences, amino acid classifications, Pfam domains, and structural attributes such as secondary and tertiary structures, together with annotation descriptors. Because INPROF relies on these external annotation resources, proteins must be correctly identified in UniProtKB to enable feature computation, which may restrict its use for unidentified or insufficiently annotated protein sequences. When required, the platform can also incorporate additional metrics derived from multiple sequence alignments (MSAs) computed for these same biological properties. In total, INPROF provides up to 46 distinct metrics, enabling a comprehensive and multidimensional characterization of protein similarity. Unlike approaches that focus primarily on enrichment analysis or functional annotation, the key strength of INPROF lies in its ability to produce broad quantitative features describing how proteins relate to each other in a reliable and biologically meaningful manner. Thus, these features can be used directly for downstream applications, including classification, clustering, and incorporation into machine learning workflows.

As a specific use case, INPROF was applied to a COVID-19 RNA-Seq dataset, demonstrating that these proteomic features accurately predict disease severity. This use case validated both the functionality and the biological relevance of the platform. INPROF can therefore be categorized as a general-purpose tool for generating complete, informative feature datasets derived from groups of proteins, suitable for diverse classification, prediction or machine learning applications. Its flexibility and quantitative nature make it a valuable resource for studies aiming to integrate protein-level information into computational analyses.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedicines14020378/s1. Table S1: List of features computed by INPROF web server; Table S2: Matrix with the entire dataset retrieved from INPROF; Table S3: Top-50 Pfam domains found shared by each paired proteins.

Author Contributions

R.E.-A., F.M.O. and D.C.-S. are responsible for the web server design and implementation. F.M.O., L.J.H. and I.R. are involved in the conception and design of the study. R.E.-A. and O.D.G. are responsible for data acquisition and data analysis for validation. C.T. is responsible for the biological interpretation of results. R.E.-A., F.M.O. and C.T. are responsible for the writing of the article. All authors have read and agreed to the published version of the manuscript.

Funding

Grant PID2024-160318OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU awarded to I.R.; Grant PID2023-152099OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU awarded to C.T.; and Grant C-ING-172-UGR23 funded by FEDER/Junta de Andalucía/Consejería de Economía y Conocimiento awarded to F.M.O. The R.E. predoctoral contract was funded by Junta de Andalucía (PREDOC_01429).

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Additionally, source codes and original raw data is available in the repository https://github.com/fortuno/INPROF (accessed on 22 January 2026). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Daniel Castillo-Secilla was employed by the company Fujitsu Technology Solutions S.A. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bendel, A.M.; Faure, A.J.; Klein, D.; Shimada, K.; Lyautey, R.; Schiffelholz, N.; Kempf, G.; Cavadini, S.; Lehner, B.; Diss, G. The genetic architecture of protein interaction affinity and specificity. Nat. Commun. 2024, 15, 8868. [Google Scholar] [CrossRef]
Baryshev, A.; La Fleur, A.; Groves, B.; Michel, C.; Baker, D.; Ljubetič, A.; Seelig, G. Massively parallel measurement of protein–protein interactions by sequencing using MP3-seq. Nat. Chem. Biol. 2024, 20, 1514–1523. [Google Scholar] [CrossRef]
Vander Meersche, Y.; Cretin, G.; Gheeraert, A.; Gelly, J.C.; Galochkina, T. ATLAS: Protein flexibility description from atomistic molecular dynamics simulations. Nucleic Acids Res. 2024, 52, D384–D392. [Google Scholar] [CrossRef]
Hon, J.; Marusiak, M.; Martinek, T.; Kunka, A.; Zendulka, J.; Bednar, D.; Damborsky, J. SoluProt: Prediction of Soluble Protein Expression in Escherichia coli. Bioinformatics 2021, 37, 23–28. [Google Scholar] [CrossRef]
Margelevicius, M. GTalign: Spatial index-driven protein structure alignment, superposition, and search. Nat. Commun. 2024, 15, 7305. [Google Scholar] [CrossRef]
Orlando, G.; Raimondi, D.; Kagami, L.P.; Vranken, W.F. ShiftCrypt: A web server to understand and biophysically align proteins through their NMR chemical-shift values. Nucleic Acids Res. 2020, 48, W36–W40. [Google Scholar] [CrossRef]
Penev, P.I.; McCann, H.M.; Meade, C.D.; Alvarez-Carreño, C.; Maddala, A.; Bernier, C.R.; Chivukula, V.L.; Ahmad, M.; Gulen, B.; Sharma, A.; et al. ProteoVision: Web server for advanced visualization of ribosomal proteins. Nucleic Acids Res. 2021, 49, W578–W588. [Google Scholar] [CrossRef]
Rozewicki, J.; Li, S.; Amada, K.M.; Standley, D.M.; Katoh, K. MAFFT-DASH: Integrated protein sequence and structural alignment. Nucleic Acids Res. 2019, 47, W5–W10. [Google Scholar] [CrossRef]
Gu, X.; Myung, Y.; Rodrigues, C.H.M.; Ascher, D.B. EFG-CS: Predicting chemical shifts from amino acid sequences with protein structure prediction using machine learning and deep learning models. Protein Sci. 2024, 33, e5096. [Google Scholar] [CrossRef]
Tian, H.; Jiang, X.; Tao, P. PASSer: Prediction of allosteric sites server. Mach. Learn. Sci. Technol. 2021, 2, 035015. [Google Scholar] [CrossRef]
McGuffin, L.J.; Adiyaman, R.; Maghrabi, A.H.; Shuid, A.N.; Brackenridge, D.A.; Nealon, J.O.; Philomina, L.S. IntFOLD: An integrated web resource for high performance protein structure and function prediction. Nucleic Acids Res. 2019, 47, W408–W413. [Google Scholar] [CrossRef]
Waterhouse, A.; Bertoni, M.; Bienert, S.; Studer, G.; Tauriello, G.; Gumienny, R.; Heer, F.; Beer, T.; Rempfer, C.; Bordoli, L.; et al. SWISS-MODEL: Homology modelling of protein structures and complexes. Nucleic Acids Res. 2018, 46, W296–W303. [Google Scholar] [CrossRef]
Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
Guevara-Barrientos, D.; Kaundal, R. ProFeatX: A parallelized protein feature extraction suite for machine learning. Comput. Struct. Biotechnol. J. 2022, 21, 796–801. [Google Scholar] [CrossRef]
Manfredi, M.; Vazzana, G.; Savojardo, C.; Martelli, P.L.; Casadio, R. AlphaFold2 and ESMFold: A large-scale pairwise model comparison of human enzymes upon Pfam functional annotation. Comput. Struct. Biotechnol. J. 2025, 27, 461–466. [Google Scholar] [CrossRef]
Kolberg, L.; Raudvere, U.; Kuzmin, I.; Adler, P.; Vilo, J.; Peterson, H. g:Profiler—Interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 2023, 51, W207–W212. [Google Scholar] [CrossRef]
Ge, S.X.; Jung, D.; Yao, R. ShinyGO: A graphical gene-set enrichment tool for animals and plants. Bioinformatics 2020, 36, 2628–2629. [Google Scholar] [CrossRef]
Zhou, Y.; Zhou, B.; Pache, L.; Chang, M.; Khodabakhshi, A.H.; Tanaseichuk, O.; Benner, C.; Chanda, S.K. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 2019, 10, 1523. [Google Scholar] [CrossRef] [PubMed]
Hu, B.; Tan, C.; Xu, Y.; Gao, Z.; Xia, J.; Wu, L.; Li, S.Z. ProtGO: Function-guided protein modeling for unified representation learning. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 9–15 December 2024; Neural Information Processing Systems Foundation, Inc.: Vancouver, BC, Canada, 2024. [Google Scholar]
Zhang, Y.; Lang, M.; Jiang, J.; Gao, Z.; Xu, F.; Litfin, T.; Chen, K.; Singh, J.; Huang, X.; Song, G.; et al. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res. 2024, 52, e3. [Google Scholar] [CrossRef]
Zhang, C.; Wang, Q.; Li, Y.; Teng, A.; Hu, G.; Wuyun, Q.; Zheng, W. The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction. Biomolecules 2024, 14, 1531. [Google Scholar] [CrossRef]
Javed, F.; Ahmed, J.; Hayat, M. ML-RBF: Predict protein subcellular locations in a multi-label system using evolutionary features. Chemom. Intell. Lab. Syst. 2020, 203, 104055. [Google Scholar] [CrossRef]
Kapli, P.; Yang, Z.; Telford, M.J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 2020, 21, 428–444. [Google Scholar] [CrossRef]
Zou, Y.; Zhang, Z.; Zeng, Y.; Hu, H.; Hao, Y.; Huang, S.; Li, B. Common Methods for Phylogenetic Tree Construction and Their Implementation in R. Bioengineering 2024, 11, 480. [Google Scholar] [CrossRef] [PubMed]
Yugandhar, K.; Gupta, S.; Yu, H. Inferring protein–protein interaction networks from mass spectrometry-based proteomic approaches: A mini-review. Comput. Struct. Biotechnol. J. 2019, 17, 805–811. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Qian, W.; Cai, W.; Song, W.; Wang, W.; Maharjan, D.T.; Cheng, W.; Chen, J.; Wang, H.; Xu, D.; et al. Inferring the effects of protein variants on protein–protein interactions with interpretable transformer representations. Research 2023, 6, 0219. [Google Scholar] [CrossRef]
Ortuño, F.M.; Valenzuela, O.; Pomares, H.; Rojas, F.; Florido, J.P.; Urquiza, J.M.; Rojas, I. Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques. Nucleic Acids Res. 2013, 41, e26. [Google Scholar] [CrossRef]
Ortuño, F.M.; Valenzuela, O.; Prieto, B.; Saez-Lara, M.J.; Torres, C.; Pomares, H.; Rojas, I. Comparing different machine learning and mathematical regression models to evaluate multiple sequence alignments. Neurocomputing 2015, 164, 123–136. [Google Scholar] [CrossRef]
Hu, J.; Li, Z.; Rao, B.; Thafar, M.A.; Arif, M. Improving protein–protein interaction prediction using protein language model and protein network features. Anal. Biochem. 2024, 693, 115550. [Google Scholar] [CrossRef]
Mutti, G.; Ocaña-Pallarès, E.; Gabaldón, T. Newly developed structure-based methods do not outperform standard sequence-based methods for large-scale phylogenomics. Mol. Biol. Evol. 2025, 42, msaf149. [Google Scholar] [CrossRef]
Kroll, A.; Niebuhr, N.; Butler, G.; Lercher, M.J. SPOT: A machine learning model that predicts specific substrates for transport proteins. PLoS Biol. 2024, 22, e3002807. [Google Scholar] [CrossRef]
Li, W.; Shen, J.; Zhuang, A.; Wang, R.; Li, Q.; Rabata, A.; Zhang, Y.; Cao, D. Palmitoylation: An emerging therapeutic target bridging physiology and disease. Cell. Mol. Biol. Lett. 2025, 30, 98. [Google Scholar] [CrossRef] [PubMed]
Castro-Pearson, S.; Samorodnitsky, S.; Yang, K.; Lotfi-Emran, S.; Ingraham, N.E.; Bramante, C.; Jones, E.K.; Greising, S.; Yu, M.; Steffen, B.T.; et al. Development of a proteomic signature associated with severe disease for patients with COVID-19 using data from five multicenter, randomized, controlled, and prospective studies. Sci. Rep. 2023, 13, 20315. [Google Scholar] [CrossRef]
Redondo-Calvo, F.J.; Rabanal-Ruíz, Y.; Verdugo-Moreno, G.; Bejarano, N.; Bodoque-Villar, R.; Durán-Prado, M.; Illescas, S.; Chicano-Gálvez, E.; Gómez-Romero, F.J.; Martínez-Alarcón, J.; et al. Longitudinal assessment of nasopharyngeal biomarkers post-COVID-19: Unveiling persistent markers and severity correlations. J. Proteome Res. 2024, 23, 5064–5084. [Google Scholar] [CrossRef] [PubMed]
Harriott, N.C.; Ryan, A.L. Proteomic profiling identifies biomarkers of COVID-19 severity. Heliyon 2024, 10, e23320. [Google Scholar] [CrossRef]
Gabarre, P.; Dumas, G.; Dupont, T.; Darmon, M.; Azoulay, E.; Zafrani, L. Acute kidney injury in critically ill patients with COVID-19. Intensive Care Med. 2020, 46, 1339–1348. [Google Scholar] [CrossRef]
Quiroga, S.A.; Hernández, C.; Castañeda, S.; Jimenez, P.; Vega, L.; Gomez, M.; Martinez, D.; Ballesteros, N.; Muñoz, M.; Cifuentes, C.; et al. Contrasting SARS-CoV-2 RNA copies and clinical symptoms in a large cohort of Colombian patients during the first wave of the COVID-19 pandemic. Ann. Clin. Microbiol. Antimicrob. 2021, 20, 39. [Google Scholar] [CrossRef]
Choudhary, S.; Sreenivasulu, K.; Mitra, P.; Misra, S.; Sharma, P. Role of genetic variants and gene expression in the susceptibility and severity of COVID-19. Ann. Lab. Med. 2021, 41, 129–138. [Google Scholar] [CrossRef]
Bajo-Morales, J.; Castillo-Secilla, D.; Herrera, L.J.; Caba, O.; Prados, J.C.; Rojas, I. Predicting COVID-19 Severity Integrating RNA-Seq Data Using Machine Learning Techniques. Curr. Bioinform. 2023, 18, 221–231. [Google Scholar] [CrossRef]
Paysan-Lafosse, T.; Andreeva, A.; Blum, M.; Chuguransky, S.R.; Grego, T.; Pinto, B.L.; Salazar, G.A.; Bileschi, M.L.; Llinares-López, F.; Meng-Papaxanthos, L.; et al. The Pfam protein families database: Embracing AI/ML. Nucleic Acids Res. 2025, 53, D523–D534. [Google Scholar] [CrossRef]
UniProt Consortium. UniProt: The Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609–D617. [Google Scholar] [CrossRef]
Burley, S.K.; Piehl, D.W.; Vallat, B.; Zardecki, C. RCSB Protein Data Bank: Supporting research and education worldwide through explorations of experimentally determined and computationally predicted atomic level 3D biostructures. IUCrJ 2024, 11, 279–286. [Google Scholar] [CrossRef]
Kagaya, Y.; Flannery, S.T.; Jain, A.; Kihara, D. ContactPFP: Protein function prediction using predicted contact information. Front. Bioinform. 2022, 2, 896295. [Google Scholar] [CrossRef]
The Gene Ontology Consortium; Aleksander, S.A.; Balhoff, J.; Carbon, S.; Cherry, J.M.; Drabkin, H.J.; Ebert, D.; Feuermann, M.; Gaudet, P.; Harris, N.L.; et al. The Gene Ontology knowledgebase in 2023. Genetics 2023, 224, iyad031. [Google Scholar] [CrossRef]
Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; et al. Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23, 2947–2948. [Google Scholar] [CrossRef]
Notredame, C.; Higgins, D.G.; Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000, 302, 205–217. [Google Scholar] [CrossRef]
Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32, 1792–1797. [Google Scholar] [CrossRef] [PubMed]
Clough, E.; Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M.; Marshall, K.A.; Phillippy, K.H.; Sherman, P.M.; et al. NCBI GEO: Archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Res. 2024, 52, D138–D144. [Google Scholar] [CrossRef] [PubMed]
Mick, E.; Kamm, J.; Pisco, A.O.; Ratnasiri, K.; Babik, J.M.; Castañeda, G.; DeRisi, J.L.; Detweiler, A.M.; Hao, S.L.; Kangelaris, K.N.; et al. Upper airway gene expression reveals suppressed immune responses to SARS-CoV-2 compared with other respiratory viruses. Nat. Commun. 2020, 11, 5854. [Google Scholar] [CrossRef]
Lieberman, N.A.P.; Peddu, V.; Xie, H.; Shrestha, L.; Huang, M.L.; Mears, M.C.; Cajimat, M.N.; Bente, D.A.; Shi, P.Y.; Bovier, F.; et al. In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age. PLoS Biol. 2020, 18, e3000849. [Google Scholar] [CrossRef]
Jain, R.; Ramaswamy, S.; Harilal, D.; Uddin, M.; Loney, T.; Nowotny, N.; Alsuwaidi, H.; Varghese, R.; Deesi, Z.; Alkhajeh, A.; et al. Host transcriptomic profiling of COVID-19 patients with mild, moderate, and severe clinical outcomes. Comput. Struct. Biotechnol. J. 2020, 19, 153–160. [Google Scholar] [CrossRef]
Castillo-Secilla, D.; Gálvez, J.M.; Carrillo-Perez, F.; Verona-Almeida, M.; Redondo-Sánchez, D.; Ortuño, F.M.; Herrera, L.J.; Rojas, I. KnowSeq R-Bioc package: The automatic smart gene expression tool for retrieving relevant biological knowledge. Comput. Biol. Med. 2021, 133, 104387. [Google Scholar] [CrossRef]
Dallah, D.; Sulieman, H.; Zaatreh, A.A.; Kamalov, F. Empirical evaluation of the relative range for detecting outliers. Entropy 2025, 27, 731. [Google Scholar] [CrossRef]
Lazar, C.; Meganck, S.; Taminau, J.; Steenhoff, D.; Coletta, A.; Molter, C.; Weiss-Solís, D.Y.; Duque, R.; Bersini, H.; Nowé, A. Batch effect removal methods for microarray gene expression data integration: A survey. Brief. Bioinform. 2013, 14, 469–490. [Google Scholar] [CrossRef]
Howe, K.L.; Achuthan, P.; Allen, J.; Allen, J.; Alvarez-Jarreta, J.; Amode, M.R.; Armean, I.M.; Azov, A.G.; Bennett, R.; Bhai, J.; et al. Ensembl 2021. Nucleic Acids Res. 2021, 49, D884–D891. [Google Scholar] [CrossRef]
Skrodzki, M.; van Geffen, H.; Chaves-de-Plaza, N.F.; Höllt, T.; Eisemann, E.; Hildebrandt, K. Accelerating hyperbolic t-SNE. IEEE Trans. Vis. Comput. Graph. 2024, 30, 4403–4415. [Google Scholar] [CrossRef] [PubMed]
McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform manifold approximation and projection. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
Krijthe, J.H. R Package, version 0.17; Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes–Hut Implementation. 2015. Available online: https://github.com/jkrijthe/Rtsne (accessed on 12 November 2025).
De Jay, N.; Papillon-Cavanagh, S.; Olsen, C.; El-Hachem, N.; Bontempi, G.; Haibe-Kains, B. mRMRe: An R package for parallelized Minimum Redundancy Maximum Relevance feature selection. Bioinformatics 2013, 29, 2365–2368. [Google Scholar] [CrossRef] [PubMed]
Mittal, M.; Gujjar, P.J.; Guru Prasad, M.S.; Devadas, R.M.; Ambreen, L.; Kumar, V. Dimensionality reduction using UMAP and t-SNE technique. In Proceedings of the Second International Conference on Advances in Information Technology (ICAIT 2024), Chikkamagaluru, India, 24–27 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Vitting-Seerup, K. Most protein domains exist as variants with distinct functions across cells, tissues and diseases. NAR Genom. Bioinform. 2023, 5, lqad084. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, Y.; Gupta, S.; Paramo, M.I.; Hou, Y.; Mao, C.; Luo, Y.; Judd, J.; Wierbowski, S.; Bertolotti, M.; et al. A comprehensive SARS-CoV-2–human protein–protein interactome reveals COVID-19 pathobiology and potential host therapeutic targets. Nat. Biotechnol. 2023, 41, 128–139. [Google Scholar] [CrossRef]
Savojardo, C.; Babbi, G.; Martelli, P.L.; Casadio, R. Mapping OMIM disease-related variations on protein domains reveals an association among variation type, Pfam models, and disease classes. Front. Mol. Biosci. 2021, 8, 617016. [Google Scholar] [CrossRef]
Kim, D.K.; Weller, B.; Lin, C.W.; Sheykhkarimli, D.; Knapp, J.J.; Dugied, G.; Zanzoni, A.; Pons, C.; Tofaute, M.J.; Maseko, S.B.; et al. A proteome-scale map of the SARS-CoV-2–human contactome. Nat. Biotechnol. 2023, 41, 140–149. [Google Scholar] [CrossRef]
Dyakov, I.N.; Mavletova, D.A.; Chernyshova, I.N.; Snegireva, N.A.; Gavrilova, M.V.; Bushkova, K.K.; Dyachkova, M.S.; Alekseeva, M.G.; Danilenko, V.N. FN3 protein fragment containing two type III fibronectin domains from B. longum GT15 binds to human tumor necrosis factor alpha in vitro. Anaerobe 2020, 65, 102247. [Google Scholar] [CrossRef]
Walls, A.C.; Park, Y.J.; Tortorici, M.A.; Wall, A.; McGuire, A.T.; Veesler, D. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 2020, 183, 1735. [Google Scholar] [CrossRef] [PubMed]
Arai, M.; Suetaka, S.; Ooka, K. Dynamics and interactions of intrinsically disordered proteins. Curr. Opin. Struct. Biol. 2024, 84, 102734. [Google Scholar] [CrossRef] [PubMed]
Shah, P.S.; Beesabathuni, N.S.; Fishburn, A.T.; Kenaston, M.W.; Minami, S.A.; Pham, O.H.; Tucker, I. Systems biology of virus–host protein interactions: From hypothesis generation to mechanisms of replication and pathogenesis. Annu. Rev. Virol. 2022, 9, 397–415. [Google Scholar] [CrossRef]

Figure 1. INPROF Web Server scheme is composed of: (i) Web Interface; (ii) Apache/PHP Web Server; (iii) Protein Feature Calculation (Perl module); (iv) Protein Feature Database.

Figure 2. INPROF pipeline for COVID-19 severity classification. The workflow includes: (1) data extraction from GEO; (2) preprocessing with outlier removal, normalization, and batch effect correction; (3) patient-level DEGs computation; (4) INPROF feature extraction (highlighted in green as the main methodological contribution), where DEGs are mapped to canonical proteins to generate INPROF; and (5) feature selection and SVM classification using mRMR ranking, t-SNE visualization, and LOOCV.

Figure 3. Dimensionality reduction algorithms t-SNE (left) and UMAP (right) applied to 44 INPROF for COVID-severity dataset.

Figure 4. Evolution of individual SVM classifier performance across progressively selected features with mRMR using 5-fold cross-validation in training subset. Four classical classification performance metrics (Accuracy, F1-Score, Sensitivity, and Specificity) are shown.

Figure 5. Confusion matrix and performance metrics (Accuracy, F1-Score, Sensitivity and Specificity) obtained with SVM and the top 10 ranked features applying a LOOCV evaluation scheme.

Figure 6. Distribution of the top six INPROF-derived features distinguishing COVID-19 severity groups. Boxplots show the variation in the six most informative features (SEQ_PL, SEQ_CK, SEQ_VA, SEQ_MX, SEQ_CT and SEQ_SU) across the three clinical categories (Outpatient, Inpatient and ICU).

Table 1. Summary of INPROF by categories. A full descriptive list is provided in Supplementary Table S1.

Feature Category	Sequence-Based	MSA-Based	Total
Sequence Statistics	5	3	8
Amino Acid Type	5	5	10
Functional Domains	6	2	8
Secondary Structure	4	5	9
Tertiary Structure	4	2	6
Ontological Terms	5	0	5
Total	29	17	46

Table 2. Average classification performance obtained using the 80% training split with 5-fold cross-validation, evaluated for selected numbers of top-ranked features. Values represent mean ± standard deviation (across folds in training).

Training (80%)—5-Fold CV
#	Accuracy	Sensitivity	Specificity	F1-Score
2	0.937 ± 0.021	0.761 ± 0.181	0.893 ± 0.085	0.515 ± 0.471
8	0.929 ± 0.052	0.789 ± 0.168	0.904 ± 0.073	0.677 ± 0.383
10	0.936 ± 0.054	0.855 ± 0.049	0.926 ± 0.029	0.854 ± 0.060
14	0.976 ± 0.022	0.900 ± 0.091	0.954 ± 0.050	0.928 ± 0.065
Test (20%)
#	Accuracy	Sensitivity	Specificity	F1-Score
2	0.910	0.654	0.878	0.648
8	0.879	0.642	0.867	0.642
10	0.879	0.642	0.867	0.642
14	0.910	0.654	0.878	0.648

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El-Awadi, R.; Gomez, O.D.; Castillo-Secilla, D.; Torres, C.; Herrera, L.J.; Rojas, I.; Ortuño, F.M. Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity. Biomedicines 2026, 14, 378. https://doi.org/10.3390/biomedicines14020378

AMA Style

El-Awadi R, Gomez OD, Castillo-Secilla D, Torres C, Herrera LJ, Rojas I, Ortuño FM. Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity. Biomedicines. 2026; 14(2):378. https://doi.org/10.3390/biomedicines14020378

Chicago/Turabian Style

El-Awadi, Radwa, Oscar D. Gomez, Daniel Castillo-Secilla, Carolina Torres, Luis J. Herrera, Ignacio Rojas, and Francisco M. Ortuño. 2026. "Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity" Biomedicines 14, no. 2: 378. https://doi.org/10.3390/biomedicines14020378

APA Style

El-Awadi, R., Gomez, O. D., Castillo-Secilla, D., Torres, C., Herrera, L. J., Rojas, I., & Ortuño, F. M. (2026). Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity. Biomedicines, 14(2), 378. https://doi.org/10.3390/biomedicines14020378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interrelational Proteomic Sequence Features Enhance Predictive Modeling: Application to COVID-19 Severity

Abstract

1. Introduction

2. Materials and Methods

2.1. Implementation and Usage

2.1.1. Protein Features

2.1.2. Web Interface

2.1.3. Apache Web Server

2.1.4. Metrics Calculation (Perl Module)

2.1.5. Protein Feature Database

2.2. Case Study: COVID-19 Severity Classification

2.2.1. Data Acquisition

2.2.2. Differential Expression Analysis

2.2.3. Retrieval of Canonical Proteins

2.2.4. INPROF Feature Extraction

2.2.5. Feature Selection

2.2.6. Classification Model

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI