MDPI - Publisher of Open Access Journals

21 pages, 4191 KB

Open AccessArticle

Classifying Protein-DNA/RNA Interactions Using Interpolation-Based Encoding and Highlighting Physicochemical Properties via Machine Learning

by Jesús Guadalupe Cabello-Lima, Patricio Adrián Zapata-Morín and Juan Horacio Espinoza-Rodríguez

Information 2025, 16(11), 947; https://doi.org/10.3390/info16110947 (registering DOI) - 1 Nov 2025

Abstract

Protein–DNA and protein–RNA interactions are central to gene regulation and genetic disease, yet experimental identification remains costly and complex. Machine learning (ML) offers an efficient alternative, though challenges persist in representing protein sequences due to residue variability, dimensionality issues, and the risk of [...] Read more.

Protein–DNA and protein–RNA interactions are central to gene regulation and genetic disease, yet experimental identification remains costly and complex. Machine learning (ML) offers an efficient alternative, though challenges persist in representing protein sequences due to residue variability, dimensionality issues, and the risk of losing biological context. Traditional approaches such as k-mer counting or neural network encodings provide standardized sequence representations but often demand high computational resources and may obscure functional information. To address these limitations, a novel encoding method based on interpolation of physicochemical properties (PCPs) is introduced. Discrete PCPs values are transformed into continuous functions using logarithmic enhancement, highlighting residues that contribute most to nucleic acid interactions while preserving biological relevance across variable sequence lengths. Statistical features extracted from the resulting spectra via Tsfresh are then used for binary classification of DNA- and RNA-binding proteins. Six classifiers were evaluated, and the proposed method achieved up to 99% accuracy, precision, recall, and F1 score when amino acid highlighting was applied, compared with 66% without highlighting. Benchmarking against k-mer and neural network approaches confirmed superior efficiency and reliability, underscoring the potential of this method for protein interaction prediction. Our framework may be extended to multiclass problems and applied to the study of protein variants, offering a scalable tool for broader protein interaction prediction. Full article

(This article belongs to the Special Issue Applications of Deep Learning in Bioinformatics and Image Processing)

► Show Figures

Figure 1

22 pages, 1926 KB

Open AccessReview

Biological Sequence Representation Methods and Recent Advances: A Review

by Hongwei Zhang, Yan Shi, Yapeng Wang, Xu Yang, Kefeng Li, Sio-Kei Im and Yu Han

Biology 2025, 14(9), 1137; https://doi.org/10.3390/biology14091137 - 27 Aug 2025

Cited by 1 | Viewed by 1112

Abstract

Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model [...] Read more.

Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model (LLM)-based, detailing their principles, applications, and limitations. Computational-based methods, such as k-mer counting and position-specific scoring matrices (PSSM), extract statistical and evolutionary patterns to support tasks like motif discovery and protein–protein interaction prediction. Word embedding-based approaches, including Word2Vec and GloVe, capture contextual relationships, enabling robust sequence classification and regulatory element identification. Advanced LLM-based methods, leveraging Transformer architectures like ESM3 and RNAErnie, model long-range dependencies for RNA structure prediction and cross-modal analysis, achieving superior accuracy. However, challenges persist, including computational complexity, sensitivity to data quality, and limited interpretability of high-dimensional embeddings. Future directions prioritize integrating multimodal data (e.g., sequences, structures, and functional annotations), employing sparse attention mechanisms to enhance efficiency, and leveraging explainable AI to bridge embeddings with biological insights. These advancements promise transformative applications in drug discovery, disease prediction, and genomics, empowering computational biology with robust, interpretable tools. Full article

(This article belongs to the Special Issue Machine Learning Applications in Biology—2nd Edition)

► Show Figures

Figure 1

13 pages, 1385 KB

Open AccessArticle

HPTAS: An Alignment-Free Haplotype Phasing Algorithm Focused on Allele-Specific Studies Using Transcriptome Data

by Jianan Wang, Zhenyuan Sun, Guohua Wang and Yan Miao

Int. J. Mol. Sci. 2025, 26(12), 5700; https://doi.org/10.3390/ijms26125700 - 13 Jun 2025

Viewed by 794

Abstract

Haplotype phasing refers to determining the haplotype sequences inherited from each parent in a diploid organism. It is a critical process for various downstream analyses, and numerous haplotype phasing methods for genomic single nucleotide polymorphisms (SNPs) have been developed. Allele-specific (AS) expression and [...] Read more.

Haplotype phasing refers to determining the haplotype sequences inherited from each parent in a diploid organism. It is a critical process for various downstream analyses, and numerous haplotype phasing methods for genomic single nucleotide polymorphisms (SNPs) have been developed. Allele-specific (AS) expression and alternative splicing play key roles in diverse biological processes. AS studies usually focus more on exonic SNPs, and multiple phased SNPs need to be combined to obtain better inferences. In this paper, we introduce an alignment-free algorithm HPTAS for haplotype phasing in AS studies. Instead of using sequence alignment to count the number of reads covering SNPs, HPTAS constructs a mapping structure from transcriptome annotations and SNPs and employs a k-mer-based approach to derive phasing counts from RNA-seq data. Using both next-generation sequencing (NGS) and the third-generation sequencing (TGS) NA12878 RNA-seq data and comparing with the most advanced algorithm in the field, we have demonstrated that HPTAS achieves high phasing accuracy and performance and that transcriptome data indeed facilitates the phasing of exonic SNPs. With the continued advancement of sequencing technology and the improvement in transcriptome annotations, HPTAS may serve as a foundation for future haplotype phasing methods. Full article

(This article belongs to the Special Issue New Computational Methodologies for Biomolecule Sequence, Structure and Function Discovery)

► Show Figures

Figure 1

15 pages, 7678 KB

Open AccessArticle

Predicting Salmonella MIC and Deciphering Genomic Determinants of Antibiotic Resistance and Susceptibility

by Moses B. Ayoola, Athish Ram Das, B. Santhana Krishnan, David R. Smith, Bindu Nanduri and Mahalingam Ramkumar

Microorganisms 2024, 12(1), 134; https://doi.org/10.3390/microorganisms12010134 - 10 Jan 2024

Cited by 4 | Viewed by 3734

Abstract

Salmonella spp., a leading cause of foodborne illness, is a formidable global menace due to escalating antimicrobial resistance (AMR). The evaluation of minimum inhibitory concentration (MIC) for antimicrobials is critical for characterizing AMR. The current whole genome sequencing (WGS)-based approaches for predicting MIC [...] Read more.

Salmonella spp., a leading cause of foodborne illness, is a formidable global menace due to escalating antimicrobial resistance (AMR). The evaluation of minimum inhibitory concentration (MIC) for antimicrobials is critical for characterizing AMR. The current whole genome sequencing (WGS)-based approaches for predicting MIC are hindered by both computational and feature identification constraints. We propose an innovative methodology called the “Genome Feature Extractor Pipeline” that integrates traditional machine learning (random forest, RF) with deep learning models (multilayer perceptron (MLP) and DeepLift) for WGS-based MIC prediction. We used a dataset from the National Antimicrobial Resistance Monitoring System (NARMS), comprising 4500 assembled genomes of nontyphoidal Salmonella, each annotated with MIC metadata for 15 antibiotics. Our pipeline involves the batch downloading of annotated genomes, the determination of feature importance using RF, Gini-index-based selection of crucial 10-mers, and their expansion to 20-mers. This is followed by an MLP network, with four hidden layers of 1024 neurons each, to predict MIC values. Using DeepLift, key 20-mers and associated genes influencing MIC are identified. The 10 most significant 20-mers for each antibiotic are listed, showcasing our ability to discern genomic features affecting Salmonella MIC prediction with enhanced precision. The methodology replaces binary indicators with k-mer counts, offering a more nuanced analysis. The combination of RF and MLP addresses the limitations of the existing WGS approach, providing a robust and efficient method for predicting MIC values in Salmonella that could potentially be applied to other pathogens. Full article

(This article belongs to the Special Issue Antimicrobial Resistance in the Food Chain)

► Show Figures

Figure 1

16 pages, 8774 KB

Open AccessArticle

KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

by Deyou Tang, Daqiang Tan, Weihao Xiao, Jiabin Lin and Juan Fu

Algorithms 2022, 15(4), 107; https://doi.org/10.3390/a15040107 - 24 Mar 2022

Viewed by 3448

Abstract

Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential [...] Read more.

Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact. Full article

(This article belongs to the Special Issue Performance Optimization and Performance Evaluation)

► Show Figures

Figure 1

12 pages, 967 KB

Open AccessArticle

Genome-Wide Mutation Scoring for Machine-Learning-Based Antimicrobial Resistance Prediction

by Peter Májek, Lukas Lüftinger, Stephan Beisken, Thomas Rattei and Arne Materna

Int. J. Mol. Sci. 2021, 22(23), 13049; https://doi.org/10.3390/ijms222313049 - 2 Dec 2021

Cited by 19 | Viewed by 4291

Abstract

The prediction of antimicrobial resistance (AMR) based on genomic information can improve patient outcomes. Genetic mechanisms have been shown to explain AMR with accuracies in line with standard microbiology laboratory testing. To translate genetic mechanisms into phenotypic AMR, machine learning has been successfully [...] Read more.

The prediction of antimicrobial resistance (AMR) based on genomic information can improve patient outcomes. Genetic mechanisms have been shown to explain AMR with accuracies in line with standard microbiology laboratory testing. To translate genetic mechanisms into phenotypic AMR, machine learning has been successfully applied. AMR machine learning models typically use nucleotide k-mer counts to represent genomic sequences. While k-mer representation efficiently captures sequence variation, it also results in high-dimensional and sparse data. With limited training data available, achieving acceptable model performance or model interpretability is challenging. In this study, we explore the utility of feature engineering with several biologically relevant signals. We propose to predict the functional impact of observed mutations with PROVEAN to use the predicted impact as a new feature for each protein in an organism’s proteome. The addition of the new features was tested on a total of 19,521 isolates across nine clinically relevant pathogens and 30 different antibiotics. The new features significantly improved the predictive performance of trained AMR models for Pseudomonas aeruginosa, Citrobacter freundii, and Escherichia coli. The balanced accuracy of the respective models of those three pathogens improved by 6.0% on average. Full article

(This article belongs to the Special Issue Microbioinformatics)

► Show Figures

Figure 1

15 pages, 3959 KB

Open AccessArticle

K-mer Content Changes with Node Degree in Promoter–Enhancer Network of Mouse ES Cells

by Kinga Szyman, Bartek Wilczyński and Michał Dąbrowski

Int. J. Mol. Sci. 2021, 22(15), 8067; https://doi.org/10.3390/ijms22158067 - 28 Jul 2021

Viewed by 2330

Abstract

Maps of Hi-C contacts between promoters and enhancers can be analyzed as networks, with cis-regulatory regions as nodes and their interactions as edges. We checked if in the published promoter–enhancer network of mouse embryonic stem (ES) cells the differences in the node type [...] Read more.

Maps of Hi-C contacts between promoters and enhancers can be analyzed as networks, with cis-regulatory regions as nodes and their interactions as edges. We checked if in the published promoter–enhancer network of mouse embryonic stem (ES) cells the differences in the node type (promoter or enhancer) and the node degree (number of regions interacting with a given promoter or enhancer) are reflected by sequence composition or sequence similarity of the interacting nodes. We used counts of all k-mers (k = 4) to analyze the sequence composition and the Euclidean distance between the k-mer count vectors (k-mer distance) as the measure of sequence (dis)similarity. The results we obtained with 4-mers are interpretable in terms of dinucleotides. Promoters are GC-rich as compared to enhancers, which is known. Enhancers are enriched in scaffold/matrix attachment regions (S/MARs) patterns and depleted of CpGs. Furthermore, we show that promoters are more similar to their interacting enhancers than vice-versa. Most notably, in both promoters and enhancers, the GC content and the CpG count increase with the node degree. As a consequence, enhancers of higher node degree become more similar to promoters, whereas higher degree promoters become less similar to enhancers. We confirmed the key results also for human keratinocytes. Full article

(This article belongs to the Special Issue Functions of Non-coding DNA Regions)

► Show Figures

Figure 1

91 pages, 1773 KB

Open AccessArticle

Amino Acid k-mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights

by Taha ValizadehAslani, Zhengqiao Zhao, Bahrad A. Sokhansanj and Gail L. Rosen

Biology 2020, 9(11), 365; https://doi.org/10.3390/biology9110365 - 28 Oct 2020

Cited by 29 | Viewed by 7678

Abstract

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction [...] Read more.

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately. Full article

(This article belongs to the Special Issue Computational Biology)

► Show Figures

Figure 1

14 pages, 496 KB

Open AccessArticle

Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method

by Yuanlin Ma, Zuguo Yu, Runbin Tang, Xianhua Xie, Guosheng Han and Vo V. Anh

Entropy 2020, 22(2), 255; https://doi.org/10.3390/e22020255 - 23 Feb 2020

Cited by 16 | Viewed by 4310

Abstract

HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis [...] Read more.

HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson–Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis. Full article

(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)

► Show Figures

Figure 1

14 pages, 477 KB

Open AccessArticle

Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

by Wolfgang Kaisers , Holger Schwender and Heiner Schaal

Int. J. Mol. Sci. 2018, 19(11), 3687; https://doi.org/10.3390/ijms19113687 - 21 Nov 2018

Cited by 6 | Viewed by 5336

Abstract

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA [...] Read more.

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees. Full article

(This article belongs to the Section Biochemistry)

► Show Figures

Figure 1

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI