Biological Sequence Representation Methods and Recent Advances: A Review

Zhang, Hongwei; Shi, Yan; Wang, Yapeng; Yang, Xu; Li, Kefeng; Im, Sio-Kei; Han, Yu

doi:10.3390/biology14091137

Open AccessReview

Biological Sequence Representation Methods and Recent Advances: A Review

by

Hongwei Zhang

¹

,

Yan Shi

^2,*,

Yapeng Wang

^1,*

,

Xu Yang

¹,

Kefeng Li

³

,

Sio-Kei Im

¹ and

Yu Han

⁴

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China

²

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

³

Center for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China

⁴

Faculty of Civil Engineering, Southwest Forestry University, Kunming 650224, China

^*

Authors to whom correspondence should be addressed.

Biology 2025, 14(9), 1137; https://doi.org/10.3390/biology14091137

Submission received: 14 July 2025 / Revised: 10 August 2025 / Accepted: 21 August 2025 / Published: 27 August 2025

(This article belongs to the Special Issue Machine Learning Applications in Biology—2nd Edition)

Download

Browse Figure

Versions Notes

Simple Summary

This paper investigates how to convert biological sequences (protein and nucleotide sequences) into a format that computers can understand to better advance computational biology research. Our goal is to explain the principles and technical details of these methods, mainly to illustrate their application areas and advantages and disadvantages. We reviewed early techniques that relied on counting patterns, later methods that imitated language-processing techniques to capture context, and state-of-the-art methods based on large models in recent years. Our results show that new methods are more accurate but require advanced computing power. The development and improvement of these methods will help scientists design effective drugs, predict diseases, and reveal the connection between genetic material and proteins. We conduct a fundamental research work, this review can provide useful guidance and help for researchers in computational biology, especially those new to the field.

Abstract

Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model (LLM)-based, detailing their principles, applications, and limitations. Computational-based methods, such as k-mer counting and position-specific scoring matrices (PSSM), extract statistical and evolutionary patterns to support tasks like motif discovery and protein–protein interaction prediction. Word embedding-based approaches, including Word2Vec and GloVe, capture contextual relationships, enabling robust sequence classification and regulatory element identification. Advanced LLM-based methods, leveraging Transformer architectures like ESM3 and RNAErnie, model long-range dependencies for RNA structure prediction and cross-modal analysis, achieving superior accuracy. However, challenges persist, including computational complexity, sensitivity to data quality, and limited interpretability of high-dimensional embeddings. Future directions prioritize integrating multimodal data (e.g., sequences, structures, and functional annotations), employing sparse attention mechanisms to enhance efficiency, and leveraging explainable AI to bridge embeddings with biological insights. These advancements promise transformative applications in drug discovery, disease prediction, and genomics, empowering computational biology with robust, interpretable tools.

Keywords:

biological sequence; computational; word embedding; large language model; machine learning

1. Introduction

The primary aim of biological-sequence representation methods is to convert nucleotide and protein sequences into formats that can be interpreted by computing systems. Form the backbone of computational biology, it provides the possibility for efficient processing and in-depth analysis of complex biological data. The emergence of large-scale genome sequencing, especially since the implementation of the Human Genome Project [1,2,3,4], has generated vast and complex datasets. Utilizing these datasets to uncover new biological insights is a significant and influential effort [5,6,7,8,9,10,11]. The biological-sequence representation method effectively captures sequence information from multiple dimensions, including statistical patterns, physical and chemical properties, structural features, contextual relationships between nucleotides and amino acids, and long-range dependencies. By providing a comprehensive and robust framework for data representation, these methods have laid a solid foundation for downstream machine learning applications, thereby advancing biomedical research.

The evolution of sequence representation methods can be categorized into three developmental stages: Early computational-based methods, using methods like k-mer analysis and position-specific scoring matrices (PSSM) extracted statistical (e.g., nucleotide composition,), physicochemical properties (e.g., hydrophobicity, polarity, and charge of proteins) and evolutionary features, often paired with shallow machine learning models like support vector machine (SVM) [12] and random forest (RF) [13] for genome assembly, structure prediction and protein–protein interaction (PPI) prediction [14,15,16]. Word embedding-based methods such as Word2Vec and ProtVec, leveraging deep learning methods, convolutional neural network (CNN) [17] and long short-term memory (LSTM) [18], capture contextual relationship for sequence classification, protein function annotation [19,20]. Recent advances leverage large language model (LLM) based methods, employing attention mechanisms [21] and models like AlphaFold3 and RoseTTAFold All-Atom, model complex sequence–struture–function relationships for structure prediction and functional annotation [22,23], these methods achieve high accuracy but come with increased computational demands. These methods address diverse biological tasks by meeting specific representation needs: computational-based methods excel in efficient local pattern capture for tasks like sequence classification, word embedding-based methods model contextual relationships for functional annotation, and LLM-based methods capture long-range dependencies for complex tasks like 3D structure prediction.

Despite these advancements, challenges persist, including computational complexity, data quality, and model interpretability. Figure 1 illustrates a machine learning framework for biological data analysis, outlining the stages of data encoding, preprocessing, model training, and evaluation.

This review systematically traces the development of these methods, highlighting their principles, applications, and limitations. Subsequent sections detail computational-based methods (Section 2), word embedding-based methods (Section 3), LLM-based techniques (Section 4), challenges and future directions (Section 5), and conclude with insights into their transformative potential (Section 6).

2. Computational-Based Methods

Computational-based methods represent the earliest stage of biological-sequence representation, focusing on statistical, physicochemical properties, and structural feature extraction from nucleotide and protein sequences. This paper reviews thirteen commonly used methods, categorized into five groups as shown in Table 1.

2.1. K-Mer-Based Methods

The k-mer-based methods, a cornerstone of computational-based methods, transform biological sequences into numerical vectors by counting k-mer frequencies. These methods capture local sequence patterns through statistical analysis of contiguous and gapped k-mers [24,25].

2.1.1. Overview

k-mer methods encode biological sequences by counting the frequencies of k-mers, producing vectors with dimensions determined by the sequence alphabet size (

|Σ| = 4

for nucleotides,

|Σ| = 20

for proteins) and

k

value. For example, nucleotide sequences yield 4-dimensional vectors for mononucleotide composition (MNC) (

k = 1

), 16-dimensional for dinucleotides composition (DNC) (

k = 2

), and 64-dimensional for trinucleotides composition (TNC) (

k = 3

), while protein sequences produce 20, 400, and 8000 dimensions for amino acid composition (AAC), dipeptides composition (DPC), and tripeptides composition (TPC), respectively. Extending this approach, gapped k-mer [26] methods introduce gaps within subsequences, enabling the capture of non-contiguous patterns critical for regulatory sequence analysis. The gkm kernel enhances this by measuring sequence similarity through gapped k-mer frequencies, using efficient tree-based data structures to manage high-dimensional feature spaces. These methods provide robust features for machine learning, with encoding length balancing predictive power and computational cost. Detailed mathematical formulations are provided in Supplementary Materials (Equations (S1)–(S4)).

2.1.2. Applications

Due to their simplicity, flexibility, and ability to capture biologically significant sequence patterns [27,28], the k-mer-based methods excel in sequence comparison, genome assembly, sequence classification, and motif discovery by encoding the frequency of short subsequences. For nucleotide sequences, facilitating the identification of functional sequence elements in regulatory DNA, and population genetic analyses [29,30]. Gapped k-mer methods enhance these capabilities by modeling non-adjacent sequence patterns, enabling regulatory sequence prediction, such as transcription factor binding sites and variant effect prediction, especially in predicting the impact of non-coding variants on gene expression and disease risk, these applications provide a new perspective for understanding how variations affect gene expression regulation and disease-related variations. [31,32]. For protein sequences, gapped k-mers aid subcellular localization prediction (e.g., cytoplasm, nucleus) and functional prediction of proteins like antioxidant proteins by capturing discontinuous physicochemical patterns [33,34]. Gapped k-mers support protein function prediction by identifying hemolytic or antimicrobial peptides, and PPI analysis by capturing compositional and local sequence order information [35,36]. These methods integrate seamlessly with machine learning models, such as SVM, RF, and deep neural networks, to achieve robust predictive performance.

2.1.3. Advantages and Limitations

Their key advantages include the flexibility to adjust the

k

value, balancing fine-grained local patterns (small

k

) with broader sequence contexts (larger

k

), and their straightforward implementation, which supports diverse computational biology applications. However, challenges include high-dimensional feature spaces, particularly for larger

k

values or gapped k-mers, which can lead to sparsity in large-scale datasets, and parameter sensitivity (e.g.,

k

value or gap size), requiring careful optimization [28]. Feature selection or dimensionality reduction techniques, such as principal component analysis, are critical to improving computational efficiency and scalability in computational biology tasks.

2.2. Group-Based Methods

Group-based methods first group sequence elements (nucleotides or amino acids) based on physicochemical properties (such as hydrophobicity, polarity, charge), analyze the position, combination, and frequency of the grouped patterns, and generate low-dimensional and biologically significant feature vectors to represent sequences. Compared with k-mer methods, group-based methods have significant advantages in dimension control, biological relevance, and computational efficiency.

2.2.1. Overview

The Composition, Transition, and Distribution (CTD) [37] groups amino acids into three categories—polar, neutral, and hydrophobic [38] (Supplementary Materials Table S1)—producing a fixed 21-dimensional vector. This includes 3 composition features (group frequencies), 3 transition features (frequencies of switches between groups, e.g., polar to hydrophobic), and 15 distribution features (positions of groups at 25%, 50%, 75%, and 100% of the sequence). Another method, the Conjoint Triad (CT) [36] groups amino acids into seven categories based on properties like dipole and side chain volume, forming triads of three consecutive amino acids. This results in a 343-dimensional vector capturing the frequency of each triad type (Supplementary Materials Figure S1). CTD produces a 21-dimensional vector, while CT produces a 343-dimensional vector. This reflects a trade-off between capturing detailed sequence patterns and maintaining computational tractability, making these methods suitable for tasks requiring biologically relevant features. Detailed mathematical formulations are provided in Supplementary Materials (Equations (S5)–(S7)).

2.2.2. Applications

Group-based methods, including CTD and CT, excel in protein function prediction, sequence analysis, and PPI prediction by leveraging physicochemical properties to generate biologically relevant features. For instance, DeepTP utilized CTD features to predict thermophilic proteins in Thermus thermophilus, achieving 87.2% accuracy in cross-validation datasets, supporting the identification of thermally stable enzymes for industrial applications [39]. Similarly, PreTP-Stack employed CTD to classify therapeutic peptides, including anticancer peptides, achieving an Area Under the Curve (AUC) of 99.0% in multiple datasets, contributing to novel cancer therapy development [40]. CT enhances these capabilities by capturing triad-based patterns, improving accuracy in protein function annotation and PPI prediction across diverse datasets [36,41].

2.2.3. Advantages and Limitations

Group-based methods face challenges, including the need for parameter optimization (e.g., grouping criteria for amino acids), which increases computational complexity, and sparsity in longer sequences, particularly for CTD and CT fixed-dimensional vectors, necessitating feature selection to support tasks like protein function prediction. The limited exploration of CTD in subcellular localization suggests untapped potential for enhancing predictive accuracy in such tasks.

2.3. Correlation-Based Methods

Correlation-based methods represent biological sequences by analyzing the relationships among their elements to capture complex patterns. These methods focus on modeling dependencies between physicochemical properties or sequence positions, enhancing feature extraction for predictive tasks in computational biology. By quantifying correlations, such as auto-covariance for single-property dependencies or cross-covariance for interactions between different properties, these methods provide robust representations that integrate local and partial global sequence information, supporting tasks like RNA classification and epigenetic analysis.

2.3.1. Overview

Correlation-based methods encode biological sequences by quantifying relationships between physicochemical properties across sequence positions. The auto-covariance (AC) [42,43] method measures the correlation of a single property, such as hydrophobicity, between sequence elements separated by a lag, producing vectors with dimensions based on the number of properties (

P

) and maximum lag (

G

) (e.g., 50 dimensions for 5 properties and

G = 10

). For nucleotide sequences, this extends to dinucleotide auto-covariance (DAC,

k = 2

) [44] and trinucleotide auto-covariance (TAC,

k = 3

) [45] variants, capturing local and extended patterns. Building on this, the Cross-Covariance auto-covariance (CC) [43,46] method analyzes interactions between different properties, such as polarity and volume, yielding higher-dimensional vectors

(P \times (P - 1) \times G)

, e.g., 200 dimensions for 5 properties and

G = 10

. CC extends to dinucleotide cross-covariance (DCC,

k = 2

) [45] and trinucleotide cross-covariance (TCC,

k = 3

) [45] variants, enhancing multi-property analysis. In addition, the combination of DAC and DCC methods forms dinucleotide auto-cross-covariance (DACC), which encodes vectors with dimensions

(P \times P \times G)

. The feature encoding length, lower for AC and higher for CC and DACC, balances detailed dependency capture with computational cost, making these methods suitable for tasks like RNA classification and epigenetic analysis. Detailed mathematical formulations are provided in Supplementary Materials (Equations (S8)–(S12)).

2.3.2. Applications

Correlation-based methods, including DAC, TAC, AC, and CC, capture sequence dependencies to advance RNA classification, epigenetic analysis, and sequence analysis with high accuracy. For instance, StackCirRNAPred combined DACC features for predicting long circular RNAs, achieving 83.9% accuracy on a mouse dataset [47]. Similarly, Deep-N6mA employed DCC and TAC features to predict N6-methyladenine sites, achieving 94.23% accuracy, support elucidating regulatory mechanisms of gene expression [48]. Additionally, Uddin et al. used DAC with XGBoost to predict 5-hydroxymethylcytosine modifications, achieving 89.97% accuracy, further supporting epigenetic analysis [49]. An et al. applied DAC, TAC, TCC, and DACC to achieve 83.3% accuracy in DNA barcode classification of poppy species, aiding forensic and ecological applications [50]. These methods, integrated with ML methods like LSTM and XGBoost, deliver robust performance for genomic and epigenetic studies.

2.3.3. Advantages and Limitations

Correlation-based methods offer significant advantages, including the ability to capture intricate sequence dependencies and multi-property interactions, enabling robust performance in RNA classification, epigenetic analysis, and sequence analysis when integrated with machine learning models like deep learning frameworks and ensemble classifiers. However, the high-dimensional feature spaces generated by these methods pose computational challenges, increasing processing time and resource demands for large-scale datasets. Additionally, their performance is sensitive to data quality, as noisy or incomplete datasets may weaken feature robustness.

2.4. PSSM-Based Methods

PSSM is a method used to describe the positional specificity of protein sequences. It reflects the probabilities of amino acids appearing at various positions within a protein sequence. This method was initially introduced by Gribskov [51] as a mathematical model for sequence alignment and structural analysis, supporting tasks like protein function prediction and interaction analysis.

2.4.1. Overview

PSSM-based methods encode protein sequences by quantifying the likelihood of amino acid substitutions at each position, reflecting evolutionary patterns. Introduced by Gribskov, the PSSM uses tools like PSI-BLAST [52] to generate an

L \times 20

matrix for a protein sequence of length L, capturing substitution probabilities for each of the 20 amino acids at each position (e.g., a 100 × 20 matrix for a 100-residue sequence). Building on this, in 2001, Chou et al. introduced the Pseudo PSSM (Pse-PSSM) [53], which incorporates sequence-order relationships, producing a

20 + 20 \times G

vector (e.g., 60 dimensions for

G = 2

) by combining average substitution scores with correlation factors across positional lags. Further advancing this, the k-Tuple Composition PSSM method normalized PSSM scores to extract k-tuple amino acid composition, yielding vectors of

20^{k}

dimensions (e.g., 20 for

k = 1

known as amino acid composition PSSM (AAC-PSSM) [54], and 400 for

k = 2

, know as dipeptide PSSM (DPC-PSSM) [54], respectively). PSSM produces feature vector of

L \times 20

for PSSM,

20 + 20 \times g

for Pse-PSSM, and

20^{k}

for k-Tuple PSSM, balancing detail and computational cost, suitable for protein function prediction and interaction analysis. Detailed mathematical formulations are provided in Supplementary Materials (Equations (S13)–(S22)).

2.4.2. Applications

PSSM-based methods, including PSSM, k-Tuple Composition PSSM, and Pse-PSSM, leverage evolutionary and sequence information to advance protein function prediction, PPI prediction, and structural analysis with high accuracy across diverse domains. For instance, DeepRank-GNN uses PSSM to predict protein–protein interfaces with 82% accuracy, aiding in the understanding of their interaction mechanisms [55]. PLM-ATG employs k-Tuple Composition PSSM to identify autophagy related proteins with 99.98% AUC, support reveal the coordinated mechanism of autophagy process by autophagy related proteins (ATG) [56]. Ensemble deep learning with PSSM-derived features classifies lipocalin sequences at 97.65% accuracy, support to understand its multiple functions and promotes the development of new disease treatment methods [57]. The prediction method of protein structure categories combining PSSM and PsePSSM features achieved an accuracy of 81.0–89.5% on different datasets, which helps to reveal the recognition of folding patterns [58]. These methods, integrated with advanced machine learning models like graph neural networks, XGBoost, deliver robust performance for genomic and proteomic studies.

2.4.3. Advantages and Limitations

Their primary advantages include the ability to capture intricate evolutionary conservation and sequence-order patterns, providing rich feature representations that enhance predictive accuracy in protein function prediction and PPI prediction. However, the high-dimensional nature of their feature vectors also increases computational complexity, often necessitating dimensionality reduction techniques, such as PCA, to optimize efficiency and scalability in large-scale bioinformatics.

2.5. Structure-Based Methods

Structure refers to the local folding and spatiality arrangement of RNA and protein sequences. Specifically, DNA features a double helix with paired nucleotide chains, whereas RNA is more complex [59], exhibiting single-stranded secondary structures (SSs) and various loop formations. Protein structure results from hydrogen bonds between amino acids, which include α-helix, β-sheet, random coil, β-turn, and π-helix. The structure-based methods capture the local folding features of RNA and protein sequences in three-dimensional space and represent these features in a format that ML methods can comprehend, thereby providing a novel perspective for sequence analysis, supporting tasks like RNA and protein function prediction. Given the relatively simple secondary structure of DNA, structure-based methods are primarily applied to RNA and protein. It is essential to note that these structural features are typically predicted using specialized tools, such as RNAFold [60] and Mfold [61]. RNA SSs are represented in dot-bracket format [60,62] and connectivity table (CT) format [63], which provide a concise numerical representation of base positions and pairings.

2.5.1. Overview

Structure-based methods encode RNA and protein sequences by modeling their secondary structure patterns, primarily for RNA and proteins due to DNA’s simple double-helix structure. Introduced by Xue et al. [64], the triplet structure (TS) method represents RNA sequences as 32-dimensional vectors, combining sequence and structural information for trinucleotides (4 nucleotides × 8 paired/unpaired states, e.g., “A(((“ or “U(..”). Su [65] proposed the split protein secondary structure composition (SPSSC) method, which uses tools like RaptorX to predict three-state secondary structure (helix, sheet, coil) [66] and divides protein sequences into m subsequences, each represented by k-tuple frequencies (e.g.,

m \times 9

dimensions for

k = 2

). Further advancing this, Liu et al. [67] introduced the Pseudo Structure Status Composition (PseSSC) method, which combines RNA secondary structure states (10 states: 4 unpaired, 6 paired) with sequence-order correlations, yielding a

10^{k} + λ

-dimensional vector (e.g.,

100 + λ

for

k = 2

). Structure-based methods produce feature vectors of 32 dimensions for TS,

m \times 3^{k}

for SPSSC, and

10^{k} + λ

for PseSSC, balancing detail and computational cost, suitable for RNA modification prediction and protein function analysis. Detailed mathematical formulations are provided in Supplementary Materials (Equations (S23)–(S29)).

2.5.2. Applications

Structure-based methods effectively capture structural and sequence patterns for RNA modification prediction, protein function prediction and RNA–protein interaction (RPI) prediction. For instance, AutoBioSeqpy employs TS to predict N7-methylguanosine (m7G) sites with 92.6% accuracy, helping reveal RNA modification mechanisms [68]. XGEM employs TS to identify miRNA precursors with 90.0% accuracy, aiding gene regulation studies [69]. PseSSC enhances pre-miRNA prediction with 85.76% accuracy and RPI prediction with 97.8% accuracy, clarifying the important role of structural features in model representation [67,70]. Integrated with ML methods, these methods advance genomics and proteomic studies.

2.5.3. Advantages and Limitations

Structure-based methods capture intricate local structural and sequence patterns, particularly for RNA and protein sequences. These methods generate biologically meaningful feature representations that reflect secondary structure motifs, such as RNA base pairings or protein helix and sheet configurations, enhancing compatibility with machine learning frameworks like deep neural networks and ensemble classifiers [64,65]. This enables precise modeling of RNA and protein sequences critical for RNA modification prediction and protein function analysis. However, their focus on local structural features limits their ability to capture global sequence contexts, potentially reducing effectiveness in tasks requiring long-range dependencies. Additionally, high-dimensional outputs, especially for TS and PseSSC, increase computational demands, particularly for large-scale datasets. These methods also rely heavily on accurate structural prediction tools, such as RNAFold or RaptorX, and their performance may be compromised by noisy or incomplete structural data [60,66].

3. Word Embedding-Based Methods

Computational-based methods, while effective for local patterns, are limited by high dimensionality and local focus. To address these limitations, word embedding-based methods were developed to capture contextual relationships using neural networks. These methods, originally developed in natural language processing (NLP), are driven by advances in ML, particularly deep neural networks, have emerged as powerful tools for biological-sequence representation by mapping sequences to continuous vector spaces that capture semantic and contextual relationships [71,72]. These methods treat biological sequences as language, representing k-mers or residues as “words” and sequences as “sentences.” By training neural networks on large-scale biological datasets, they generate dense, low-dimensional vectors that encode meaningful features, facilitating tasks such as sequence classification, function prediction, and cross-species analysis. These methods are summarized in Table 2.

3.1. Local Feature Embedding-Based Methods

These methods focus on local feature extraction at the k-mer or sub-k-mer level, suitable for capturing local patterns and contextual relationships, and are applicable to tasks such as sequence analysis and protein function prediction. They perform well in fine-grained analysis, but struggle to capture long-range dependencies.

3.1.1. Overview

Local feature embedding-based methods encode biological sequences by mapping k-mers into continuous vector spaces using neural network-based methods. Introduced in 2013 by Mikolov et al. [73], Word2Vec uses shallow neural networks with two models: Skip-gram, which predicts context k-mers given a target k-mer, and Continuous Bag-of-Words (CBOW), which predicts a target k-mer from its context. By treating sequences as “sentences” and k-mers as “words,” it generates fixed-length vectors (e.g., 100 dimensions) that capture local sequence patterns. Building on this, in 2014, Asgari and Mofrad proposed BioVec [74], tailored for biological sequences. BioVec includes ProtVec, which segments protein sequences into non-overlapping 3-mers and trains a Skip-gram model on the Swiss-Prot database to produce 100-dimensional embeddings encoding sequence patterns and BioVec-specific physicochemical properties, and GeneVec, which applies a similar method to nucleotide sequences. Further advancing this, Ng et al. introduced DNA2Vec [75], which extends Word2Vec to handle variable-length k-mers (e.g., 3 to 8 nucleotides) in DNA sequences. Using a Skip-gram model trained on large genomic datasets, DNA2Vec generates fixed-length vectors (e.g., 100 dimensions) that capture sequence motifs and contextual relationships. These methods produce configurable fixed-length vectors, typically 100 dimensions, ensuring robustness to sequence length variability and computational efficiency for tasks like sequence analysis and protein function prediction.

3.1.2. Applications

Local feature embedding methods by mapping k-mers or residues into continuous vector spaces, capturing local sequence patterns and contextual relationships. These methods excel in sequence analysis, protein function prediction, and binding site prediction. Word2Vec, via Skip-gram models, classifies protein sequences with 81.14% accuracy, Helping promote the recognition of snake venom proteins and the development of related drugs [76]. BioVec, including ProtVec and GeneVec, can be used for functional classification, structural prediction, and PPI prediction of protein families [74,77]. DNA2Vec identifies transcription factor (TF) binding sites with 83.10% accuracy, supporting gene regulation studies [78]. Integrated with deep learning, these methods enhance genomic and proteomic analysis.

3.1.3. Advantages and Limitations

The key advantages of these methods include their ability to automatically learn features from sequence data, eliminating the need for manually defined features common in earlier computational-based methods, and their production of low-dimensional, fixed-length embeddings that ensure robustness to sequence length variability and improve computational efficiency for tasks like sequence analysis and protein function prediction. These embeddings integrate effectively with ML methods, such as support vector machines and deep neural networks, to achieve robust predictive performance. However, these methods have notable limitations. Their effectiveness relies heavily on large, high-quality training datasets, with smaller or noisy datasets leading to suboptimal embeddings [74]. Additionally, the embeddings, derived from local k-mer patterns, are challenging to interpret biologically, as they do not directly correspond to specific functional or structural properties, limiting their use in tasks requiring mechanistic insights. Furthermore, their focus on local patterns hinders their ability to capture long-range dependencies, a challenge addressed by later methods.

3.2. Global Feature Embedding-Based Methods

Global feature embedding-based methods transform biological sequences into continuous vector representations by modeling statistical relationships across entire sequences or datasets, capturing holistic context for tasks like sequence analysis and protein and RNA function prediction. These methods leverage fixed-length encodings, typically in low-dimensional spaces, to balance comprehensive pattern capture with computational efficiency.

3.2.1. Overview

Global feature embedding-based methods encode nucleotide and protein sequences by capturing statistical relationships across entire sequences using neural network-based or matrix factorization methods. Introduced in 2014 by Le and Mikolov as an extension of Word2Vec, Doc2Vec [73,79] treats sequences as “documents” and k-mers as “words,” generating fixed-length vectors (e.g., 100–300 dimensions). It employs two architectures: Distributed Memory (DM), which integrates a document vector with k-mer vectors to predict target words based on sequence context, and Distributed Bag-of-Words (DBOW), which uses the document vector alone to predict sequence k-mers, capturing sequence-wide patterns. Also proposed in 2014 by Pennington et al., Global Vectors (GloVe) [80] builds a co-occurrence matrix from k-mers within a fixed context window (e.g., 5–15 units), counting pairwise co-occurrences across large biological datasets. A logarithmic weighting function reduces the impact of frequent pairs, and weighted least squares optimization factorizes the matrix to produce fixed-length vectors (e.g., 50–300 dimensions). These methods produce fixed-length vectors, typically 100–300 dimensions for Doc2Vec and GloVe, ensuring robustness to sequence length variability and computational efficiency for tasks like sequence analysis and protein and RNA function prediction.

3.2.2. Applications

Global embedding-based methods, including Doc2Vec and GloVe, capture contextual sequence features for precise sequence analysis, protein and RNA function prediction, and PPI prediction. Doc2Vec predicts PPI in human herpesvirus and cross-species datasets with high accuracy, revealing interaction networks [81,82]. It also predicted the circRNA–RNA-Binding Protein (RBP) interaction site with an average AUC of 90.76%, supporting research on circRNA as a biomarker for diagnosing various diseases [83]. GloVe predicts RNA 5-methyluridine (m5U) sites and DNA-binding proteins (DBP) with strong performance, supporting epigenetic analysis [84,85].

3.2.3. Advantages and Limitations

Their primary advantages of global feature embedding-based methods include their ability to capture global sequence patterns, providing a comprehensive representation of sequence context that is more robust to sequence length variability compared to traditional computational-based methods. The low-dimensional embeddings generated by Doc2Vec and GloVe enhance computational efficiency, while their reliance on statistical relationships enables adaptability to sequence analysis and protein and RNA function prediction. However, these methods have notable limitations. These methods depend on large, high-quality datasets to build effective co-occurrence matrices or document vectors, with performance declining for smaller or noisy datasets. Their emphasis on global sequence patterns often sacrifices local details, reducing suitability for fine-grained tasks like motif discovery, which local embedding methods handle more effectively. Moreover, the embeddings’ abstract nature complicates mapping to specific biological properties, such as functional motifs or interaction sites, hindering mechanistic understanding.

4. Large Language Model-Based Methods

With the advancement of sequencing technology, feature models of biological data have evolved, leading to the emergence of integrated multi task learning and transfer learning technology models that adapt to constantly changing computing tasks, including sequence analysis, protein structure prediction, and drug discovery. A significant advancement is Protein Language Models (PLMs) [86], which leverage Transformer architectures to model protein sequences with unprecedented accuracy [87]. PLMs are advanced computational methods that apply natural language processing techniques to represent protein sequences as language-like data, using attention mechanisms to capture contextual relationships and long-range dependencies [87]. The attention mechanism, a cornerstone of Transformers, enables PLMs to weigh the importance of different residues in a sequence, effectively modeling long-range dependencies and evolutionary patterns [87]. These methods are summarized in Table 3.

4.1. Self-Supervised Learning Methods

Self-supervised learning (SSL) methods are pre-trained through self-supervised tasks, utilizing masked language modeling (MLM) technology to capture context and long-range dependencies in biological sequences for tasks like sequence analysis and protein and RNA structure prediction. In MLM, 15% of sequence tokens (e.g., nucleotides or amino acids) are masked (80% [MASK], 10% random, 10% unchanged) and predicted via cross-entropy loss, enabling models to learn intrinsic patterns from large, unlabeled datasets (e.g., RNAcentral database, UniProt).

4.1.1. Overview

Self-supervised learning methods encode nucleotide and protein sequences using Transformer-based architectures trained on large datasets to capture contextual and long-range dependencies. Protein-focused models emerged with Evolutionary Scale Modeling (ESM) series, notable ESM1b [88] proposed by Rives et al. in 2021., utilizing 33 Transformer encoder layers with 12 attention heads and scaled dot-product attention. Trained on 250 million UniRef50 database protein sequences, ESM1b applies MLM to predict masked amino acids, producing 1280-dimensional embeddings. Elnaggar et al. proposed ProtTrans [89], a family of BERT-based models, includes ProBERT, which utilizes 30 Transformer encoder layers and 16 attention heads. Pre-trained on UniRef100 with 15% MLM, tokenizes sequences into 20 standard amino acids plus special tokens, standardized to 512 residues, yielding 1024-dimensional embeddings. In 2024, Shen et al. introduced RNA-FM [90], which employs 12 Transformer encoder layers with 20 multi-heads self-attention to model RNA sequences. RNA-FM is trained on 23.7 million non-coding RNA sequences from RNAcentral, with T replaced by U. It applies MLM to predict 15% randomly masked nucleotides, generating 640-dimensional embeddings via mean pooling. Subsequently, in 2025, ProLLama [91] extended protein modeling by adapting the LLAMA-2-7B architecture, trained on UniRef50 using autoregressive modeling and leveraging residual connections, layer normalization, and low-rank adaptation to generate embeddings optimized for multi-task protein language processing, including sequence generation and superfamily prediction.

4.1.2. Applications

Self-supervised learning (SSL) methods mark a transformative advancement in biological-sequence representation, capturing contextual. ESM1b supports protein function prediction and binding site identification by encoding evolutionary and contextual patterns, achieving high accuracy in diverse protein analysis tasks [88,92]. ProtBERT enhances protein function and domain classification, as well as binding site identification, by modeling complex residue interactions [89]. RNA-FM excels in RNA 3D structure prediction and classification, leveraging large-scale unlabeled data, enhances performance in scenarios with limited labeled data [90]. Similarly, ProLLama strengthens protein structure prediction and interaction analysis, capturing intricate sequence relationships [91]. These methods integrate seamlessly with machine learning frameworks, leveraging large, unlabeled datasets to deliver robust performance across sequence analysis, protein and RNA structure prediction, and binding site prediction.

4.1.3. Advantages and Limitations

Their primary advantages include their ability to model contextual relationships through attention mechanisms, surpassing the local focus of word embedding-based methods, and their generalizability across sequence analysis and protein and RNA structure prediction due to pre-training on expansive datasets like UniProt and RNAcentral. The high-dimensional, contextualized embeddings provide rich feature representations, adaptable through optional fine-tuning for specific applications, and their data-driven method reduces reliance on predefined features compared to computational-based methods. However, these methods face significant challenges. Their high computational costs, driven by large-scale Transformer architectures and extensive training datasets, restrict accessibility and scalability in resource-constrained settings [59,89,91]. Additionally, their complex, high-dimensional embeddings obscure direct connections to biological properties, such as secondary structures or binding affinities, limiting mechanistic insights and interpretability. Performance is heavily dependent on large, high-quality datasets, with smaller or noisy datasets risking suboptimal embeddings.

4.2. Multi-Task Learning Methods

Multi-task learning (MTL) methods integrate multiple pre-training objectives, such as sequence analysis, structure prediction, and functional annotation, to generate universal embeddings for biological sequences. Unlike SSL, which focuses on single-task pre-training like MLM, MTL jointly optimizes diverse tasks [93] (e.g., MLM, contact map prediction, knowledge embedding) using composite loss functions (e.g., weighted sums of task-specific losses). MTL models, typically based on Transformer architectures, leverage large, annotated datasets (e.g., UniProt, ProteinKG25) to encode sequence, structural, and functional relationships. This multi-objective method increases training complexity but produces versatile embeddings suited for diverse biomolecular tasks, contrasting with SSL’s task-agnostic, sequence-focused embeddings.

4.2.1. Overview

Multi-task learning methods encode nucleotide and protein sequences using Transformer-based architectures trained on large datasets to model sequence, structural, and functional properties. Earlier, protein-focused models emerged with Tasks Assessing Protein Embeddings (TAPE) in 2019 by Rao et al. [93], using 12 Transformer encoder layers with 8 self-attentions heads to tokenize protein sequences into 20 standard amino acids plus special tokens. TAPE optimizes multiple tasks—secondary structure prediction, contact map prediction, and remote homology detection on independent data sources such as Pfam database, CB513 and ProteinNet. In 2022, Zhang et al. proposed OntoProtein [94], combining MLM with knowledge graph embedding using ProteinKG25 (612,483 entities, 4,990,097 triplets). A hybrid encoder, combining ProtBERT-based protein encoding with BERT-based Gene Ontology (GO) term encoding, employs contrastive learning to align the semantic association between proteins and GO to generate embeddings. In 2024, RNAErnie [95] employs 12 Transformer encoder layers with multi-head self-attention to process RNA sequences. Pre-trained on 23.7 million non-coding RNA sequences from RNAcentral database, RNAErnie uses a motif-aware MLM strategy, masking 15% of nucleotides (80% [MASK], 10% random, 10% unchanged) at base, subsequence, and motif levels to predict structural motifs and sequence context, generating 768-dimensional embeddings via mean pooling. It also employs type-guided fine-tuning, predicting RNA types (e.g., microRNA (miRNA), long non-coding RNA (lncRNA)) to enhance adaptability, with flexible fine-tuning architectures (FBTH, TBTH, STACK). Subsequently, in 2024, Abramson et al. introduced AlphaFold3 [22], a diffusion-based model integrating Transformer-based sequence encoding to predict protein, small molecule, and nucleic acid interactions. Trained on PDB and chemical databases, it generates embeddings for biomolecular complexes. Similarly, Krishna et al. proposed RoseTTAFold All-Atom [23] in 2024, a Transformer-based model trained on PDB and UniProt for protein-ligand and protein-nucleic acid interactions, using multi-task objectives for structure prediction and functional annotation. In 2025, ESM3 [96] introduced multimodal generative modeling, encoding sequences, structures, and functions as discrete tokens in a 98-billion-parameter Transformer. ESM3, trained on 3.15 billion sequences, 236 million structures, and 539 million annotations from curated datasets (e.g., UniProt, PDB), and uses generative MLM to predict masked tokens across modalities, producing high-dimensional embeddings [96]. These methods produce feature encodings, ranging from 256 to 1024 dimensions, balancing comprehensive multi-modal representation with computational efficiency for tasks like sequence analysis, structure prediction, and functional annotation.

4.2.2. Applications

Multi-task learning (MTL) methods capturing sequence, structural, and functional properties for sequence analysis, structure prediction, and functional annotation. TAPE supports protein function prediction and structural analysis, effectively modeling complex protein characteristics [93]. OntoProtein enhances protein function prediction and annotation by integrating biological knowledge with sequence data, improving accuracy in functional analysis [94]. RNAErnie excels in RNA structure prediction and type classification, such as identifying miRNA and lncRNA, by enhancing structural and functional insights [95]. AlphaFold3 predicts protein-small molecule, protein-nucleic acid interactions and structure prediction, aiding drug discovery [22]. RoseTTAFold All-Atom supports protein-ligand binding and functional genomics [23]. ESM3 enables protein structure generation, function prediction, and cross-modal reasoning, offering exceptional scalability across diverse computational biology tasks [96]. These methods integrate seamlessly with machine learning frameworks, leveraging diverse data to deliver robust performance in sequence analysis, structure prediction, and functional annotation.

4.2.3. Advantages and Limitations

Their primary advantages include their ability to jointly model sequence, structural, and functional properties, yielding versatile embeddings that outperform single-task SSL methods in sequence analysis, structure prediction, and functional annotation. The incorporation of external knowledge (e.g., GO terms in OntoProtein) and multimodal data (e.g., ESM3) enhances generalizability, while fine-tuning strategies improve adaptability to specific tasks. However, MTL methods face significant challenges. Their high computational complexity, stemming from multi-objective optimization and large-scale datasets, requires substantial resources, limiting accessibility [22,23,89,94,96]. Moreover, the multifaceted embeddings blur direct associations with specific biological features, such as binding sites or functional motifs, limiting mechanistic interpretation and interpretability. Performance heavily depends on large, high-quality annotated datasets, with noisy or limited data degrading embedding quality, a challenge shared with SSL methods.

5. Challenges and Future Directions

The development of biological-sequence representation methods, from computational-based methods to word embedding-based methods and LLM-based methods, has advanced tasks like sequence analysis, structure prediction, functional annotation, and interaction prediction. Each stage, however, faces specific challenges tied to feature encoding length and computational demands, guiding targeted improvements for computational biology applications.

Computational-based structure methods, such as TS and PseSSC, encode RNA and protein sequences into high-dimensional vectors (e.g., 64 dimensions for TS,

10^{k} + λ

for PseSSC) to capture local structural patterns like RNA base pairings or protein secondary structures. These high-dimensional outputs increase computational complexity for tasks like N7-methylguanosine site prediction and pre-miRNA identification, while their focus on local features often misses global sequence context, limiting performance in tasks requiring long-range dependencies [67,68]. Moving to word-embedding methods, local feature embeddings (Section 3.1), like Word2Vec and DNA2Vec, generate fixed-length 100-dimensional vectors using shallow neural networks. Their reliance on extensive training data leads to suboptimal embeddings for smaller or noisy datasets, impacting transcription factor binding prediction [74,75]. Global feature embeddings (Section 3.2), such as Doc2Vec and GloVe, produce 100–300-dimensional vectors by modeling sequence-wide statistical patterns but often overlook local motifs, reducing effectiveness in RNA methylation site prediction [83,84]. Progressing to LLM-based methods, self-supervised learning methods (Section 4.1), like RNA-FM and ESM1b, use Transformer architectures to generate high-dimensional embeddings. These models, trained on large datasets like RNAcentral database and UniProt, demand significant computational resources for RNA 3D structure prediction and protein function annotation [88,90]. Multi-task learning methods (Section 4.2), such as RNAErnie and ESM3, integrate objectives like MLM and knowledge embedding, producing 256–1024-dimensional embeddings. Their complex architectures and dependence on annotated datasets like ProteinKG25 increase computational costs and sensitivity to data quality, affecting tasks like RNA classification, RNA-RNA interaction, and protein structure prediction.

Common challenges include data imbalances (e.g., under-represented RNA types in RNAcentral database), sequencing errors introducing noise, and overfitting risks in high-dimensional embeddings due to limited labeled data. Specially, the complexity of computation is a noteworthy issue as it brings about the scarcity of computing resources. In current research, strategies such as sparse attention mechanisms [97] are used to reduce memory and time requirements by focusing on key sequence interactions, while model compression techniques such as knowledge distillation and low-rank adaptation [98,99] have created lightweight models with comparable performance. Mixed precision training and recent combinations of k-mer features with small language models have further improved accessibility for resource-constrained researchers, as demonstrated by recent models for plant genome annotation and prediction of regulatory element strength [100]. These optimization strategies alleviate the high resource requirements of LLM, enabling it to be deployed on standard hardware for tasks such as sequence analysis and functional annotation, directly addressing the accessibility challenges of resource-constrained researchers.

However, another noteworthy challenge is that the high-dimensional embeddings generated by LLMs limit their ability to generate substantial biological insights due to interpretability limitations. To address this issue, feature attribution methods such as Shapley additive explanations (SHAP) and local interpretable model-agnostic explanations (LIME) can identify key sequence features that are helpful for prediction, as shown in protein functional annotation [101,102]. Visualization techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) map high-dimensional embeddings to low-dimensional spaces for functional annotation and interaction prediction, as well as gene function prediction [103]. Integrating biological priors, such as Gene Ontology (GO) terminology or known pathways, can enhance understanding of mechanisms, such as OntoProtein for protein function annotation. In addition, analyzing Transformer attention weights, such as RNAErnie for RNA structure prediction, highlights the key sequence regions that drive prediction [95]. These methods collectively bridge the gap between complex LLM outputs and specific biological insights, enhancing their practicality in tasks such as functional annotation and interaction prediction.

To address these challenges, future advancements in biological-sequence representation will focus on unifying the strengths of computational, word embedding, and LLM-based methods to enhance their applicability across sequence analysis, structure prediction, functional annotation, and interaction prediction. Integrating multimodal data, such as sequences, secondary structures, and functional annotations from datasets like ProteinKG25, will enable richer representations for tasks like interaction prediction and sequence analysis. For instance, combining single-cell RNA sequencing (scRNA-seq) and Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) data with cross-modal generative models like ESM3 can capture complex biological patterns, improving the accuracy of function annotation and drug discovery, and combining sequence-order correlations with structural features, as seen in PseSSC, and functional annotations, as in OntoProtein, can better model complex biological interactions [67,94]. Enhancing interpretability by mapping high-dimensional embeddings to biological properties, such as binding affinities for RNA methylation sites or structural motifs for protein functions, will improve mechanistic insights for functional annotation and interaction prediction. Emerging explainable AI (XAI) techniques, such as SHAP and attention visualization, along with knowledge graph-based embeddings like OntoProtein, can elucidate feature contributions, bridging the gap between embeddings and biological mechanisms [101]. Optimizing computational efficiency through techniques like sparse attention mechanisms (e.g., Linformer and Performer) and model compression methods (e.g., knowledge distillation, quantization) can significantly lower the computational burden, enabling scalability on standard hardware [87]. Additionally, addressing data imbalances through synthetic data generation and cross-species transfer learning will enhance model robustness for tasks like non-coding RNA classification. Developing robust denoising techniques to mitigate sequencing errors will further improve performance in tasks like interaction prediction [104,105]. Optimizing feature encoding length, such as balancing the high-dimensional outputs of PseSSC and ESM with low-dimensional embeddings like Word2Vec through dimensionality reduction, will ensure a trade-off between capturing complex patterns and computational tractability. These advancements will drive the next phase of computational biology, enabling more accurate and interpretable sequence analysis.

6. Conclusions

This paper systematically reviews biological-sequence representation methods, tracing their development through computational-based methods, word embedding-based methods, and LLM-based methods, which collectively underpin the extraction of biological insights from sequence data. Computational-based methods, such as k-mer, PSSM, and structure-based methods like TS and SPSSC, excel in capturing local patterns and structural features, making them ideal for sequence analysis and structure prediction, though limited by high-dimensionality and local focus [28]. Word embedding-based methods, including Word2Vec, BioVec, and Doc2Vec, marked a shift toward contextual feature learning, enhancing performance in applications like transcription factor binding prediction and protein–protein interaction analysis, but constrained by data requirements and interpretability challenges [88,94]. LLM-based methods, such as ESM3, ProtBERT, and ProLLama, leverage attention mechanisms to model long-range dependencies and evolutionary patterns, achieving state-of-the-art results in structure prediction, functional annotation, and drug discovery, despite high computational costs [59,91]. While computational-based methods dominate in efficiency and quantity, structure-based methods are increasingly recognized for their ability to reflect the intricate interplay between sequence, structure, and function. Emerging trends in representation learning and multi-source, multimodal data integration are enhancing feature extraction, with studies exploring these to improve robustness [106,107]. Although LLM-based methods are still developing, they demonstrate significant potential to transform computational biology. Future progress will hinge on addressing computational, data, and interpretability challenges, ensuring these methods continue to evolve and drive transformative discoveries in biological research.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biology14091137/s1, Table S1: Amino acids grouping in the CTD method; Figure S1: The detail of the definition and description for CT method; Equations (S1)–(S4): The detailed mathematical formulations for the k-mer-based methods; Equations (S5)–(S7): The detailed mathematical formulations for the group-based methods; Equations (S8)–(S12): The detailed mathematical formulations for the correlation-based methods; Equations (S13)–(S22): The detailed mathematical formulations for the PSSM-based methods; Equations (S23)–(S29): The detailed mathematical formulations for the structure-based methods.

Author Contributions

Conceptualization, H.Z.; Funding acquisition, Y.W.; Investigation, X.Y.; Methodology, K.L., S.-K.I. and Y.H.; Resources, X.Y. and S.-K.I.; Software, Y.H.; Supervision, Y.S. and Y.W.; Validation, K.L.; Writing—original draft, H.Z.; Writing—review, Y.S., Y.W. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the fund from Macao Polytechnic University (RP/FCA-14/2023) and The Science and Technology Development Funds (FDCT) of Macao (0033/2023/RIB2).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ML	Machine Learning
SVM	Support Vector Machine
RF	Random Forest
NN	Neural Network
LR	Linear Regression
AUC-ROC	Area Under the Receiver Operating Characteristic Curve
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
LLM	Large Language Model
PPI	Protein–Protein Interaction
XGBOOST	eXtreme Gradient Boosting
PCA	Principal Component Analysis
PSSM	Position Specific Scoring Matrix
MNC	Mononucleotide Composition
DNC	Dinucleotide Composition
TNC	Trinucleotide Composition
AAC	Amino Acid Composition
DPC	Dipeptide Composition
TPC	Tripeptide Composition
CTD	Composition, Transition, and Distribution
CT	Conjoint Triad
AC	Auto-Covariance
CC	Cross-Covariance
DAC	Dinucleotide Auto-Covariance
TAC	Trinucleotide Auto-Covariance
DCC	Dinucleotide Cross-Covariance
TCC	Trinucleotide Cross-Covariance
DACC	Dinucleotide Auto- and Cross-Covariance
Pse-PSSM	Pseudo Position Specific Scoring Matrix
SPSSC	Split Protein Secondary Structure Composition
TS	Triplet Structure
MLM	Masked Language Modeling
PLM	Protein Language Model
GO	Gene Ontology
TAPE	Tasks Assessing Protein Embeddings
SHAP	SHapley Additive exPlanations
LIME	Local Interpretable Model-agnostic Explanations

References

Olson, M.V. The human genome project. Proc. Natl. Acad. Sci. USA 1993, 90, 4338–4344. [Google Scholar] [CrossRef] [PubMed]
Lander, E.S.; Linton, L.M.; Birren, B.; Nusbaum, C.; Zody, M.C.; Baldwin, J.; Morris, W.; Meldrim, J.; Devon, K.; Santos, R.; et al. Initial sequencing and analysis of the hum1an genome. Nature 2001, 409, 860–921. [Google Scholar] [PubMed]
Venter, J.C.; Adams, M.D.; Myers, E.W.; Li, P.W.; Mural, R.J.; Sutton, G.G.; Smith, H.O.; Yandell, M.; Evans, C.A.; Holt, R.A.; et al. The sequence of the human genome. Science 2001, 291, 1304–1351. [Google Scholar] [CrossRef]
Moraes, F.; Góes, A. A decade of human genome project conclusion: Scientific diffusion about our genome knowledge. Biochem. Mol. Biol. Educ. 2016, 44, 215–223. [Google Scholar] [CrossRef] [PubMed]
Aganezov, S.; Yan, S.M.; Soto, D.C.; Kirsche, M.; Zarate, S.; Avdeyev, P.; Taylor, D.J.; Shafin, K.; Shumate, A.; Xiao, C.; et al. A complete reference genome improves analysis of human genetic variation. Science 2022, 376, eabl3533. [Google Scholar] [CrossRef]
Nurk, S.; Koren, S.; Rhie, A.; Rautiainen, M.; Bzikadze, A.V.; Mikheenko, A.; Vollger, M.R.; Altemose, N.; Uralsky, L.; Gershman, A.; et al. The complete sequence of a human genome. Science 2022, 376, 44–53. [Google Scholar] [CrossRef]
Altemose, N.; Logsdon, G.A.; Bzikadze, A.V.; Sidhwani, P.; Langley, S.A.; Caldas, G.V.; Hoyt, S.J.; Uralsky, L.; Ryabov, F.D.; Shew, C.J.; et al. Complete genomic and epigenetic maps of human centromeres. Science 2022, 376, eabl4178. [Google Scholar] [CrossRef]
Gershman, A.; Sauria, M.E.; Guitart, X.; Vollger, M.R.; Hook, P.W.; Hoyt, S.J.; Jain, M.; Shumate, A.; Razaghi, R.; Koren, S.; et al. Epigenetic patterns in a complete human genome. Science 2022, 376, eabj5089. [Google Scholar] [CrossRef]
Hoyt, S.J.; Storer, J.M.; Hartley, G.A.; Grady, P.G.; Gershman, A.; de Lima, L.G.; Limouse, C.; Halabian, R.; Wojenski, L.; Rodriguez, M.; et al. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science 2022, 376, eabk3112. [Google Scholar] [CrossRef]
Vollger, M.R.; Guitart, X.; Dishuck, P.C.; Mercuri, L.; Harvey, W.T.; Gershman, A.; Diekhans, M.; Sulovari, A.; Munson, K.M.; Lewis, A.P.; et al. Segmental duplications and their variation in a complete human genome. Science 2022, 376, eabj6965. [Google Scholar] [CrossRef]
Sravani, C.; Pavani, P.; Vybhavi, G.Y.; Ramesh, G.; Farman, A. Decoding the HumanGenome: Machine Learning Techniques for DNA Sequencing Analysis. In Proceedings of the E3S Web of Conferences, Newcastle, UK, 5–8 September 2023; EDP Sciences. Volume 430, p. 01067. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Formenti, G.; Rhie, A.; Walenz, B.P.; Thibaud-Nissen, F.; Shafin, K.; Koren, S.; Myers, E.W.; Jarvis, E.D.; Phillippy, A.M. Merfin: Improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods 2022, 19, 696–704. [Google Scholar] [CrossRef]
Xu, R.; Zhou, J.; Wang, H.; He, Y.; Wang, X.; Liu, B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst. Biol. 2015, 9 (Suppl. 1), S10. [Google Scholar] [CrossRef]
Zahiri, J.; Yaghoubi, O.; Mohammad-Noori, M.; Ebrahimpour, R.; Masoudi-Nejad, A. PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information. Genomics 2013, 102, 237–242. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Yilmaz, S.; Toklu, S. A deep learning analysis on question classification task using Word2vec representations. Neural Comput. Appl. 2020, 32, 2909–2928. [Google Scholar] [CrossRef]
Qiu, J.; Bernhofer, M.; Heinzinger, M.; Kemper, S.; Norambuena, T.; Melo, F.; Rost, B. ProNA2020 predicts protein–DNA, protein–RNA, and protein–protein binding proteins and residues from sequence. J. Mol. Biol. 2020, 432, 2428–2443. [Google Scholar] [CrossRef] [PubMed]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
Krishna, R.; Wang, J.; Ahern, W.; Sturmfels, P.; Venkatesh, P.; Kalvet, I.; Lee, G.R.; Morey-Brrows, F.S.; Anishchenko, I.; Humphreys, I.R.; et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 2024, 384, eadl2528. [Google Scholar] [CrossRef]
Burge, C.; Campbell, A.M.; Karlin, S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA 1992, 89, 1358–1362. [Google Scholar] [CrossRef]
Karlin, S.; Ladunga, I. Comparisons of eukaryotic genomic sequences. Proc. Natl. Acad. Sci. USA 1994, 91, 12832–12836. [Google Scholar] [CrossRef]
Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 2014, 10, e1003711. [Google Scholar] [CrossRef]
Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W.M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 2017, 18, 186. [Google Scholar] [CrossRef] [PubMed]
Moeckel, C.; Mareboina, M.; Konnaris, M.A.; Chan, C.S.; Mouratidis, I.; Montgomery, A.; Chantzi, N.; Pavlopoulos, G.A.; Georgakopoulos-Soares, I. A survey of k-mer methods and applications in bioinformatics. Comput. Struct. Biotechnol. J. 2024, 23, 2289–2303. [Google Scholar] [CrossRef]
Ghandi, M.; Mohammad-Noori, M.; Ghareghani, N.; Lee, D.; Garraway, L.; Beer, M.A. gkmSVM: An R package for gapped-kmer SVM. Bioinformatics 2016, 32, 2205–2207. [Google Scholar] [CrossRef] [PubMed]
Wood, D.E.; Salzberg, S.L. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014, 15, R46. [Google Scholar] [CrossRef] [PubMed]
Yan, J.; Qiu, Y.; Ribeiro dos Santos, A.M.; Yin, Y.; Li, Y.E.; Vinckier, N.; Nariai, N.; Benaglio, P.; Raman, A.; Li, X.; et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 2021, 591, 147–151. [Google Scholar] [CrossRef]
Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef]
Yao, Y.H.; Lv, Y.P.; Li, L.; Xu, H.M.; Ji, B.B.; Chen, J.; Li, C.; Liao, B.; Nan, X.Y. Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method. BMC Bioinform. 2019, 20 (Suppl. 22), 719. [Google Scholar] [CrossRef]
Ahmad, A.; Akbar, S.; Hayat, M.; Ali, F.; Khan, S.; Sohail, M. Identification of antioxidant proteins using a discriminative intelligent model of k-space amino acid pairs based descriptors incorporating with ensemble feature selection. Biocybern. Biomed. Eng. 2022, 42, 727–735. [Google Scholar] [CrossRef]
Bae, D.; Kim, M.; Seo, J.; Nam, H. AI-guided discovery and optimization of antimicrobial peptides through species-aware language model. Brief. Bioinform. 2025, 26, bbaf343. [Google Scholar] [CrossRef]
Shen, J.; Zhang, J.; Luo, X.; Zhu, W.; Yu, K.; Chen, K.; Li, Y.; Jiang, H. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef] [PubMed]
Dubchak, I.; Muchnik, I.; Mayor, C.; Dralyuk, I.; Kim, S.H. Recognition of a protein fold in the context of the SCOP classification. Proteins Struct. Funct. Bioinform. 1999, 35, 401–407. [Google Scholar] [CrossRef]
Chothia, C.; Finkelstein, A.V. The classification and origins of protein folding patterns. Annu. Rev. Biochem. 1990, 59, 1007–1035. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Yan, W.; Yang, Y. DeepTP: A deep learning model for thermophilic protein prediction. Int. J. Mol. Sci. 2023, 24, 2217. [Google Scholar] [CrossRef]
Yan, K.; Lv, H.; Wen, J.; Guo, Y.; Xu, Y.; Liu, B. PreTP-Stack: Prediction of therapeutic peptide based on the stacked ensemble learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 20, 1337–1344. [Google Scholar] [CrossRef]
Sureyya Rifaioglu, A.; Doğan, T.; Jesus Martin, M.; Cetin-Atalay, R.; Atalay, V. DEEPred: Automated protein function prediction with multi-task feed-forward deep neural networks. Sci. Rep. 2019, 9, 7344. [Google Scholar] [CrossRef]
Wold, S.; Jonsson, J.; Sjörström, M.; Sandberg, M.; Rännar, S. DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Anal. Chim. Acta 1993, 277, 239–253. [Google Scholar] [CrossRef]
Guo, Y.; Yu, L.; Wen, Z.; Li, M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 2008, 36, 3025–3030. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Xiao, X.; Yu, D.J.; Jia, J.; Qiu, W.R.; Chou, K.C. pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical–chemical properties. Anal. Biochem. 2016, 497, 60–67. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Liu, F.; Fang, L.; Wang, X.; Chou, K.C. repDNA: A Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 2015, 31, 1307–1309. [Google Scholar] [CrossRef]
Dong, Q.; Zhou, S.; Guan, J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 2009, 25, 2655–2662. [Google Scholar] [CrossRef]
Wang, X.; Liu, Y.; Li, J.; Wang, G. StackCirRNAPred: Computational classification of long circRNA from other lncRNA based on stacking strategy. BMC Bioinform. 2022, 23, 563. [Google Scholar] [CrossRef]
Khan, S.; Uddin, I.; Noor, S.; AlQahtani, S.A.; Ahmad, N. N6-methyladenine identification using deep learning and discriminative feature integration. BMC Med. Genom. 2025, 18, 58. [Google Scholar] [CrossRef]
Uddin, I.; Awan, H.H.; Khalid, M.; Khan, S.; Akbar, S.; Sarker, M.R.; Abdolrasol, M.G.; Alghamdi, T.A. A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications. Sci. Rep. 2024, 14, 20819. [Google Scholar] [CrossRef]
An, H.E.; Mun, M.H.; Malik, A.; Kim, C.B. Development of a two-layer machine learning model for the forensic application of legal and illegal poppy classification based on sequence data. Forensic Sci. Int. Genet. 2024, 71, 103061. [Google Scholar] [CrossRef] [PubMed]
Gribskov, M.; McLachlan, A.D.; Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 1987, 84, 4355–4358. [Google Scholar] [CrossRef] [PubMed]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef]
Chou, K.C.; Shen, H.B. MemType-2L: A web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem. Biophys. Res. Commun. 2007, 360, 339–345. [Google Scholar] [CrossRef]
Liu, T.; Zheng, X.; Wang, J. Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie 2010, 92, 1330–1334. [Google Scholar] [CrossRef]
Réau, M.; Renaud, N.; Xue, L.C.; Bonvin, A.M. DeepRank-GNN: A graph neural network framework to learn patterns in protein–protein interfaces. Bioinformatics 2023, 39, btac759. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C. PLM-ATG: Identification of Autophagy Proteins by Integrating Protein Language Model Embeddings with PSSM-Based Features. Molecules 2025, 30, 1704. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, L.; Xue, L.; Liu, F.; Jing, R.; Luo, J. Optimizing lipocalin sequence classification with ensemble deep learning models. PLoS ONE 2025, 20, e0319329. [Google Scholar] [CrossRef]
Zhang, S. Accurate prediction of protein structural classes by incorporating PSSS and PSSM into Chou’s general PseAAC. Chemom. Intell. Lab. Syst. 2015, 142, 28–35. [Google Scholar] [CrossRef]
Shen, B.; Zhang, H.; Li, C.; Zhao, T.; Liu, Y. Deep Learning Method for RNA Secondary Structure Prediction with Pseudoknots Based on Large-Scale Data. J. Healthc. Eng. 2021, 2021, 1–9. [Google Scholar] [CrossRef]
Hofacker, I.L.; Fontana, W.; Stadler, P.F.; Bonhoeffer, L.S.; Tacker, M.; Schuster, P. Fast folding and comparison of RNA secondary structures. Monatshefte Chem. Chem. Mon. 1994, 125, 167–188. [Google Scholar] [CrossRef]
Zuker, M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003, 31, 3406–3415. [Google Scholar] [CrossRef] [PubMed]
Zuker, M.; Stiegler, P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981, 9, 133–148. [Google Scholar] [CrossRef]
Zuker, M.; Jaeger, J.A.; Turner, D.H. A comparison of optimal and suboptimal RNA secondary structures predicted by free energy minimization with structures determined by phylogenetic comparison. Nucleic Acids Res. 1991, 19, 2707–2714. [Google Scholar] [CrossRef]
Xue, C.; Li, F.; He, T.; Liu, G.P.; Li, Y.; Zhang, X. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinform. 2005, 6, 310. [Google Scholar] [CrossRef] [PubMed]
Pengcheng, S. Prediction of membrane protein types based on fusion features and voting ensemble learning. Hans J. Comput. Biol. 2022, 11, 49. [Google Scholar]
Cuff, J.A.; Barton, G.J. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins Struct. Funct. Bioinform. 1999, 34, 508–519. [Google Scholar] [CrossRef]
Liu, B.; Fang, L.; Liu, F.; Wang, X.; Chen, J.; Chou, K.C. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS ONE 2015, 10, e0121501. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, L.; Jing, R.; Han, B.; Luo, J. Fast and efficient design of deep neural networks for predicting n7-methylguanosine sites using autobioseqpy. ACS Omega 2023, 8, 19728–19740. [Google Scholar] [CrossRef]
Min, H.; Xin, X.H.; Gao, C.Q.; Wang, L.; Du, P.F. XGEM: Predicting essential miRNAs by the ensembles of various sequence-based classifiers with XGBoost algorithm. Front. Genet. 2022, 13, 877409. [Google Scholar] [CrossRef]
Yu, B.; Wang, X.; Zhang, Y.; Gao, H.; Wang, Y.; Liu, Y.; Gao, X. RPI-MDLStack: Predicting RNA–protein interactions through deep learning with stacking strategy and LASSO. Appl. Soft Comput. 2022, 120, 108676. [Google Scholar] [CrossRef]
Iuchi, H.; Matsutani, T.; Yamada, K.; Iwano, N.; Sumi, S.; Hosoda, S.; Zhao, S.; Fukunaga, T.; Hamada, M. Representation learning applications in biological sequence analysis. Comput. Struct. Biotechnol. J. 2021, 19, 3198–3208. [Google Scholar] [CrossRef]
Hamid, M.N.; Friedberg, I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics 2019, 35, 2009–2016. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 1–9. [Google Scholar]
Asgari, E.; Mofrad, M.R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 2015, 10, e0141287. [Google Scholar] [CrossRef]
Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. arXiv 2017, arXiv:1701.06279. [Google Scholar]
Zulfiqar, H.; Guo, Z.; Ahmad, R.M.; Ahmed, Z.; Cai, P.; Chen, X.; Zhang, Y.; Lin, H.; Shi, Z. Deep-STP: A deep learning-based approach to predict snake toxin proteins by using word embeddings. Front. Med. 2024, 10, 1291352. [Google Scholar] [CrossRef]
Li, H.; Zhang, S.; Chen, L.; Pan, X.; Li, Z.; Huang, T.; Cai, Y.D. Identifying functions of proteins in mice with functional embedding features. Front. Genet. 2022, 13, 909040. [Google Scholar] [CrossRef]
Cao, L.; Liu, P.; Chen, J.; Deng, L. Prediction of transcription factor binding sites using a combined deep learning approach. Front. Oncol. 2022, 12, 893520. [Google Scholar] [CrossRef]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 18 June 2014; pp. 1188–1196. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Yang, X.; Wuchty, S.; Liang, Z.; Ji, L.; Wang, B.; Zhu, J.; Zhang, Z.; Dong, Y. Multi-modal features-based human-herpesvirus protein–protein interaction prediction by using LightGBM. Brief. Bioinform. 2024, 25, 1–13. [Google Scholar] [CrossRef]
Tran, H.N.; Nguyen, P.X.; Guo, F.; Wang, J. Prediction of protein–protein interactions based on integrating deep learning and feature fusion. Int. J. Mol. Sci. 2024, 25, 5820. [Google Scholar] [CrossRef]
Yang, Y.; Hou, Z.; Ma, Z.; Li, X.; Wong, K.C. iCircRBP-DHN: Identification of circRNA-RBP interaction sites using deep hierarchical network. Brief. Bioinform. 2021, 22, bbaa274. [Google Scholar] [CrossRef]
Ji, Y.; Sun, J.; Xie, J.; Wu, W.; Shuai, S.C.; Zhao, Q.; Chen, W. m5UMCB: Prediction of RNA 5-methyluridine sites using multi-scale convolutional neural network with BiLSTM. Comput. Biol. Med. 2024, 168, 107793. [Google Scholar] [CrossRef]
Mahmud, S.H.; Goh, K.O.; Hosen, M.F.; Nandi, D.; Shoombuatong, W. Deep-WET: A deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features. Sci. Rep. 2024, 14, 2961. [Google Scholar] [CrossRef]
Ruffolo, J.A.; Madani, A. Designing proteins with language models. Nat. Biotechnol. 2024, 42, 200–202. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar] [CrossRef] [PubMed]
Shen, T.; Hu, Z.; Sun, S.; Liu, D.; Wong, F.; Wang, J.; Chen, J.; Wang, Y.; Hong, L.; Xiao, J.; et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat. Methods 2024, 21, 2287–2298. [Google Scholar] [CrossRef]
Lv, L.; Lin, Z.; Li, H.; Liu, Y.; Cui, J.; Chen, C.Y.; Yuan, L.; Tian, Y. Prollama: A protein large language model for multi-task protein language processing. IEEE Trans. Artif. Intell. Early Access 2025, 1–12. [Google Scholar] [CrossRef]
Jang, Y.J.; Qin, Q.Q.; Huang, S.Y.; Peter, A.T.; Ding, X.M.; Kornmann, B. Accurate prediction of protein function using statistics-informed graph networks. Nat. Commun. 2024, 15, 6601. [Google Scholar] [CrossRef]
Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.; Song, Y. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 2019, 32, 9689–9701. [Google Scholar]
Zhang, N.; Bi, Z.; Liang, X.; Cheng, S.; Hong, H.; Deng, S.; Lian, J.; Zhang, Q.; Chen, H. Ontoprotein: Protein pretraining with gene ontology embedding. arXiv 2022, arXiv:2201.11147. [Google Scholar] [CrossRef]
Wang, N.; Bian, J.; Li, Y.; Li, X.; Mumtaz, S.; Kong, L.; Xiong, H. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nat. Mach. Intell. 2024, 6, 548–557. [Google Scholar] [CrossRef]
Hayes, T.; Rao, R.; Akin, H.; Sofroniew, N.J.; Oktay, D.; Lin, Z.; Verkuil, R.; Tran, V.Q.; Deaton, J.; Wiggert, M.; et al. Simulating 500 million years of evolution with a language model. Science 2025, 387, 850–858. [Google Scholar] [CrossRef]
Lam, H.Y.; Ong, X.E.; Mutwil, M. Large language models in plant biology. Trends Plant Sci. 2024, 29, 1145–1155. [Google Scholar] [CrossRef]
Li, W.; Mao, Z.; Xiao, Z.; Liao, X.; Koffas, M.; Chen, Y.; Tang, Y.J. Large language model for knowledge synthesis and AI-enhanced biomanufacturing. Trends Biotechnol. 2025, 43, 1864–1875. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Suzuki, S.; Horie, K.; Amagasa, T.; Fukuda, N. Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction. Plant Mol. Biol. 2025, 115, 100. [Google Scholar] [CrossRef]
Silva, J.C.; Schuster, L.; Sexson, N.; Erdem, M.; Hulk, R.; Kirst, M.; Resende, M.F.; Dias, R. InteracTor: Feature Engineering and Explainable AI for Profiling Protein Structure-Interaction-Function Relationships. bioRxiv 2025. [Google Scholar] [CrossRef]
Dickinson, Q.; Meyer, J.G. Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences. PLoS Comput. Biol. 2022, 18, e1009736. [Google Scholar] [CrossRef]
Clarke, D.J.; Marino, G.B.; Deng, E.Z.; Xie, Z.; Evangelista, J.E.; Ma’ayan, A. Rummagene: Massive mining of gene sets from supporting materials of biomedical research publications. Commun. Biol. 2024, 7, 482. [Google Scholar] [CrossRef]
Song, B.; Li, Z.; Lin, X.; Wang, J.; Wang, T.; Fu, X. Pretraining model for biological sequence data. Brief. Funct. Genom. 2021, 20, 181–195. [Google Scholar] [CrossRef]
Stock, M.; Van Criekinge, W.; Boeckaerts, D.; Taelman, S.; Van Haeverbeke, M.; Dewulf, P.; De Baets, B. Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data. PLoS Comput. Biol. 2024, 20, e1012426. [Google Scholar] [CrossRef] [PubMed]
Tam, B.; Qin, Z.; Zhao, B.; Sinha, S.; Lei, C.L.; Wang, S.M. Classification of MLH1 Missense VUS Using Protein Structure-Based Deep Learning-Ramachandran Plot-Molecular Dynamics Simulations Method. Int. J. Mol. Sci. 2024, 25, 850. [Google Scholar] [CrossRef] [PubMed]
Tang, Z.; Huang, J.; Chen, G.; Chen, C.Y. Comprehensive View Embedding Learning for Single-Cell Multimodal Integration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 24 March 2024; Volume 38, pp. 15292–15300. [Google Scholar]

Figure 1. Shows a process framework for biological data analysis based on machine learning. The direction indicated by the arrow illustrates the sequential process of the framework. The process is divided into four stages: Stage A involves encoding input sequence data, transforming row biological sequence into a suitable format for analysis. Stage B encompasses preprocessing steps, including outlier handling, normalization, feature selection, dimensionality reduction, and data balancing; to enhance data quality and consistency. In stage C, ML methods such as neural network (NN), RF, SVM, and linear regression (LR) are employed for model training, enabing the development of predictive models. Stage D evaluates model performance using metrics, including accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC), to assess predictive reliability and generalizability for computational biology tasks like sequence classification and function prediction.

Table 1. Summary of computational-based methods.

Methods	Core Applications	Advantages	Limitations
k-mer-based	Genome assembly, motif discovery, sequence classification	Computationally efficient, captures local patterns	High dimensionality, limited long-range dependency capture
Group-based	Protein function prediction, Protein annotation, protein–protein interaction prediction	Encodes physicochemical properties, biologically interpretable	Sparsity in long sequences, parameter optimization needed
Correlation-based	RNA classification, epigenetic modification prediction	Models complex dependencies, robust for multi-property interactions	High computational cost, limited for RNA trinucleotide correlations
PSSM-based	Protein structure/function prediction, PPI prediction	Leverages evolutionary conservation, robust feature extraction	Dependent on alignment quality, computationally intensive
Structure-based	RNA modification prediction, protein function prediction	Captures local structural motifs, biologically meaningful	Relies on accurate structural predictions, limited global context

Table 1 shows a summary of computation-based methods. These methods have some common advantages: all methods are effectively integrated with machine learning models (such as SVM, RF, deep neural network and extreme gradient boosting (XGBoost)), improving the prediction performance across tasks. There are also some common limitations: high-dimensional output is a challenge for most methods, and dimensionality reduction techniques (such as principal component analysis (PCA)) are usually required for efficient processing.

Table 2. Summary of word embedding-based methods.

Methods	Core Applications	Advantages	Limitations
Local feature embedding-based	Protein sequence classification, transcription factor binding prediction, gene annotation	Captures short-range patterns, robust to sequence length variability, computationally efficient	Limited to local dependencies, requires large training datasets, lacks direct biological interpretability
Global feature embedding-based	Protein function prediction, RNA methylation site prediction, regulatory RNA identification	Models sequence-wide context, adaptable to diverse tasks, robust to variable sequence lengths	Misses fine-grained local patterns, computationally intensive, sensitive to dataset quality

Table 2 shows a summary of word embedding-based methods. These methods have some common advantages: All methods integrate effectively with machine learning models (e.g., SVM, deep neural networks, ensemble classifiers), enhancing predictive performance across computational biology tasks. Common Limitation: Embeddings lack direct interpretability, as they are not explicitly tied to biological features (e.g., binding affinities, structural motifs), limiting mechanistic insights. Performance depends on large, high-quality datasets, with smaller or noisy datasets risking suboptimal embeddings.

Table 3. Summary of large language model-based methods.

Methods	Core Applications	Advantages	Limitations
Self-supervised learning	RNA/protein structure prediction, binding site identification, functional annotation	Models complex sequence relationships, generalizable with unlabeled data, high predictive accuracy	High computational cost, requires large datasets
Multi-task learning	RNA type/structure classification, protein function/structure prediction, cross-modal analysis	Integrates sequence, structure, and function data, adaptable to diverse tasks, enhanced by multi-modal learning	Complex training process, resource-intensive, sensitive to annotated dataset quality

Table 3 shows a summary of word embedding-based methods. These methods have some common advantages: All methods integrate effectively with machine learning frameworks (e.g., deep neural networks, ensemble classifiers), enhancing predictive performance in computational biology tasks. Common Limitation: Embeddings lack direct interpretability, as they are not explicitly tied to biological features (e.g., binding affinities, structural motifs). Performance relies on large, high-quality datasets, with smaller or noisy datasets risking suboptimal results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Shi, Y.; Wang, Y.; Yang, X.; Li, K.; Im, S.-K.; Han, Y. Biological Sequence Representation Methods and Recent Advances: A Review. Biology 2025, 14, 1137. https://doi.org/10.3390/biology14091137

AMA Style

Zhang H, Shi Y, Wang Y, Yang X, Li K, Im S-K, Han Y. Biological Sequence Representation Methods and Recent Advances: A Review. Biology. 2025; 14(9):1137. https://doi.org/10.3390/biology14091137

Chicago/Turabian Style

Zhang, Hongwei, Yan Shi, Yapeng Wang, Xu Yang, Kefeng Li, Sio-Kei Im, and Yu Han. 2025. "Biological Sequence Representation Methods and Recent Advances: A Review" Biology 14, no. 9: 1137. https://doi.org/10.3390/biology14091137

APA Style

Zhang, H., Shi, Y., Wang, Y., Yang, X., Li, K., Im, S.-K., & Han, Y. (2025). Biological Sequence Representation Methods and Recent Advances: A Review. Biology, 14(9), 1137. https://doi.org/10.3390/biology14091137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Biological Sequence Representation Methods and Recent Advances: A Review

Simple Summary

Abstract

1. Introduction

2. Computational-Based Methods

2.1. K-Mer-Based Methods

2.1.1. Overview

2.1.2. Applications

2.1.3. Advantages and Limitations

2.2. Group-Based Methods

2.2.1. Overview

2.2.2. Applications

2.2.3. Advantages and Limitations

2.3. Correlation-Based Methods

2.3.1. Overview

2.3.2. Applications

2.3.3. Advantages and Limitations

2.4. PSSM-Based Methods

2.4.1. Overview

2.4.2. Applications

2.4.3. Advantages and Limitations

2.5. Structure-Based Methods

2.5.1. Overview

2.5.2. Applications

2.5.3. Advantages and Limitations

3. Word Embedding-Based Methods

3.1. Local Feature Embedding-Based Methods

3.1.1. Overview

3.1.2. Applications

3.1.3. Advantages and Limitations

3.2. Global Feature Embedding-Based Methods

3.2.1. Overview

3.2.2. Applications

3.2.3. Advantages and Limitations

4. Large Language Model-Based Methods

4.1. Self-Supervised Learning Methods

4.1.1. Overview

4.1.2. Applications

4.1.3. Advantages and Limitations

4.2. Multi-Task Learning Methods

4.2.1. Overview

4.2.2. Applications

4.2.3. Advantages and Limitations

5. Challenges and Future Directions

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI