Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (94)

Search Parameters:
Keywords = k-mer analysis

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 7393 KB  
Article
Deciphering 6-mer Spectra Distribution Rules in Coronavirus Genomes: Application to Comparative Genomic Analysis
by Zhenhua Yang, Hong Li, Xiaolong Li and Guojun Liu
Int. J. Mol. Sci. 2026, 27(8), 3604; https://doi.org/10.3390/ijms27083604 - 18 Apr 2026
Viewed by 450
Abstract
Given the rapid mutation and high transmissibility of coronaviruses, especially SARS-CoV-2, comparative genomic studies are crucial for understanding viral evolution, transmission dynamics, and therapeutic development. In prior work, we analyzed and compared the spectral distribution patterns of various k-mer subsets across 920 genome [...] Read more.
Given the rapid mutation and high transmissibility of coronaviruses, especially SARS-CoV-2, comparative genomic studies are crucial for understanding viral evolution, transmission dynamics, and therapeutic development. In prior work, we analyzed and compared the spectral distribution patterns of various k-mer subsets across 920 genome sequences, spanning from primates to prokaryotes. This revealed an evolutionary mechanism in genome sequences, indicating the presence of both CG and TA-specific selection modes. In the present study, we further investigate the specific selection modes in coronavirus genomic sequences by examining the intrinsic distribution rules of 32 XYi 6-mer subset spectra. Our results show that coronavirus genomes exhibit only the CG-specific selection mode, with no evidence of TA-specific selection. Using the CG-specific selection mode, we identified CG1 6-mers as the fundamental subset underlying coronavirus genome evolution. To validate the CG1 subset, we constructed phylogenetic relationships for a set of coronaviruses and SARS-CoV-2 variant genomes. Comparative analysis confirmed that the resulting phylogenetic relationships align more closely with established knowledge. This study thus provides a theoretical framework for inferring phylogenetic relationships at the whole-genome level. Full article
(This article belongs to the Section Molecular Genetics and Genomics)
Show Figures

Figure 1

14 pages, 5203 KB  
Article
Machine Learning Prediction of Listeria monocytogenes Serogroups and Biofilm Formation from Infrared Spectra: A Comparative Study with Genomic Analysis
by Martine Denis, Stéphanie Bougeard, Virginie Allain, Mélanie Guy, Emmanuelle Houard, Arnaud Felten, Jean Lagarde, Benoit Gassilloud, Evelyne Boscher and Pierre-Emmanuel Douarre
Appl. Microbiol. 2026, 6(4), 54; https://doi.org/10.3390/applmicrobiol6040054 - 16 Apr 2026
Viewed by 665
Abstract
This study evaluated the performance of Fourier-transform infrared (FTIR) spectroscopy for identifying spectral signatures associated with two key traits of Listeria monocytogenes: serogroup classification and biofilm-forming capacity. A total of 100 strains, previously serogrouped by PCR and categorized as high, intermediate, or [...] Read more.
This study evaluated the performance of Fourier-transform infrared (FTIR) spectroscopy for identifying spectral signatures associated with two key traits of Listeria monocytogenes: serogroup classification and biofilm-forming capacity. A total of 100 strains, previously serogrouped by PCR and categorized as high, intermediate, or low biofilm producers, were analyzed. Whole-genome sequencing was performed, and comparative genomics was conducted at core-genome, pangenome, and whole-genome (k-mer) levels to determine which genomic representation best reflected the phenotypes. Strains were typed using Fourier-Transform Infrared (FTIR Biotyper® system from Bruker Daltonics GmbH and Co., Bremen, Germany) with five technical replicates. Spectral data from the polysaccharide region (1300–800 cm−1) were extracted and used to train twelve statistical models within a machine learning pipeline combined with cross-validation to predict four serogroups and three biofilm clusters from 501 spectral variables. Genomic analyses showed strong concordance between population structure and serogroup, whereas biofilm formation displayed only weak genomic association, explaining less than 0.1% of genomic variance (PERMANOVA R2 ≤ 0.001). Penalized discriminant analysis achieved the highest performance for serogroup prediction (overall accuracy 97.2%), while the k-nearest neighbor model performed best for biofilm prediction (74.8%). Two dedicated R Shiny applications were developed to facilitate model use. Overall, FTIR spectroscopy coupled with machine learning can provide a rapid and cost-effective alternative to PCR, genomic analyses, and in vitro assays for phenotypic trait prediction. Full article
Show Figures

Figure 1

21 pages, 3313 KB  
Article
MGF-DTA: A Multi-Granularity Fusion Model for Drug–Target Binding Affinity Prediction
by Zheng Ni, Bo Wei and Yuni Zeng
Int. J. Mol. Sci. 2026, 27(2), 947; https://doi.org/10.3390/ijms27020947 - 18 Jan 2026
Viewed by 723
Abstract
Drug–target affinity (DTA) prediction is one of the core components of drug discovery. Despite considerable advances in previous research, DTA tasks still face several limitations with insufficient multi-modal information of drugs, the inherent sequence length limitation of protein language models, and single attention [...] Read more.
Drug–target affinity (DTA) prediction is one of the core components of drug discovery. Despite considerable advances in previous research, DTA tasks still face several limitations with insufficient multi-modal information of drugs, the inherent sequence length limitation of protein language models, and single attention mechanisms that fail to capture critical multi-scale features. To alleviate the above limitations, we developed a multi-granularity fusion model for drug–target binding affinity prediction, termed MGF-DTA. This model is composed of three fusion modules, specifically as follows. First, the model extracts deep semantic features of SMILES strings through ChemBERTa-2 and integrates them with molecular fingerprints by using gated fusion to enhance the multi-modal information of drugs. In addition, it employs a residual fusion mechanism to integrate the global embeddings from ESM-2 with the local features obtained by the k-mer and principal component analysis (PCA) method. Finally, a hierarchical attention mechanism is employed to extract multi-granularity features from both drug SMILES strings and protein sequences. Comparative analysis with other mainstream methods on the Davis, KIBA, and BindingDB datasets reveals that the MGF-DTA model exhibits outstanding performance advantages. Further, ablation studies confirm the effectiveness of the model components and case study illustrates its robust generalization capability. Full article
Show Figures

Figure 1

18 pages, 3607 KB  
Article
Construction of Phylogenetic Relationships Based on 8-mer Spectra Distribution Characteristics of Vertebrate Whole Genome Sequences
by Zhenhua Yang, Li Wang, Guojun Liu, Dongsheng Yu and Xiangjun Cui
Genes 2026, 17(1), 39; https://doi.org/10.3390/genes17010039 - 31 Dec 2025
Cited by 1 | Viewed by 616
Abstract
Background/Objectives: With advances in sequencing technology, whole genome sequences have become a valuable resource for deciphering species evolution. However, efficiently extracting phylogenetic information from such data remains a major challenge. Traditional multiple sequence alignment methods are computationally intensive and perform poorly for [...] Read more.
Background/Objectives: With advances in sequencing technology, whole genome sequences have become a valuable resource for deciphering species evolution. However, efficiently extracting phylogenetic information from such data remains a major challenge. Traditional multiple sequence alignment methods are computationally intensive and perform poorly for distantly related species, while k-mer analysis offers a new direction for efficiently capturing genomic composition and evolutionary signatures. Methods: Feature extraction based on 8-mer spectra from 16 XYi subsets. Results: This study found that the distribution characteristics of whole genome sequences 8-mer spectra are closely related to species evolution. Building on this, we developed a dual-feature strategy for genome-scale phylogenetics. The strategy incorporates two distinct feature types: (a) 186 class-level phylogenetic features (comprising 93 for separability and 93 for conservatism), identified from 8-mer spectrum distributions of 16 XYi subsets, which capture macroevolutionary patterns; and (b) order-level phylogenetic features, designated as rank information, which are generated by ranking all 65,536 8-mers by frequency based on the CGi subset’s long-tail distribution and thereby capture microevolutionary patterns. Validation across vertebrate genomes confirmed that the class-level features establish the phylogenetic backbone, whereas the order-level features enable finer-resolution discrimination at the ordinal level. Conclusions: This study proposes a new method for constructing phylogenetic relationships at the genomic level. Full article
(This article belongs to the Section Bioinformatics)
Show Figures

Figure 1

18 pages, 2591 KB  
Article
Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes
by Dmitrii O. Kostenko, Natalya S. Bogatyreva and Alexey N. Fedorov
Int. J. Mol. Sci. 2026, 27(1), 109; https://doi.org/10.3390/ijms27010109 - 22 Dec 2025
Cited by 1 | Viewed by 698
Abstract
Proteomes are typically analyzed at the level of individual proteins or protein families. In this study, we introduce a bottom-up approach that treats proteomes as holistic entities by examining the properties of k-mers within entire proteomes and protein groups. We performed a comprehensive [...] Read more.
Proteomes are typically analyzed at the level of individual proteins or protein families. In this study, we introduce a bottom-up approach that treats proteomes as holistic entities by examining the properties of k-mers within entire proteomes and protein groups. We performed a comprehensive analysis of short amino acid k-mer (k = 1, 2, 3) distributions across all proteins in a given proteome. Using 86 bacterial proteomes representing 18 clades, we evaluated whether k-mer frequencies characterize uniquely the analyzed organisms. Remarkably, in a post hoc analysis, we found that the k-mer frequency vector unambiguously coevolves with the entire proteome—a pattern not observed even within specific protein groups, such as conserved ribosomal proteins or more variable nucleotide-binding proteins. This finding holds regardless of the k-mer calculation parameters or the distance metrics employed. Our results show that even a simple analysis based on tripeptide frequencies can precisely position proteomes within the k-mer space. Moreover, relationships derived from k-mer comparisons highly correlate with evolutionary relationships derived from phylogenetic trees, reaching up to 99% match with reference classification of the proteomes within major bacterial clades. These findings establish k-mer-based proteomic analysis as an additional robust and powerful feature for characterizing evolutionary relationships, opening new pathways in phylogenetics and evolutionary genomics. Full article
(This article belongs to the Section Molecular Informatics)
Show Figures

Figure 1

28 pages, 2198 KB  
Systematic Review
Bioinformatics Tools and Approaches for Virus Discovery in Genomic Data: A Systematic Review
by Julia Galeeva, Polina Kuzmichenko, Alexander Manolov, Alexander Lukashev and Elena Ilina
Viruses 2025, 17(12), 1538; https://doi.org/10.3390/v17121538 - 24 Nov 2025
Cited by 1 | Viewed by 2865
Abstract
The exponential growth of viral metagenomic data has created an urgent need for accurate and scalable tools for virus discovery, yet the extreme diversity, rapid evolution, and limited reference databases for viruses pose unique computational challenges that traditional sequence comparison methods struggle to [...] Read more.
The exponential growth of viral metagenomic data has created an urgent need for accurate and scalable tools for virus discovery, yet the extreme diversity, rapid evolution, and limited reference databases for viruses pose unique computational challenges that traditional sequence comparison methods struggle to address. This systematic review, conducted in accordance with PRISMA 2020, examines current trends and methodological advances in virus discovery tools from 1990 to 2025. As virus discovery is a broad and multi-dimensional topic, this review focuses on the first-line tools used to analyze the results of high-throughput sequencing. The review was conducted using the PubMed database with a snowballing approach, with over 54 key studies selected for the analysis. These studies encompass the following approaches: alignment-based methods, rapid similarity estimation techniques, profile hidden Markov model methods, combination pipelines, k-mer-based approaches, and machine learning-based methods. The transition from alignment-based to machine learning methods has dramatically improved the detection of divergent viruses, yet challenges remain in interpreting model decisions and handling incomplete viral genomes. This review summarizes current knowledge and potential future directions for the development of virus detection capabilities. Full article
(This article belongs to the Section General Virology)
Show Figures

Figure 1

17 pages, 1973 KB  
Article
Analysis of the Relationship Between the Charge Increment of the SARS-CoV-2 Spike Protein and Evolution
by Yingxue Ma, Ying Zhang, Menghao Chen, Kun Wang and Jun Lv
Viruses 2025, 17(11), 1483; https://doi.org/10.3390/v17111483 - 8 Nov 2025
Viewed by 1157
Abstract
The changes in charge distribution caused by mutations in the spike protein may play a crucial role in balancing infectivity and immune evasion during the evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To explore how charge increments in spike protein variants [...] Read more.
The changes in charge distribution caused by mutations in the spike protein may play a crucial role in balancing infectivity and immune evasion during the evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To explore how charge increments in spike protein variants influence viral evolution, a statistical analysis was conducted on 57 SARS-CoV-2 variants, examining relationships between charge distribution, lineage divergence, angiotensin-converting enzyme 2 (ACE2) affinity, immune evasion, and receptor-binding domain (RBD) expression. A phylogenetic tree was also reconstructed using only the charge properties of mutation sites. Results indicated that with increasing lineage divergence, overall positive charge initially rose sharply and then more gradually. Partitioning the spike protein into three domains—the RBD, the N-terminal flanking region (B-RBD), and the C-terminal flanking region (A-RBD)—revealed distinct patterns: positive charge increased in the RBD and A-RBD, whereas the B-RBD accumulated negative charge. Charge increments were negatively associated with ACE2 affinity and RBD expression but positively correlated with immune evasion. The k-mer-based tree derived from charge-reduced sequences showed a topology consistent with the whole-genome tree. These findings suggest that charge distribution in spike proteins is closely linked to viral evolution, with the opposing trends in the RBD and B-RBD potentially reflecting a balance between infectivity and immune escape. Full article
(This article belongs to the Section Coronaviruses)
Show Figures

Figure 1

15 pages, 1865 KB  
Article
Molecular Epidemiological Surveillance of Carbapenem-Resistant Gram-Negative Bacteria in Southern Lebanon
by Anwar Al Souheil, Hadi Hussein, Ziad Jabbour, Sara Barada, Jose-Rita Gerges, Ghada Derbaj, Abdallah Kurdi, Hassan Jamil Kazma, Nour Nahouli, Ali Hasan Najem, Abdallah Medlej, Wael Zorkot, Rana El Hajj, Mahmoud I. Khalil, Ghassan M. Matar and Antoine Abou Fayad
Antibiotics 2025, 14(11), 1124; https://doi.org/10.3390/antibiotics14111124 - 7 Nov 2025
Cited by 1 | Viewed by 1404
Abstract
Introduction: Carbapenem-resistant Gram-negative bacteria (CR-GNB) are rapidly spreading pathogens that increase morbidity and mortality in hospital settings and significantly restrict available treatment options worldwide. The lack of molecular epidemiological data and the limited use of next-generation sequencing (NGS) in South Lebanon have hindered [...] Read more.
Introduction: Carbapenem-resistant Gram-negative bacteria (CR-GNB) are rapidly spreading pathogens that increase morbidity and mortality in hospital settings and significantly restrict available treatment options worldwide. The lack of molecular epidemiological data and the limited use of next-generation sequencing (NGS) in South Lebanon have hindered comprehensive surveillance efforts. This study represents the first molecular characterization of CR-GNB in this region. Methods: A total of 477 clinical Gram-negative bacterial isolates were collected from intensive care unit (ICU) patients admitted to various hospitals in South Lebanon in 2023. Of these, 131 CR-GNB were subjected to whole-genome sequencing using the Illumina MiSeq platform. K-mer-based species identification, multilocus sequence typing (MLST), antimicrobial resistance (AMR) gene profiling, and plasmid analysis were performed using multiple bioinformatic tools. Phylogenetic analysis was conducted using SaffronTree. Results: K-mer-based identification revealed that the predominant species among CR-GNB isolates were Pseudomonas spp. and Escherichia coli (26.7% each), followed by Klebsiella pneumoniae (19.8%), Acinetobacter baumannii (17.6%), Proteus mirabilis (4.6%), Enterobacter cloacae (2.3%), Achromobacter spp. (1.5%), and Citrobacter freundii (0.8%). Based on antimicrobial susceptibility testing, isolates were classified as follows: 0.8% as pan drug-resistant (PDR), 40.5% as extensively drug-resistant (XDR), and 52.7% as multidrug-resistant (MDR) and 6.1% as antimicrobial-resistant (AMR). All isolates harbored AMR genes, with the following distribution: 2% blaVIM, 5% blaNDM-1, 27% blaNDM-5, 65% blaOXA-type, and 1% blaDIM-1. Plasmid-associated AMR genes were detected in 58% of isolates; among these, 96% carried Inc-family plasmids, 57% Col plasmids, and 11% replication-associated elements (rep). Phylogenetic analysis demonstrated that certain isolates exhibited both hospital-specific and shared genetic profiles, indicating widespread dissemination across multiple healthcare facilities, as well as evidence of local emergence and ongoing transmission. Conclusions: The high prevalence of CR-GNB harboring resistance genes and plasmids underscores the urgent need for NGS-based genomic surveillance in South Lebanon. Implementing such strategies is essential for tracking resistance genes, identifying clonal outbreaks, and guiding effective infection control interventions to mitigate the spread of CR-GNB. Full article
Show Figures

Figure 1

17 pages, 12144 KB  
Article
The Genome Survey Analysis of Female and Male Sepiella japonica
by Yuting Ren, Yinquan Qu, Fenglin Wang, Tianxiang Gao and Xiumei Zhang
Genes 2025, 16(10), 1215; https://doi.org/10.3390/genes16101215 - 15 Oct 2025
Viewed by 1518
Abstract
Background/Objectives: Sepiella japonica is a highly adaptable cephalopod with an advanced nervous system and complex reproductive behavior, capable of reproducing two to three generations annually depending on water temperature. However, the absence of a complete genome assembly has limited molecular investigations of its [...] Read more.
Background/Objectives: Sepiella japonica is a highly adaptable cephalopod with an advanced nervous system and complex reproductive behavior, capable of reproducing two to three generations annually depending on water temperature. However, the absence of a complete genome assembly has limited molecular investigations of its unique biological characteristics. This study aimed to perform a genome survey of female and male S. japonica, systematically characterize and compare key genomic characteristics. Methods: Quality-filtered short reads enabled K-mer-based estimation of genome size, heterozygosity, repeat content, and GC content; generation of draft genome assemblies, SSR identification from the draft assemblies, complete mitogenome assemblies and annotations with ML phylogeny based on 13 concatenated PCGs, and PSMC-based demographic inference. Results: The estimated genome sizes were 4317 Mb (female) and 4222 Mb (male), with revised estimates of 4310 Mb and 4215 Mb, respectively. K-mer analysis revealed heterozygosity rates of 0.85% (female) and 0.77% (male) and repeat content of 76.05% (female) and 75.91% (male). The assembled genome sizes were 4197 Mb for females (N50: 508 bp) and 4206 Mb for males (N50: 511 bp); the GC content was 34.15% for both genomes. Deduplicated data showed GC content of 35.16% (female) and 35.27% (male). Microsatellite analysis revealed that mononucleotide repeats were the most abundant simple sequence repeat motif. The mitochondrial genome sequences measured 16,729 bp for the female genome and 16,725 bp for the male genome. Conclusions: This study provides fundamental data for subsequent high-quality whole-genome assembly and comparative analysis of female and male genomes. Full article
(This article belongs to the Section Animal Genetics and Genomics)
Show Figures

Figure 1

20 pages, 1372 KB  
Article
α-Linolenic Acid Production in Aspergillus oryzae via the Overexpression of an Endogenous Omega-3 Desaturase Gene
by Hiroki Kikuta, Hirotoshi Sushida, Tsuyoshi Tanaka, Eiichi Kotake, Wakako Tsuzuki, Ryota Hattori, Satoshi Suzuki, Ken-Ichi Kusumoto and Junichi Mano
Fermentation 2025, 11(10), 585; https://doi.org/10.3390/fermentation11100585 - 11 Oct 2025
Cited by 1 | Viewed by 2603
Abstract
α-Linolenic acid (ALA) is an important essential omega-3 (ω-3) polyunsaturated fatty acid for the maintenance of human health. Although ALA has traditionally been obtained from plant sources, microbial fermentation has emerged as a promising alternative for its sustainable and cost-effective production. However, most [...] Read more.
α-Linolenic acid (ALA) is an important essential omega-3 (ω-3) polyunsaturated fatty acid for the maintenance of human health. Although ALA has traditionally been obtained from plant sources, microbial fermentation has emerged as a promising alternative for its sustainable and cost-effective production. However, most of the present approaches rely on genetically modified organisms, which present regulatory and consumer-acceptance concerns. In this study, we aimed to develop a high-ALA-producing strain of Aspergillus oryzae, a Generally Recognized As Safe (GRAS) microorganism widely used in food production in Japan, through self-cloning, a form of genetic engineering that utilizes only the host’s own DNA. To achieve this, an endogenous ω-3 desaturase gene (fad3), which catalyzes the conversion of linoleic acid to ALA, was identified via BLASTP analysis. Subsequently, a multicopy A. oryzae strain (Aofad3-MC) overexpressing fad3 was constructed. This strain increased ALA production, with ALA comprising 30.7% of the total lipids. Furthermore, k-mer analysis confirmed the absence of foreign vector sequences, verifying that Aofad3-MC was constructed through self-cloning. In addition to the identification of the A. oryzae ω-3 desaturase gene, this study provides a microbial platform for the sustainable production of ALA, with potential applications across the food, feed, and related industries. Full article
(This article belongs to the Special Issue Metabolic Engineering, Strain Modification and Industrial Application)
Show Figures

Figure 1

16 pages, 2426 KB  
Article
First Insights into Ploidy and Genome Size Estimation in Choerospondias axillaris (Roxb.) B.L.Burtt & A.W.Hill (Anacardiaceae) Using Flow Cytometry and Genome Survey Sequencing
by Fangdi Li, Zhuolong Shen, Tianhe Zhang, Xiaoge Gao, Huashan Ling, Hequn Gu, Zhigao Liu, Jiyan Liu, Chaokai Lin and Qirong Guo
Plants 2025, 14(19), 3094; https://doi.org/10.3390/plants14193094 - 7 Oct 2025
Viewed by 1840
Abstract
For the Choerospondias axillaris (Roxb.) B.L.Burtt & A.W.Hill, a significant economic tree in the Anacardiaceae family with industrial, medicinal, and ecological value, the genome size remains unreported. Here, we optimized the flow cytometry-based method for ploidy analysis, finding that WPB lysis solution proved [...] Read more.
For the Choerospondias axillaris (Roxb.) B.L.Burtt & A.W.Hill, a significant economic tree in the Anacardiaceae family with industrial, medicinal, and ecological value, the genome size remains unreported. Here, we optimized the flow cytometry-based method for ploidy analysis, finding that WPB lysis solution proved to be the most effective. Analysis of 58 C. axillaris accessions identified 47 diploids and 11 triploids. The average genome size of diploids was estimated at 450.36 Mb. Illumina sequencing of a diploid (No.22) generated 81.98 Gb of high-quality data (224.44X depth). K-mer analysis estimated the genome size at 365.25 Mb, with 0.91% genome heterozygosity, 34.17% GC content, and 47.74% repeated sequences, indicating high heterozygosity and duplication levels in the genome. Genome assembly may necessitate a combination of second- and third-generation sequencing technologies. Comparative analysis with the NT database revealed that C. axillaris exhibited the highest similarity to C. axillaris (3.01%) and Pistacia vera (2.5%). This study establishes a crucial theoretical framework for C. axillaris genome sequencing and molecular genetics. Full article
(This article belongs to the Section Plant Genetics, Genomics and Biotechnology)
Show Figures

Figure 1

25 pages, 7550 KB  
Article
CG-Based Stratification of 8-mers Highlights Functional Roles and Phylogenetic Divergence Markers
by Guojun Liu, Hu Meng, Zhenhua Yang, Guoqing Liu, Yongqiang Xing and Ningkun Xiao
Int. J. Mol. Sci. 2025, 26(19), 9477; https://doi.org/10.3390/ijms26199477 - 27 Sep 2025
Cited by 1 | Viewed by 1128
Abstract
K-mer analysis is a powerful tool for understanding genome structure and evolution. A “k-mer” refers to a short DNA sequence made up of k nucleotides (where k is a specific integer), while an “m-mer” is a similar concept but with a shorter sequence [...] Read more.
K-mer analysis is a powerful tool for understanding genome structure and evolution. A “k-mer” refers to a short DNA sequence made up of k nucleotides (where k is a specific integer), while an “m-mer” is a similar concept but with a shorter sequence length. The functional mechanisms of CG-containing k-mers, as well as their potential role in evolutionary processes, remain unclear. To explore this issue, we analyzed 8-mers in several species with varying genomic complexities and evolutionary divergences: Homo sapiens, Saccharomyces cerevisiae, Bombyx mori, Ciona intestinalis, Danio rerio, and Caenorhabditis elegans, which were grouped by CG dinucleotide content (0CG, 1CG, and 2CG). We examined the relative frequencies of shorter m-mers (with m = 3 and 4) within each CG-defined group, using information-theoretic, distance-based, and angular metrics. Our results show that 0CG motifs follow random patterns, while 1CG and 2CG motifs display significant deviations, likely due to functional constraints such as nucleosome-binding and CpG island association. The observed unimodal distribution of 8-mers arises from the convergence of the three CG-defined groups. Among them, the 2CG group shows the highest divergence in m-mer composition, followed by 1CG, reflecting varying degrees of selective pressure. Furthermore, species-specific differences in CG-classified 8-mer patterns could provide valuable insights into phylogenetic relationships. Through extensive comparison, we explore how CG content and sequence composition influence genomic organization and contribute to evolutionary divergence across different taxa. These findings deepen our understanding of short motif functions, genome organization, and sequence evolution. Full article
(This article belongs to the Special Issue Statistical Approaches to Omics Data: Searching for Biological Truth)
Show Figures

Figure 1

22 pages, 1926 KB  
Review
Biological Sequence Representation Methods and Recent Advances: A Review
by Hongwei Zhang, Yan Shi, Yapeng Wang, Xu Yang, Kefeng Li, Sio-Kei Im and Yu Han
Biology 2025, 14(9), 1137; https://doi.org/10.3390/biology14091137 - 27 Aug 2025
Cited by 3 | Viewed by 3100
Abstract
Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model [...] Read more.
Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model (LLM)-based, detailing their principles, applications, and limitations. Computational-based methods, such as k-mer counting and position-specific scoring matrices (PSSM), extract statistical and evolutionary patterns to support tasks like motif discovery and protein–protein interaction prediction. Word embedding-based approaches, including Word2Vec and GloVe, capture contextual relationships, enabling robust sequence classification and regulatory element identification. Advanced LLM-based methods, leveraging Transformer architectures like ESM3 and RNAErnie, model long-range dependencies for RNA structure prediction and cross-modal analysis, achieving superior accuracy. However, challenges persist, including computational complexity, sensitivity to data quality, and limited interpretability of high-dimensional embeddings. Future directions prioritize integrating multimodal data (e.g., sequences, structures, and functional annotations), employing sparse attention mechanisms to enhance efficiency, and leveraging explainable AI to bridge embeddings with biological insights. These advancements promise transformative applications in drug discovery, disease prediction, and genomics, empowering computational biology with robust, interpretable tools. Full article
(This article belongs to the Special Issue Machine Learning Applications in Biology—2nd Edition)
Show Figures

Figure 1

25 pages, 9347 KB  
Article
Phylogroup Homeostasis of Escherichia coli in the Human Gut Reflects the Physiological State of the Host
by Maria S. Frolova, Sergey S. Kiselev, Valery V. Panyukov and Olga N. Ozoline
Microorganisms 2025, 13(7), 1584; https://doi.org/10.3390/microorganisms13071584 - 4 Jul 2025
Viewed by 1324
Abstract
The advent of alignment-free k-mer barcoding has revolutionized taxonomic analysis, enabling bacterial identification at phylogroup resolution within natural communities. We applied this approach to characterize Escherichia coli intraspecific diversity in human gut microbiomes using publicly available datasets representing diverse human physiological states. [...] Read more.
The advent of alignment-free k-mer barcoding has revolutionized taxonomic analysis, enabling bacterial identification at phylogroup resolution within natural communities. We applied this approach to characterize Escherichia coli intraspecific diversity in human gut microbiomes using publicly available datasets representing diverse human physiological states. By estimating the relative abundance of eight E. coli phylogroups defined by their 18-mer markers in 558 fecal samples, we compared their distribution between gut microbiomes of healthy individuals, patients with chronic bowel diseases and volunteers subjected to various external interventions. Across all datasets, phylogroups exhibited bidirectional abundance shifts in response to host physiological changes, indicating an inherent bimodality in their adaptive strategies. Correlation analysis of phylogroup persistence revealed positive intraspecific connectivity networks and dependence of their patterns on both acute interventions like antibiotic or probiotic treatment and chronic bowel disorders. Along with predominantly negative correlations with Bacteroides, we observed a transition from positive to negative associations with Prevotella in Prevotella-rich microbiomes. Several interspecific correlations individually established by E. coli phylogroups with dominant taxa suggest their potential role in shaping intraspecific networks. Machine learning techniques statistically confirmed an ability of phylogroup patterns to discriminate the physiological state of the host and virtual diagnostic assays opened a way to optimize intraspecific phylotyping for medical applications. Full article
Show Figures

Figure 1

20 pages, 2342 KB  
Article
Comparing Strategies for Optimal Pumps as Turbines Selection in Pressurised Irrigation Networks Using Particle Swarm Optimisation: Application in Canal del Zújar Irrigation District, Spain
by Mariana Akemi Ikegawa Bernabé, Miguel Crespo Chacón, Juan Antonio Rodríguez Díaz, Pilar Montesinos and Jorge García Morillo
Technologies 2025, 13(6), 233; https://doi.org/10.3390/technologies13060233 - 5 Jun 2025
Cited by 1 | Viewed by 1126
Abstract
The modernisation of irrigation networks has enhanced water use efficiency but increased energy demand and costs in agriculture. Energy recovery (ER) is possible by utilising excess pressure to generate electricity with pumps as turbines (PATs), offering a cost-effective alternative to traditional turbines. This [...] Read more.
The modernisation of irrigation networks has enhanced water use efficiency but increased energy demand and costs in agriculture. Energy recovery (ER) is possible by utilising excess pressure to generate electricity with pumps as turbines (PATs), offering a cost-effective alternative to traditional turbines. This study assesses the use of PATs in pressurised irrigation networks for recovering wasted hydraulic energy, employing the particle swarm optimisation (PSO) algorithm for PAT sizing based on two single-objective functions. The analysis focuses on minimising the payback period (MPP) and maximising energy recovery (MER) at specific excess pressure points (EPPs). A comparative analysis of values for each EPP and objective function is conducted independently in Sector II of the Canal del Zújar Irrigation District (CZID) in Extremadura, Spain. A sensitivity analysis on energy prices and installation costs is also performed to assess socioeconomic trends and volatility, examining their effects on both objective functions. The optimisation process predicts an annual ER for an average irrigation season using 2015 data ranging from 9554.86 kWh to 43,992.15 kWh per PATs from the MER function, and payback periods (PPs) from 12.92 years to 3.01 years for the MPP function. The sensitivity analysis replicated the optimisation for the years 2022 and 2023, showing potential annual ER of up to 54,963.21 kWh and PPs ranging from 0.88 to 5.96 years for the year 2022. Full article
(This article belongs to the Special Issue Sustainable Water and Environmental Technologies of Global Relevance)
Show Figures

Figure 1

Back to TopTop