Benchmarking Datasets in Bioinformatics, 2nd Edition

A special issue of Data (ISSN 2306-5729). This special issue belongs to the section "Computational Biology, Bioinformatics, and Biomedical Data Science".

Deadline for manuscript submissions: 31 July 2025 | Viewed by 3256

Special Issue Editor

Special Issue Information

Dear Colleagues,

Over the last few years, computational predictions and identifications have gained importance in modern life science and medical science. Many efforts have been made to develop algorithms and computational models that can be used to identify molecular structures, functions, interactions, evolutions, and their relationships with complex disorders. To validate these methods, many benchmarking datasets have been constructed, applied, and released to the public domain. These benchmarking datasets form the basis of the fair comparison and validation of computational methods. A thorough discussion and comparison of these datasets is necessary. In this Special Issue, we aim to provide deep insights into the construction procedures and characteristics of different benchmarking datasets with the same, or similar, biological topics.

We are looking for manuscripts that discuss different benchmarking datasets which cover a single bioinformatics topic or a specific category of topics. These manuscripts can discuss and compare the construction procedures, data sources, and statistics of different datasets, as well as the computational methods that are developed and evaluated using these datasets. There is no limit or fixed boundary to these comparisons. All kinds of discussions, comments, and comparisons are welcome. In particular, a collection of different datasets covering a single topic or similar topics are welcome, as this will facilitate the further development of different computational methods. In general, all contributions related to bioinformatics benchmarking datasets may be included in this Special Issue.

Dr. Pufeng Du
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Data is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • bioinformatics datasets
  • dataset construction
  • dataset comparisons
  • dataset qualities
  • dataset comments
  • dataset collections
  • comparison of computational methods based on datasets

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Other

13 pages, 726 KiB  
Data Descriptor
A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues
by Ekaterina D. Osintseva, German A. Ashniev, Alexey V. Orlov, Petr I. Nikitin, Zoia G. Zaitseva, Vladimir V. Volkov and Natalia N. Orlova
Data 2025, 10(5), 74; https://doi.org/10.3390/data10050074 - 14 May 2025
Viewed by 162
Abstract
Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we [...] Read more.
Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we present a dataset and accompanying framework that facilitate a more nuanced, non-binary examination of SE activation across mouse tissue types (mammary gland, lung tissue, and NMuMG cells) and various experimental conditions (normal, tumor, and drug-treated samples). By consolidating overlapping SE intervals and capturing continuous enhancer activity metrics (e.g., ChIP-seq signal intensities), our dataset reveals gradual transitions between moderate and high enhancer activity levels that are not captured by strictly binary classification. Additionally, the data include extensive functional annotations, linking SE loci to nearby genes and enabling immediate downstream analyses such as clustering and gene ontology enrichment. The flexible approach supports broader investigations of enhancer landscapes, offering a comprehensive platform for understanding how SE activation underpins disease mechanisms, therapeutic response, and developmental processes. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

14 pages, 4526 KiB  
Data Descriptor
A Complementary Dataset of Scalp EEG Recordings Featuring Participants with Alzheimer’s Disease, Frontotemporal Dementia, and Healthy Controls, Obtained from Photostimulation EEG
by Aimilia Ntetska, Andreas Miltiadous, Markos G. Tsipouras, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Dimitrios G. Tsalikakis, Konstantinos Sakkas, Emmanouil D. Oikonomou, Nikolaos Grigoriadis, Pantelis Angelidis, Nikolaos Giannakeas and Alexandros T. Tzallas
Data 2025, 10(5), 64; https://doi.org/10.3390/data10050064 - 29 Apr 2025
Viewed by 352
Abstract
Research interest in the application of electroencephalogram (EEG) as a non-invasive diagnostic tool for the automated detection of neurodegenerative diseases is growing. Open-access datasets have become crucial for researchers developing such methodologies. Our previously published open-access dataset of resting-state (eyes-closed) EEG recordings from [...] Read more.
Research interest in the application of electroencephalogram (EEG) as a non-invasive diagnostic tool for the automated detection of neurodegenerative diseases is growing. Open-access datasets have become crucial for researchers developing such methodologies. Our previously published open-access dataset of resting-state (eyes-closed) EEG recordings from patients with Alzheimer’s disease (AD), frontotemporal dementia (FTD), and cognitively normal (CN) controls has attracted significant attention. In this paper, we present a complementary dataset consisting of eyes-open photic stimulation recordings from the same cohort. The dataset includes recordings from 88 participants (36 AD, 23 FTD, and 29 CN) and is provided in Brain Imaging Data Structure (BIDS) format, promoting consistency and ease of use across research groups. Additionally, a fully preprocessed version is included, using EEGLAB-based pipelines that involve filtering, artifact removal, and Independent Component Analysis, preparing the data for machine learning applications. This new dataset enables the study of brain responses to visual stimulation across different cognitive states and supports the development and validation of automated classification algorithms for dementia detection. It offers a valuable benchmark for both methodological comparisons and biological investigations, and it is expected to significantly contribute to the fields of neurodegenerative disease research, biomarker discovery, and EEG-based diagnostics. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

7 pages, 407 KiB  
Data Descriptor
Draft Genome Sequence Data of the Ensifer sp. P24N7, a Symbiotic Bacteria Isolated from Nodules of Phaseolus vulgaris Grown in Mining Tailings from Huautla, Morelos, Mexico
by José Augusto Ramírez-Trujillo, Maria Guadalupe Castillo-Texta, Mario Ramírez-Yáñez and Ramón Suárez-Rodríguez
Data 2025, 10(3), 34; https://doi.org/10.3390/data10030034 - 27 Feb 2025
Viewed by 751
Abstract
In this work, we report the draft genome sequence of Ensifer sp. P24N7, a symbiotic nitrogen-fixing bacterium isolated from nodules of Phaseolus vulgaris var. Negro Jamapa was planted in pots that contained mining tailings from Huautla, Morelos, México. The genomic DNA was sequenced [...] Read more.
In this work, we report the draft genome sequence of Ensifer sp. P24N7, a symbiotic nitrogen-fixing bacterium isolated from nodules of Phaseolus vulgaris var. Negro Jamapa was planted in pots that contained mining tailings from Huautla, Morelos, México. The genomic DNA was sequenced by an Illumina NovaSeq 6000 using the 250 bp paired-end protocol obtaining 1,188,899 reads. An assembly generated with SPAdes v. 3.15.4 resulted in a genome length of 7,165,722 bp composed of 181 contigs with a N50 of 323,467 bp, a coverage of 76X, and a GC content of 61.96%. The genome was annotated with the NCBI Prokaryotic Genome Annotation Pipeline and contains 6631 protein-coding sequences, 3 complete rRNAs, 52 tRNAs, and 4 non-coding RNAs. The Ensifer sp. P24N7 genome has 59 genes related to heavy metal tolerance predicted by RAST server. These data may be useful to the scientific community because they can be used as a reference for other works related to heavy metals, including works in Huautla, Morelos. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

11 pages, 1926 KiB  
Data Descriptor
Minisatellite Isolation and Minisatellite Molecular Marker Development in Citrus limon (L.) Osbeck
by Oleg S. Alexandrov and Dmitry V. Romanov
Data 2025, 10(1), 2; https://doi.org/10.3390/data10010002 - 28 Dec 2024
Viewed by 893
Abstract
Minisatellites are widespread tandem DNA repeats in the genome with a monomer length of 10 to 100 bp. The high variability of minisatellite loci makes them attractive for the development of molecular markers. Minisatellites are used as markers according to three strategies: marking [...] Read more.
Minisatellites are widespread tandem DNA repeats in the genome with a monomer length of 10 to 100 bp. The high variability of minisatellite loci makes them attractive for the development of molecular markers. Minisatellites are used as markers according to three strategies: marking of digested genomic DNA with minisatellite-based probes; amplification with primers based on the sequences of the minisatellites themselves; amplification with primers designed for borders upstream and downstream of the minisatellite locus. In this study, a microsatellite dataset was obtained from the analysis of the Citrus limon (L.) Osbeck genome using Tandem Repeat Finder (TRF) and GMATA software. The minisatellite loci found were used to develop molecular markers that were tested in GMATA using electronic PCR (e-PCR). The obtained dataset includes sequences of extracted minisatellites and their characteristics (start and end nucleotide positions on the chromosome, length of monomer, number of repetitions and length of array), as well as sequences of developed primers, expected lengths of amplicons, and e-PCR results. The presented dataset can be used for the marking of lemon samples according to any of the three strategies. It provides a useful basis for lemon variety certification, identification of samples, verification of collections, lemon genome mapping, saturation of already created maps, studying of the lemon genome architecture etc. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

Back to TopTop