Benchmarking Datasets in Bioinformatics, 2nd Edition

A special issue of Data (ISSN 2306-5729). This special issue belongs to the section "Computational Biology, Bioinformatics, and Biomedical Data Science".

Deadline for manuscript submissions: 31 July 2025 | Viewed by 4884

Special Issue Editor

Special Issue Information

Dear Colleagues,

Over the last few years, computational predictions and identifications have gained importance in modern life science and medical science. Many efforts have been made to develop algorithms and computational models that can be used to identify molecular structures, functions, interactions, evolutions, and their relationships with complex disorders. To validate these methods, many benchmarking datasets have been constructed, applied, and released to the public domain. These benchmarking datasets form the basis of the fair comparison and validation of computational methods. A thorough discussion and comparison of these datasets is necessary. In this Special Issue, we aim to provide deep insights into the construction procedures and characteristics of different benchmarking datasets with the same, or similar, biological topics.

We are looking for manuscripts that discuss different benchmarking datasets which cover a single bioinformatics topic or a specific category of topics. These manuscripts can discuss and compare the construction procedures, data sources, and statistics of different datasets, as well as the computational methods that are developed and evaluated using these datasets. There is no limit or fixed boundary to these comparisons. All kinds of discussions, comments, and comparisons are welcome. In particular, a collection of different datasets covering a single topic or similar topics are welcome, as this will facilitate the further development of different computational methods. In general, all contributions related to bioinformatics benchmarking datasets may be included in this Special Issue.

Dr. Pufeng Du
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Data is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • bioinformatics datasets
  • dataset construction
  • dataset comparisons
  • dataset qualities
  • dataset comments
  • dataset collections
  • comparison of computational methods based on datasets

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Other

15 pages, 3594 KiB  
Article
Macao-ebird: A Curated Dataset for Artificial-Intelligence-Powered Bird Surveillance and Conservation in Macao
by Xiaoyuan Huang, Silvia Mirri and Su-Kit Tang
Data 2025, 10(6), 84; https://doi.org/10.3390/data10060084 - 30 May 2025
Viewed by 524
Abstract
Artificial intelligence (AI) currently exhibits considerable potential within the realm of biodiversity conservation. However, high-quality regionally customized datasets remain scarce, particularly within urban environments. The existing large-scale bird image datasets often lack a dedicated focus on endangered species endemic to specific geographic regions, [...] Read more.
Artificial intelligence (AI) currently exhibits considerable potential within the realm of biodiversity conservation. However, high-quality regionally customized datasets remain scarce, particularly within urban environments. The existing large-scale bird image datasets often lack a dedicated focus on endangered species endemic to specific geographic regions, as well as a nuanced consideration of the complex interplay between urban and natural environmental contexts. Therefore, this paper introduces Macao-ebird, a novel dataset designed to advance AI-driven recognition and conservation of endangered bird species in Macao. The dataset comprises two subsets: (1) Macao-ebird-cls, a classification dataset with 7341 images covering 24 bird species, emphasizing endangered and vulnerable species native to Macao; and (2) Macao-ebird-det, an object detection dataset generated through AI-agent-assisted labeling using grounding DETR with improved denoising anchor boxes (DINO), significantly reducing manual annotation effort while maintaining high-quality bounding-box annotations. We validate the dataset’s utility through baseline experiments with the You Only Look Once (YOLO) v8–v12 series, achieving a mean average precision (mAP50) of up to 0.984. Macao-ebird addresses critical gaps in the existing datasets by focusing on region-specific endangered species and complex urban–natural environments, providing a benchmark for AI applications in avian conservation. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

Other

Jump to: Research

16 pages, 3375 KiB  
Data Descriptor
ICA-Based Resting-State Networks Obtained on Large Autism fMRI Dataset ABIDE
by Sjir J. C. Schielen, Jesper Pilmeyer, Albert P. Aldenkamp, Danny Ruijters and Svitlana Zinger
Data 2025, 10(7), 109; https://doi.org/10.3390/data10070109 - 3 Jul 2025
Viewed by 28
Abstract
Functional magnetic resonance imaging (fMRI) has become instrumental in researching the functioning of the brain. One application of fMRI is investigating the brains of people with autism spectrum disorder (ASD). The Autism Brain Imaging Data Exchange (ABIDE) facilitates this research through its extensive [...] Read more.
Functional magnetic resonance imaging (fMRI) has become instrumental in researching the functioning of the brain. One application of fMRI is investigating the brains of people with autism spectrum disorder (ASD). The Autism Brain Imaging Data Exchange (ABIDE) facilitates this research through its extensive data-sharing initiative. While ABIDE offers raw data and data preprocessed with various atlases, independent component analysis (ICA) for dimensionality reduction remains underutilized. ICA is a data-driven way to reduce dimensionality without prior assumptions on delineations. Additionally, ICA separates the noise from the signal, and the signal components correspond well to functional brain networks called resting-state networks (RSNs). Currently, no large, readily available dataset preprocessed with ICA exists. Here, we address this gap by presenting ABIDE’s data preprocessed to extract ICA-based resting-state networks, which are publicly available. These RSNs unveil neural activation clusters without atlas constraints, offering a perspective on ASD analyses that complements the predominantly atlas-based literature. This contribution provides a resource for further research into ASD, benchmarking between methodologies, and the development of new analytical approaches. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Graphical abstract

10 pages, 687 KiB  
Data Descriptor
A DNA Barcode Dataset for the Aquatic Fauna of the Panama Canal: Novel Resources for Detecting Faunal Change in the Neotropics
by Kristin Saltonstall, Rachel Collin, Celestino Aguilar, Fernando Alda, Laura M. Baldrich-Mora, Victor Bravo, María Fernanda Castillo, Sheril Castro, Luis F. De León, Edgardo Díaz-Ferguson, Humberto A. Garcés, Eyda Gómez, Rigoberto G. González, Maribel A. González-Torres, Hector M. Guzman, Alexandra Hiller, Roberto Ibáñez, César Jaramillo, Klara L. Kaiser, Yulang Kam, Mayra Lemus Peralta, Oscar G. Lopez, Maycol E. Madrid C., Matthew J. Miller, Natalia Ossa-Hernandez, Ruth G. Reina, D. Ross Robertson, Tania E. Romero-Gonzalez, Milton Sandoval, Oris Sanjur, Carmen Schlöder, Ashley E. Sharpe, Diana Sharpe, Jakob Siepmann, David Strasiewsky, Mark E. Torchin, Melany Tumbaco, Marta Vargas, Miryam Venegas-Anaya, Benjamin C. Victor and Gustavo Castellanos-Galindoadd Show full author list remove Hide full author list
Data 2025, 10(7), 108; https://doi.org/10.3390/data10070108 - 2 Jul 2025
Viewed by 60
Abstract
DNA metabarcoding is a powerful biodiversity monitoring tool, enabling simultaneous assessments of diverse biological communities. However, its accuracy depends on the reliability of reference databases that assign taxonomic identities to obtained sequences. Here we provide a DNA barcode dataset for aquatic fauna of [...] Read more.
DNA metabarcoding is a powerful biodiversity monitoring tool, enabling simultaneous assessments of diverse biological communities. However, its accuracy depends on the reliability of reference databases that assign taxonomic identities to obtained sequences. Here we provide a DNA barcode dataset for aquatic fauna of the Panama Canal, a region that connects the Western Atlantic and Eastern Pacific oceans. This unique setting creates opportunities for trans-oceanic dispersal while acting as a modern physical dispersal barrier for some terrestrial organisms. We sequenced 852 specimens from a diverse array of taxa (e.g., fishes, zooplankton, mollusks, arthropods, reptiles, birds, and mammals) using COI, and in some cases, 12S and 16S barcodes. These data were collected for a variety of studies, many of which have sought to understand recent changes in aquatic communities in the Panama Canal. The DNA barcodes presented here are all from captured specimens, which confirms their presence in Panama and, in many cases, inside the Panama Canal. Both native and introduced taxa are included. This dataset represents a valuable resource for environmental DNA (eDNA) work in the Panama Canal region and across the Neotropics aimed at monitoring ecosystem health, tracking non-native and potentially invasive species, and understanding the ecology and distribution of these freshwater and euryhaline taxa. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

13 pages, 726 KiB  
Data Descriptor
A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues
by Ekaterina D. Osintseva, German A. Ashniev, Alexey V. Orlov, Petr I. Nikitin, Zoia G. Zaitseva, Vladimir V. Volkov and Natalia N. Orlova
Data 2025, 10(5), 74; https://doi.org/10.3390/data10050074 - 14 May 2025
Viewed by 433
Abstract
Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we [...] Read more.
Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we present a dataset and accompanying framework that facilitate a more nuanced, non-binary examination of SE activation across mouse tissue types (mammary gland, lung tissue, and NMuMG cells) and various experimental conditions (normal, tumor, and drug-treated samples). By consolidating overlapping SE intervals and capturing continuous enhancer activity metrics (e.g., ChIP-seq signal intensities), our dataset reveals gradual transitions between moderate and high enhancer activity levels that are not captured by strictly binary classification. Additionally, the data include extensive functional annotations, linking SE loci to nearby genes and enabling immediate downstream analyses such as clustering and gene ontology enrichment. The flexible approach supports broader investigations of enhancer landscapes, offering a comprehensive platform for understanding how SE activation underpins disease mechanisms, therapeutic response, and developmental processes. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

14 pages, 4526 KiB  
Data Descriptor
A Complementary Dataset of Scalp EEG Recordings Featuring Participants with Alzheimer’s Disease, Frontotemporal Dementia, and Healthy Controls, Obtained from Photostimulation EEG
by Aimilia Ntetska, Andreas Miltiadous, Markos G. Tsipouras, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Dimitrios G. Tsalikakis, Konstantinos Sakkas, Emmanouil D. Oikonomou, Nikolaos Grigoriadis, Pantelis Angelidis, Nikolaos Giannakeas and Alexandros T. Tzallas
Data 2025, 10(5), 64; https://doi.org/10.3390/data10050064 - 29 Apr 2025
Viewed by 823
Abstract
Research interest in the application of electroencephalogram (EEG) as a non-invasive diagnostic tool for the automated detection of neurodegenerative diseases is growing. Open-access datasets have become crucial for researchers developing such methodologies. Our previously published open-access dataset of resting-state (eyes-closed) EEG recordings from [...] Read more.
Research interest in the application of electroencephalogram (EEG) as a non-invasive diagnostic tool for the automated detection of neurodegenerative diseases is growing. Open-access datasets have become crucial for researchers developing such methodologies. Our previously published open-access dataset of resting-state (eyes-closed) EEG recordings from patients with Alzheimer’s disease (AD), frontotemporal dementia (FTD), and cognitively normal (CN) controls has attracted significant attention. In this paper, we present a complementary dataset consisting of eyes-open photic stimulation recordings from the same cohort. The dataset includes recordings from 88 participants (36 AD, 23 FTD, and 29 CN) and is provided in Brain Imaging Data Structure (BIDS) format, promoting consistency and ease of use across research groups. Additionally, a fully preprocessed version is included, using EEGLAB-based pipelines that involve filtering, artifact removal, and Independent Component Analysis, preparing the data for machine learning applications. This new dataset enables the study of brain responses to visual stimulation across different cognitive states and supports the development and validation of automated classification algorithms for dementia detection. It offers a valuable benchmark for both methodological comparisons and biological investigations, and it is expected to significantly contribute to the fields of neurodegenerative disease research, biomarker discovery, and EEG-based diagnostics. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

7 pages, 407 KiB  
Data Descriptor
Draft Genome Sequence Data of the Ensifer sp. P24N7, a Symbiotic Bacteria Isolated from Nodules of Phaseolus vulgaris Grown in Mining Tailings from Huautla, Morelos, Mexico
by José Augusto Ramírez-Trujillo, Maria Guadalupe Castillo-Texta, Mario Ramírez-Yáñez and Ramón Suárez-Rodríguez
Data 2025, 10(3), 34; https://doi.org/10.3390/data10030034 - 27 Feb 2025
Viewed by 862
Abstract
In this work, we report the draft genome sequence of Ensifer sp. P24N7, a symbiotic nitrogen-fixing bacterium isolated from nodules of Phaseolus vulgaris var. Negro Jamapa was planted in pots that contained mining tailings from Huautla, Morelos, México. The genomic DNA was sequenced [...] Read more.
In this work, we report the draft genome sequence of Ensifer sp. P24N7, a symbiotic nitrogen-fixing bacterium isolated from nodules of Phaseolus vulgaris var. Negro Jamapa was planted in pots that contained mining tailings from Huautla, Morelos, México. The genomic DNA was sequenced by an Illumina NovaSeq 6000 using the 250 bp paired-end protocol obtaining 1,188,899 reads. An assembly generated with SPAdes v. 3.15.4 resulted in a genome length of 7,165,722 bp composed of 181 contigs with a N50 of 323,467 bp, a coverage of 76X, and a GC content of 61.96%. The genome was annotated with the NCBI Prokaryotic Genome Annotation Pipeline and contains 6631 protein-coding sequences, 3 complete rRNAs, 52 tRNAs, and 4 non-coding RNAs. The Ensifer sp. P24N7 genome has 59 genes related to heavy metal tolerance predicted by RAST server. These data may be useful to the scientific community because they can be used as a reference for other works related to heavy metals, including works in Huautla, Morelos. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

11 pages, 1926 KiB  
Data Descriptor
Minisatellite Isolation and Minisatellite Molecular Marker Development in Citrus limon (L.) Osbeck
by Oleg S. Alexandrov and Dmitry V. Romanov
Data 2025, 10(1), 2; https://doi.org/10.3390/data10010002 - 28 Dec 2024
Viewed by 1007
Abstract
Minisatellites are widespread tandem DNA repeats in the genome with a monomer length of 10 to 100 bp. The high variability of minisatellite loci makes them attractive for the development of molecular markers. Minisatellites are used as markers according to three strategies: marking [...] Read more.
Minisatellites are widespread tandem DNA repeats in the genome with a monomer length of 10 to 100 bp. The high variability of minisatellite loci makes them attractive for the development of molecular markers. Minisatellites are used as markers according to three strategies: marking of digested genomic DNA with minisatellite-based probes; amplification with primers based on the sequences of the minisatellites themselves; amplification with primers designed for borders upstream and downstream of the minisatellite locus. In this study, a microsatellite dataset was obtained from the analysis of the Citrus limon (L.) Osbeck genome using Tandem Repeat Finder (TRF) and GMATA software. The minisatellite loci found were used to develop molecular markers that were tested in GMATA using electronic PCR (e-PCR). The obtained dataset includes sequences of extracted minisatellites and their characteristics (start and end nucleotide positions on the chromosome, length of monomer, number of repetitions and length of array), as well as sequences of developed primers, expected lengths of amplicons, and e-PCR results. The presented dataset can be used for the marking of lemon samples according to any of the three strategies. It provides a useful basis for lemon variety certification, identification of samples, verification of collections, lemon genome mapping, saturation of already created maps, studying of the lemon genome architecture etc. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

Back to TopTop