Special Issue "Benchmarking Datasets in Bioinformatics"

A special issue of Data (ISSN 2306-5729). This special issue belongs to the section "Computational Biology, Bioinformatics, and Biomedical Data Science".

Deadline for manuscript submissions: closed (31 July 2020).

Special Issue Editor

Dr. Pufeng Du
E-Mail Website
Guest Editor
School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
Interests: big data analysis; biomedical data analysis; biomedical data management; pattern recognition; knowledge discovery
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Over the last few years, computational predictions and identifications have become important methods in modern life science and medical science. Many efforts have been made in developing algorithms and computational models to identify molecular structures, functions, interactions, evolutions, and their relationships with complex disorders. To validate these methods, many benchmarking datasets have been constructed, applied, and released to the public domain. The benchmarking datasets are the basis of fair comparison and validation of computational methods. A thorough discussion and comparison of the datasets is necessary. In this Special Issue, we aim at providing deep insights into the construction procedures and the characters of different benchmarking datasets for the same or similar biological topics.

We expect manuscripts that can discuss different benchmarking datasets for a single bioinformatics topic or in a specific category of topics. The manuscripts can discuss and compare the constructions procedures, data sources, statistics of different datasets, as well as computational methods that are developed and evaluated on the datasets. There is no limit or fixed boundary of these comparisons. All kinds of discussions, comments, and comparisons are welcome. Particularly, a collection of different datasets for a single topic or similar topics are welcome, as this will facilitate further developments of computational methods. In general, all contributions regarding bioinformatics benchmarking datasets can be included in this Special Issue.

Dr. Pufeng Du
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Data is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Bioinformatics dataset
  • Dataset constructions
  • Dataset comparisons
  • Dataset qualities
  • Dataset comments
  • Dataset collections
  • Computational methods comparison based on datasets

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Other

Data Descriptor
First Draft Genome Assembly of the Malaysian Stingless Bee, Heterotrigona itama (Apidae, Meliponinae)
Data 2020, 5(4), 112; https://doi.org/10.3390/data5040112 - 30 Nov 2020
Viewed by 1437
Abstract
The Malaysian stingless bee industry is hugely dependent on wild colonies. Nevertheless, the availability of new queens to establish new colonies is insufficient to meet the growing demand for hives in the industry. Heterotrigona itama is primarily utilized for honey production in the [...] Read more.
The Malaysian stingless bee industry is hugely dependent on wild colonies. Nevertheless, the availability of new queens to establish new colonies is insufficient to meet the growing demand for hives in the industry. Heterotrigona itama is primarily utilized for honey production in the region and the major source of stingless bee colonies comes from the wild. To propagate new colonies domestically, a fundamental understanding of the biology of queen development, especially from the genomics aspect, is necessary. The whole genome was sequenced using a paired-end 150 strategy on the Illumina HiSeq X platform. The shotgun sequencing generated approximately 89 million raw pair-end reads with a total output of 13.37 Gb and a GC content of 37.31%. The genome size of the species was estimated to be approximately 272 Mb. Phylogenetic analysis showed H. itama are much more closely related to the bumble bee (Bombus spp.) than they are to the modern honey bee (Apis spp.). The genome data provided here are expected to contribute to a better understanding of the genetic aspect of queen differentiation as well as of important molecular pathways which are crucial for stingless bee biology, management and conservation. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Graphical abstract

Data Descriptor
Bioinformatics Analysis Identifying Key Biomarkers in Bladder Cancer
Data 2020, 5(2), 38; https://doi.org/10.3390/data5020038 - 16 Apr 2020
Cited by 1 | Viewed by 1390
Abstract
Our goal was to find new diagnostic and prognostic biomarkers in bladder cancer (BCa), and to predict molecular mechanisms and processes involved in BCa development and progression. Notably, the data collection is an inevitable step and time-consuming work. Furthermore, identification of the complementary [...] Read more.
Our goal was to find new diagnostic and prognostic biomarkers in bladder cancer (BCa), and to predict molecular mechanisms and processes involved in BCa development and progression. Notably, the data collection is an inevitable step and time-consuming work. Furthermore, identification of the complementary results and considerable literature retrieval were requested. Here, we provide detailed information of the used datasets, the study design, and on data mining. We analyzed differentially expressed genes (DEGs) in the different datasets and the most important hub genes were retrieved. We report on the meta-data information of the population, such as gender, race, tumor stage, and the expression levels of the hub genes. We include comprehensive information about the gene ontology (GO) enrichment analyses and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. We also retrieved information about the up- and down-regulation of genes. All in all, the presented datasets can be used to evaluate potential biomarkers and to predict the performance of different preclinical biomarkers in BCa. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

Data Descriptor
Intracranial Hemorrhage Segmentation Using a Deep Convolutional Model
Data 2020, 5(1), 14; https://doi.org/10.3390/data5010014 - 01 Feb 2020
Cited by 27 | Viewed by 3557
Abstract
Traumatic brain injuries may cause intracranial hemorrhages (ICH). ICH could lead to disability or death if it is not accurately diagnosed and treated in a time-sensitive procedure. The current clinical protocol to diagnose ICH is examining Computerized Tomography (CT) scans by radiologists to [...] Read more.
Traumatic brain injuries may cause intracranial hemorrhages (ICH). ICH could lead to disability or death if it is not accurately diagnosed and treated in a time-sensitive procedure. The current clinical protocol to diagnose ICH is examining Computerized Tomography (CT) scans by radiologists to detect ICH and localize its regions. However, this process relies heavily on the availability of an experienced radiologist. In this paper, we designed a study protocol to collect a dataset of 82 CT scans of subjects with a traumatic brain injury. Next, the ICH regions were manually delineated in each slice by a consensus decision of two radiologists. The dataset is publicly available online at the PhysioNet repository for future analysis and comparisons. In addition to publishing the dataset, which is the main purpose of this manuscript, we implemented a deep Fully Convolutional Networks (FCNs), known as U-Net, to segment the ICH regions from the CT scans in a fully-automated manner. The method as a proof of concept achieved a Dice coefficient of 0.31 for the ICH segmentation based on 5-fold cross-validation. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

Data Descriptor
The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms
Data 2020, 5(1), 13; https://doi.org/10.3390/data5010013 - 30 Jan 2020
Cited by 1 | Viewed by 1090
Abstract
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report [...] Read more.
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names “Atom”, “Chainlink”, “EngyTime”, “Golfball”, “Hepta”, “Lsun”, “Target”, “Tetra”, “TwoDiamonds”, and “WingNut”. Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

Data Descriptor
Matrix Metalloproteinases as Markers of Acute Inflammation Process in the Pulmonary Tuberculosis
Data 2019, 4(4), 137; https://doi.org/10.3390/data4040137 - 05 Oct 2019
Cited by 4 | Viewed by 1372
Abstract
The main factors of pathogenesis in the pulmonary tuberculosis are not only the bacterial virulence and sensitivity of the host immune system to the pathogen, but also the degree of destruction of the lung tissue. Such destruction processes lead to the development of [...] Read more.
The main factors of pathogenesis in the pulmonary tuberculosis are not only the bacterial virulence and sensitivity of the host immune system to the pathogen, but also the degree of destruction of the lung tissue. Such destruction processes lead to the development of caverns, in most cases requiring surgical interventions besides the drug therapy. Identification of special biochemical markers allowing to assess the necessity of surgery or therapy prolongation remains a challenge. We consider promising markers—metalloproteinases—analyzing the data obtained from patients with pulmonary tuberculosis infected by different strains of Mycobacterium tuberculosis. We argue that the presence of drug-resistant strains in lungs leading to complicated clinical prognosis could be justified not only by the difference in medians of biomarkers concentration (as determined by the Mann–Whitney test for small samples), but also by the qualitative difference in their probability distributions (as detected by the Kolmogorov–Smirnov test). Our results and the provided raw data could be used for further development of precise biochemical data-based diagnostic and prognostic tools for pulmonary tuberculosis. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

Data Descriptor
Database for Gene Variants and Metabolic Networks Implicated in Familial Gastroschisis
Data 2019, 4(3), 97; https://doi.org/10.3390/data4030097 - 11 Jul 2019
Cited by 1 | Viewed by 1191
Abstract
Gastroschisis is one of the most prevalent human birth defects concerning the ventral body wall development. Recent research has given a better understanding of gastroschisis pathogenesis through the identification of multiple novel pathogenetic pathways implicated in ventral body wall closure. Deciphering the underlying [...] Read more.
Gastroschisis is one of the most prevalent human birth defects concerning the ventral body wall development. Recent research has given a better understanding of gastroschisis pathogenesis through the identification of multiple novel pathogenetic pathways implicated in ventral body wall closure. Deciphering the underlying genetic factors segregating among familial gastroschisis allows better detection of novel susceptibility variants than the screening of pooled unrelated cases and controls, whereas bioinformatic-aided analysis can help to address new insights into human biology and molecular mechanisms involved in gastroschisis. Technological advances in DNA sequencing (Next Generation Sequencing), computing power, and machine learning techniques provide opportunities to the scientific communities to assess significant gaps in research and clinical practice. Thus, in an effort to study the role of gene variation in gastroschisis, we employed whole exome sequencing in a Mexican family with recurrence for gastroschisis. Stringent bioinformatic analyses were implemented to identify and predict pathogenetic networks comprised of potential gastroschisis predispositions. This is the first database for gene variants and metabolic networks implicated in familial gastroschisis. The dataset provides information on gastroschisis annotated genes, gene variants, and metabolic networks and constitutes a useful source to enhance further investigations in gastroschisis. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

Back to TopTop