Next Article in Journal
De-Esterified Homogalacturonan Enrichment of the Cell Wall Region Adjoining the Preprophase Cortical Cytoplasmic Zone in Some Protodermal Cell Types of Three Land Plants
Previous Article in Journal
Comparative Proteomic Analysis of Nodulated and Non-Nodulated Casuarina glauca Sieb. ex Spreng. Grown under Salinity Conditions Using Sequential Window Acquisition of All Theoretical Mass Spectra (SWATH-MS)
Previous Article in Special Issue
Discrimination of DNA Methylation Signal from Background Variation for Clinical Diagnostics
Open AccessArticle

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data

by Jörn Lötsch 1,2,* and Alfred Ultsch 3
1
Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
2
Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Project Group Translational Medicine and Pharmacology TMP, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
3
DataBionics Research Group, University of Marburg, Hans-Meerwein-Straße, 35032 Marburg, Germany
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2020, 21(1), 79; https://doi.org/10.3390/ijms21010079
Received: 4 November 2019 / Revised: 9 December 2019 / Accepted: 16 December 2019 / Published: 20 December 2019
(This article belongs to the Collection Technical Pitfalls and Biases in Molecular Biology)
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative. View Full-Text
Keywords: flow cytometry; high-dimensional data sets; computational techniques; machine-learning; data science; t-distributed stochastic neighbor embedding; emergent self-organizing maps; immunological research flow cytometry; high-dimensional data sets; computational techniques; machine-learning; data science; t-distributed stochastic neighbor embedding; emergent self-organizing maps; immunological research
Show Figures

Figure 1

MDPI and ACS Style

Lötsch, J.; Ultsch, A. Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data. Int. J. Mol. Sci. 2020, 21, 79.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop