entropy-logo

Journal Browser

Journal Browser

Statistical Inference from High Dimensional Data

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Statistical Physics".

Deadline for manuscript submissions: closed (31 December 2020) | Viewed by 54914

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editor


E-Mail Website
Guest Editor
Department of Computer Science, Faculty of Computer Science, University of A Coruña, CITIC, 15071 A Coruña, Spain
Interests: machine learning; feature selection; complex biological systems; cancer systems; bionformatics; biomedical data science; computational biology
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Continuous improvement and cost reduction in next-generation sequencing platforms is enabling better understanding of multifactorial and complex pathologies such as cancer. This is the typical problem in which the amount of data matters and where, in addition, the so-called curse of dimensionality occurs (the number of variables is many orders of magnitude greater than the number of cases). In this Special Issue, we welcome contributions that apply different approaches of Statistical Inference or Machine Learning for the characterization of complex pathologies using -omic data. We strongly encourage interdisciplinary works with real data (TCGA, HMP, clinicogenomic data or related datasets) and heterogeneous data integration (clinical, genomic, proteomic, and so on).

This Special Issue solicit submissions in, but not limited to, the following areas:

  • Applications based on statistical inference from high dimensional data;
  • Dimensionality reduction with imbalanced biological datasets;
  • Applications based on feature selection (e.g., text processing, bioinformatics, medical informatics and natural language processing);
  • Applications based on Information Theory for data integration (e.g., semantic interoperability, clustering, classification);
  • Applications based on feature selection methods using meta-heuristic search methods such as genetic algorithms, particle swarm optimization and so on;
  • Applications based on feature extraction (e.g., PCA, LDA);
  • Applications based on prior knowledge (e.g., ontologies, pathways).

Volume II: Special Issue "Statistical Inference from High Dimensional Data II"

Dr. Carlos Fernandez-Lozano
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Feature selection
  • Machine learning
  • Statistical inference
  • Dimensionality
  • Complex biological systems
  • Multifactorial diseases
  • Computational biology
  • Bioinformatics
  • Information theory
  • Large-scale data analysis
  • Information theory
  • Data mining

Published Papers (17 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

14 pages, 1826 KiB  
Article
Set-Wise Differential Interaction between Copy Number Alterations and Gene Expressions of Lower-Grade Glioma Reveals Prognosis-Associated Pathways
by Seong Beom Cho
Entropy 2020, 22(12), 1434; https://doi.org/10.3390/e22121434 - 18 Dec 2020
Cited by 4 | Viewed by 2107
Abstract
The integrative analysis of copy number alteration (CNA) and gene expression (GE) is an essential part of cancer research considering the impact of CNAs on cancer progression and prognosis. In this research, an integrative analysis was performed with generalized differentially coexpressed gene sets [...] Read more.
The integrative analysis of copy number alteration (CNA) and gene expression (GE) is an essential part of cancer research considering the impact of CNAs on cancer progression and prognosis. In this research, an integrative analysis was performed with generalized differentially coexpressed gene sets (gdCoxS), which is a modification of dCoxS. In gdCoxS, set-wise interaction is measured using the correlation of sample-wise distances with Renyi’s relative entropy, which requires an estimation of sample density based on omics profiles. To capture correlations between the variables, multivariate density estimation with covariance was applied. In the simulation study, the power of gdCoxS outperformed dCoxS that did not use the correlations in the density estimation explicitly. In the analysis of the lower-grade glioma of the cancer genome atlas program (TCGA-LGG) data, the gdCoxS identified 577 pathway CNAs and GEs pairs that showed significant changes of interaction between the survival and non-survival group, while other benchmark methods detected lower numbers of such pathways. The biological implications of the significant pathways were well consistent with previous reports of the TCGA-LGG. Taken together, the gdCoxS is a useful method for an integrative analysis of CNAs and GEs. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

20 pages, 3755 KiB  
Article
Low Entropy Sub-Networks Prevent the Integration of Metabolomic and Transcriptomic Data
by Krzysztof Gogolewski, Marcin Kostecki and Anna Gambin
Entropy 2020, 22(11), 1238; https://doi.org/10.3390/e22111238 - 31 Oct 2020
Cited by 2 | Viewed by 2252
Abstract
The constantly and rapidly increasing amount of the biological data gained from many different high-throughput experiments opens up new possibilities for data- and model-driven inference. Yet, alongside, emerges a problem of risks related to data integration techniques. The latter are not so widely [...] Read more.
The constantly and rapidly increasing amount of the biological data gained from many different high-throughput experiments opens up new possibilities for data- and model-driven inference. Yet, alongside, emerges a problem of risks related to data integration techniques. The latter are not so widely taken account of. Especially, the approaches based on the flux balance analysis (FBA) are sensitive to the structure of a metabolic network for which the low-entropy clusters can prevent the inference from the activity of the metabolic reactions. In the following article, we set forth problems that may arise during the integration of metabolomic data with gene expression datasets. We analyze common pitfalls, provide their possible solutions, and exemplify them by a case study of the renal cell carcinoma (RCC). Using the proposed approach we provide a metabolic description of the known morphological RCC subtypes and suggest a possible existence of the poor-prognosis cluster of patients, which are commonly characterized by the low activity of the drug transporting enzymes crucial in the chemotherapy. This discovery suits and extends the already known poor-prognosis characteristics of RCC. Finally, the goal of this work is also to point out the problem that arises from the integration of high-throughput data with the inherently nonuniform, manually curated low-throughput data. In such cases, the over-represented information may potentially overshadow the non-trivial discoveries. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

23 pages, 3623 KiB  
Article
Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data
by Samarendra Das and Shesh N. Rai
Entropy 2020, 22(11), 1205; https://doi.org/10.3390/e22111205 - 25 Oct 2020
Cited by 9 | Viewed by 2562
Abstract
Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through [...] Read more.
Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

27 pages, 3308 KiB  
Article
FASTENER Feature Selection for Inference from Earth Observation Data
by Filip Koprivec, Klemen Kenda and Beno Šircelj
Entropy 2020, 22(11), 1198; https://doi.org/10.3390/e22111198 - 23 Oct 2020
Cited by 2 | Viewed by 2886
Abstract
In this paper, a novel feature selection algorithm for inference from high-dimensional data (FASTENER) is presented. With its multi-objective approach, the algorithm tries to maximize the accuracy of a machine learning algorithm with as few features as possible. The algorithm exploits entropy-based measures, [...] Read more.
In this paper, a novel feature selection algorithm for inference from high-dimensional data (FASTENER) is presented. With its multi-objective approach, the algorithm tries to maximize the accuracy of a machine learning algorithm with as few features as possible. The algorithm exploits entropy-based measures, such as mutual information in the crossover phase of the iterative genetic approach. FASTENER converges to a (near) optimal subset of features faster than other multi-objective wrapper methods, such as POSS, DT-forward and FS-SDS, and achieves better classification accuracy than similarity and information theory-based methods currently utilized in earth observation scenarios. The approach was primarily evaluated using the earth observation data set for land-cover classification from ESA’s Sentinel-2 mission, the digital elevation model and the ground truth data of the Land Parcel Identification System from Slovenia. For land cover classification, the algorithm gives state-of-the-art results. Additionally, FASTENER was tested on open feature selection data sets and compared to the state-of-the-art methods. With fewer model evaluations, the algorithm yields comparable results to DT-forward and is superior to FS-SDS. FASTENER can be used in any supervised machine learning scenario. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

11 pages, 243 KiB  
Article
Approximate Learning of High Dimensional Bayesian Network Structures via Pruning of Candidate Parent Sets
by Zhigao Guo and Anthony C. Constantinou
Entropy 2020, 22(10), 1142; https://doi.org/10.3390/e22101142 - 10 Oct 2020
Cited by 6 | Viewed by 2504
Abstract
Score-based algorithms that learn Bayesian Network (BN) structures provide solutions ranging from different levels of approximate learning to exact learning. Approximate solutions exist because exact learning is generally not applicable to networks of moderate or higher complexity. In general, approximate solutions tend to [...] Read more.
Score-based algorithms that learn Bayesian Network (BN) structures provide solutions ranging from different levels of approximate learning to exact learning. Approximate solutions exist because exact learning is generally not applicable to networks of moderate or higher complexity. In general, approximate solutions tend to sacrifice accuracy for speed, where the aim is to minimise the loss in accuracy and maximise the gain in speed. While some approximate algorithms are optimised to handle thousands of variables, these algorithms may still be unable to learn such high dimensional structures. Some of the most efficient score-based algorithms cast the structure learning problem as a combinatorial optimisation of candidate parent sets. This paper explores a strategy towards pruning the size of candidate parent sets, and which could form part of existing score-based algorithms as an additional pruning phase aimed at high dimensionality problems. The results illustrate how different levels of pruning affect the learning speed relative to the loss in accuracy in terms of model fitting, and show that aggressive pruning may be required to produce approximate solutions for high complexity problems. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

26 pages, 5775 KiB  
Article
A Blended Artificial Intelligence Approach for Spectral Classification of Stars in Massive Astronomical Surveys
by Carlos Dafonte, Alejandra Rodríguez, Minia Manteiga, Ángel Gómez and Bernardino Arcay
Entropy 2020, 22(5), 518; https://doi.org/10.3390/e22050518 - 1 May 2020
Cited by 3 | Viewed by 3149
Abstract
This paper analyzes and compares the sensitivity and suitability of several artificial intelligence techniques applied to the Morgan–Keenan (MK) system for the classification of stars. The MK system is based on a sequence of spectral prototypes that allows classifying stars according to their [...] Read more.
This paper analyzes and compares the sensitivity and suitability of several artificial intelligence techniques applied to the Morgan–Keenan (MK) system for the classification of stars. The MK system is based on a sequence of spectral prototypes that allows classifying stars according to their effective temperature and luminosity through the study of their optical stellar spectra. Here, we include the method description and the results achieved by the different intelligent models developed thus far in our ongoing stellar classification project: fuzzy knowledge-based systems, backpropagation, radial basis function (RBF) and Kohonen artificial neural networks. Since one of today’s major challenges in this area of astrophysics is the exploitation of large terrestrial and space databases, we propose a final hybrid system that integrates the best intelligent techniques, automatically collects the most important spectral features, and determines the spectral type and luminosity level of the stars according to the MK standard system. This hybrid approach truly emulates the behavior of human experts in this area, resulting in higher success rates than any of the individual implemented techniques. In the final classification system, the most suitable methods are selected for each individual spectrum, which implies a remarkable contribution to the automatic classification process. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

12 pages, 1548 KiB  
Article
Residue Cluster Classes: A Unified Protein Representation for Efficient Structural and Functional Classification
by Fernando Fontove and Gabriel Del Rio
Entropy 2020, 22(4), 472; https://doi.org/10.3390/e22040472 - 20 Apr 2020
Cited by 7 | Viewed by 3248
Abstract
Proteins are characterized by their structures and functions, and these two fundamental aspects of proteins are assumed to be related. To model such a relationship, a single representation to model both protein structure and function would be convenient, yet so far, the most [...] Read more.
Proteins are characterized by their structures and functions, and these two fundamental aspects of proteins are assumed to be related. To model such a relationship, a single representation to model both protein structure and function would be convenient, yet so far, the most effective models for protein structure or function classification do not rely on the same protein representation. Here we provide a computationally efficient implementation for large datasets to calculate residue cluster classes (RCCs) from protein three-dimensional structures and show that such representations enable a random forest algorithm to effectively learn the structural and functional classifications of proteins, according to the CATH and Gene Ontology criteria, respectively. RCCs are derived from residue contact maps built from different distance criteria, and we show that 7 or 8 Å with or without amino acid side-chain atoms rendered the best classification models. The potential use of a unified representation of proteins is discussed and possible future areas for improvement and exploration are presented. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

16 pages, 15166 KiB  
Article
Improved Practical Vulnerability Analysis of Mouse Data According to Offensive Security based on Machine Learning in Image-Based User Authentication
by Kyungroul Lee and Sun-Young Lee
Entropy 2020, 22(3), 355; https://doi.org/10.3390/e22030355 - 18 Mar 2020
Cited by 2 | Viewed by 2901
Abstract
The objective of this study was to verify the feasibility of mouse data exposure by deriving features to improve the accuracy of a mouse data attack technique using machine learning models. To improve the accuracy, the feature appearing between the mouse coordinates input [...] Read more.
The objective of this study was to verify the feasibility of mouse data exposure by deriving features to improve the accuracy of a mouse data attack technique using machine learning models. To improve the accuracy, the feature appearing between the mouse coordinates input from the user was analyzed, which is defined as a feature for machine learning models to derive a method of improving the accuracy. As a result, we found a feature where the distance between the coordinates is concentrated in a specific range. We verified that the mouse data is apt to being stolen more accurately when the distance is used as a feature. An accuracy of over 99% was achieved, which means that the proposed method almost completely classifies the mouse data input from the user and the mouse data generated by the defender. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

13 pages, 306 KiB  
Article
A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies
by Hongping Guo, Zuguo Yu, Jiyuan An, Guosheng Han, Yuanlin Ma and Runbin Tang
Entropy 2020, 22(3), 329; https://doi.org/10.3390/e22030329 - 13 Mar 2020
Cited by 5 | Viewed by 3157
Abstract
Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative [...] Read more.
Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative sure independence screening expectation-maximization Bayesian Lasso (ISIS EM-BLASSO) method. In this work, a modified version of ISIS EM-BLASSO is proposed, which reduces the number of SNPs by a screening methodology based on Pearson correlation and mutual information, then estimates the effects via EM-Bayesian Lasso (EM-BLASSO), and finally detects the true quantitative trait nucleotides (QTNs) through likelihood ratio test. We call our method a two-stage mutual information based Bayesian Lasso (MBLASSO). Under three simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other algorithms. Moreover, MBLASSO performs best on model fitting, the accuracy of detected associations is the highest, and 21 genes can only be detected by MBLASSO in Arabidopsis thaliana datasets. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

12 pages, 839 KiB  
Article
Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
by Hang Wang and David Miller
Entropy 2020, 22(3), 326; https://doi.org/10.3390/e22030326 - 12 Mar 2020
Cited by 2 | Viewed by 2307
Abstract
In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained [...] Read more.
In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics—such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM’s model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

14 pages, 496 KiB  
Article
Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method
by Yuanlin Ma, Zuguo Yu, Runbin Tang, Xianhua Xie, Guosheng Han and Vo V. Anh
Entropy 2020, 22(2), 255; https://doi.org/10.3390/e22020255 - 23 Feb 2020
Cited by 12 | Viewed by 3134
Abstract
HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis [...] Read more.
HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson–Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

18 pages, 2552 KiB  
Article
Sub-Graph Regularization on Kernel Regression for Robust Semi-Supervised Dimensionality Reduction
by Jiao Liu, Mingbo Zhao and Weijian Kong
Entropy 2019, 21(11), 1125; https://doi.org/10.3390/e21111125 - 15 Nov 2019
Cited by 2 | Viewed by 2450
Abstract
Dimensionality reduction has always been a major problem for handling huge dimensionality datasets. Due to the utilization of labeled data, supervised dimensionality reduction methods such as Linear Discriminant Analysis tend achieve better classification performance compared with unsupervised methods. However, supervised methods need sufficient [...] Read more.
Dimensionality reduction has always been a major problem for handling huge dimensionality datasets. Due to the utilization of labeled data, supervised dimensionality reduction methods such as Linear Discriminant Analysis tend achieve better classification performance compared with unsupervised methods. However, supervised methods need sufficient labeled data in order to achieve satisfying results. Therefore, semi-supervised learning (SSL) methods can be a practical selection rather than utilizing labeled data. In this paper, we develop a novel SSL method by extending anchor graph regularization (AGR) for dimensionality reduction. In detail, the AGR is an accelerating semi-supervised learning method to propagate the class labels to unlabeled data. However, it cannot handle new incoming samples. We thereby improve AGR by adding kernel regression on the basic objective function of AGR. Therefore, the proposed method can not only estimate the class labels of unlabeled data but also achieve dimensionality reduction. Extensive simulations on several benchmark datasets are conducted, and the simulation results verify the effectiveness for the proposed work. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

18 pages, 3745 KiB  
Article
Radiomics Analysis on Contrast-Enhanced Spectral Mammography Images for Breast Cancer Diagnosis: A Pilot Study
by Liliana Losurdo, Annarita Fanizzi, Teresa Maria A. Basile, Roberto Bellotti, Ubaldo Bottigli, Rosalba Dentamaro, Vittorio Didonna, Vito Lorusso, Raffaella Massafra, Pasquale Tamborra, Alberto Tagliafico, Sabina Tangaro and Daniele La Forgia
Entropy 2019, 21(11), 1110; https://doi.org/10.3390/e21111110 - 13 Nov 2019
Cited by 46 | Viewed by 4275
Abstract
Contrast-enhanced spectral mammography is one of the latest diagnostic tool for breast care; therefore, the literature is poor in radiomics image analysis useful to drive the development of automatic diagnostic support systems. In this work, we propose a preliminary exploratory analysis to evaluate [...] Read more.
Contrast-enhanced spectral mammography is one of the latest diagnostic tool for breast care; therefore, the literature is poor in radiomics image analysis useful to drive the development of automatic diagnostic support systems. In this work, we propose a preliminary exploratory analysis to evaluate the impact of different sets of textural features in the discrimination of benign and malignant breast lesions. The analysis is performed on 55 ROIs extracted from 51 patients referred to Istituto Tumori “Giovanni Paolo II” of Bari (Italy) from the breast cancer screening phase between March 2017 and June 2018. We extracted feature sets by calculating statistical measures on original ROIs, gradiented images, Haar decompositions of the same original ROIs, and on gray-level co-occurrence matrices of the each sub-ROI obtained by Haar transform. First, we evaluated the overall impact of each feature set on the diagnosis through a principal component analysis by training a support vector machine classifier. Then, in order to identify a sub-set for each set of features with higher diagnostic power, we developed a feature importance analysis by means of wrapper and embedded methods. Finally, we trained an SVM classifier on each sub-set of previously selected features to compare their classification performances with respect to those of the overall set. We found a sub-set of significant features extracted from the original ROIs with a diagnostic accuracy greater than 80 % . The features extracted from each sub-ROI decomposed by two levels of Haar transform were predictive only when they were all used without any selection, reaching the best mean accuracy of about 80 % . Moreover, most of the significant features calculated by HAAR decompositions and their GLCMs were extracted from recombined CESM images. Our pilot study suggested that textural features could provide complementary information about the characterization of breast lesions. In particular, we found a sub-set of significant features extracted from the original ROIs, gradiented ROI images, and GLCMs calculated from each sub-ROI previously decomposed by the Haar transform. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

14 pages, 2604 KiB  
Article
Identify Risk Pattern of E-Bike Riders in China Based on Machine Learning Framework
by Chen Wang, Siyuan Kou and Yanchao Song
Entropy 2019, 21(11), 1084; https://doi.org/10.3390/e21111084 - 6 Nov 2019
Cited by 7 | Viewed by 2976
Abstract
In this paper, the risk pattern of e-bike riders in China was examined, based on tree-structured machine learning techniques. Three-year crash/violation data were acquired from the Kunshan traffic police department, China. Firstly, high-risk (HR) electric bicycle (e-bike) riders were defined as those with [...] Read more.
In this paper, the risk pattern of e-bike riders in China was examined, based on tree-structured machine learning techniques. Three-year crash/violation data were acquired from the Kunshan traffic police department, China. Firstly, high-risk (HR) electric bicycle (e-bike) riders were defined as those with at-fault crash involvement, while others (i.e., non-at-fault or without crash involvement) were considered as non-high-risk (NHR) riders, based on quasi-induced exposure theory. Then, for e-bike riders, their demographics and previous violation-related features were developed based on the crash/violation records. After that, a systematic machine learning (ML) framework was proposed so as to capture the complex risk patterns of those e-bike riders. An ensemble sampling method was selected to deal with the imbalanced datasets. Four tree-structured machine learning methods were compared, and a gradient boost decision tree (GBDT) appeared to be the best. The feature importance and partial dependence were further examined. Interesting findings include the following: (1) tree-structured ML models are able to capture complex risk patterns and interpret them properly; (2) spatial-temporal violation features were found as important indicators of high-risk e-bike riders; and (3) violation behavior features appeared to be more effective than violation punishment-related features, in terms of identifying high-risk e-bike riders. In general, the proposed ML framework is able to identify the complex crash risk pattern of e-bike riders. This paper provides useful insights for policy-makers and traffic practitioners regarding e-bike safety improvement in China. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

Review

Jump to: Research

15 pages, 1188 KiB  
Review
Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data
by Malik Yousef, Abhishek Kumar and Burcu Bakir-Gungor
Entropy 2021, 23(1), 2; https://doi.org/10.3390/e23010002 - 22 Dec 2020
Cited by 41 | Viewed by 5228
Abstract
In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under [...] Read more.
In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Figure 1

16 pages, 219 KiB  
Review
Models of the Gene Must Inform Data-Mining Strategies in Genomics
by Łukasz Huminiecki
Entropy 2020, 22(9), 942; https://doi.org/10.3390/e22090942 - 27 Aug 2020
Cited by 5 | Viewed by 2155
Abstract
The gene is a fundamental concept of genetics, which emerged with the Mendelian paradigm of heredity at the beginning of the 20th century. However, the concept has since diversified. Somewhat different narratives and models of the gene developed in several sub-disciplines of genetics, [...] Read more.
The gene is a fundamental concept of genetics, which emerged with the Mendelian paradigm of heredity at the beginning of the 20th century. However, the concept has since diversified. Somewhat different narratives and models of the gene developed in several sub-disciplines of genetics, that is in classical genetics, population genetics, molecular genetics, genomics, and, recently, also, in systems genetics. Here, I ask how the diversity of the concept impacts data-integration and data-mining strategies for bioinformatics, genomics, statistical genetics, and data science. I also consider theoretical background of the concept of the gene in the ideas of empiricism and experimentalism, as well as reductionist and anti-reductionist narratives on the concept. Finally, a few strategies of analysis from published examples of data-mining projects are discussed. Moreover, the examples are re-interpreted in the light of the theoretical material. I argue that the choice of an optimal level of abstraction for the gene is vital for a successful genome analysis. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
23 pages, 1996 KiB  
Review
Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges
by Samarendra Das, Craig J. McClain and Shesh N. Rai
Entropy 2020, 22(4), 427; https://doi.org/10.3390/e22040427 - 10 Apr 2020
Cited by 31 | Viewed by 5631
Abstract
Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the [...] Read more.
Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors. Full article
(This article belongs to the Special Issue Statistical Inference from High Dimensional Data)
Show Figures

Graphical abstract

Back to TopTop