Entropy

Research

17 pages, 478 KB

Open AccessArticle

A Bayesian Model for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer

by Yashi Bu, Min Chen, Zhenyu Xuan and Xinlei Wang

Entropy 2025, 27(10), 1077; https://doi.org/10.3390/e27101077 - 18 Oct 2025

Viewed by 547

Complex human diseases, including cancer, are linked to genetic factors. Genome-wide association studies (GWASs) are powerful for identifying genetic variants associated with cancer but are limited by their reliance on case–control data. We propose approaches to expanding GWAS by using tumor and paired [...] Read more.

Complex human diseases, including cancer, are linked to genetic factors. Genome-wide association studies (GWASs) are powerful for identifying genetic variants associated with cancer but are limited by their reliance on case–control data. We propose approaches to expanding GWAS by using tumor and paired normal tissues to investigate somatic mutations. We apply penalized maximum likelihood estimation for single-marker analysis and develop a Bayesian hierarchical model to integrate multiple markers, identifying SNP sets grouped by genes or pathways, improving detection of moderate-effect SNPs. Applied to breast cancer data from The Cancer Genome Atlas (TCGA), both single- and multiple-marker analyses identify associated genes, with multiple-marker analysis providing more consistent results with external resources. The Bayesian model significantly increases the chance of new discoveries. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

23 pages, 723 KB

Open AccessArticle

Multivariate Modeling of Some Datasets in Continuous Space and Discrete Time

by Rigele Te and Juan Du

Entropy 2025, 27(8), 837; https://doi.org/10.3390/e27080837 - 6 Aug 2025

Viewed by 829

Abstract

Multivariate space–time datasets are often collected at discrete, regularly monitored time intervals and are typically treated as components of time series in environmental science and other applied fields. To effectively characterize such data in geostatistical frameworks, valid and practical covariance models are essential. [...] Read more.

Multivariate space–time datasets are often collected at discrete, regularly monitored time intervals and are typically treated as components of time series in environmental science and other applied fields. To effectively characterize such data in geostatistical frameworks, valid and practical covariance models are essential. In this work, we propose several classes of multivariate spatio-temporal covariance matrix functions to model underlying stochastic processes whose discrete temporal margins correspond to well-known autoregressive and moving average (ARMA) models. We derive sufficient and/or necessary conditions under which these functions yield valid covariance matrices. By leveraging established methodologies from time series analysis and spatial statistics, the proposed models are straightforward to identify and fit in practice. Finally, we demonstrate the utility of these multivariate covariance functions through an application to Kansas weather data, using co-kriging for prediction and comparing the results to those obtained from traditional spatio-temporal models. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

19 pages, 380 KB

Open AccessArticle

Bootstrap Confidence Intervals for Multiple Change Points Based on Two-Stage Procedures

by Li Hou, Baisuo Jin, Yuehua Wu and Fangwei Wang

Entropy 2025, 27(5), 537; https://doi.org/10.3390/e27050537 - 17 May 2025

Viewed by 1959

Abstract

This paper investigates the construction of confidence intervals for multiple change points in linear regression models. First, we detect multiple change points by performing variable selection on blocks of the input sequence; second, we re-estimate their exact locations in a refinement step. Specifically, [...] Read more.

This paper investigates the construction of confidence intervals for multiple change points in linear regression models. First, we detect multiple change points by performing variable selection on blocks of the input sequence; second, we re-estimate their exact locations in a refinement step. Specifically, we exploit an orthogonal greedy algorithm to recover the number of change points consistently in the cutting stage, and employ the sup-Wald-type test statistic to determine the locations of multiple change points in the refinement stage. Based on a two-stage procedure, we propose bootstrapping the estimated centered error sequence, which can accommodate unknown magnitudes of changes and ensure the asymptotic validity of the proposed bootstrapping method. This enables us to construct confidence intervals using the empirical distribution of the resampled data. The proposed method is illustrated with simulations and real data examples. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

22 pages, 7837 KB

Open AccessArticle

Online Monitoring and Fault Diagnosis for High-Dimensional Stream with Application in Electron Probe X-Ray Microanalysis

by Tao Wang, Yunfei Guo, Fubo Zhu and Zhonghua Li

Entropy 2025, 27(3), 297; https://doi.org/10.3390/e27030297 - 13 Mar 2025

Viewed by 1190

Abstract

This study introduces an innovative two-stage framework for monitoring and diagnosing high-dimensional data streams with sparse changes. The first stage utilizes an exponentially weighted moving average (EWMA) statistic for online monitoring, identifying change points through extreme value theory and multiple hypothesis testing. The [...] Read more.

This study introduces an innovative two-stage framework for monitoring and diagnosing high-dimensional data streams with sparse changes. The first stage utilizes an exponentially weighted moving average (EWMA) statistic for online monitoring, identifying change points through extreme value theory and multiple hypothesis testing. The second stage involves a fault diagnosis mechanism that accurately pinpoints abnormal components upon detecting anomalies. Through extensive numerical simulations and electron probe X-ray microanalysis applications, the method demonstrates exceptional performance. It rapidly detects anomalies, often within one or two sampling intervals post-change, achieves near 100% detection power, and maintains type-I error rates around the nominal 5%. The fault diagnosis mechanism shows a 99.1% accuracy in identifying components in 200-dimensional anomaly streams, surpassing principal component analysis (PCA)-based methods by 28.0% in precision and controlling the false discovery rate within 3%. Case analyses confirm the method’s effectiveness in monitoring and identifying abnormal data, aligning with previous studies. These findings represent significant progress in managing high-dimensional sparse-change data streams over existing methods. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

28 pages, 413 KB

Open AccessArticle

Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data

by Xiaoming Sha, Puying Zhao and Niansheng Tang

Entropy 2025, 27(2), 146; https://doi.org/10.3390/e27020146 - 1 Feb 2025

Viewed by 1102

Abstract

This paper develops a penalized exponentially tilted (ET) likelihood to simultaneously estimate unknown parameters and select variables for growing dimensional models with missing response at random. The inverse probability weighted approach is employed to compensate for missing information and to ensure the consistency [...] Read more.

This paper develops a penalized exponentially tilted (ET) likelihood to simultaneously estimate unknown parameters and select variables for growing dimensional models with missing response at random. The inverse probability weighted approach is employed to compensate for missing information and to ensure the consistency of parameter estimators. Based on the penalized ET likelihood, we construct an ET likelihood ratio statistic to test the contrast hypothesis of parameters. Under some wild conditions, we obtain the consistency, asymptotic properties, and oracle properties of parameter estimators and show that the constrained penalized ET likelihood ratio statistic for testing the contrast hypothesis possesses the Wilks’ property. Simulation studies are conducted to validate the finite sample performance of the proposed methodologies. Thyroid data taken from the First People’s Hospital of Yunnan Province is employed to illustrate the proposed methodologies. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

31 pages, 13420 KB

Open AccessArticle

Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight

by Yilong Wei, Jinlin Ma, Ziping Ma and Yulei Huang

Entropy 2025, 27(2), 107; https://doi.org/10.3390/e27020107 - 22 Jan 2025

Cited by 1 | Viewed by 1468

Abstract

Subspace learning has achieved promising performance as a key technique for unsupervised feature selection. The strength of subspace learning lies in its ability to identify a representative subspace encompassing a cluster of features that are capable of effectively approximating the space of the [...] Read more.

Subspace learning has achieved promising performance as a key technique for unsupervised feature selection. The strength of subspace learning lies in its ability to identify a representative subspace encompassing a cluster of features that are capable of effectively approximating the space of the original features. Nonetheless, most existing unsupervised feature selection methods based on subspace learning are constrained by two primary challenges. (1) Many methods only predominantly focus on the relationships between samples in the data space but ignore the correlated information between features in the feature space, which is unreliable for exploiting the intrinsic spatial structure. (2) Graph-based methods typically only take account of one-order neighborhood structures, neglecting high-order neighborhood structures inherent in original data, thereby failing to accurately preserve local geometric characteristics of the data. To pursue filling this gap in research, taking dual high-order graph learning into account, we propose a framework called subspace learning for dual high-order graph learning based on Boolean weight (DHBWSL). Firstly, a framework for unsupervised feature selection based on subspace learning is proposed, which is extended by dual-graph regularization to fully investigate geometric structure information on dual spaces. Secondly, the dual high-order graph is designed by embedding Boolean weights to learn a more extensive node from the original space such that the appropriate high-order adjacency matrix can be selected adaptively and flexibly. Experimental results on 12 public datasets demonstrate that the proposed DHBWSL outperforms the nine recent state-of-the-art algorithms. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

24 pages, 74482 KB

Open AccessArticle

Bayesian Regression Analysis for Dependent Data with an Elliptical Shape

by Yian Yu, Long Tang, Kang Ren, Zhonglue Chen, Shengdi Chen and Jianqing Shi

Entropy 2024, 26(12), 1072; https://doi.org/10.3390/e26121072 - 9 Dec 2024

Viewed by 1827

Abstract

This paper proposes a parametric hierarchical model for functional data with an elliptical shape, using a Gaussian process prior to capturing the data dependencies that reflect systematic errors while modeling the underlying curved shape through a von Mises–Fisher distribution. The model definition, Bayesian [...] Read more.

This paper proposes a parametric hierarchical model for functional data with an elliptical shape, using a Gaussian process prior to capturing the data dependencies that reflect systematic errors while modeling the underlying curved shape through a von Mises–Fisher distribution. The model definition, Bayesian inference, and MCMC algorithm are discussed. The effectiveness of the model is demonstrated through the reconstruction of curved trajectories using both simulated and real-world examples. The discussion in this paper focuses on two-dimensional problems, but the framework can be extended to higher-dimensional spaces, making it adaptable to a wide range of applications. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

17 pages, 1232 KB

Open AccessArticle

Optimizing Prognostic Predictions in Liver Cancer with Machine Learning and Survival Analysis

by Kaida Cai, Wenzhi Fu, Zhengyan Wang, Xiaofang Yang, Hanwen Liu and Ziyang Ji

Entropy 2024, 26(9), 767; https://doi.org/10.3390/e26090767 - 7 Sep 2024

Cited by 3 | Viewed by 2498

Abstract

This study harnesses RNA sequencing data from the Cancer Genome Atlas to unearth pivotal genetic markers linked to the progression of liver hepatocellular carcinoma (LIHC), a major contributor to cancer-related deaths worldwide, characterized by a dire prognosis and limited treatment avenues. We employ [...] Read more.

This study harnesses RNA sequencing data from the Cancer Genome Atlas to unearth pivotal genetic markers linked to the progression of liver hepatocellular carcinoma (LIHC), a major contributor to cancer-related deaths worldwide, characterized by a dire prognosis and limited treatment avenues. We employ advanced feature selection techniques, including sure independence screening (SIS) combined with the least absolute shrinkage and selection operator (Lasso), smoothly clipped absolute deviation (SCAD), information gain (IG), and permutation variable importance (VIMP) methods, to effectively navigate the challenges posed by ultra-high-dimensional data. Through these methods, we identify critical genes like MED8 as significant markers for LIHC. These markers are further analyzed using advanced survival analysis models, including the Cox proportional hazards model, survival tree, and random survival forests. Our findings reveal that SIS-Lasso demonstrates strong predictive accuracy, particularly in combination with the Cox proportional hazards model. However, when coupled with the random survival forests method, the SIS-VIMP approach achieves the highest overall performance. This comprehensive approach not only enhances the prediction of LIHC outcomes but also provides valuable insights into the genetic mechanisms underlying the disease, thereby paving the way for personalized treatment strategies and advancing the field of cancer genomics. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

21 pages, 3074 KB

Open AccessArticle

Tail Risk Dynamics under Price-Limited Constraint: A Censored Autoregressive Conditional Fréchet Model

by Tao Xu, Lei Shu and Yu Chen

Entropy 2024, 26(7), 555; https://doi.org/10.3390/e26070555 - 28 Jun 2024

Viewed by 1544

Abstract

This paper proposes a novel censored autoregressive conditional Fréchet (CAcF) model with a flexible evolution scheme for the time-varying parameters, which allows deciphering tail risk dynamics constrained by price limits from the viewpoints of different risk preferences. The proposed model can well accommodate [...] Read more.

This paper proposes a novel censored autoregressive conditional Fréchet (CAcF) model with a flexible evolution scheme for the time-varying parameters, which allows deciphering tail risk dynamics constrained by price limits from the viewpoints of different risk preferences. The proposed model can well accommodate many important empirical characteristics of financial data, such as heavy-tailedness, volatility clustering, extreme event clustering, and price limits. We then investigate tail risk dynamics via the CAcF model in the price-limited stock markets, taking entropic value at risk (EVaR) as a risk measurement. Our findings suggest that tail risk will be seriously underestimated in price-limited stock markets when the censored property of limit prices is ignored. Additionally, the evidence from the Chinese Taiwan stock market shows that widening price limits would lead to a decrease in the incidence of extreme events (hitting limit-down) but a significant increase in tail risk. Moreover, we find that investors with different risk preferences may make opposing decisions about an extreme event. In summary, the empirical results reveal the effectiveness of our model in interpreting and predicting time-varying tail behaviors in price-limited stock markets, providing a new tool for financial risk management. Full article

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Statistical Methods for Modeling High-Dimensional and Complex Data: Second Edition

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issues

Published Papers (9 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI