Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”

Chen, Li-Pang

doi:10.3390/math13162646

Open AccessEditorial

Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”

by

Li-Pang Chen

Department of Statistics, National Chengchi University, Taipei City 116, Taiwan

Mathematics 2025, 13(16), 2646; https://doi.org/10.3390/math13162646

Submission received: 28 July 2025 / Accepted: 14 August 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

Download Versions Notes

MSC:

62Nxx; 62N02

Data science has become a prominent field in recent decades, closely intertwined with modern statistical analysis. Its primary objective is to tackle challenges posed by noisy data and complex model structures, such as high dimensionality, measurement error, censoring, and genetic data. To reflect recent advancements in the field, this Special Issue, entitled “Statistical Analysis and Data Science for Complex Data”, presents a collection of high-quality research papers that introduce novel statistical methods for addressing these challenges across various domains. Fourteen papers authored by researchers from diverse countries and regions are featured, including Taiwan, the U.S.A., Canada, Japan, and China, showcasing both topical breadth and geographic diversity.

The first paper, Contribution 1 introduces a fused lasso approach for data-adaptive testing in multivariate settings. A key advantage of the proposed method is its ability to account for the effects of adjacent variants. Numerical results demonstrate its superior performance compared to several existing tests.

Contribution 2 address the analysis of survival data involving two potentially dependent groups. Their primary goal is to compare the survival distributions between these groups while accounting for their dependence. To this end, they propose a copula-based method to estimate the Mann–Whitney effect under parametric survival models. Additionally, they develop a Shiny web application to facilitate the practical implementation of their approach.

In medical research, the receiver operating characteristic (ROC) curve is a widely used statistical tool for distinguishing between individuals with and without a disease. Two common summary measures for evaluating a biomarker’s diagnostic accuracy are the area under the ROC curve (

A U C

) and the Youden index (J). To provide a more comprehensive understanding of the ROC curve’s characteristics, Contribution 3 propose a semiparametric density ratio model that links the biomarker distributions for healthy and diseased populations, enabling simultaneous inference on both

A U C

and J. The authors also establish several theoretical properties, including the joint asymptotic normality of the maximum empirical likelihood estimator of

(A U C, J)

and an asymptotically valid confidence region for these indices.

Contribution 4 investigate interactions among multiple biomarkers, such as gene–environment interactions, which are of significant importance in bioinformatics. A central challenge in their study lies in model adaptation errors caused by imbalanced data types. To address this, the authors employ the SMOTE-Tomek procedure to correct for data imbalance. Using datasets from The Cancer Genome Atlas on lung adenocarcinoma and breast invasive carcinoma, they demonstrate that the SMOTE-Tomek approach improves prediction performance over both untreated data and data processed with SMOTE alone, across various imbalance ratios. Furthermore, the study identifies biomarkers associated with gene–environment interactions in cancer and reports their estimated odds ratios, offering valuable insights from the data analysis.

Estimating treatment or exposure effects is a central topic in causal inference. When multiple data sources are available, it is natural to consider integrating information across studies to improve estimation. However, a major challenge lies in the heterogeneity and incompleteness of covariate sets across datasets. To address this issue, Contribution 5 propose a generalized meta-analysis approach that integrates summary statistics, such as regression coefficients, from outcome and treatment models across studies with differing covariate structures. The authors establish the asymptotic distribution of the proposed integrated estimator, and simulation studies confirm the method’s validity and effectiveness.

Contribution 6 investigate the joint analysis of interval-censored data and panel count data, two commonly encountered forms of incomplete data in survival studies. To accommodate both data types, the authors propose a novel semiparametric joint regression model, where the failure time follows an additive–multiplicative hazards model and recurrent events are modeled using a nonhomogeneous Poisson process. They develop a sieve maximum likelihood estimation procedure based on Bernstein polynomials to estimate the model parameters efficiently.

Contribution 7 propose a nonparametric framework for estimating instrumental variable treatment effects. To address the curse of dimensionality inherent in high-dimensional data, the authors introduce two types of sufficient dimension reduction techniques: the partial central subspace for the outcome and treatment variables and the central subspace for the instrumental variable. They further demonstrate that combining these two approaches allows the estimator to attain the semiparametric efficiency bound for the marginal version of local average treatment effects, underscoring the framework’s potential to improve both efficiency and robustness in nonparametric causal inference.

The work (Contribution 8) was motivated by the global COVID-19 pandemic, which posed a significant public health challenge in recent years. Their primary goal is to efficiently monitor the number of newly confirmed daily cases. To this end, they propose using exponentially weighted moving average (EWMA) control charts integrated with two time series models: the autoregressive integrated moving average (ARIMA) model and the vector autoregressive moving average (VARMA) model. Their data analysis demonstrates that the proposed method can detect early signals of disease outbreaks more effectively than conventional control charts.

Contribution 9 investigate nonlinear time series models in the presence of missing data. To address the challenges posed by such complex structures and to improve prediction accuracy, the authors extend the LightGBM algorithm to model nonlinear interactions and coupling relationships among variables. Their approach reconstructs variables with missing values, restoring data completeness and enhancing the accuracy and reliability of time series forecasting by effectively capturing temporal dependencies.

Contribution 10 focuses on the estimation of both time-dependent and time-independent ROC curves under a cured survival model with covariate measurement error. To correct for the bias introduced by measurement error in estimating the AUC, the author proposes the use of insertion methods and regression calibration, combined with the EM algorithm for parameter estimation. Theoretical properties of the proposed estimators, such as consistency and asymptotic normality, are established.

Contribution 11 present a Bayesian hierarchical model incorporating Gaussian processes to account for spatial random effects in the analysis of storm surge data. Their results demonstrate that the proposed method improves the accuracy of storm surge estimates, particularly for long return periods, offering valuable insights for environmental risk assessment.

Contribution 12 investigate the optimal quantization of discrete probability distributions when part of the support set is pre-specified. Their study covers both finite and infinite discrete distributions, deriving optimal conditional quantizers and corresponding quantization errors. The authors provide theoretical insights that lay the groundwork for future research in quantization under structural constraints.

Contribution 13 apply CEEMDAN to decompose carbon emission time series into intrinsic mode functions (IMFs) that represent different frequency components. A hybrid CNN–Transformer framework is then employed to capture both local features and long-range temporal dependencies. Based on four real-world datasets comprising over 133,000 observations, their approach significantly outperforms existing methods, achieving approximately 13% lower RMSE and 12–13% improvements in MAE and CRPS.

The final paper in this Special Issue addresses the assumption of log-linear effects in Cox proportional hazards models. Contribution 14 propose a partial-likelihood-based goodness-of-fit test to assess the validity of the log-linear effect assumption in univariate Cox models. When this assumption is rejected, it indicates the presence of monotonic but non-log-linear covariate effects on the hazard function. The proposed method is applied to breast cancer and lung cancer datasets, revealing valuable insights into the suitability of log-linearity in Cox model applications.

As the Guest Editor, I would like to express my sincere gratitude to all the authors for their valuable contributions to this Special Issue. I also extend my heartfelt thanks to the reviewers for their thoughtful and constructive feedback, which has greatly enhanced the quality of the accepted papers. The aim of this Special Issue was to gather innovative research across diverse areas of data science. I hope that the published works will capture the interest of readers and have a meaningful impact on the international scientific community.

Conflicts of Interest

The author declares no conflicts of interest.

List of Contributions

Ueki, M. Data-Adaptive Multivariate Test for Genomic Studies Using Fused Lasso. Mathematics 2024, 12, 1422.
Nakazono, K.; Lin, Y.-C.; Liao, G.-Y.; Uozumi, R.; Emura, T. Computation of the Mann–Whitney Effect under Parametric Survival Copula Models. Mathematics 2024, 12, 1453.
Liu, S.; Tian, Q.; Liu, Y.; Li, P. Joint Statistical Inference for the Area under the ROC Curve and Youden Index under a Density Ratio Model. Mathematics 2024, 12, 2118.
Wang, J.-H.; Liu, C.-Y.; Min, Y.-R.;Wu, Z.-H.; Hou, P.-L. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics 2024, 12, 2209.
Chen, Y.-H.; Hsu, S.-Y.;Wang, J.-H.; Su, C.-C. Analyzing Treatment Effect by Integrating Existing Propensity Score and Outcome Regressions with Heterogeneous Covariate Sets. Mathematics 2024, 12, 2265.
Wang, T.; Li, Y.; Sun, J.; Wang, S. Semiparametric Analysis of Additive–Multiplicative Hazards Model with Interval-Censored Data and Panel Count Data. Mathematics 2024, 12, 3667.
Huang, M.-Y.; Chan, K.C.G. Adaptive Reduction of Curse of Dimensionality in Nonparametric Instrumental Variable Estimation. Mathematics 2025, 13, 106.
Hsu, C.-R.; Wang, H. EWMA Control Chart Integrated with Time Series Models for COVID-19 Surveillance. Mathematics 2025, 13, 115.
Lv, J.; Mao, H.; Wang, Y.; Yao, Z. Reconstruction and Prediction of Chaotic Time Series with Missing Data: Leveraging Dynamical Correlations Between Variables. Mathematics 2025, 13, 152.
Chen, L.-P. Analysis of Receiver Operating Characteristic Curves for Cure Survival Data and Mismeasured Biomarkers. Mathematics 2025, 13, 424.
Scott, M.; Huang, H.-H. Generalizable Storm Surge Risk Modeling. Mathematics 2025, 13, 486.
Gonzalez, E.A.; Roychowdhury, M.K.; Salinas, D.A.; Veeramachaneni, V. Conditional Quantization for Some Discrete Distributions. Mathematics 2025, 13, 1717.
Sun, Y.; Qu, Z.; Liu, Z.; Li, X. Hierarchical Multi-Scale Decomposition and Deep Learning Ensemble Framework for Enhanced Carbon Emission Prediction. Mathematics 2025, 13, 1924.
Chen, H.; Tang, C.-F. A Goodness-of-Fit Test for Log-Linearity in Cox Proportional Hazards Model Under Monotonic Covariate Effects. Mathematics 2025, 13, 2264.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.-P. Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”. Mathematics 2025, 13, 2646. https://doi.org/10.3390/math13162646

AMA Style

Chen L-P. Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”. Mathematics. 2025; 13(16):2646. https://doi.org/10.3390/math13162646

Chicago/Turabian Style

Chen, Li-Pang. 2025. "Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”" Mathematics 13, no. 16: 2646. https://doi.org/10.3390/math13162646

APA Style

Chen, L.-P. (2025). Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”. Mathematics, 13(16), 2646. https://doi.org/10.3390/math13162646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Preface to the Special Issue on “Statistical Analysis and Data Science for Complex Data”

Conflicts of Interest

List of Contributions

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI