Reprint

Statistical Methods in Data Science and Applications

Edited by
April 2024
302 pages
  • ISBN978-3-7258-0747-5 (Hardback)
  • ISBN978-3-7258-0748-2 (PDF)

This book is a reprint of the Special Issue Statistical Methods in Data Science and Applications that was published in

Computer Science & Mathematics
Engineering
Physical Sciences
Public Health & Healthcare
Summary

The rise of big data has significantly elevated the significance of data science, catalyzing extensive research across multiple fields, including mathematics, statistics, computer science, and artificial intelligence. Data science encompasses modeling, computation, and learning processes to transform data into information, information into knowledge, and knowledge into actionable decisions. However, the intricacies of big data pose numerous challenges, such as dealing with missing data, high- and ultra-high-dimensional data, response dependencies, time series analysis, and distributed storage. Existing theories, methods, and algorithms for analyzing big data encounter significant hurdles, especially concerning fundamental statistical concepts like estimation, hypothesis testing, confidence intervals, and variable selection, spanning frequentist and Bayesian approaches. This reprint offers an array of tools within the realm of data science aimed at tackling these challenges. It encompasses various topics, including handling measurement errors or missing data, cognitive diagnosis modeling, constructing credit risk scorecards using logistic regression models, geographically weighted regression modeling, privacy protection practices in data mining, clustering methods, and model selection for high-dimensional datasets. Furthermore, it delves into predicting sensitive features under indirect questioning. These discussions aim to provide valuable tools and examples for the practical application of data science.

Format
  • Hardback
License
© 2024 by the authors; CC BY-NC-ND license
Keywords
meta learning; data classification; hybrid sine and cosine algorithm; Wilcoxon signed rank test; multiple application scenario datasets; model selection; nonparametric additive models; nonparametric smoothing; ridge estimation; data masking; multiplicative noise; data mining; sample size calculation; clustering; correlation; REML; multivariate linear mixed models; GWNR; linear estimator; mixed estimator; spatial data; unbiased; bootstrap resampling; imputation; non-inferiority assessment; non-ignorable missing data; three-arm trial; bootstrap; expectation-maximization (EM) algorithm; latent class; likelihood ratio test; maximum likelihood; randomized response; sensitive attribute; credit risk scorecards; hypothesis testing; population stability; simulation; biomarkers; correction for attenuation; measurement error; Poisson binomial distribution; logistic regression; data aggregation; likelihood; numerical optimization; indirect questioning; non-randomized response technique; randomized response technique; sensitive attribute; statistical methods; model averaging; asymptotic optimality; HRCp; varying-coefficient partially linear model; missing data; otsfeatures; ordinal time series; feature extraction; cumulative probabilities; R package; cognitive diagnosis model; DINA model; penalized likelihood; Shannon entropy; EM algorithm; measurement error; surrogate; zero-inflated data