Topic Editors

Prof. Dr. Jonathan Cepeda-Negrete
División de Ciencias de la Vida (DICIVA), Universidad de Guanajuato, Campus Irapuato-Salamanca, Carretera Irapuato-Silao km 9 ap 311, Irapuato 36500, Mexico
Prof. Dr. Qianmu Li
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

Machine Learning and Data Mining: Theory and Applications

Abstract submission deadline
31 October 2026
Manuscript submission deadline
31 December 2026
Viewed by
4105

Topic Information

Dear Colleagues,

Recent decades have witnessed a rapid evolution of machine learning (ML) and data mining (DM) techniques, driven by advances in computing power, big data, and artificial intelligence. These technologies have transformed diverse domains such as healthcare, finance, manufacturing, transportation, and smart environments. This Topic aims to gather high-quality contributions that explore both the theoretical foundations and practical applications of ML and DM. We welcome research on emerging algorithms, optimization strategies, and innovative architectures, as well as studies addressing real-world challenges through predictive modeling, pattern recognition, and knowledge discovery. Particular attention will be given to works that integrate ML and DM with cutting-edge paradigms such as deep learning, IoT, cloud computing, and ethical AI.

Prof. Dr. Jonathan Cepeda-Negrete
Prof. Dr. Qianmu Li
Topic Editors

Keywords

  • machine learning
  • data mining
  • artificial intelligence
  • deep learning
  • big data analytics
  • pattern recognition
  • predictive modeling
  • computer vision
  • IoT and smart systems

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Algorithms
algorithms
2.1 4.5 2008 19.2 Days CHF 1800 Submit
Applied Sciences
applsci
2.5 5.5 2011 16 Days CHF 2400 Submit
AppliedMath
appliedmath
0.7 1.1 2021 20.6 Days CHF 1200 Submit
Data
data
2.0 5.0 2016 25 Days CHF 1600 Submit
Information
information
2.9 6.5 2010 20.9 Days CHF 1800 Submit
Symmetry
symmetry
2.2 5.3 2009 15.8 Days CHF 2400 Submit

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

  1. Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
  2. Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
  3. Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
  4. Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
  5. Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (8 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
27 pages, 3420 KB  
Article
BRB-Based Classification of Imbalanced Cybersecurity Data in the Industrial Internet
by Yang Zhao, Yanbin Yuan, Yuhe Wang, Qun Han and Shiming Li
Symmetry 2026, 18(6), 916; https://doi.org/10.3390/sym18060916 - 27 May 2026
Viewed by 144
Abstract
Class distribution asymmetry (imbalanced data) is a prevalent problem in the field of Industrial Internet cybersecurity, where normal data far outnumber abnormal data. This causes traditional machine learning classifiers to be biased towards the majority class, severely degrading their attack detection capability. To [...] Read more.
Class distribution asymmetry (imbalanced data) is a prevalent problem in the field of Industrial Internet cybersecurity, where normal data far outnumber abnormal data. This causes traditional machine learning classifiers to be biased towards the majority class, severely degrading their attack detection capability. To address this issue while meeting the requirement for traceability of the decision-making process in industrial scenarios, this paper proposes an imbalanced data classification method based on the Belief Rule Base (BRB). First, the Cluster-Based Oversampling (CBO) algorithm is employed to restore the symmetry of class distribution at the data level. Then, the Evidential Reasoning (ER) iterative algorithm is used to perform attribute fusion, which reduces the number of antecedent attributes of BRB while maintaining the information, effectively alleviating the rule explosion problem. Finally, interpretable classification is realized based on BRB, and the Circle chaotic mapping Gray Wolf Optimizer (Circle-GWO) algorithm is introduced to complete model construction, parameter optimization and fine-tuning. Experimental results on the UNSW-NB15 and TON_IoT datasets demonstrate that the proposed method can effectively handle imbalanced data classification tasks in this field, providing a practical technical solution to improve the accuracy and efficiency of cybersecurity decision-making in the Industrial Internet. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

25 pages, 8836 KB  
Article
Dual-Tensor Constrained Multi-View Subspace Clustering
by Guanghui Li, Yue Qian, Yong Cheng, You Huang, Lingbin Zeng, Shixin Yao and Xingkong Ma
Appl. Sci. 2026, 16(10), 4766; https://doi.org/10.3390/app16104766 - 11 May 2026
Viewed by 164
Abstract
Existing multi-view clustering approaches based on matrix factorization often fail to jointly capture global high-order correlations and local view-specific characteristics, and they typically suffer from instability in generating final clustering labels. To overcome these limitations, this paper presents a multi-view subspace clustering method [...] Read more.
Existing multi-view clustering approaches based on matrix factorization often fail to jointly capture global high-order correlations and local view-specific characteristics, and they typically suffer from instability in generating final clustering labels. To overcome these limitations, this paper presents a multi-view subspace clustering method termed dual-tensor constrained multi-view subspace clustering (DTCMVSC). Specifically, for each view, we learn an independent latent representation matrix, a projection matrix, and a basis matrix. The latent representations and projection matrices are stacked into third-order tensors, upon which tensor nuclear norm regularization is imposed to simultaneously exploit consensus structures and complementary information across views. Additionally, a consensus regularization term and adaptive view weights are introduced to align the latent representations of different views toward a unified consensus subspace. The resulting optimization problem is efficiently solved under the ADMM framework, after which a similarity matrix is constructed from the consensus representation and spectral clustering is performed to obtain the final labels. Experimental evaluations on six benchmark datasets demonstrate the superiority of DTCMVSC. Specifically, it achieves an ACC of 86.10% on CMU and an NMI of 94.17% on ORL, surpassing even the lowest-performing state-of-the-art baselines by 63.08 and 18.53 percentage points, respectively. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

18 pages, 708 KB  
Article
NSCH-Flourishing-ML: A Curated Dataset and Reproducible Pipeline for Machine Learning Analysis of Child Flourishing
by Miguel Arcos-Argudo, Rodolfo Bojorque, Fernando Pesántez and Kely Nieto-Andrade
Data 2026, 11(5), 103; https://doi.org/10.3390/data11050103 - 3 May 2026
Viewed by 408
Abstract
Large-scale population surveys provide valuable information for studying child well-being, yet their structure often limits the direct application of machine-learning methods. The National Survey of Children’s Health (NSCH) is one of the most comprehensive datasets for monitoring children’s health and development in the [...] Read more.
Large-scale population surveys provide valuable information for studying child well-being, yet their structure often limits the direct application of machine-learning methods. The National Survey of Children’s Health (NSCH) is one of the most comprehensive datasets for monitoring children’s health and development in the United States, but the raw survey files contain logical skip patterns, categorical variables, and complex survey-design elements that require substantial preprocessing before predictive analysis can be performed. This study presents a curated machine-learning-ready benchmark dataset derived from the 2023 NSCH together with a fully reproducible computational pipeline for studying school-age child flourishing. The workflow constructs a binary flourishing outcome from four survey items related to curiosity, task persistence, emotional self-regulation, and interest in doing well in school. After restricting the sample to children aged 6–17 years and retaining only records with valid responses in all four outcome items, the final analytical dataset contained 32,934 observations. Feature selection based on mutual information computed on the training partition, combined with cross-validated subset-size selection, yielded a final benchmark subset of 150 predictors. Baseline experiments using logistic regression and random forest showed stable and reasonably strong predictive performance, with held-out ROC-AUC values around 0.84–0.85 and closely aligned cross-validation results. An exploratory comparison between weighted and unweighted learning further showed that survey weighting did not improve discriminative performance in this benchmark setting, although the magnitude of the effect was modest and model-dependent. By releasing both the curated benchmark dataset and the reproducible pipeline, this study provides a reusable resource for machine-learning research on child well-being and survey-based computational benchmarking. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

23 pages, 13014 KB  
Article
Seasonal Estimation of Net Surface Shortwave Radiation Using Multiple Machine Learning Algorithms, Remote Sensing Observation, and In-Situ Station
by Nuan Wang, Shisong Cao, Mingyi Du, Jingyi Chen, Ling Li, Yang Liu and Huiping Sun
Appl. Sci. 2026, 16(9), 4370; https://doi.org/10.3390/app16094370 - 29 Apr 2026
Viewed by 284
Abstract
Net surface shortwave radiation (NSSR) is a key parameter in the Earth’s energy cycle, greatly affecting global water and heat balance. Currently, a comprehensive comparative analysis regarding the accuracy of different models remains severely lacking, and there is also a notable deficiency in [...] Read more.
Net surface shortwave radiation (NSSR) is a key parameter in the Earth’s energy cycle, greatly affecting global water and heat balance. Currently, a comprehensive comparative analysis regarding the accuracy of different models remains severely lacking, and there is also a notable deficiency in the systematic exploration of seasonal radiative drivers. Therefore, we developed a machine learning-based seasonal NSSR estimation model. By integrating in-situ observational data with multi-source remote sensing datasets, we achieved precise quantification of radiative fluxes. This proposed model framework employed three cutting-edge algorithms, namely Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), to capture the non-linear interactions among radiative drivers across the four seasons. Through mechanistic sensitivity analysis, we quantified the impacts of key variables on NSSR prediction. The results unequivocally demonstrated that the RF algorithm demonstrated the best performance. Its seasonal R2 were 0.95 (spring), 0.89 (summer), 0.95 (autumn), and 0.96 (winter). The Solar Zenith Angle (SZA) dominated in spring and winter; its absence reduced R2 by 0.23 and raised RMSE by 20.66–26.42 W/m2. Meteorological factors mattered most in summer; excluding them cut R2 by 0.17 and hiked RMSE by 23.82 W/m2. This study provides actionable insights for terrestrial radiation budget research. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

19 pages, 522 KB  
Article
Halpin’s Differential Test Functioning via Robust Linking: A Comparison of Bisquare and L0 Loss Functions
by Alexander Robitzsch
Information 2026, 17(5), 428; https://doi.org/10.3390/info17050428 - 29 Apr 2026
Viewed by 392
Abstract
Differential test functioning (DTF) assesses, within an item response model, whether differential item functioning (DIF) affects the test as a whole. A recent contribution by Halpin (2025, arXiv) introduced a DTF statistic defined as the difference between a robust linking method based [...] Read more.
Differential test functioning (DTF) assesses, within an item response model, whether differential item functioning (DIF) affects the test as a whole. A recent contribution by Halpin (2025, arXiv) introduced a DTF statistic defined as the difference between a robust linking method based on the bisquare loss function and a nonrobust linking method such as mean–mean linking. The present article applies this statistic in the context of robust mean–geometric mean linking using the L0 loss function and compares it with Halpin’s original bisquare-loss approach. Alternative confidence interval estimation methods are evaluated for statistical inference for the DTF statistic. The findings indicate that the L0 loss function yields a smaller bias in the group mean estimate under several conditions than the bisquare loss function. However, the DTF statistic is estimated more precisely with the bisquare than with the L0 loss function. Moreover, the most satisfactory statistical inference is obtained from bias-corrected bootstrap and basic bootstrap confidence intervals based on a parametric rather than nonparametric bootstrap. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

24 pages, 4544 KB  
Article
DualGAD: A Generalist Graph Anomaly Detection Method via Dual-Encoder Architecture
by Jizhao Liu, Shuo Mao, Shuqin Zhang, Fangfang Shan and Jun Li
Information 2026, 17(5), 416; https://doi.org/10.3390/info17050416 - 27 Apr 2026
Viewed by 361
Abstract
Due to the capability of graph structures to model complex relationships, graph anomaly detection has significant application value in various domains, including financial fraud detection, network security, and fake account identification. Traditional graph anomaly detection methods follow a specialized paradigm of “one dataset, [...] Read more.
Due to the capability of graph structures to model complex relationships, graph anomaly detection has significant application value in various domains, including financial fraud detection, network security, and fake account identification. Traditional graph anomaly detection methods follow a specialized paradigm of “one dataset, one model”, which requires retraining or fine-tuning models for each new domain. This approach faces critical challenges in practical applications, namely high deployment costs and limited generalization capability. To address this problem, generalist graph anomaly detection aims to achieve the goal of “train once, apply across domains”. However, existing generalist methods primarily rely on graph neural networks to implicitly learn structural information, where the learned structural representations are tightly coupled with specific topology distributions, resulting in limited structural stability under domain shifts. To address this limitation, we propose DualGAD, a generalist graph anomaly detection method via a dual-encoder architecture. In particular, DualGAD introduces explicit structural modeling that characterizes the relative topological deviation of nodes with respect to the overall graph structure, thereby enhancing structural invariance across heterogeneous domains. This method separately models node attribute information and explicit graph structural information via an attribute feature encoder and an explicit structural feature encoder, and adopts an “attribute-dominant, structure-complementary” fusion strategy to achieve collaborative modeling. Experiments on eight real datasets demonstrate that DualGAD achieves an average improvement of 3.12% in AUROC compared to the strongest baseline methods, exhibiting significant cross-domain generalization capability. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

20 pages, 1426 KB  
Review
Profiling Decision-Making Styles Under Healthcare Resource Scarcity: An Interdisciplinary Clustering Approach
by Micaela Pinho, Fátima Leal and Isabel Miguel
Information 2026, 17(3), 287; https://doi.org/10.3390/info17030287 - 14 Mar 2026
Cited by 1 | Viewed by 665
Abstract
Scarcity of healthcare resources requires prioritisation decisions that raise complex ethical, economic, and social challenges. While normative frameworks provide guidance on how such decisions ought to be made, growing evidence suggests that individuals differ substantially in how they approach morally charged allocation choices. [...] Read more.
Scarcity of healthcare resources requires prioritisation decisions that raise complex ethical, economic, and social challenges. While normative frameworks provide guidance on how such decisions ought to be made, growing evidence suggests that individuals differ substantially in how they approach morally charged allocation choices. This study investigates heterogeneity in decision-making styles and support for healthcare prioritisation criteria using an interdisciplinary approach that integrates health economics, social psychology, and computational methods to identify latent decision-making profiles among a sample of adults residing in Portugal. Data were collected from adults residing in Portugal using a structured online questionnaire comprising socio-demographic characteristics, decision-making styles, and preferences elicited through twenty hypothetical healthcare rationing scenarios. The results reveal three meaningful decision-making profiles characterised by different combinations of cognitive styles and ethical prioritisation patterns: analytically oriented decision-makers prioritising health gains; intuitive, context-sensitive decision-makers balancing clinical and social criteria; heuristic-driven decision-makers relying on simpler or less differentiated heuristics. These findings demonstrate that, within this sample, healthcare prioritisation preferences are shaped by systematic variations in decision style rather than a single moral or rational framework. By linking behavioural heterogeneity with ethical decision-making, this study contributes to theoretical debates on healthcare rationing and demonstrates the value of clustering techniques for uncovering latent structures in complex decision data. The results provide insights relevant for the design of decision-support systems and rationing policies, which may be adapted to accommodate heterogeneous decision styles in comparable settings. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

25 pages, 639 KB  
Article
A Sparse L-Norm Regularized Least Squares Support Vector Regression
by Xiaoyong Liu, Dong Li and Chengbin Zeng
Algorithms 2026, 19(2), 160; https://doi.org/10.3390/a19020160 - 18 Feb 2026
Cited by 1 | Viewed by 526
Abstract
Although Least Squares Support Vector Regression (LSSVR) reduces the hyperparameter space to two, it sacrifices sparsity, causing all training samples to become support vectors and increasing storage costs. In contrast, standard Support Vector Regression (SVR) preserves sparsity but requires tuning three highly coupled [...] Read more.
Although Least Squares Support Vector Regression (LSSVR) reduces the hyperparameter space to two, it sacrifices sparsity, causing all training samples to become support vectors and increasing storage costs. In contrast, standard Support Vector Regression (SVR) preserves sparsity but requires tuning three highly coupled hyperparameters, leading to higher computational burden. To address these limitations, this paper proposes a sparse L-norm regularized least squares SVR framework that incorporates the infinity norm of approximation errors into both the objective function and inequality constraints. The resulting optimization problem minimizes model complexity while controlling the maximum prediction deviation through a single slack variable, thereby transforming the conventional three-hyperparameter SVR tuning task into a two-parameter problem involving only the regularization coefficient and kernel width. This formulation restores sparsity by enabling a compact support vector set, while preserving the stability and convexity advantages of LSSVR. Experiments on both static and dynamic datasets demonstrate that the proposed method consistently achieves higher predictive accuracy and improved robustness compared with standard SVR and LSSVR. These results indicate that the proposed L-norm regularized framework offers a mathematically principled and computationally efficient alternative for sparse, robust, and scalable regression modeling. Full article
(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)
Show Figures

Figure 1

Back to TopTop