This is the preface for the Special Issue “Big Data Mining and Analytics with Applications”, which was published in the MDPI journal Mathematics. Among the 26 submissions received for this Special Issue, the editors selected nine articles and one review paper that successfully passed the peer-review process. They contain original research ideas that have made significant advancements in the theory of Big Data Mining and Analytics with Applications. In particular, the topics discussed in these 10 papers are related to span predictive modeling for spatiotemporal data (e.g., traffic, air quality, crowd flow), data quality and repair, automated data understanding (e.g., table type prediction), advanced clustering and user segmentation, scalable statistical learning (e.g., Bayesian variable selection), and text mining applications (e.g., document classification, fake review detection). Together, they highlight a strong focus on efficient, robust, and interpretable methods for big data mining and analysis across domains.
Contribution 1 presents a two-stage hybrid extreme learning machine for short-term traffic flow forecasting. By combining particle swarm optimization and gravitational search algorithm to optimize extreme learning machine parameters, the model enhances prediction accuracy and stability while retaining the fast training advantage of extreme learning machines. Contribution 2 introduces a modified γ-Sutte indicator for air quality index prediction, which reduces computational complexity by using a shorter sliding window while maintaining high accuracy. Validated on datasets from Taiwan and Laos, the method outperforms existing α-Sutte, β-Sutte, and ARIMA models across multiple metrics, demonstrating both efficiency and transferability. Contribution 3 proposes an automated method for predicting web table column types by integrating CNN with a knowledge base-based voting mechanism. Entities in the table are queried from a knowledge base to identify corresponding categories, expanding the dataset and generating synthetic column samples for data augmentation, thereby enhancing the model’s generalization capability. The approach employs a CNN to predict each sub-column and averages the probabilities, while entity description texts extracted from the knowledge base are used to compute keyword coverage as voting probability. The final result is determined by comparing these two probabilities. Contribution 4 implements a two-step cluster analysis methodology to empirically segment travel application users. By employing a log-likelihood distance measure that effectively handles both continuous and categorical variables and utilizing information criteria (AIC/BIC) to automatically determine the optimal number of clusters, this study overcomes the limitations of traditional distance metrics. The resulting identification of four distinct user profiles provides a data-driven framework for understanding real-world usage patterns, offering a methodological contribution to user segmentation in Mobility-as-a-Service research. Contribution 5 proposes the VBSSLQR method, which efficiently performs variable selection in high-dimensional quantile regression by integrating the Spike-and-Slab Lasso prior with Variational Bayesian inference. This approach avoids the computational burden of MCMC sampling and non-convex optimization, enabling scalable and uncertainty-aware variable selection without relying on sub-Gaussian error assumptions. Contribution 6 provides a comprehensive survey on multi-source data repairing, systematically categorizing error types into entity overlapping, attribute conflicts, and value inconsistencies. By organizing existing methods into a unified taxonomy covering entity resolution, truth discovery, and inconsistency repair, this review establishes a clear methodological framework for the field. The survey further identifies key challenges in compound-error repair and semantic-aware detection, offering valuable guidance for future research in multi-source data quality management. Contribution 7 applies the concept of gradient descent to clustering problems to improve the effectiveness of clustering methods. A generic clustering objective function is proposed, enabling more intelligent updates of cluster centroids. To address the slow convergence of gradient descent, Nesterov momentum is incorporated. The method offers greater flexibility by allowing the use of any differentiable distance function in the clustering objective function. Contribution 8 addresses the challenge of efficiently classifying and managing the massive documents generated in educational reform by designing and implementing an automated classification system. After defining classification criteria, constructing a dataset, and preprocessing the data (including tokenization, stop word removal, and part-of-speech tagging), text features are extracted using the TF-IDF algorithm, and the Naïve Bayes algorithm is selected for classification. A software tool is developed to facilitate user interaction, visualizing the distribution proportion of document categories. Contribution 9 aims to eliminate ‘water army’ in e-commerce platform comment sections by combining TF-IDF and LSI models to identify water army based on behavioral and content features. The processed reviews are further analyzed using a TextCNN model for sentiment analysis, categorizing them as positive or negative. LDA topic modeling is then applied to analyze themes in both positive and negative reviews. Contribution 10 introduces spatial heterogeneity for large-scale crowd flow prediction by employing two fully connected networks to cluster spatial units into different types a posteriori and adaptively. Convolutional layers compress redundant information in historical crowd flow to extract shared patterns of clusters. The temporal evolution of these shared patterns is predicted, and they are mapped back to physical space to derive trend flows. The difference between trends and actual flows is learned to enhance prediction accuracy.
Looking ahead, future research should prioritize the integration of semantic awareness, cross-domain generalization, and uncertainty quantification in big data mining and analytics. Building on the advances in this Special Issue, key directions include the following: (1) developing unified frameworks for compound error repair that jointly resolve entity, attribute, and value inconsistencies using deep semantic representations; (2) designing adaptive spatiotemporal models that explicitly account for heterogeneous dynamics (e.g., in traffic, crowd flow, or air quality) through interpretable, physics-informed architectures; (3) advancing scalable Bayesian and optimization-based methods for high-dimensional data that provide reliable uncertainty estimates without restrictive distributional assumptions; and (4) creating knowledge-enhanced learning systems that seamlessly fuse structured knowledge bases with neural models for robust table understanding, text mining, and anomaly detection in real-world applications. Such efforts will bridge the gap between algorithmic innovation and practical deployment in complex, multi-source data ecosystems.
The Guest Editors extend their sincere appreciation to all of the authors for their valuable contributions to this Special Issue. We are also deeply grateful to the anonymous reviewers for their insightful and professional evaluation reports, which have significantly enhanced the quality of the submitted manuscripts. Furthermore, we acknowledge the excellent collaboration with the publisher, the constant assistance provided by the MDPI associate editors in bringing this project to the end, and the support of the Managing Editor of this Special Issue, Ms. Helene Hu.