Recent Developments in Data Science and Knowledge Discovery

A special issue of Applied System Innovation (ISSN 2571-5577). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 30 December 2025 | Viewed by 5224

Special Issue Editor


E-Mail Website
Guest Editor
Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, OH 44115, USA
Interests: human-centered systems; machine learning; data science; distributed computing; blockchain
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The Special Issue aims to highlight the latest advancements and emerging trends in the fields of data science and knowledge discovery. This issue brings together a collection of high-quality research papers, reviews, and case studies that explore innovative methodologies, tools, and applications driving the field forward.

Methodologies: We are inviting examinations of techniques for managing, processing, and analyzing large-scale datasets, which includes advancements in distributed computing, cloud-based analytics, and real-time data processing frameworks; explorations of new methods for extracting meaningful patterns and insights from complex datasets, including clustering, association rule mining, and sequential pattern mining; and papers that explore advancements in supervised, unsupervised, and reinforcement learning, with applications ranging from natural language processing to image recognition and beyond.

Topics include improvements in algorithm efficiency, scalability, and interpretability, as well as the development of hybrid models that combine different machine-learning approaches for enhanced performance.

Tools: We welcome introductions to new tools, platforms, and frameworks that facilitate data science workflows; an emphasis on open-source solutions and platforms that enhance collaboration, reproducibility, and scalability in data science projects; and reviews of integrated development environments, data visualization tools, and data management systems that support efficient data analysis and knowledge discovery.

Applications: We are interested in showcasing how data science and knowledge discovery techniques are being applied across various disciplines; papers that illustrate the impact of data-driven approaches in fields such as environmental science, education, and public policy; and innovative applications that demonstrate the transformative potential of data science in solving real-world problems and improving societal outcomes.

Prof. Dr. Wenbing Zhao
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied System Innovation is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • big data analytics
  • data mining
  • knowledge discovery
  • natural language processing
  • image recognition
  • predictive analytics
  • anomaly detection
  • decision-making

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

26 pages, 3622 KiB  
Article
Shear Strength Prediction for RCDBs Utilizing Data-Driven Machine Learning Approach: Enhanced CatBoost with SHAP and PDPs Analyses
by Imad Shakir Abbood, Noorhazlinda Abd Rahman and Badorul Hisham Abu Bakar
Appl. Syst. Innov. 2025, 8(4), 96; https://doi.org/10.3390/asi8040096 - 10 Jul 2025
Viewed by 599
Abstract
Reinforced concrete deep beams (RCDBs) provide significant strength and serviceability for building structures. However, a simple, general, and universally accepted procedure for predicting their shear strength (SS) has yet to be established. This study proposes a novel data-driven approach to predicting the SS [...] Read more.
Reinforced concrete deep beams (RCDBs) provide significant strength and serviceability for building structures. However, a simple, general, and universally accepted procedure for predicting their shear strength (SS) has yet to be established. This study proposes a novel data-driven approach to predicting the SS of RCDBs using an enhanced CatBoost (CB) model. For this purpose, a newly comprehensive database of RCDBs with shear failure, including 950 experimental specimens, was established and adopted. The model was developed through a customized procedure including feature selection, data preprocessing, hyperparameter tuning, and model evaluation. The CB model was further evaluated against three data-driven models (e.g., Random Forest, Extra Trees, and AdaBoost) as well as three prominent mechanics-driven models (e.g., ACI 318, CSA A23.3, and EU2). Finally, the SHAP algorithm was employed for interpretation to increase the model’s reliability. The results revealed that the CB model yielded a superior accuracy and outperformed all other models. In addition, the interpretation results showed similar trends between the CB model and mechanics-driven models. The geometric dimensions and concrete properties are the most influential input features on the SS, followed by reinforcement properties. In which the SS can be significantly improved by increasing beam width and concert strength, and by reducing shear span-to-depth ratio. Thus, the proposed interpretable data-driven model has a high potential to be an alternative approach for design practice in structural engineering. Full article
(This article belongs to the Special Issue Recent Developments in Data Science and Knowledge Discovery)
Show Figures

Figure 1

20 pages, 2918 KiB  
Article
Randomized Feature and Bootstrapped Naive Bayes Classification
by Bharameeporn Phatcharathada and Patchanok Srisuradetchai
Appl. Syst. Innov. 2025, 8(4), 94; https://doi.org/10.3390/asi8040094 - 2 Jul 2025
Viewed by 760
Abstract
Naive Bayes (NB) classifiers are widely used for their simplicity, computational efficiency, and interpretability. However, their predictive performance can degrade significantly in real-world settings where the conditional independence assumption is often violated. More complex NB variants address this issue but typically introduce structural [...] Read more.
Naive Bayes (NB) classifiers are widely used for their simplicity, computational efficiency, and interpretability. However, their predictive performance can degrade significantly in real-world settings where the conditional independence assumption is often violated. More complex NB variants address this issue but typically introduce structural complexity or require explicit dependency modeling, limiting their scalability and transparency. This study proposes two lightweight ensemble-based extensions—randomized feature naive Bayes (RF-NB) and randomized feature bootstrapped naive Bayes (RFB-NB)—designed to enhance robustness and predictive stability without altering the underlying NB model. By integrating randomized feature selection and bootstrap resampling, these methods implicitly reduce feature dependence and noise-induced variance. Evaluation across twenty real-world datasets spanning medical, financial, and industrial domains demonstrates that RFB-NB consistently outperformed classical NB, RF-NB, and k-nearest neighbor in several cases. Although random forest achieved higher average accuracy overall, RFB-NB demonstrated comparable accuracy with notably lower variance and improved predictive stability specifically in datasets characterized by high noise levels, large dimensionality, or significant class imbalance. These findings underscore the practical and complementary advantages of RFB-NB in challenging classification scenarios. Full article
(This article belongs to the Special Issue Recent Developments in Data Science and Knowledge Discovery)
Show Figures

Figure 1

50 pages, 7212 KiB  
Article
Coordinated Evaluation of Technological Innovation and Financial Development in China: An Engineering Perspective
by Jiong Zhou, Yuanxin Jia, Yixin Yang and Wenbing Zhao
Appl. Syst. Innov. 2025, 8(3), 77; https://doi.org/10.3390/asi8030077 - 30 May 2025
Viewed by 2303
Abstract
Innovation-driven development is the main driving strategy for promoting high-quality economic development. Technological innovation is the core of innovation-driven development. Financial innovation is an important aspect of promoting financial development. As such, the coupling and coordination of the technological innovation and financial development [...] Read more.
Innovation-driven development is the main driving strategy for promoting high-quality economic development. Technological innovation is the core of innovation-driven development. Financial innovation is an important aspect of promoting financial development. As such, the coupling and coordination of the technological innovation and financial development in developing countries, such as China, is an important issue. The topic has been extensively studied over the last decade in the context of China, and a dominating method has emerged on how to model the technological innovation subsystem and the financial development subsystem, and how to quantitatively determine the degree of coupling and coordination of the two subsystems. A variety of predictors have been proposed to model each subsystem. The coupling degree and the coordination degree are then calculated, and then they are used to analyze the current development status for potential issues. However, we make an effort to validate the calculated degree of coupling and coordination before the results are used for the analysis.Without validation, the outcomes of the analysis not only might not be useful but also could lead to inappropriate governmental policies. That said, it is tremendously challenging to validate the results due to the lack of the ground truth. The goal of this study is to work towards objectively determining the reliability of the degree of coupling and coordination from an engineering perspective. Specifically, we accomplish this task by evaluating the regression performance and projection performance. We demonstrate that the use of a carefully crafted set of predictors for each subsystem is the foundation for deriving the reliable coordination degree of the two subsystems. Full article
(This article belongs to the Special Issue Recent Developments in Data Science and Knowledge Discovery)
Show Figures

Figure 1

13 pages, 44634 KiB  
Article
Predictive and Explainable Machine Learning Models for Endocrine, Nutritional, and Metabolic Mortality in Italy Using Geolocalized Pollution Data
by Donato Romano, Michele Magarelli, Pierfrancesco Novielli, Domenico Diacono, Pierpaolo Di Bitonto, Nicola Amoroso, Alfonso Monaco, Roberto Bellotti and Sabina Tangaro
Appl. Syst. Innov. 2025, 8(2), 48; https://doi.org/10.3390/asi8020048 - 1 Apr 2025
Viewed by 853
Abstract
This study investigated the predictive performance of three regression models—Gradient Boosting (GB), Random Forest (RF), and XGBoost—in forecasting mortality due to endocrine, nutritional, and metabolic diseases across Italian provinces. Utilizing a dataset encompassing air pollution metrics and socio-economic indices, the models were trained [...] Read more.
This study investigated the predictive performance of three regression models—Gradient Boosting (GB), Random Forest (RF), and XGBoost—in forecasting mortality due to endocrine, nutritional, and metabolic diseases across Italian provinces. Utilizing a dataset encompassing air pollution metrics and socio-economic indices, the models were trained and tested to evaluate their accuracy and robustness. Performance was assessed using metrics such as coefficient of determination (r2), mean absolute error (MAE), and root mean squared error (RMSE), revealing that GB outperformed both RF and XGB, offering superior predictive accuracy and model stability (r2 = 0.55, MAE = 0.17, and RMSE = 0.05). To further interpret the results, SHAP (SHapley Additive exPlanations) analysis was applied to the best-performing model to identify the most influential features driving mortality predictions. The analysis highlighted the critical roles of specific pollutants, including benzene and socio-economic factors such as life quality and instruction, in influencing mortality rates. These findings underscore the interplay between environmental and socio-economic determinants in health outcomes and provide actionable insights for policymakers aiming to reduce health disparities and mitigate risk factors. By combining advanced machine learning techniques with explainability tools, this research demonstrates the potential for data-driven approaches to inform public health strategies and promote targeted interventions in the context of complex environmental and social determinants of health. Full article
(This article belongs to the Special Issue Recent Developments in Data Science and Knowledge Discovery)
Show Figures

Figure 1

Back to TopTop