applsci-logo

Journal Browser

Journal Browser

AI-Based Data Science and Database Systems

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 10 July 2025 | Viewed by 6363

Special Issue Editors


E-Mail Website
Guest Editor
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Interests: artificial intelligence; data science; data lake; database systems

E-Mail Website
Guest Editor
College of Computer Science, Nankai University, Tianjin, China
Interests: database; big data; data mining; artificial intelligence

Special Issue Information

Dear Colleagues,

As machine learning (ML), deep learning (DL), and large language models (LLMs) become widely adopted across various applications and disciplines, the synergy between database (DB) systems and the artificial intelligence (AI) community is becoming increasingly evident. AI technology, with its unparalleled modeling and generalization capabilities, is at the forefront of technological advancement, catalyzing further development in numerous fields. Beyond the contributions of algorithms and models themselves, the quality of training data significantly impacts the performance of AI models. Accurate, consistent, and representative clean datasets are crucial for enhancing the modeling effectiveness and generalization capability of AI models. The steps involved in data preparation, cleaning, and management, which greatly influence data quality, are closely linked to research within the database community. Additionally, the ML pipeline also depends on mechanisms for storing and querying ML artifacts. Conversely, the database field can also benefit from AI research. Traditional methods in the database domain, which often rely on constraint- or rule-based approaches, can leverage AI to reduce the heavy dependence on human supervision and offer new perspectives and solutions for addressing traditional complex problems.

This Special Issue focuses on exploring the potential at the intersection of the database and AI fields, emphasizing research that combines the strengths of both domains. By harnessing the mutual empowerment of these fields, we aim to advance the progress of both database and AI technologies.

The Special Issue is particularly interested in topics such as, but not limited to, the following:

  • Advanced data cleaning techniques for AI applications;
  • Seamless data integration solutions for AI-driven processes;
  • Comprehensive data discovery methods for AI development;
  • Lifecycle management of datasets in AI pipelines;
  • Automated data preprocessing for AI;
  • AI-driven techniques for database schema design and optimization;
  • Enhanced AI-based functionality within modern DBMS;
  • AI-based data discovery and profiling;
  • Integrated AI-based data cleaning and data integration solutions;
  • AI-powered data analytics and exploration in data lakes.

Dr. Chengliang Chai
Dr. Yu Sun
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data management in AI model lifecycle
  • AI-based functionality inside DBMS
  • AI-based data science
  • AI-based data discovery
  • AI-based data preparation
  • AI-based database systems

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 576 KiB  
Article
A Novel Approach to Incremental Diffusion for Continuous Dataset Updates in Image Retrieval
by Zili Tang, Fan Yang, Jiong Lou and Jie Li
Appl. Sci. 2025, 15(5), 2535; https://doi.org/10.3390/app15052535 - 26 Feb 2025
Viewed by 518
Abstract
Diffusion is well known for its success in improving retrieval performance by exploiting the local structure of data distribution. Some recent works have focused on improving its efficiency by shifting the computing burden offline. However, we find that efficient offline diffusion handles continuously [...] Read more.
Diffusion is well known for its success in improving retrieval performance by exploiting the local structure of data distribution. Some recent works have focused on improving its efficiency by shifting the computing burden offline. However, we find that efficient offline diffusion handles continuously updating datasets with difficulty, which directly hinders its application in the real world. Unlike previous methods that apply diffusion to the entire gallery, we introduce an anchor graph to serve as an agent of the complete gallery graph. By doing that, we empower diffusion with the ability of retrieving newly added images at acceptable computational cost. We demonstrate that our proposed method is a good approximation of diffusion featuring fast online search speed and the ability of handling growing data. Moreover, experiments on benchmark datasets show that the proposed method outperforms the state of the art by a large margin with proper parameter settings. Full article
(This article belongs to the Special Issue AI-Based Data Science and Database Systems)
Show Figures

Figure 1

32 pages, 4697 KiB  
Article
Predicting the Compression Index of Clayey Soils Using a Hybrid Genetic Programming and XGBoost Model
by Abolfazl Baghbani, Katayoon Kiany, Hossam Abuel-Naga and Yi Lu
Appl. Sci. 2025, 15(4), 1926; https://doi.org/10.3390/app15041926 - 13 Feb 2025
Cited by 1 | Viewed by 697
Abstract
The accurate prediction of the compression index (Cc) is crucial for understanding the settlement behavior of clayey soils, which is a key factor in geotechnical design. Traditional empirical models, while widely used, often fail to generalize across diverse soil conditions [...] Read more.
The accurate prediction of the compression index (Cc) is crucial for understanding the settlement behavior of clayey soils, which is a key factor in geotechnical design. Traditional empirical models, while widely used, often fail to generalize across diverse soil conditions due to their reliance on simplified assumptions and regional dependencies. This study proposed a novel hybrid method combining Genetic Programming (GP) and XGBoost methods. A large database (including 385 datasets) of geotechnical properties, including the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), and the water content (w), was used. The hybrid GP-XGBoost model achieved remarkable predictive performance, with an R2 of 0.966 and 0.927 and mean squared error (MSE) values of 0.001 and 0.001 for training and testing datasets, respectively. The mean absolute error (MAE) was also exceptionally low at 0.030 for training and 0.028 for testing datasets. Comparative analysis showed that the hybrid model outperformed the standalone GP (R2 = 0.934, MSE = 0.003) and XGBoost (R2 = 0.939, MSE = 0.002) models, as well as traditional empirical methods such as Terzaghi and Peck (R2 = 0.149, MSE = 0.090). Key findings highlighted that the initial void ratio and water content are the most influential predictors of Cc, with feature importance scores of 0.55 and 0.27, respectively. The novelty of the proposed method lies in its ability to combine the interpretability of GP with the computational efficiency of XGBoost and results in a robust and adaptable predictive tool. This hybrid approach has the potential to advance geotechnical engineering practices by providing accurate and interpretable models for diverse soil profiles and complex site conditions. Full article
(This article belongs to the Special Issue AI-Based Data Science and Database Systems)
Show Figures

Figure 1

24 pages, 7608 KiB  
Article
Identifying NSFW Groups on Reddit Social Network by Identifying Highly Interconnected Subreddits Through Analysis of Implicit Communication Patterns
by Pushwitha Krishnappa, Lance Lindner, Eduardo Pasiliao and Tathagata Mukherjee
Appl. Sci. 2024, 14(24), 11665; https://doi.org/10.3390/app142411665 - 13 Dec 2024
Viewed by 2785
Abstract
In this paper, we analyze the Reddit social network with the goal of identifying “highly interconnected” subreddits. Intuitively, a subreddit is highly interconnected if the users in the subreddit interact a lot with users from other subreddits in the Reddit ecosystem. To identify [...] Read more.
In this paper, we analyze the Reddit social network with the goal of identifying “highly interconnected” subreddits. Intuitively, a subreddit is highly interconnected if the users in the subreddit interact a lot with users from other subreddits in the Reddit ecosystem. To identify the highly interconnected subreddits, we used the communication patterns of the users on the Reddit platform. We definde an “interconnectedness score” that was obtained from user interactions across subreddits. This score was used to identify the highly interconnected subreddits. We also leveraged the interactions among users within the subreddits to identify implicit leader–follower relationships within them. Intuitively, an implicit leader in a subreddit is someone who receives a lot of attention from other users, who are the followers. We inferred the implicit leaders using only the responses they received on their posts from other users in the subreddit. Finally, we studied the role played by these implicit leaders within the interconnected subreddits using the idea of a “leaderness score”. For the analysis, we used data obtained from Reddit in 2022 with a custom-built crawler. We analyzed a total of 125,000 subreddits for this work and identified the group of highly interconnected subreddits using the idea of the interconnectedness score. We manually evaluated the content of the posts on the identified interconnected subreddits in order to understand the nature of these subreddits. Our analysis showed that the highly interconnected subreddits discuss content considered to be “not safe/suitable for work” (NSFW). We also observed that though these subreddits were highly interconnected among themselves, they were sparsely connected with other non-NSFW subreddits. Furthermore, we found that the implicit leaders in these subreddits drove majority of the conversations in these groups. These results are socially significant as they can be used to make online social networks safe for the underage population. Thus, our results can be used for enforcing age-based restrictions on access to these NSFW subreddits. Finally, our results also open up the possibility of moderating the content on these subreddits by enforcing content moderation rules on the implicit leaders who drive the conversation in these groups. Finally, though these results are specific to Reddit, the insights obtained from this analysis can be used for analyzing other large-scale online social networks with similar goals to this study. Full article
(This article belongs to the Special Issue AI-Based Data Science and Database Systems)
Show Figures

Figure 1

30 pages, 2336 KiB  
Article
Enhancing DDBMS Performance through RFO-SVM Optimized Data Fragmentation: A Strategic Approach to Machine Learning Enhanced Systems
by Kassem Danach, Abdullah Hussein Khalaf, Abbas Rammal and Hassan Harb
Appl. Sci. 2024, 14(14), 6093; https://doi.org/10.3390/app14146093 - 12 Jul 2024
Viewed by 1575
Abstract
Effective data fragmentation is essential in enhancing the performance of distributed database management systems (DDBMS) by strategically dividing extensive databases into smaller fragments distributed across multiple nodes. This study emphasizes horizontal fragmentation and introduces an advanced machine learning algorithm, Red Fox Optimization-based Support [...] Read more.
Effective data fragmentation is essential in enhancing the performance of distributed database management systems (DDBMS) by strategically dividing extensive databases into smaller fragments distributed across multiple nodes. This study emphasizes horizontal fragmentation and introduces an advanced machine learning algorithm, Red Fox Optimization-based Support Vector Machine (RFO-SVM), designed for optimizing the data fragmentation process. The input database undergoes meticulous pre-processing to address missing data concerns, followed by analysis through RFO-SVM. This algorithm efficiently classifies features and target labels based on class labels. The RFO algorithm optimizes critical SVM parameters, including the kernel, kernel parameter, and boundary parameter, leveraging the accuracy metric. The resulting classified data serves as fragments for the fragmentation process. To ensure precision in fragmentation, a Genetic Algorithm (GA) allocates these fragments to diverse nodes within the DDBMS, optimizing the total allocation cost as the fitness function. The proposed model, implemented in Python, significantly contributes to the efficient fragmentation and allocation of databases in distributed systems, thereby enhancing overall performance and scalability. Full article
(This article belongs to the Special Issue AI-Based Data Science and Database Systems)
Show Figures

Figure 1

Back to TopTop