Advanced Research in Data-Centric AI

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "E1: Mathematics and Computer Science".

Deadline for manuscript submissions: closed (31 December 2024) | Viewed by 20915

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science, Portland State University, Portland, OR 97201, USA
Interests: feature engineering; data mining; reinforcement learning

E-Mail Website
Guest Editor
Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
Interests: spatiotemporal graph network; point process
Department of Computer Science, University of Texas Rio Grande Valleydisabled, Brownsville, TX 78520, USA
Interests: time series; spatial-temporal pattern mining
Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
Interests: data mining; big data analytics

Special Issue Information

Dear Colleagues,

The primary focus of machine learning is often on developing models to fit a particular dataset. However, in real-world scenarios, data can often be untidy, and refining models may not be the most effective way to enhance their performance. An alternative approach is to concentrate on improving the dataset itself rather than considering it as a fixed input. Data-Centric AI (DCAI) is an up-and-coming field that deals with techniques which systematically enhance datasets, often leading to notable improvements in practical machine learning applications.

The purpose of this Special Issue is to encourage the development of a dynamic and collaborative interdisciplinary community focused on addressing real-world data challenges through DCAI. These challenges involve several areas, such as data acquisition and creation, data labeling, data preprocessing and enhancement, data quality assessment, data debt, and data governance. As many of these domains are still emerging, we endeavor to foster an environment that collates experts to define and shape the DCAI movement to significantly impact the future of AI and ML.

Dr. Kunpeng Liu
Dr. Pengfei Wang
Dr. Yifeng Gao
Dr. Yanjie Fu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • automated data science
  • data pre-processing
  • big data analytics
  • feature engineering
  • reinforcement learning
  • time series
  • graph mining
  • open-source datasets
  • cross-dataset mining
  • spatiotemporal data mining
  • statistical machine learning
  • bioinformatics
  • distribution shift

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 1769 KiB  
Article
Next Arrival and Destination Prediction via Spatiotemporal Embedding with Urban Geography and Human Mobility Data
by Pengjiang Li, Zaitian Wang, Xinhao Zhang, Pengfei Wang and Kunpeng Liu
Mathematics 2025, 13(5), 746; https://doi.org/10.3390/math13050746 - 25 Feb 2025
Viewed by 495
Abstract
With the development of transportation networks, countless trajectory data are accumulated, and understanding human mobility from traffic data could be helpful for smart cities, urban computing, and urban planning. Extracting valuable insights from traffic data, such as taxi trajectories, can significantly improve residents’ [...] Read more.
With the development of transportation networks, countless trajectory data are accumulated, and understanding human mobility from traffic data could be helpful for smart cities, urban computing, and urban planning. Extracting valuable insights from traffic data, such as taxi trajectories, can significantly improve residents’ daily lives. There are many studies on spatiotemporal data mining. As we know, arrival prediction or regional function detection encompasses important tasks for traffic management and urban planning. However, trajectory data are often mutilated because of personal privacy and hardware limitations, i.e., we usually can only obtain partial trajectory information. In this paper, we develop an embedding method to predict the next arrival using the origin–destination (O-D) pair trajectory information and point of interest (POI) data. Moreover, the embedding information contains region latent features; thus, we also detect the regional function in this paper. Finally, we conduct a comprehensive experimental study on a real-world trajectory dataset. The experimental results demonstrate the benefit of predicting arrivals, and the embedding vectors can detect the regional function in a city. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

17 pages, 3763 KiB  
Article
Graph-Based Feature Crossing to Enhance Recommender Systems
by Congyu Cai, Hong Chen, Yunxuan Liu, Daoquan Chen, Xiuze Zhou and Yuanguo Lin
Mathematics 2025, 13(2), 302; https://doi.org/10.3390/math13020302 - 18 Jan 2025
Viewed by 973
Abstract
In recommendation tasks, most existing models that learn users’ preferences from user–item interactions ignore the relationships between items. Additionally, ensuring that the crossed features capture both global graph structures and local context is non-trivial, requiring innovative techniques for multi-scale representation learning. To overcome [...] Read more.
In recommendation tasks, most existing models that learn users’ preferences from user–item interactions ignore the relationships between items. Additionally, ensuring that the crossed features capture both global graph structures and local context is non-trivial, requiring innovative techniques for multi-scale representation learning. To overcome these difficulties, we develop a novel neural network, CoGraph, which uses a graph to build the relations between items. The item co-occurrence pattern assumes that certain items consistently appear in pairs in users’ viewing or consumption logs. First, to learn relationships between items, a graph whose distance is measured by Normalised Point-Wise Mutual Information (NPMI) is applied to link items for the co-occurrence pattern. Then, to learn as many useful features as possible for higher recommendation quality, a Convolutional Neural Network (CNN) and the Transformer model are used to parallelly learn local and global feature interactions. Finally, a series of comprehensive experiments were conducted on several public data sets to show the performance of our model. It provides valuable insights into the capability of our model in recommendation tasks and offers a viable pathway for the public data operation. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

13 pages, 813 KiB  
Article
RODA-OOD: Robust Domain Adaptation from Out-of-Distribution Data
by Jaekyun Jeong, Mangyu Lee, Sunguk Yun, Keejun Han and Jungeun Kim
Mathematics 2024, 12(24), 3895; https://doi.org/10.3390/math12243895 - 10 Dec 2024
Viewed by 907
Abstract
Domain adaptation aims to effectively learn from two domains with different distributions, solving labeling problems; however, traditional methods assume that the source and target data are in-distribution data that share the same labels. In practice, Out-Of-Distribution (OOD) data which do not share labels [...] Read more.
Domain adaptation aims to effectively learn from two domains with different distributions, solving labeling problems; however, traditional methods assume that the source and target data are in-distribution data that share the same labels. In practice, Out-Of-Distribution (OOD) data which do not share labels with the existing data may also be collected during the target data collection process. These OOD data introduce noise and confusion, leading to decreased performance during adaptation. To address this issue, we propose RObust Domain Adaptation from Out-Of-Distribution data (RODA-OOD), a novel method based on data-centric AI principles that focuses on improving data quality rather than refining model architecture. RODA-OOD utilizes the characteristics of deep learning models that prioritize learning in-distribution data, which are easier to train on compared to OOD data. By dynamically adjusting the threshold for OOD detection, the proposed method effectively filters out OOD data, allowing the model to focus on relevant target data. RODA-OOD was compared with competitor and original domain adaptation algorithms based on target data accuracy. The results show that RODA-OOD demonstrates the most robust performance against OOD data, achieving a 21.3% increase in accuracy compared to existing domain adaptation methods. Thus, RODA-OOD can provide a solution to the OOD issue in unsupervised domain adaptation. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

17 pages, 2220 KiB  
Article
Robust Bias Compensation Method for Sparse Normalized Quasi-Newton Least-Mean with Variable Mixing-Norm Adaptive Filtering
by Ying-Ren Chien, Han-En Hsieh and Guobing Qian
Mathematics 2024, 12(9), 1310; https://doi.org/10.3390/math12091310 - 25 Apr 2024
Cited by 1 | Viewed by 1116
Abstract
Input noise causes inescapable bias to the weight vectors of the adaptive filters during the adaptation processes. Moreover, the impulse noise at the output of the unknown systems can prevent bias compensation from converging. This paper presents a robust bias compensation method for [...] Read more.
Input noise causes inescapable bias to the weight vectors of the adaptive filters during the adaptation processes. Moreover, the impulse noise at the output of the unknown systems can prevent bias compensation from converging. This paper presents a robust bias compensation method for a sparse normalized quasi-Newton least-mean (BC-SNQNLM) adaptive filtering algorithm to address these issues. We have mathematically derived the biased-compensation terms in an impulse noisy environment. Inspired by the convex combination of adaptive filters’ step sizes, we propose a novel variable mixing-norm method, BC-SNQNLM-VMN, to accelerate the convergence of our BC-SNQNLM algorithm. Simulation results confirm that the proposed method significantly outperforms other comparative works regarding normalized mean-squared deviation (NMSD) in the steady state. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

20 pages, 714 KiB  
Article
TabFedSL: A Self-Supervised Approach to Labeling Tabular Data in Federated Learning Environments
by Ruixiao Wang, Yanxin Hu, Zhiyu Chen, Jianwei Guo and Gang Liu
Mathematics 2024, 12(8), 1158; https://doi.org/10.3390/math12081158 - 12 Apr 2024
Cited by 2 | Viewed by 1441
Abstract
Currently, self-supervised learning has shown effectiveness in solving data labeling issues. Its success mainly depends on having access to large, high-quality datasets with diverse features. It also relies on utilizing the spatial, temporal, and semantic structures present in the data. However, domains such [...] Read more.
Currently, self-supervised learning has shown effectiveness in solving data labeling issues. Its success mainly depends on having access to large, high-quality datasets with diverse features. It also relies on utilizing the spatial, temporal, and semantic structures present in the data. However, domains such as finance, healthcare, and insurance primarily utilize tabular data formats. This presents challenges for traditional data augmentation methods aimed at improving data quality. Furthermore, the privacy-sensitive nature of these domains complicates the acquisition of the extensive, high-quality datasets necessary for training effective self-supervised models. To tackle these challenges, our proposal introduces a novel framework that combines self-supervised learning with Federated Learning (FL). This approach aims to solve the problem of data-distributed training while ensuring training quality. Our framework improves upon the conventional self-supervised learning data augmentation paradigm by incorporating data labeling through the segmentation of data into subsets. Our framework adds noise by splitting subsets of data and can achieve the same level of centralized learning in a distributed environment. Moreover, we conduct experiments on various public tabular datasets to evaluate our approach. The experimental results showcase the effectiveness and generalizability of our proposed method in scenarios involving unlabeled data and distributed settings. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

24 pages, 14284 KiB  
Article
Mask2Former with Improved Query for Semantic Segmentation in Remote-Sensing Images
by Shichen Guo, Qi Yang, Shiming Xiang, Shuwen Wang and Xuezhi Wang
Mathematics 2024, 12(5), 765; https://doi.org/10.3390/math12050765 - 4 Mar 2024
Cited by 8 | Viewed by 4574
Abstract
Semantic segmentation of remote sensing (RS) images is vital in various practical applications, including urban construction planning, natural disaster monitoring, and land resources investigation. However, RS images are captured by airplanes or satellites at high altitudes and long distances, resulting in ground objects [...] Read more.
Semantic segmentation of remote sensing (RS) images is vital in various practical applications, including urban construction planning, natural disaster monitoring, and land resources investigation. However, RS images are captured by airplanes or satellites at high altitudes and long distances, resulting in ground objects of the same category being scattered in various corners of the image. Moreover, objects of different sizes appear simultaneously in RS images. For example, some objects occupy a large area in urban scenes, while others only have small regions. Technically, the above two universal situations pose significant challenges to the segmentation with a high quality for RS images. Based on these observations, this paper proposes a Mask2Former with an improved query (IQ2Former) for this task. The fundamental motivation behind the IQ2Former is to enhance the capability of the query of Mask2Former by exploiting the characteristics of RS images well. First, we propose the Query Scenario Module (QSM), which aims to learn and group the queries from feature maps, allowing the selection of distinct scenarios such as the urban and rural areas, building clusters, and parking lots. Second, we design the query position module (QPM), which is developed to assign the image position information to each query without increasing the number of parameters, thereby enhancing the model’s sensitivity to small targets in complex scenarios. Finally, we propose the query attention module (QAM), which is constructed to leverage the characteristics of query attention to extract valuable features from the preceding queries. Being positioned between the duplicated transformer decoder layers, QAM ensures the comprehensive utilization of the supervisory information and the exploitation of those fine-grained details. Architecturally, the QSM, QPM, and QAM as well as an end-to-end model are assembled to achieve high-quality semantic segmentation. In comparison to the classical or state-of-the-art models (FCN, PSPNet, DeepLabV3+, OCRNet, UPerNet, MaskFormer, Mask2Former), IQ2Former has demonstrated exceptional performance across three publicly challenging remote-sensing image datasets, 83.59 mIoU on the Vaihingen dataset, 87.89 mIoU on Potsdam dataset, and 56.31 mIoU on LoveDA dataset. Additionally, overall accuracy, ablation experiment, and visualization segmentation results all indicate IQ2Former validity. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

20 pages, 854 KiB  
Article
A Community Detection and Graph-Neural-Network-Based Link Prediction Approach for Scientific Literature
by Chunjiang Liu, Yikun Han, Haiyun Xu, Shihan Yang, Kaidi Wang and Yongye Su
Mathematics 2024, 12(3), 369; https://doi.org/10.3390/math12030369 - 24 Jan 2024
Cited by 4 | Viewed by 3729
Abstract
This study presents a novel approach that synergizes community detection algorithms with various Graph Neural Network (GNN) models to bolster link prediction in scientific literature networks. By integrating the Louvain community detection algorithm into our GNN frameworks, we consistently enhanced the performance across [...] Read more.
This study presents a novel approach that synergizes community detection algorithms with various Graph Neural Network (GNN) models to bolster link prediction in scientific literature networks. By integrating the Louvain community detection algorithm into our GNN frameworks, we consistently enhanced the performance across all models tested. For example, integrating the Louvain model with the GAT model resulted in an AUC score increase from 0.777 to 0.823, exemplifying the typical improvements observed. Similar gains were noted when the Louvain model was paired with other GNN architectures, confirming the robustness and effectiveness of incorporating community-level insights. This consistent increase in performance—reflected in our extensive experimentation on bipartite graphs of scientific collaborations and citations—highlights the synergistic potential of combining community detection with GNNs to overcome common link prediction challenges such as scalability and resolution limits. Our findings advocate for the integration of community structures as a significant step forward in the predictive accuracy of network science models, offering a comprehensive understanding of scientific collaboration patterns through the lens of advanced machine learning techniques. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

22 pages, 6195 KiB  
Article
Enhanced Sea Horse Optimization Algorithm for Hyperparameter Optimization of Agricultural Image Recognition
by Zhuoshi Li, Shizheng Qu, Yinghang Xu, Xinwei Hao and Nan Lin
Mathematics 2024, 12(3), 368; https://doi.org/10.3390/math12030368 - 23 Jan 2024
Cited by 2 | Viewed by 1770
Abstract
Deep learning technology has made significant progress in agricultural image recognition tasks, but the parameter adjustment of deep models usually requires a lot of manual intervention, which is time-consuming and inefficient. To solve this challenge, this paper proposes an adaptive parameter tuning strategy [...] Read more.
Deep learning technology has made significant progress in agricultural image recognition tasks, but the parameter adjustment of deep models usually requires a lot of manual intervention, which is time-consuming and inefficient. To solve this challenge, this paper proposes an adaptive parameter tuning strategy that combines sine–cosine algorithm with Tent chaotic mapping to enhance sea horse optimization, which improves the search ability and convergence stability of standard sea horse optimization algorithm (SHO). Through adaptive optimization, this paper determines the best parameter configuration in ResNet-50 neural network and optimizes the model performance. The improved ESHO algorithm shows superior optimization effects than other algorithms in various performance indicators. The improved model achieves 96.7% accuracy in the corn disease image recognition task, and 96.4% accuracy in the jade fungus image recognition task. These results show that ESHO can not only effectively improve the accuracy of agricultural image recognition, but also reduce the need for manual parameter adjustment. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

25 pages, 5269 KiB  
Article
Application of Artificial Intelligence Methods for Predicting the Compressive Strength of Green Concretes with Rice Husk Ash
by Miljan Kovačević, Marijana Hadzima-Nyarko, Ivanka Netinger Grubeša, Dorin Radu and Silva Lozančić
Mathematics 2024, 12(1), 66; https://doi.org/10.3390/math12010066 - 24 Dec 2023
Cited by 5 | Viewed by 2020
Abstract
To promote sustainable growth and minimize the greenhouse effect, rice husk fly ash can be used instead of a certain amount of cement. The research models the effects of using rice fly ash as a substitute for regular Portland cement on the compressive [...] Read more.
To promote sustainable growth and minimize the greenhouse effect, rice husk fly ash can be used instead of a certain amount of cement. The research models the effects of using rice fly ash as a substitute for regular Portland cement on the compressive strength of concrete. In this study, different machine-learning techniques are investigated and a procedure to determine the optimal model is provided. A database of 909 analyzed samples forms the basis for creating forecast models. The derived models are assessed using the accuracy criteria RMSE, MAE, MAPE, and R. The research shows that artificial intelligence techniques can be used to model the compressive strength of concrete with acceptable accuracy. It is also possible to evaluate the importance of specific input variables and their influence on the strength of such concrete. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

25 pages, 26532 KiB  
Article
Statistical Image Watermark Algorithm for FAPHFMs Domain Based on BKF–Rayleigh Distribution
by Siyu Yang, Ansheng Deng and Hui Cui
Mathematics 2023, 11(23), 4720; https://doi.org/10.3390/math11234720 - 21 Nov 2023
Cited by 1 | Viewed by 1649
Abstract
In the field of image watermarking, imperceptibility, robustness, and watermarking capacity are key indicators for evaluating the performance of watermarking techniques. However, these three factors are often mutually constrained, posing a challenge in achieving a balance among them. To address this issue, this [...] Read more.
In the field of image watermarking, imperceptibility, robustness, and watermarking capacity are key indicators for evaluating the performance of watermarking techniques. However, these three factors are often mutually constrained, posing a challenge in achieving a balance among them. To address this issue, this paper presents a novel image watermark detection algorithm based on local fast and accurate polar harmonic Fourier moments (FAPHFMs) and the BKF–Rayleigh distribution model. Firstly, the original image is chunked without overlapping, the entropy value is calculated, the high-entropy chunks are selected in descending order, and the local FAPHFM magnitudes are calculated. Secondly, the watermarking signals are embedded into the robust local FAPHFM magnitudes by the multiplication function, and then MMLE based on the RSS method is utilized to estimate the statistical parameters of the BKF–Rayleigh distribution model. Finally, a blind image watermarking detector is designed using BKF–Rayleigh distribution and LO decision criteria. In addition, we derive the closed expression of the watermark detector using the BKF–Rayleigh model. The experiments proved that the algorithm in this paper outperforms the existing methods in terms of performance, maintains robustness well under a large watermarking capacity, and has excellent imperceptibility at the same time. The algorithm maintains a well-balanced relationship between robustness, imperceptibility, and watermarking capacity. Full article
(This article belongs to the Special Issue Advanced Research in Data-Centric AI)
Show Figures

Figure 1

Back to TopTop