Previous Article in Journal
The PacifAIst Benchmark: Do AIs Prioritize Human Survival over Their Own Objectives?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Review

Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices

by
Paraskevas Koukaras
and
Christos Tjortjis
*
School of Science and Technology, International Hellenic University, 14th km Thessaloniki-Moudania, 57001 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
AI 2025, 6(10), 257; https://doi.org/10.3390/ai6100257
Submission received: 22 August 2025 / Revised: 27 September 2025 / Accepted: 29 September 2025 / Published: 2 October 2025

Abstract

Data preprocessing and feature engineering play key roles in data mining initiatives, as they have a significant impact on the accuracy, reproducibility, and interpretability of analytical results. This review presents an analysis of state-of-the-art techniques and tools that can be used in data input preparation and data manipulation to be processed by mining tasks in diverse application scenarios. Additionally, basic preprocessing techniques are discussed, including data cleaning, normalisation, and encoding, as well as more sophisticated approaches regarding feature construction, selection, and dimensionality reduction. This work considers manual and automated methods, highlighting their integration in reproducible, large-scale pipelines by leveraging modern libraries. We also discuss assessment methods of preprocessing effects on precision, stability, and bias–variance trade-offs for models, as well as pipeline integrity monitoring, when operating environments vary. We focus on emerging issues regarding scalability, fairness, and interpretability, as well as future directions involving adaptive preprocessing and automation guided by ethically sound design philosophies. This work aims to benefit both professionals and researchers by shedding light on best practices, while acknowledging existing research questions and innovation opportunities.
Keywords: data preprocessing; feature engineering; data mining; machine learning; data cleaning; feature selection; dimensionality reduction; pipeline automation; AutoML; PyCaret; explainable preprocessing; AI data preprocessing; feature engineering; data mining; machine learning; data cleaning; feature selection; dimensionality reduction; pipeline automation; AutoML; PyCaret; explainable preprocessing; AI

Share and Cite

MDPI and ACS Style

Koukaras, P.; Tjortjis, C. Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI 2025, 6, 257. https://doi.org/10.3390/ai6100257

AMA Style

Koukaras P, Tjortjis C. Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI. 2025; 6(10):257. https://doi.org/10.3390/ai6100257

Chicago/Turabian Style

Koukaras, Paraskevas, and Christos Tjortjis. 2025. "Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices" AI 6, no. 10: 257. https://doi.org/10.3390/ai6100257

APA Style

Koukaras, P., & Tjortjis, C. (2025). Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI, 6(10), 257. https://doi.org/10.3390/ai6100257

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop