Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices
Abstract
1. Introduction
1.1. Role of Preprocessing in DM
1.2. Motivating Examples and Performance Impact
1.3. Objectives of This Review
- 1.
- Develop a unifying framework: Develop a context-specific taxonomy that integrates preprocessing and feature engineering as systematic phases (cleaning, transformation, construction, selection, reduction), while incorporating decision variables such as dataset dimensions, interpretability, and computational resources.
- 2.
- Evaluate techniques critically: Make comparative assessments of imputation, encoding, feature selection, and DR techniques, including linked risks, trade-offs, and recorded failure instances.
- 3.
- Describe practical heuristics: Specify requirements as numerical values (e.g., sample sizes of autoencoders and variance filter thresholds) and shed light on common pipeline design choices (e.g., fitting training-only scalers) to enhance reproducibility and tackle data leakage.
- 4.
- Codify best practices: Establish definitive design patterns and procedural guidelines for direct implementation by practitioners, covering topics such as leakage management, CV, fairness auditing, and monitoring in realistic deployment scenarios.
1.4. Scope and Structure of the Review
1.5. Novelty and Contribution
- 1.
- We propose a unifying framework (Section 2), which systematically organises data preprocessing and feature engineering into five pipeline stages (cleaning, transformation, construction, selection, reduction), while explicitly linking them to decision criteria (dataset size, interpretability, domain constraints, computational resources).
- 2.
- We provide comparative tables and method-level evaluations that highlight trade-offs, risks, and common failure cases across imputation, encoding, selection, and DR.
- 3.
- We highlight best engineering practices, including serialisation, experiment tracking, and monitoring for drift and health, such that methodologies are always reliably transferable from the realm of notebooks into production environments.
2. Conceptual Framework: A Context-Aware Taxonomy of Data Preprocessing and Feature Engineering
3. Data Cleaning and Transformation
3.1. Handling Missing Values
3.2. Outlier Detection and Correction
Evaluation Protocol for Outlier Handling
- Testing content and reporting template.
- 1.
- Data splitting. Use k-fold (or repeated k-fold) CV; detectors and downstream models are fit within folds and evaluated on held-out folds [44].
- 2.
- Candidates. Compare retain, transform (e.g., log/winsorise), and remove; record detector thresholds and treatment parameters.
- 3.
- 4.
- Decision rule. Select the least-complex treatment that improves the task metric by ≥one standard error without increasing ECE by >0.01 and without worsening subgroup parity.
- 5.
- Reporting. Provide fold-wise scores, chosen thresholds/parameters, and a one-line rationale.
- Unified decision rule (retain/transform/remove).
- 1.
- Retain (with flagging/robust loss) if influence diagnostics are low (e.g., leverage/Cook’s D below conventional cut-offs) and removal degrades calibration or subgroup parity (Table 3).
- 2.
- Transform (e.g., winsorise or log) if the transformed option improves the cross-validated task metric by at least one standard error over retain, with no material deterioration in calibration (e.g., ECE ) or fairness (absolute parity delta non-increasing).
- 3.
- Remove only if removal improves the cross-validated task metric by at least one standard error over both retain and transform, while not worsening calibration or fairness (as above). Report the criteria and selected option.
3.3. Normalisation and Scaling
Why Fit Scalers Only on Training Data
3.4. Encoding Categorical Variables
4. Feature Construction and Selection
4.1. Manual vs. Automated Construction
Decision Framework
4.2. Domain-Driven Feature Creation
4.3. Filter, Wrapper, and Embedded Selection Methods
4.4. Correlation, Mutual Information, and Variance Thresholds
Interrelationship with DR
- 1.
- When to prefer selection → reduction. If interpretability, regulatory traceability, or stable feature semantics are required, prune redundancy first (filters/embedded in Section 4.3 and Section 4.4) and apply DR thereafter to compress residual collinearity. This retains named features, while still curbing variance.
- 2.
- When to prefer reduction → selection. In ultra-high-dimensional, sparse regimes (e.g., one-hot with many rare levels), apply a lightweight DR step (e.g., PCA or hashing-based sketches) to mitigate dimensionality, and then perform wrapper/embedded selection on the compressed representation.
- 3.
- Order-of-operations ablation. Evaluate four candidates under the same CV split: (i) baseline, (ii) selection-only, (iii) DR-only, (iv) selection→DR, (v) DR→selection; pick the least-complex option that improves task metrics without harming calibration/stability.
- 4.
- Guardrails. Fit every statistic within folds (selection criteria, DR components) and apply to hold-outs only; report component counts and selected features per fold. When DR reduces semantic transparency (e.g., neural embeddings), justify with performance/calibration gains and provide post hoc projections where possible.
5. DR Techniques
5.1. Principal Component Analysis (PCA)
5.2. Autoencoders and Neural Embeddings
5.3. Manifold Learning Methods
6. Preprocessing Pipelines and Automation Tools
6.1. Pipeline Design in Scikit-Learn and PyCaret
6.2. DataPrep, Featuretools, and Automated Feature Engineering (AutoFE)
6.3. Pipeline Serialisation and Reproducibility
7. Evaluation of Preprocessing Impact
7.1. Measuring Improvements in Model Accuracy and Stability
7.2. Bias–Variance Implications
- Protocol (controlling confounders).
7.3. Interaction with Downstream Mining Tasks
8. Open-Source Libraries and Frameworks
8.1. Scikit-Learn, PyCaret, and MLJ
8.2. AutoML Integration Tools
8.3. Jupyter Workflows and Experimentation Environments
9. Best Practices and Design Patterns
9.1. Data Splitting and Leakage Control
9.2. Preprocessing Order
- 1.
- Data cleaning (error detection, type correction);
- 2.
- Missing-value imputation;
- 3.
- Categorical encoding;
- 4.
- Scaling/normalisation;
- 5.
- Feature construction;
- 6.
- Feature selection;
- 7.
- Model training.
9.3. Evaluation Protocols and Ablation
9.4. Logging and Reproducibility
9.5. Fairness and Transparency
9.6. Automation Trade-Offs
9.7. Monitoring and Drift Response
10. Challenges and Future Directions
10.1. Scalability in Large and Heterogeneous Datasets
10.2. Fairness, Explainability, and Ethical Considerations
10.3. Towards More Adaptive and Intelligent Preprocessing
Technical Implementation Path
11. Conclusions and Future Directions
11.1. Summary of Contributions
11.2. Best-Practice Guidelines
- Leakage control: Fit scalers and encoders/scalers on training folds only. Do not recalculate transformations on validation/test sets.
- Validation protocols: Use stratified or blocked CV as appropriate for the domain; report mean and variance over folds. Report learning curves.
- Fairness and interpretability: Be mindful of subgroup-specific measurements and prefer clear-sighted methods in regulated environments.
- Hybrid Design: Combine scalability and interpretability with automated discovery and expert-controlled features.
- Reproducibility and monitoring: Preprocessing artefacts for versioning, retaining all transformations, and keeping track of distribution of features for drift in.
11.3. Research Roadmap
- 1.
- Systems of adaptive preprocessing: Dynamically re-tuned pipelines of imputation, scaling, and encoding plans with data drift.
- 2.
- Unified benchmarking: Comprehensive, open benchmark datasets for comparing preprocessing methods in a systematic and domain-invariant way (i.e., across domains).
- 3.
- Fairness-aware preprocessing: Integrating subgroup audits, debiasing techniques, and interpretability modules as first.
- 4.
- Automated decision support: Generalising flowchart-based systems into recommendation systems that suggest preprocessing actions based on characteristics of the datasets.
- 5.
- Cross-domain transferability: The creation of preprocessing components that exhibit generalisability across various tasks and data modalities, thereby minimising dependence on custom-tailored designs.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| AutoFE | Automated Feature Engineering |
| AutoML | Automated Machine Learning |
| CV | Cross-Validation |
| DBSCAN | Density-Based Spatial Clustering of Applications with Noise |
| DFS | Deep Feature Synthesis |
| DM | Data Mining |
| DVC | Data Version Control |
| EM | Expectation–Maximisation |
| IQR | Interquartile Range |
| JSON | JavaScript Object Notation |
| k-NN | k-Nearest Neighbours |
| LOF | Local Outlier Factor |
| LASSO | Least Absolute Shrinkage and Selection Operator |
| MAR | Missing At Random |
| MCAR | Missing Completely At Random |
| MNAR | Missing Not At Random |
| ML | Machine Learning |
| NaN | Not a Number |
| ONNX | Open Neural Network Exchange |
| PCA | Principal Component Analysis |
| RFE | Recursive Feature Elimination |
| RMSE | Root Mean Square Error |
| SVD | Singular Value Decomposition |
| TPOT | Tree-based Pipeline Optimisation Tool |
| t-SNE | t-Distributed Stochastic Neighbour Embedding |
| UMAP | Uniform Manifold Approximation and Projection |
| VAE | Variational Autoencoder |
| YAML | YAML Ain’t Markup Language |
References
- García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Cham, Switzerland, 2015; Volume 72, ISBN 978-3-319-10246-7. [Google Scholar] [CrossRef]
- Kuhn, M.; Johnson, K. Data Pre-Processing. In Applied Predictive Modeling; Springer: New York, NY, USA, 2013; pp. 27–59. [Google Scholar] [CrossRef]
- Kotsiantis, S.; Kanellopoulos, D.; Pintelas, P. Data Preprocessing for Supervised Learning. Int. J. Comput. Sci. 2006, 1, 111–117. [Google Scholar]
- Dwivedi, S.K.; Rawat, B. A review paper on data preprocessing: A critical phase in web usage mining process. In Proceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), Greater Noida, India, 8–10 October 2015; pp. 506–510. [Google Scholar] [CrossRef]
- Caton, S.; Haas, C. Fairness in Machine Learning: A Survey. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
- Cabello-Solorzano, K.; Ortigosa de Araujo, I.; Peña, M.; Correia, L.; Tallón-Ballesteros, A.J. The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis. In Proceedings of the 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023), Salamanca, Spain, 5–7 September 2023; García Bringas, P., Pérez García, H., Martínez de Pisón, F.J., Martínez Álvarez, F., Troncoso Lora, A., Herrero, Á., Calvo Rolle, J.L., Quintián, H., Corchado, E., Eds.; Springer: Cham, Switzerland, 2023; pp. 344–353, ISBN 9783031425356. [Google Scholar]
- Jerez, J.M.; Molina, I.; García-Laencina, P.J.; Alba, E.; Ribelles, N.; Martín, M.; Franco, L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 2010, 50, 105–115. [Google Scholar] [CrossRef]
- Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
- Nargesian, F.; Samulowitz, H.; Khurana, U.; Khalil, E.B.; Turaga, D. Learning feature engineering for classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, Melbourne, Australia, 19–25 August 2017; AAAI Press: Washington, DC, USA, 2017; pp. 2529–2535. [Google Scholar] [CrossRef]
- Bilal, M.; Ali, G.; Iqbal, M.W.; Anwar, M.; Malik, M.S.A.; Kadir, R.A. Auto-Prep: Efficient and Automated Data Preprocessing Pipeline. IEEE Access 2022, 10, 107764–107784. [Google Scholar] [CrossRef]
- Aragão, M.V.C.; Afonso, A.G.; Ferraz, R.C.; Ferreira, R.G.; Leite, S.G.; Figueiredo, F.A.P.d.; Mafra, S. A practical evaluation of automl tools for binary, multiclass, and multilabel classification. Sci. Rep. 2025, 15, 17682. [Google Scholar] [CrossRef]
- Eldeeb, H.; Maher, M.; Elshawi, R.; Sakr, S. AutoMLBench: A comprehensive experimental evaluation of automated machine learning frameworks. Expert Syst. Appl. 2024, 243, 122877. [Google Scholar] [CrossRef]
- Gardner, W.; Winkler, D.A.; Alexander, D.L.J.; Ballabio, D.; Muir, B.W.; Pigram, P.J. Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models. J. Vac. Sci. Technol. A 2023, 41, 063204. [Google Scholar] [CrossRef]
- Kapoor, S.; Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef]
- Tawakuli, A.; Havers, B.; Gulisano, V.; Kaiser, D.; Engel, T. Survey:Time-series data preprocessing: A survey and an empirical analysis. J. Eng. Res. 2025, 13, 674–711. [Google Scholar] [CrossRef]
- Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Afkanpour, M.; Hosseinzadeh, E.; Tabesh, H. Identify the most appropriate imputation method for handling missing values in clinical structured datasets: A systematic review. BMC Med. Res. Methodol. 2024, 24, 188. [Google Scholar] [CrossRef]
- Sun, Y.; Li, J.; Xu, Y.; Zhang, T.; Wang, X. Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Syst. Appl. 2023, 227, 120201. [Google Scholar] [CrossRef]
- Kazijevs, M.; Samad, M.D. Deep imputation of missing values in time series health data: A review with benchmarking. J. Biomed. Inform. 2023, 144, 104440. [Google Scholar] [CrossRef]
- Casella, M.; Milano, N.; Dolce, P.; Marocco, D. Transformers deep learning models for missing data imputation: An application of the ReMasker model on a psychometric scale. Front. Psychol. 2024, 15, 1449272. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
- Mangussi, A.D.; Pereira, R.C.; Lorena, A.C.; Santos, M.S.; Abreu, P.H. Studying the robustness of data imputation methodologies against adversarial attacks. Comput. Secur. 2025, 157, 104574. [Google Scholar] [CrossRef]
- Hodge, J.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22, 85–126. [Google Scholar] [CrossRef]
- Domingues, R.; Filippone, M.; Michiardi, P.; Zouaoui, J. A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognit. 2018, 74, 406–421. [Google Scholar] [CrossRef]
- Ahsan, M.M.; Mahmud, M.A.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
- Mahmud Sujon, K.; Binti Hassan, R.; Tusnia Towshi, Z.; Othman, M.A.; Abdus Samad, M.; Choi, K. When to Use Standardization and Normalization: Empirical Evidence From Machine Learning Models and XAI. IEEE Access 2024, 12, 135300–135314. [Google Scholar] [CrossRef]
- Potdar, K.; Pardawala, T.S.; Pai, C.D. A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers. Int. J. Comput. Appl. 2017, 175, 7–9. [Google Scholar] [CrossRef]
- Kanter, J.M.; Veeramachaneni, K. Deep feature synthesis: Towards automating data science endeavors. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, 19–21 October 2015; pp. 1–10. [Google Scholar] [CrossRef]
- Katz, G.; Shin, E.; Song, D. ExploreKit: Automatic feature generation and selection. In Proceedings of the 16th IEEE International Conference on Data Mining, ICDM 2016, Barcelona, Spain, 12–15 December 2016; pp. 979–984. [Google Scholar]
- Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Baldi, P. Autoencoders, Unsupervised Learning, and Deep Architectures. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Edinburgh, UK, 26 June–1 July 2012; Guyon, I., Dror, G., Lemaire, V., Taylor, G., Silver, D., Eds.; Proceedings of Machine Learning Research. PMLR: Bellevue, WA, USA, 2012; Volume 27, pp. 37–49. Available online: http://proceedings.mlr.press/v27/baldi12a/baldi12a.pdf (accessed on 12 July 2025).
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
- Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. arXiv 2022, arXiv:1908.09635. [Google Scholar] [CrossRef]
- Beretta, L.; Santaniello, A. Nearest neighbor imputation algorithms: A critical evaluation. BMC Med. Inform. Decis. Mak. 2016, 16, 74. [Google Scholar] [CrossRef]
- Malan, L.; Smuts, C.M.; Baumgartner, J.; Ricci, C. Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns. Nutr. Res. 2020, 75, 67–76. [Google Scholar] [CrossRef]
- Nazábal, A.; Olmos, P.M.; Ghahramani, Z.; Valera, I. Handling incomplete heterogeneous data using VAEs. Pattern Recognit. 2020, 107, 107501. [Google Scholar] [CrossRef]
- Zimek, A.; Schubert, E.; Kriegel, H. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 2012, 5, 363–387. [Google Scholar] [CrossRef]
- Souiden, I.; Omri, M.N.; Brahmi, Z. A survey of outlier detection in high dimensional data streams. Comput. Sci. Rev. 2022, 44, 100463. [Google Scholar] [CrossRef]
- Zoppi, T.; Gazzini, S.; Ceccarelli, A. Anomaly-based error and intrusion detection in tabular data: No DNN outperforms tree-based classifiers. Future Gener. Comput. Syst. 2024, 160, 951–965. [Google Scholar] [CrossRef]
- Herrmann, M.; Pfisterer, F.; Scheipl, F. A geometric framework for outlier detection in high-dimensional data. WIREs Data Min. Knowl. Discov. 2023, 13, e1491. [Google Scholar] [CrossRef]
- Aggarwal, C.C. An Introduction to Outlier Analysis. In Outlier Analysis; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–34. [Google Scholar] [CrossRef]
- Ojeda, F.M.; Jansen, M.L.; Thiéry, A.; Blankenberg, S.; Weimar, C.; Schmid, M.; Ziegler, A. Calibrating machine learning approaches for probability estimation: A comprehensive comparison. Stat. Med. 2023, 42, 5451–5478. [Google Scholar] [CrossRef] [PubMed]
- Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Available online: https://api.semanticscholar.org/CorpusID:2702042 (accessed on 21 July 2025).
- Divya, D.; Babu, S.S. Methods to detect different types of outliers. In Proceedings of the 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE), Ernakulam, India, 16–18 March 2016; pp. 23–28. [Google Scholar] [CrossRef]
- Edwards, C.; Raskutti, B. The Effect of Attribute Scaling on the Performance of Support Vector Machines. In Proceedings of the AI 2004: Advances in Artificial Intelligence, Cairns, Australia, 4–6 December 2004; Webb, G.I., Yu, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 500–512. [Google Scholar] [CrossRef]
- Chen, H.; Zhang, H.; Si, S.; Li, Y.; Boning, D.; Hsieh, C.J. Robustness Verification of Tree-based Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Panić, J.; Defeudis, A.; Balestra, G.; Giannini, V.; Rosati, S. Normalization strategies in multi-center radiomics abdominal mri: Systematic review and meta-analyses. IEEE Open J. Eng. Med. Biol. 2023, 4, 67–76. [Google Scholar] [CrossRef]
- Demircioğlu, A. The effect of feature normalization methods in radiomics. Insights Into Imaging 2024, 15, 2. [Google Scholar] [CrossRef] [PubMed]
- Pargent, F.; Pfisterer, F.; Thomas, J.; Bischl, B. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput. Stat. 2022, 37, 2671–2692. [Google Scholar] [CrossRef]
- Guo, C.; Berkhahn, F. Entity Embeddings of Categorical Variables. arXiv 2016, arXiv:1604.06737. [Google Scholar] [CrossRef]
- Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. 2017, 50, 1–45. [Google Scholar] [CrossRef]
- Chauhan, K.; Jani, S.; Thakkar, D.; Dave, R.; Bhatia, J.; Tanwar, S.; Obaidat, M.S. Automated Machine Learning: The New Wave of Machine Learning. In Proceedings of the 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, India, 5–7 March 2020; pp. 205–212. [Google Scholar] [CrossRef]
- Horn, F.; Pack, R.; Rieger, M. The autofeat Python Library for Automated Feature Engineering and Selection. arXiv 2020, arXiv:1901.07329. [Google Scholar] [CrossRef]
- Hyndman, R.; Athanasopoulos, G. Forecasting: Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2018; Available online: https://otexts.com/fpp2/ (accessed on 2 August 2025).
- Bienefeld, C.; Becker-Dombrowsky, F.M.; Shatri, E.; Kirchner, E. Investigation of Feature Engineering Methods for Domain-Knowledge-Assisted Bearing Fault Diagnosis. Entropy 2023, 25, 1278. [Google Scholar] [CrossRef]
- Jiménez-Cordero, A.; Maldonado, S. Automatic feature scaling and selection for support vector machine classification with functional data. Appl. Intell. 2021, 51, 161–184. [Google Scholar] [CrossRef]
- Brown, G.; Pocock, A.; Zhao, M.; Luján, M. Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
- Urbanowicz, R.J.; Meeker, M.; Cava, W.L.; Olson, R.S.; Moore, J.H. Relief-based feature selection: Introduction and review. J. Biomed. Inform. 2018, 85, 189–203. [Google Scholar] [CrossRef]
- Nogueira, S.; Sechidis, K.; Brown, G. On the Stability of Feature Selection Algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
- Jolliffe, I.T.; Cadima, J. Principal Component Analysis: A Review and Recent Developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar] [CrossRef]
- van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Wattenberg, M.; Viégas, F.; Johnson, I. How to Use t-SNE Effectively. Distill 2016, 1, e2. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar] [CrossRef]
- Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.; Kwok, I.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using umap. Nat. Biotechnol. 2018, 37, 38–44. [Google Scholar] [CrossRef] [PubMed]
- Ali, M. PyCaret: An Open Source, Low-Code Machine Learning Library in Python, PyCaret Version 1.0.0. 2020. Available online: https://www.pycaret.org (accessed on 27 July 2025).
- Peng, J.; Wu, W.; Lockhart, B.; Bian, S.; Yan, J.N.; Xu, L.; Chi, Z.; Rzeszotarski, J.M.; Wang, J. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, New York, NY, USA, 20–25 June 2021; pp. 2271–2280. [Google Scholar] [CrossRef]
- Zaharia, M.A.; Chen, A.; Davidson, A.; Ghodsi, A.; Hong, S.A.; Konwinski, A.; Murching, S.; Nykodym, T.; Ogilvie, P.; Parkhe, M.; et al. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 2018, 41, 39–45. [Google Scholar]
- Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
- de Amorim, L.B.; Cavalcanti, G.D.; Cruz, R.M. The choice of scaling technique matters for classification performance. Appl. Soft Comput. 2023, 133, 109924. [Google Scholar] [CrossRef]
- Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
- Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011; ISBN 978-0-12-381479-1. [Google Scholar]
- Tan, P.N.; Steinbach, M.; Kumar, V. Introduction to Data Mining, 2 ed.; Pearson: London, UK, 2019; ISBN 9780134080284. [Google Scholar]
- Blaom, A.D.; Kiraly, F.; Lienart, T.; Simillides, Y.; Arenas, D.; Vollmer, S.J. MLJ: A Julia package for composable machine learning. J. Open Source Softw. 2020, 5, 2704. [Google Scholar] [CrossRef]
- Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.T.; Blum, M.; Hutter, F. Efficient and robust automated machine learning. In Proceedings of the 29th International Conference on Neural Information Processing Systems, NIPS’15, Montreal, QC, Canada, 7–12 December 2015; Volume 2, pp. 2755–2763. [Google Scholar]
- Olson, R.S.; Bartley, N.; Urbanowicz, R.J.; Moore, J.H. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’16, Denver, CO, USA, 20–24 July 2016; pp. 485–492. [Google Scholar] [CrossRef]
- Brugman, S. pandas-profiling: Create HTML Profiling Reports from Pandas DataFrame Objects. 2021. Available online: https://github.com/ydataai/pandas-profiling (accessed on 3 August 2025).
- Bertrand, F. sweetviz: A Pandas-Based Library to Visualise and Compare Datasets. 2023. Available online: https://github.com/fbdesignpro/sweetviz (accessed on 29 July 2025).
- dvc.org. Data Version Control—and Much More–for AI Projects. 2025. Available online: https://dvc.org/ (accessed on 21 July 2025).
- mlflow.org. MLflow—Deliver Production-Ready AI. 2025. Available online: https://mlflow.org/ (accessed on 7 August 2025).
- Barrak, A.; Eghan, E.E.; Adams, B. On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 9–12 March 2021; pp. 422–433. [Google Scholar] [CrossRef]
- Schlegel, M.; Sattler, K.U. Capturing end-to-end provenance for machine learning pipelines. Inf. Syst. 2025, 132, 102495. [Google Scholar] [CrossRef]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
- Raftopoulos, G.; Fazakis, N.; Davrazos, G.; Kotsiantis, S. A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models. Algorithms 2025, 18, 435. [Google Scholar] [CrossRef]
- Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under Concept Drift: A Review. IEEE Trans. Knowl. Data Eng. 2019, 31, 2346–2363. [Google Scholar] [CrossRef]
- Kodakandla, N. Data drift detection and mitigation: A comprehensive mlops approach for real-time systems. Int. J. Sci. Res. Arch. 2024, 12, 3127–3139. [Google Scholar] [CrossRef]
- Lee, Y.; Lee, Y.; Lee, E.; Lee, T. Explainable Artificial Intelligence-Based Model Drift Detection Applicable to Unsupervised Environments. Comput. Mater. Contin. 2023, 76, 1701–1719. [Google Scholar] [CrossRef]
- Ramírez-Gallego, S.; Krawczyk, B.; García, S.; Wozniak, M.; Herrera, F. A survey on Data Preprocessing for Data Stream Mining: Current status and future directions. Neurocomputing 2017, 239, 39–57. [Google Scholar] [CrossRef]
- Ataei, P.; Staegemann, D. Application of microservices patterns to big data systems. J. Big Data 2023, 10, 56. [Google Scholar] [CrossRef]
- Fragkoulis, M.; Carbone, P.; Kalavri, V.; Katsifodimos, A. A survey on the evolution of stream processing systems. VLDB J. 2023, 33, 507–541. [Google Scholar] [CrossRef]
- Lipton, Z. The Mythos of Model Interpretability. Commun. ACM 2016, 61, 36–43. [Google Scholar] [CrossRef]
- Biswas, S.; Rajan, H. Fair preprocessing: Towards understanding compositional fairness of data transformers in machine learning pipeline. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 981–993. [Google Scholar] [CrossRef]











| Domain | Typical Risks | Required Safeguards |
|---|---|---|
| Healthcare | Rare-event dilution; heterogeneous coding; extreme lab values; subgroup harm | Leakage-safe imputation/encoding; robust scaling; subgroup analyses; transparent logging |
| Finance | Temporal leakage; concept drift; high-cardinality categoricals; regime shifts | Blocked/forward-chaining cross-validation (CV); out-of-fold encoders; drift monitoring; scheduled recalibration |
| Industrial IoT/Time-series | Non-stationarity; bursty missingness; sensor bias; misalignment | Windowed/context-aware imputation; detrending/denoising; per-sensor scaling; segment-wise validation |
| E-commerce (tabular) | High cardinality; seasonality; sparse interactions | Hashing; frequency/target encoders; time-aware CV; interaction features; leakage checks in promotions |
| Technique | Advantages | Limitations | Data Regime (Rule-of-Thumb) |
|---|---|---|---|
| Listwise Deletion | Simple; no computation | Unbiased only under MCAR; information loss; selection bias if MAR/MNAR [7] | Overall missingness ≲ 5–10% and MCAR plausibly satisfied; otherwise avoid. |
| Columnwise Deletion | Removes attributes with pervasive missingness | Discards potentially informative variables; reduces model capacity [7] | Drop a feature when missingness ≳ 40–60% and low predictive value in screening. |
| Mean/Median/ Mode Imputation | Fast; easy to implement | Shrinks variance; attenuates correlations; distorts distributions [7] | Small-n, low missingness (≲10–20%) with roughly unimodal/ symmetric features. |
| k-NN Imputation | Captures local structure; non-parametric | Sensitive to distance metric and scaling; degrades in high dimension/ sparsity [35] | (often ) with scaled features; moderate dimensionality; . |
| Regression Imputation | Preserves multivariate relations | Underestimates uncertainty; overfitting risk if deterministic [7] | Moderate n with stable relations; prefer multiple imputation when feasible. |
| Expectation–Maximisation (EM)/ Multiple Imputation | Statistically principled; models uncertainty under MAR | Model/init sensitivity; more complex to tune [36] | (often ); MAR plausible; imputations. |
| Autoencoder-based Imputation | Handles nonlinear dependencies; learns latent structure [37] | Data- and tuning-hungry; small-n can underperform simple baselines | Large-sample regimes: typically total (or ≳ per class) with sufficient capacity; otherwise prefer simpler methods. |
| Aspect | Indicator | Why it Matters |
|---|---|---|
| Predictive performance | Mean ± SD of ROC-AUC/RMSE (k-fold) | Gains must be consistent across folds [44] |
| Calibration | Brier score/ECE | Prevents overconfident models after trimming/winsorising [2] |
| Influence | Max leverage/Cook’s D | Identifies points dominating fit [38] |
| Robustness | Feature-wise IQR ratio before/after | Detects excessive shrinkage from aggressive trimming |
| Fairness (if applicable) | Metric parity deltas across subgroups | Ensures treatment does not induce disparate impact [5] |
| Encoding Method | Use Case | Pros | Cons |
|---|---|---|---|
| One-Hot Encoding | Low-cardinality nominal variables | Preserves category identity | High dimensionality |
| Ordinal Encoding | Ordinal variables | Compact, simple | Imposes artificial order |
| Target Encoding | Predictive categorical vars | Captures target signal | Leakage risk if misused |
| Frequency Encoding | High-cardinality nominals | Efficient, scalable | May inject bias |
| Entity Embeddings | Large, complex datasets | Learns deep representations | Hard to interpret |
| Method | Examples | Model Dep. | Cost (Fits/CV) | Strengths | Limitations | Indicators |
|---|---|---|---|---|---|---|
| Filter | Corr., MI, | Indep. | ≈0 (stat-only) | Fast; interpretable; near-linear in p | Ignores interactions; unstable under collinearity | Stability across folds (e.g., Nogueira index) [60]; regime: very high p, limited compute. |
| Wrapper | RFE; Fwd/ Back sel. | Dep. | ≈ | Captures interactions; task-aligned | Expensive; overfitting risk in small-n | Report S, wall-time, nested CV status; stability across folds; ablation vs. filters/embedded. |
| Embedded | LASSO; Elastic Net; Trees | Dep. | ≈ per model | Integrated with model; balances cost/accuracy | Model-dependent selection; may not transfer | Coeff. sparsity/feature counts; stability across folds; regime: moderate/large n, structured p. |
| Method | Advantages | Limitations |
|---|---|---|
| PCA | Fast; reduces multicollinearity; variance-explained criterion | Linear; components hard to interpret; variance retention accuracy; fit on train only to avoid leakage |
| t-SNE | Reveals local clusters; strong for visualisation | Distorts global geometry; sensitive to hyperparameters and seeds; non-parametric (no out-of-sample mapping) |
| UMAP | Faster; often better global structure; flexible parameterisation | Stochastic; hyperparameter-sensitive; embedded distances not metric-faithful |
| Autoencoders | Nonlinear compression; scalable; denoising/variational variants | Data- and compute-demanding; tuning-sensitive; latent factors opaque |
| Tool | Category | Key Preprocessing/Pipeline Features | Best for |
|---|---|---|---|
| scikit-learn | Library (Python) | Composable transformers; Pipeline ColumnTransformer; CV with Grid/RandomisedSearch | Tunable, explicit workflows; benchmarking baselines |
| PyCaret | Low-code library (Python) | Auto setup (scaling, encoding, outliers, transforms); pipeline export; experiment logging | Rapid prototyping and quick model comparison |
| MLJ | Library (Julia) | Type-safe pipelines; unified interfaces; broad data-type support | High-performance Julia-native projects |
| H2O AutoML | AutoML | Auto imputation/encoding/scaling; ensemble search | Hands-off model + preprocess search at scale |
| auto-sklearn | AutoML (Python) | Joint search over preprocessing and model hyperparameters; meta-learning warm starts | Automated pipeline selection in the scikit-learn ecosystem |
| TPOT | AutoML (evolutionary) | Evolves DAG pipelines combining preprocessors and models | Discovering novel operator combinations |
| Jupyter + add-ons | Notebook workflow | Interactive EDA/reporting; parameterised notebooks; tracking and versioning (Papermill, DVC, MLflow) | Exploratory analysis and reproducible pipelines |
| Protocol | Requirement | Rationale |
|---|---|---|
| CV | k-fold or nested CV | Reduces variance of performance estimates [44] |
| Leakage checks | Fit transforms on training folds only | Prevents optimistic bias [2] |
| Ablation studies | Systematic removal/replacement | Identifies contribution of each preprocessing step [30] |
| Stability metrics | Feature overlap across folds | Ensures robustness of selection [60] |
| Task-specific metrics | ROC-AUC, calibration error | Aligns evaluation with application objectives ([84]) |
| Area | Best Practice | References |
|---|---|---|
| Imputation | Assess subgroup bias before replacement | [7,18,35] |
| Encoding | Use out-of-fold fitting for target encoding | [27] |
| Transparency | Prefer interpretable encodings when feasible | [31] |
| Bias mitigation | Apply reweighting or resampling | [85] |
| Documentation | Log inclusion/exclusion rationale | [5,34] |
| System | Strengths | Limitations |
|---|---|---|
| auto-sklearn | Joint optimisation of pipeline and model | Complex, less transparent [76] |
| TPOT | Genetic programming for pipeline search | Can produce bloated pipelines [77] |
| PyCaret | Low-code prototyping, fast deployment | Limited flexibility for advanced tuning [67] |
| AutoFE methods | Automated feature construction | Harder to interpret and validate [12,54] |
| Type | Control Logic | Triggers | Strengths | Limitations | Contexts |
|---|---|---|---|---|---|
| Rule-based | Declarative thresholds | Null-rate increase; new cats.; scale drift | Simple; transparent; low overhead | Brittle; manual tuning; limited generalisation | Regulated domains; low drift |
| Drift-triggered | Scheduler + detectors | PSI; K–S; shift | Handles distribution shift; easy ops | Coarse; compute cost; lagged response | Batch systems with clear baselines |
| AutoML/HPO | BO/GP/EAs over pipeline | Periodic retrain; perf. decay | Joint step optimisation; strong perf. | Opaque; higher compute; reproducibility load | Offline retrains; research/benchmarks |
| Meta-learning | Data descriptors → pipeline | New dataset/task; cold-start | Fast warm-start; consistent defaults | Needs meta-dataset; misgeneralisation risk | Portfolios of similar tasks |
| RL controller | Policy learns step choices | Online perf./SLAs | Long-horizon optimisation | Data-hungry; strict safety/rollback | Large-scale platforms w/guardrails |
| Human-in-loop | Policy + approval gates | Any critical change | Safety; accountability; domain context | Slower; staffing required | High-stakes or regulated settings |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Koukaras, P.; Tjortjis, C. Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI 2025, 6, 257. https://doi.org/10.3390/ai6100257
Koukaras P, Tjortjis C. Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI. 2025; 6(10):257. https://doi.org/10.3390/ai6100257
Chicago/Turabian StyleKoukaras, Paraskevas, and Christos Tjortjis. 2025. "Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices" AI 6, no. 10: 257. https://doi.org/10.3390/ai6100257
APA StyleKoukaras, P., & Tjortjis, C. (2025). Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI, 6(10), 257. https://doi.org/10.3390/ai6100257
