A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks

Charlène Béatrice Bridge-Nduwimana; Salah Eddine  El Harrauss; Aziza El Ouaazizi; Majid Benyakhlef

doi:10.3390/bdcc9120321

,

and

¹

Innovative Technologies and Computer Science Laboratory (LT2I), High School of Technology (EST), Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco

²

Laboratory of Engineering Sciences (LSE), Polydisciplinary Faculty of Taza, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput.2025, 9(12), 321;https://doi.org/10.3390/bdcc9120321

Version Notes

Order Reprints

Abstract

Upstream processes strongly influence downstream analysis in sequential data-processing workflows, particularly in machine learning, where data quality directly affects model performance. Conventional statistical imputations often fail to capture nonlinear dependencies, while deep learning approaches typically lack uncertainty quantification. We introduce a hybrid imputation model that integrates a deep learning autoencoder with Convolutional Neural Network (CNN) layers and a Transformer-based contextual modeling architecture to address systematic variation across heterogeneous data sources. Performing multiple imputations in the autoencoder–transformer latent space and averaging representations provides implicit batch correction that suppresses context-specific remains without explicit batch identifiers. We performed experiments on datasets in which 10% of missing data was artificially introduced by completely random missing data (MCAR) and non-random missing data (MNAR) mechanisms. They demonstrated practical performance, jointly ranking first among the imputation methods evaluated. This imputation technique reduced the root mean square error (RMSE) by 50% compared to denoising autoencoders (DAE) and by 46% compared to iterative imputation (MICE). Performance was comparable for adversarial models (GAIN) and attention-based models (MIDA), and both provided interpretable uncertainty estimates (CV = 0.08–0.15). Validation on datasets from multiple sources confirmed the robustness of the technique: notably, on a forensic dataset from multiple laboratories, our imputation technique achieved a practical improvement over GAIN (0.146 vs. 0.189 RMSE), highlighting its effectiveness in mitigating batch effects.

Keywords:

imputation technique; multiple imputation; autoencoder; transformer; latent average space; deep learning; uncertainty quantification; missing data mechanism

A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks

Abstract

Article Metrics

Citations

Article Access Statistics