You are currently viewing a new version of our website. To view the old version click .
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

13 December 2025

A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks

,
,
and
1
Innovative Technologies and Computer Science Laboratory (LT2I), High School of Technology (EST), Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
2
Laboratory of Engineering Sciences (LSE), Polydisciplinary Faculty of Taza, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
*
Authors to whom correspondence should be addressed.

Abstract

Upstream processes strongly influence downstream analysis in sequential data-processing workflows, particularly in machine learning, where data quality directly affects model performance. Conventional statistical imputations often fail to capture nonlinear dependencies, while deep learning approaches typically lack uncertainty quantification. We introduce a hybrid imputation model that integrates a deep learning autoencoder with Convolutional Neural Network (CNN) layers and a Transformer-based contextual modeling architecture to address systematic variation across heterogeneous data sources. Performing multiple imputations in the autoencoder–transformer latent space and averaging representations provides implicit batch correction that suppresses context-specific remains without explicit batch identifiers. We performed experiments on datasets in which 10% of missing data was artificially introduced by completely random missing data (MCAR) and non-random missing data (MNAR) mechanisms. They demonstrated practical performance, jointly ranking first among the imputation methods evaluated. This imputation technique reduced the root mean square error (RMSE) by 50% compared to denoising autoencoders (DAE) and by 46% compared to iterative imputation (MICE). Performance was comparable for adversarial models (GAIN) and attention-based models (MIDA), and both provided interpretable uncertainty estimates (CV = 0.08–0.15). Validation on datasets from multiple sources confirmed the robustness of the technique: notably, on a forensic dataset from multiple laboratories, our imputation technique achieved a practical improvement over GAIN (0.146 vs. 0.189 RMSE), highlighting its effectiveness in mitigating batch effects.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.