Next Article in Journal
Spatio-Temporal and Semantic Dual-Channel Contrastive Alignment for POI Recommendation
Previous Article in Journal
A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks

by
Charlène Béatrice Bridge-Nduwimana
1,*,
Salah Eddine El Harrauss
1,*,
Aziza El Ouaazizi
1,2 and
Majid Benyakhlef
2
1
Innovative Technologies and Computer Science Laboratory (LT2I), High School of Technology (EST), Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
2
Laboratory of Engineering Sciences (LSE), Polydisciplinary Faculty of Taza, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(12), 321; https://doi.org/10.3390/bdcc9120321
Submission received: 5 November 2025 / Revised: 2 December 2025 / Accepted: 9 December 2025 / Published: 13 December 2025

Abstract

Upstream processes strongly influence downstream analysis in sequential data-processing workflows, particularly in machine learning, where data quality directly affects model performance. Conventional statistical imputations often fail to capture nonlinear dependencies, while deep learning approaches typically lack uncertainty quantification. We introduce a hybrid imputation model that integrates a deep learning autoencoder with Convolutional Neural Network (CNN) layers and a Transformer-based contextual modeling architecture to address systematic variation across heterogeneous data sources. Performing multiple imputations in the autoencoder–transformer latent space and averaging representations provides implicit batch correction that suppresses context-specific remains without explicit batch identifiers. We performed experiments on datasets in which 10% of missing data was artificially introduced by completely random missing data (MCAR) and non-random missing data (MNAR) mechanisms. They demonstrated practical performance, jointly ranking first among the imputation methods evaluated. This imputation technique reduced the root mean square error (RMSE) by 50% compared to denoising autoencoders (DAE) and by 46% compared to iterative imputation (MICE). Performance was comparable for adversarial models (GAIN) and attention-based models (MIDA), and both provided interpretable uncertainty estimates (CV = 0.08–0.15). Validation on datasets from multiple sources confirmed the robustness of the technique: notably, on a forensic dataset from multiple laboratories, our imputation technique achieved a practical improvement over GAIN (0.146 vs. 0.189 RMSE), highlighting its effectiveness in mitigating batch effects.

Graphical Abstract

1. Introduction

In many scientific domains, missing data is an important problem. In the health, finance, and surveillance sectors, which rely on large, complex datasets, missing data can undermine confidence in the analysis and introduce biases that distort results. There are many approaches to managing these gaps, ranging from simple statistical methods to sophisticated machine learning and deep learning imputation techniques [1,2,3,4]. Each methodology entails considerable trade-offs among accuracy, scalability, and flexibility to various constraints, with its efficacy profoundly affected by data features and the configuration of missing-data mechanisms [5]. In addition to recent advances, existing techniques face limitations: traditional statistical methods struggle to model complex nonlinear dependencies, whereas recent deep learning approaches often lack explicit uncertainty quantification, which is essential for reliable decision-making.
A frequently overlooked problem arises from systematic changes that occur when data are acquired under disparate conditions, at different times, or across different technology platforms. These variations, commonly referred to as batch effects in genomics [6] or domain shifts in machine learning, introduce confounding correlations that are not related to the true models. In multi-site biomedical studies, for example, systematic differences between collection stations undermine the accuracy of imputation. Furthermore, conventional imputation methods treat all samples uniformly, thereby amplifying rather than correcting these batch-specific biases.
In this study, a novel imputation technique is introduced: (1) conducting multiple imputations within the autoencoder’s latent space instead of the original feature space, thereby creating average representations that reflect a robust data structure; (2) offering implicit batch correction via the latent space average, which effectively mitigates source-specific anomalies while retaining the genuine signal, all without the need for explicit batch identifiers; and (3) preserving uncertainty quantification through the multiple imputation strategy, allowing for confidence-weighted downstream analysis that is not accessible in deterministic deep learning methods. This innovative approach is highly advantageous because it uses the latent space, which compresses and denoises the representation, where intricate relationships among features and non-linear dependencies are represented more effectively than in the original high-dimensional space. By merging the statistical robustness of multiple imputation with the representational capabilities of deep autoencoders, our technique provides both practical, impressive accuracy and interpretable uncertainty estimates.
Evaluation on standard reference datasets under both the missing completely at random (MCAR) and missing not at random (MNAR) conditions shows that our imputation technique has distinct strengths. Statistical evaluations indicate that this new technique yields significant practical advantages over conventional methods, including a reduction in root-mean-square error (RMSE), while achieving performance on par with advanced deep learning solutions such as generative adversarial imputation networks (GAIN) and multiple imputation using denoising autoencoders (MIDA). An examination of batch-size sensitivity confirms that mini-batch training ensures consistency during the multiple-imputation process in the latent space. The structure of this article is organized as follows: Section 2 discusses current imputation methods and their shortcomings. Section 3 outlines the methodology, detailing the mathematical framework and algorithmic execution. Section 4 presents the experimental findings and the comparative assessment. Section 5 explores the implications and practical uses. Section 6 concludes by summarizing the key contributions and future research directions.

2. Related Works

2.1. Background

Missed data is frequent in numerous data-driven fields, including web analytics, clinical medicine, and materials engineering [7,8]. Factors contributing to this issue include participant non-compliance with survey guidelines [9,10], malfunctioning sensors, privacy limitations, and technical issues during data collection, such as inadequate resolution, image deterioration, or suboptimal hybridization [10]. Datasets with missing values are prone to considerable bias, compromising analytical precision and ultimately impacting knowledge extraction and decision-making processes [5]. This ongoing issue has motivated decades of research on missing-data imputation, resulting in a range of techniques for managing incomplete datasets.
Analyzing data is crucial for both classification and prediction, especially when addressing missing values (MV). Imputation plays a vital role in dealing with MV that arises from random processes (missing at random—MAR or missing completely at random—MCAR) as well as from non-random processes (missing not at random—MNAR). However, many learning algorithms require complete data sets. Indeed, several statistical algorithms, such as singular value decomposition (SVD), principal component analysis (PCA), and artificial neural networks (ANN) [1,11], depend on complete datasets [7] and are very sensitive to MV and their patterns [8,9]. Consequently, imputation, the process of estimating and completing MV, is an essential pre-processing step in data analysis [11].
Data imputation approaches differ depending on their sources of information. Firstly, there are internal imputation techniques, which treat missing values exclusively on the basis of models and redundancies within the existing dataset [12,13], whereas external imputation methods incorporate user-specified external data sets or domain expertise [14,15], which generally requires greater human involvement and more resources. Furthermore, imputation methods can differ in the way they are carried out, from straightforward strategies that discard samples with missing values, which can introduce bias and reduce representativeness [16], to more advanced techniques that maintain the integrity of the data.

2.2. From Simple to Advanced Methods

2.2.1. Traditional Approaches

Early imputation approaches used simple statistical techniques, including mean imputation [5,17], hot deck imputation (HDI) [1,5], cold deck imputation (CDI) [1,5], and k-nearest neighbor imputation (KNNI) [1]. These methods remain computationally efficient and widely used due to their simplicity. Fixed imputation techniques, such as zero imputation [1,18], are simple but have limitations, including biased results and misleading correlations. However, they operate on the assumption that data are missing at random (MAR and MCAR) and are ineffective when dealing with missing data scenarios that are not random (MNAR) [19,20].
Traditional machine learning advanced the field of imputation by introducing more interesting methods such as XGBoost imputation (XGBI) [21,22], MissForest imputation (MissFI) [23], multivariate imputation by chained equations (MICE) [5,20,24,25], regression-based imputation [1,26], expectation-maximization imputation, soft imputation (SI) [27], matrix factorization imputation (MFI) [28], principal component analysis imputation (PCAI) [28], and multilayer perceptron imputation (MLPI) [29]. Multiple imputation (MI) [5,24,25], is an approach that involves generating multiple data sets to account for uncertainty, while hybrid methods combine multiple imputation with machine learning [22,30,31,32] or deep learning [33] to improve accuracy, this may introduce some complexity and those approaches demonstrate a superior ability to capture non-linear relationships between data. For example, methods such as MICE allow quantification of uncertainty via multiple imputations. However, their effectiveness depends heavily on the quality of the data and requires careful parameter tuning [23,29,34].

2.2.2. Deep Learning Imputation Methods

Deep learning approaches use generative models for imputation, incorporating explicit generative models such as autoencoders (AEs) [35] and implicit models such as generative adversarial networks (GANs) [16,36]. Feature-based imputation, such as the GAIN method [16] and the deep learning imputation network (DLIN) [37], offers more sophisticated solutions for complex missing patterns. They excel at modeling complex, high-dimensional data through powerful hierarchical representation learning. However, they are sensitive to missing data distributions, require large training datasets, and require significant computational resources [35,36]. Recent models such as Bidirectional Recurrent Imputation for Time Series (BRITS) [38], which is based on recurrent neural networks, and Self-Attention-based Imputation for Time Series (SAITS) [39], which adapts the Transformer to irregular time series, demonstrate the potential of sequential models for imputation. However, these recent approaches struggle to adapt to large tabular data (BRITS) or mainly target time series rather than general-purpose heterogeneous features (SAITS).
The development of large language models (LLMs) opened up new perspectives for data imputation [40,41]. Through prompt engineering, LLMs can be adapted to specific imputation tasks by improving task contextualization without modifying model parameters. The performance of LLMs in imputation tasks strongly depends on the quality of the prompts, with few-shot learning generally improving over zero-shot approaches through example-based contextualization [42]. Recent innovations include training LLMs to understand tabular data structures [43] and integrating augmented search techniques that dynamically incorporate external knowledge. However, these methods face limitations due to their reliance on stored knowledge or computationally expensive table-level search mechanisms that may neglect contextual requirements [44,45]. To address these challenges, advanced prompt engineering techniques have been developed, involving multi-turn, chain-of-thought, and tree-structured reasoning frameworks [46].

2.2.3. Evolutionary Computation Methods

Evolutionary computation methods have recently increased in imputation tasks by investigating optimal values for missing data using natural selection approaches. Examples include multi-objective genetic algorithms (MOGI) [47]; differential evolution involving KNN imputation, clustering, and feature selection (DEKCF) [48]; and particle swarm optimization-based feature selection and imputation (PSOFI), which simultaneously handle data imputation and feature selection to improve both imputation accuracy and classification performance, particularly for mixed data types [48,49]. These methods leverage the stochastic search capabilities of evolutionary algorithms, yet they frequently struggle with managing diverse feature distributions and high computational requirements. Additionally, combining feature selection with imputation methods has shown additional benefits by decreasing dimensionality and improving the robustness of the model [50].
Fuzzy logic-based imputation methods excel in addressing uncertainty and ambiguity in data by effectively modeling imprecise information [51], thereby enhancing adaptability in real-world situations dealing with noise [31,52]. These approaches utilize fuzzy set theory to manage ambiguous data, employing linguistic variables and fuzzy rules to fill in missing values while maintaining the relationships between data points. Overall, fuzzy-approximate techniques exhibit strong performance and resilience to noise, without the need for user-defined parameters or initial approximations, making them especially suitable for environments characterized by uncertain data.
Ultimately, imputation techniques that are specific to certain domains, such as inverse distance weighting utilizing KNN models in hydrology [53] and Gaussian mixture methods utilizing KNN models in biotechnology [35], tackle challenges unique to their respective fields. In medical applications [54], the balance between intra-series interactions and the selection of modeling techniques, such as recurrent neural networks (RNNs) to handle sequential dependencies or GANs for pattern recognition, must take into account the characteristics of the datasets and the missing values associated with batch effects [18,55], highlighting the importance of contextualized strategies for reliable data analysis. Batch effects represent non-biological systematic variations resulting from differences in data collection, processing, or measurement conditions [6]. In genomics, methods [56,57] provide a posteriori statistical adjustment when batch identifiers are known. However, these approaches work on complete data and assume that batch labels are available, conditions that are rarely met in imputation scenarios where missing data patterns may themselves depend on batches. Other recent work has explored machine learning with batch considerations. Domain adaptation techniques [58] learn representations that are invariant in known source domains through adversarial learning. Batch correction in single-cell RNA sequencing [59] aligns data from multiple experiments using mutual nearest neighbors. However, these methods require explicit batch labels or complete data matrices, preventing their direct application to missing data imputation when the batch structure may be unlabeled or partially observed.

3. Materials and Methods

In this section, we present the materials used to build the new hybrid imputation technique, which is based on an integrated structure combining pre-processing strategies, deep learning architectures, and iterative learning mechanisms. The approach revolves around complementary components that work together to handle complex missing-data patterns in tabular data.

3.1. Datasets Experimental Design

We evaluated our approach on benchmark datasets from the UCI Machine Learning Repository. Table 1 presents the characteristics of each dataset.
Our technique implements a comprehensive strategy for generating and handling different missing mechanisms, following established protocols in the literature [60,61]. We simulate four distinct missing data patterns by combining two mechanisms (MCAR and MNAR) with two distribution methods (uniform and random). Each dataset was stratified for the split of train-test with test_size = 0.3 (70% training, 30% testing) using random_state = 42. Synthetic missing data was generated following four mechanisms at one missing rate r { 10 % } using the procedure detailed in Algorithm 1.
Algorithm 1 Synthetic Missingness Pattern Generation
Bdcc 09 00321 i001
For MCAR scenarios, we generate a uniform random matrix V U ( 0 , 1 ) n × m , where n is the number of rows (samples) and m is the number of columns (features). Then a binary missingness mask M is created such that M i j = 1 if V i j t , with t = 0.1 representing the missingness threshold. In the uniform variant, this applies to all features, yielding P ( M i j = 1 ) = t . In the random variant, missingness is restricted to a subset S 1 , 2 , , m of features with | S | = m / 2 , so that P ( M i j = 1 ) = t × I ( j S ) .
For MNAR mechanisms, we introduce systematic dependency by sampling two reference features c 1 , c 2 and computing their medians m 1 = median ( X : , c 1 ) and m 2 = median ( X : , c 2 ) . The conditional missingness probability becomes Equation (1):
P ( M i j = 1 | X ) = t × I ( X i , c 1 m 1 X i , c 2 m 2 )
where the logical conjunction ∧ enforces that both conditions hold simultaneously, producing structured missingness patterns that depend on the data distribution.
In the random MNAR mechanism, missing data are restricted to a random subset S 1 , 2 , , m of features with | S | = m / 2 , which leads to Equation (2):
P ( M i j = 1 | X , S ) = t × I ( X i , c 1 m 1 X i , c 2 m 2 ) × I ( j S ) .

3.2. Proposed Approach

The proposed technique consists of a structure that integrates pre-processing and imputation based on deep learning and the transformer. Figure 1 presents an overview of the structure, progressively from raw data to the final imputed result.

3.2.1. Preprocessing Stage

The pre-processing step uses an architecture implemented with the scikit-learn column transformer to process numerical and categorical features separately. The complete procedure is described in Algorithm 2. We specify here that this concerns only the pre-processing of the data before artificially adding missing values.
For numeric features, a custom imputer is utilized: HistGradientBoostingRegressor. During training for each numeric column with missing data, the model learns to predict missing values by using all other columns as predictors. To handle missing data in the predictor columns, median imputation is temporarily applied to fill any gaps, ensuring the model receives complete input data. The same median-filling process is repeated during the prediction of missing entries. After completion of imputation, the data are normalized to the range using MinMaxScaler as Equation (3):
X norm = X min ( X ) max ( X ) min ( X )
Algorithm 2 Preprocessing stage Pipeline construction
Bdcc 09 00321 i002
Categorical features are processed using simple imputation with the most frequent value, followed by encoding with OrdinalEncoder, which assigns a specific code to unknown categories. The processed numeric and categorical features are then combined into the final input matrix for the Transformer model.

3.2.2. Transformer-CNN Hybrid Architecture

Our technique leverages the complementary strengths of both Transformer encoders and autoencoders with convolutional layers to capture global context and fine-grained local features effectively. Our architecture (Figure 2) includes the following components, explained step-by-step:
1.
Positional Encoding Layer: Adds positional information to the input sequence using a sine function:
P E ( p o s , 2 i ) = sin p o s 10 , 000 2 i / d model
p o s denotes the position index within the input sequence, i is the dimension index in the embedding of the model, and d model is the internal representation of the dimensionality model. Only the sine component is used; the cosine component is omitted in our implementation. Next, after positional encoding, the dropout rate p drop = 0.1 is applied to help regularize the model.
2.
Convolutional Feature Extraction: Two Conv1D layers with kernel size 1 are used to transform the input features. These layers help the model focus on local patterns in the data:
H ( 1 ) = ReLU ( Conv 1 D d input 32 ( X pos ) )
H ( 2 ) = ReLU ( Conv 1 D 32 32 ( H ( 1 ) ) )
The input feature matrix after positional encoding is X pos . The outputs of the first and second convolutional layers are H ( 1 ) and H ( 2 ) , respectively
3.
Transformer Encoder Stack: A stack of 16 layers of Transformer encoders to provide hierarchical abstraction capacity, allowing the structure to capture deep, nonlinear dependencies across features; and then processes the sequence with multi-head self-attention, where d model = 32 , and number of heads n head = 2 to model diverse relationships efficiently without excessive complexity:
Attention ( Q , K , V ) = softmax Q K T d k V
In the attention mechanism, Q, K, and V are the query, key, and value matrices. The scalar d k is the dimension of the key vectors used to scale the dot product. In addition, the missingness mask M i , j prevents the model from attending to future positions, where the entries of mask such positions during training:
M i , j = if i < j 0 otherwise
4.
Linear Decoder: Finally, a linear layer projects the encoded representation back to the original feature space:
X ^ = W dec H trans + b dec
with weights initialized uniformly as W dec U ( 0.1 , 0.1 ) , and biases set to zero b dec = 0 to transform the output of transformer H trans back to the predicted features X ^ .
In conclusion, the architecture, depicted in Figure 2 and Algorithm A1, combines Conv1D and Transformer layers to balance local feature extraction with global contextual modeling. Conv1D ( k e r n e l _ s i z e = 1 ) functions as a pointwise transformation, converting input features into a latent space of uniform dimensionality ( d model = 32 ) for the Transformer while only capturing limited local relationships among adjacent features.

3.2.3. Implicit Batch Correction Through Latent Averaging

Effectively addressing batch effects related to missing data requires a careful strategy, with the stepwise imputation method (SIM) emerging as particularly effective. SIM fills in missing data in batches that exhibit higher missing data rates, utilizing insights from batches with fewer missing entries, all while considering batch-specific variations. Nevertheless, the implementation of SIM requires meticulous planning to prevent the introduction of additional batch effects during the imputation process. Our suggested technique, a hybrid model that integrates a deep learning autoencoder with convolutional neural network (CNN) layers and Transformer frameworks, is capable of capturing both local and global data patterns, making it highly effective for imputing missing values that are influenced by batch effects. This approach enhances imputation accuracy, minimizes error propagation, acknowledges batch-specific influences, and provides a transparent and flexible framework for multi-batch datasets.
Our technique provides implicit batch correction through latent-space averaging (Equation (11)). Given a dataset D = { ( x i , b i ) } i = 1 n where x i R p are samples and b i represents (potentially unobserved) batch membership, the data generating process can be decomposed as follows:
x i = μ + β b i + ϵ i
where μ indicates the common signal, β b i captures systematic variations that are specific to each batch, and ϵ i represents random noise.
In contrast to post-hoc methods [56] that depend on explicit labels, the reduction in dimensionality effectively diminishes batch-specific variations β b i while maintaining the shared structure μ . The multiple imputation process can be expressed as follows:
z ¯ i = 1 m j = 1 m f θ ( x i ( j ) )
further improves robustness by averaging m latent codes, which minimize the variance of batch artifacts: Var ( z ¯ i ) = 1 m Var ( z i ( j ) ) . This represents implicit batch correction: the combination of dimensionality reduction and latent averaging mitigates systematic variations without needing explicit batch identifiers or post-hoc adjustments. The full imputation procedure that includes batch correction is detailed in Algorithm 3.
Algorithm 3 Batch-Aware Latent-Space Multiple Imputation with Uncertainty Quantification
Bdcc 09 00321 i003
Interpretation of the Implicit Batch Correction Mechanism
The proposed strategy for averaging in latent space is effective because two theoretical mechanisms work together to minimize non-informative variation in the encoded representation. While these characteristics offer a logical rationale, they do not serve as formal proof and should be viewed as empirical hypotheses backed by the results obtained.
Property 1: Dimensionality Reduction as Noise and Bias Filtering
The autoencoder’s bottleneck ( d p ) necessitates compression that highlights globally consistent patterns across samples. Since batch effects β b i reflect deviations specific to the source rather than common trends, they are expected to have a minimal contribution to the reconstruction objective and therefore will be somewhat diminished in the latent representation. In general, if Cov ( β b i , μ ) 0 , the encoder is likely to map β b i onto components with lower variance, which in turn reduces its effect on z i .
Property 2: Latent-Space Averaging as Variance Stabilization
Multiple imputations z i ( j ) represent stochastic outcomes of latent codes affected by both the uncertainty of the model and the noise related to specific data. By using averages of these imputations, z ¯ i performs a variance-reducing function such that Var ( z ¯ i ) under independence. Although batch effects are systematic rather than random, empirical findings indicate that averaging in the latent space still helps lessen their influence by focusing on stable features across imputations that correspond to shared structure rather than artifacts specific to batches.
Together, these mechanisms suggest—rather than establish—that the underlying process can inherently diminish batch-related distortions through low-dimensional representation learning and the averaging of multiple imputations.

3.3. Evaluation

In the first place, our imputation technique is compared to three state-of-the-art deep learning-based imputation techniques and one classical statistical method (see Table 2). Generative Adversarial Imputation Networks (GAIN) [16] is a GAN-based approach that uses a generator to impute missing values and a discriminator to distinguish observed from imputed data. The model employs hint mechanisms to guide the imputation process. Multiple Imputation using Denoising Autoencoders (MIDA) [40] generates multiple imputations using stacked denoising autoencoders trained with dropout noise to capture uncertainty. The multiple imputation Denoising Autoencoder (DAE) [62] is similar to MIDA but focuses on single imputation with enhanced denoising capabilities through corruption mechanisms. Multivariate Imputation by Chained Equations (MICE) [63,64] is a classical iterative imputation using chained regression models. MICE draws imputations by iterating over conditional densities, which has the added advantage of being able to model different densities for different variables. Classical approaches (such as MICE and KNN) treat all samples identically. Recent deep learning methods—GAIN, MIDA, and denoising autoencoders—similarly lack mechanisms to account for systematic variations. The performance of the model is evaluated using several complementary measures to assess both the accuracy of the imputation and the effectiveness of subsequent tasks.
  • Root Mean Squared Error (RMSE): the metric for imputation accuracy on artificially masked values:
    RMSE = 1 | M | ( i , j ) M ( X true ( i , j ) X imputed ( i , j ) ) 2
  • Mean Squared Error (MSE): a variance-focused imputation metric:
    MSE = 1 | M | ( i , j ) M ( X true ( i , j ) X imputed ( i , j ) ) 2
  • Mean Absolute Error (MAE): a robust metric less sensitive to outliers:
    MAE = 1 | M | ( i , j ) M | X true ( i , j ) X imputed ( i , j ) |
    where M = { ( i , j ) : M i j = 1 } represents the set of missing positions introduced artificially.
  • Classification Accuracy (ACC): Proportion of correctly classified instances using imputed data as input to a standard classifier, measuring the practical utility of imputed values:
    ACC = Number of correct predictions Total number of predictions

4. Experimental Results

GAIN was used with default hyperparameters, and we added preprocessing stages to MIDA (training parameters detailed in Table A2).
Dataset-specific analysis (Table 3) revealed pronounced advantages in multi-source data. In Glass—comprising forensic measurements from multiple laboratories with documented variations in analytical equipment—our imputation technique achieved a 22.8 % improvement over GAIN ( 0.146 vs. 0.189 RMSE, rank # 1 / 5 ) and 33.9 % over MIDA. This validates that latent-space averaging effectively suppresses lab-specific systematic biases. In Credit, aggregated from multiple financial institutions, our imputation technique presented practical significance to traditional approaches by 38.7 % versus MICE while achieving comparable performance with GAIN ( 1.9 % ) and moderate improvement over MIDA ( + 7.9 % ). The 2–3× larger improvements on Glass (high batch evidence) versus Credit (medium batch evidence) confirm that the benefits of batch correction correlate with the severity of systematic variations (Spearman ρ = 0.74 ). As illustrated in Figure 3, our imputation technique, depicted in green, achieves the lowest RMSE across most datasets, particularly for Glass, Sonar, and Breast, demonstrating reliable reconstruction accuracy across diverse data structures.
To thoroughly assess the performance of our imputation technique in relation to a deep learning-like architecture, we carried out comprehensive comparisons with MIDA across four different missing data mechanisms (MCAR-Uniform, MCAR-Random, MNAR-Uniform, MNAR-Random) and nine reference datasets. Both techniques were applied with the same experimental parameters (num_epochs = 50) to ensure a fair evaluation. The outcomes are displayed in Table A3. The distribution analysis presented in Figure 4 indicates that while the median RMSE values for our proposed technique are marginally lower in the MCAR and MNAR scenarios, the variance remains similar, demonstrating consistent but not statistically superior performance.
In Figure 5, the mean RMSE of MIDA is compared to that of our imputation technique (“OURS”) under four missingness mechanisms: MCAR-Uniform, MCAR-Random, MNAR-Uniform, and MNAR-Random. The blue bars illustrate MIDA’s performance, while the red bars represent OURS. In all cases, except for MNAR-Random, OURS exhibits a lower RMSE, with reductions varying from 6.4% to 14.3%, the most significant enhancement occurring under MCAR-Uniform conditions. In the MNAR-Random scenario, our technique underperforms MIDA by 6.0%, indicating no reduction in RMSE and a decrease in performance compared to MIDA.
Finally, we performed a batch-size sensitivity analysis to examine its effect on imputation performance. Experiments were conducted on nine datasets under four missingness mechanisms (MCAR/MNAR × Uniform/Random at 10% missingness). We evaluated four configurations: online learning (batch_size = 1), small mini-batches (batch_size = 2 and 4), and intermediate mini-batch training (batch_size = 8), each for 50 epochs. The online configuration with batch_size = 1 achieved a balanced win rate of 31.2% (10/32) in all scenarios, with varying performance by dataset size. Datasets of smaller size often benefited from more precise gradient updates afforded by smaller batch sizes, whereas larger datasets occasionally preferred the stability provided by larger batches. Wilcoxon tests did not indicate significant differences in batch sizes (all p > 0.05 ), suggesting that changing batch sizes strikes a balance between flexibility and consistency, with no clear benefits.
Figure 6 summarizes the mean RMSE trajectories in all datasets and mechanisms for four batch sizes. Figure 7 presents the RMSE distributions for four batch sizes across missingness mechanisms. The box plots reveal that batch_size = 1 achieves the lowest median RMSE and the smallest variance under the conditions MCAR and MNAR, indicating both superior accuracy and stability. Figure 8 shows the RMSE trajectories specific to the data set between the mechanisms and four batch sizes (1, 2, 4, 8). Most datasets—including Iris, Haberman, and Boston—show a lower RMSE for Batch = 1, indicating practical performance with online learning. Only larger datasets, such as Credit, occasionally favor larger batch sizes (Batch = 8), suggesting that fine-grained updates benefit smaller, heterogeneous datasets, while mini-batching remains competitive for larger ones.

Summary Statistics and Statistical Testing

The mean RMSEs across all scenarios show batch_size = 1 with slight advantages. Statistical significance tests confirm mainly non-significant differences, with a single MCAR-Uniform aggregated comparison (batch_size = 1 vs. batch_size = 4) showing a modest significant difference (Wilcoxon p = 0.07 ). The aggregated results of the Friedman test in batch sizes (1, 2, 4, 8) are as follows:
  • MCAR-Uniform: χ 2 = 8.07 , p = 0.04 (significant differences)
  • MCAR-Random: χ 2 = 0.00 , p = 1.00 (no significant difference)
  • MNAR-Uniform: χ 2 = 4.36 , p = 0.22 (no significant difference)
  • MNAR-Random: χ 2 = 5.96 , p = 0.11 (no significant difference)
The Friedman test aggregated all scenarios, producing χ 2 = 3.71 , p = 0.29 and indicating that there are no significant differences in RMSE performance between batch sizes in general.

5. Discussion

5.1. Performance Interpretation

Experiment 1 (see Figure 3 and Table 2 and Table 3): Our imputation technique achieved competitive performance (mean rank = 1.60 ) with significant advantages over traditional methods. The Friedman test confirmed significant differences between methods ( χ 2 ( 4 ) = 9.60 where the number 4 represents the degrees of freedom, which in this context corresponds to one less than the number of methods being compared (i.e., the total number of methods − our technique = 4), p-value = 0.0477), validating post-hoc comparisons. Pairwise Wilcoxon tests revealed large effect sizes compared to DAE ( r = 1.00 , Wilcoxon p = 0.0312 ) and MICE ( r = 1.00 , Wilcoxon p = 0.0312 ), both of which survive Bonferroni correction ( α = 0.0125 ). The results translate into RMSE reductions of 50.8 % and 46.5 % , respectively—significant practical enhancements that highlight the benefits of incorporating multiple imputation into latent representations. Compared with leading methodologies, our approach demonstrated comparable performance: GAIN ( r = 0.60 , Wilcoxon p = 0.156 ) and MIDA ( r = 0.87 , Wilcoxon p = 0.062 ) showed small effect sizes without notable differences. This indicates the advancement of deep learning imputation, where various techniques achieve similar levels of accuracy. However, the technique offers distinct benefits, i.e., clear uncertainty quantification via multiple imputation and thorough correction for batch effects—features lacking in adversarial (GAIN) or purely autoencoder-based (MIDA) systems.
Experiment 2 (see Figure 4 and Figure 5 and Table A3) is a comparative evaluation of our imputation technique against MIDA across four missingness mechanisms and nine datasets that revealed a competitive but balanced performance profile. Our approach recorded a win rate of 53.1 % ( 17 / 32 pairwise comparisons of RMSE), confirming overall equivalence with the baseline. In the MCAR-Uniform scenario, our technique produced a 14.3 % reduction in RMSE compared to MIDA (Wilcoxon p = 0.074 , Cohen’s d = 0.53 ), indicating a marginally significant improvement. In particular, the most substantial improvements were observed in datasets characterized by strong nonlinear or heterogeneous feature dependencies, such as Iris. In the MCAR-Random and MNAR-Uniform scenarios, our technique achieved modest, non-significant RMSE reductions ( 6.4 % and 7.9 % , respectively; and Wilcoxon values p > 0.05 ), while the MNAR-Random conditions showed no reduction in performance (Wilcoxon p = 0.769 , + 6.0 % increase). A global Friedman test ( p = 0.326 ) illustrated the lack of overall significance between mechanisms. These findings validate our latent-space multiple imputation approach as competitive with MIDA, while also providing unique advantages in modeling nonlinear dependencies, quantifying uncertainty, and addressing structured missingness patterns.
Experiment 3 (see Figure 6, Figure 7 and Figure 8 and Table 2 and Table A5) revealed that the use of smaller batch sizes can consistently improve performance across different missingness mechanisms. In particular, a batch size of 1 achieved a mean RMSE of 0.198 ± 0.065 compared to 0.213 ± 0.063 for a batch size of 2, resulting in an RMSE augmentation of 7.4 % . The greatest gains appeared under the MNAR-uniform mechanism (up to a reduction in RMSE of 24.4 % ), although the differences were not statistically significant (Wilcoxon p = 0.31–0.94). Under the MCAR mechanism, the differences were smaller and non-significant, suggesting broadly comparable performance across batch sizes. Dataset-specific analysis showed batch_size = 1 dominating datasets such as Boston, Haberman, and Iris. These trends support the notion that fine-grained online learning is especially beneficial for small-to-medium datasets (approximately 150–700 samples) with complex or non-random missingness patterns.

5.2. Batch Size Impact on Computational Efficiency

Under conditions consistent with MCAR-Uniform across various datasets, the batch size plays a crucial role in determining training efficiency (Table 4). Batch 8 produced a mean execution time of 141 s, yielding time savings of 51.4% compared to batch 4 and 78.4% compared to batch 1, reinforcing that larger batches significantly enhance efficiency by improving memory usage and lessening gradient overhead.
Using smaller batches leads to non-linear increases in computational time: batch 2 results in double the execution time relative to batch 4, whereas the use of online learning (batch 1) results in a fourfold increase in computational expense. This trend is consistent across all dimensions—resulting in a 4.7 × slowdown from batch 8 to batch 1 for the Iris dataset, while the Sonar dataset shows a 4.2 × decline—underscoring that the configuration of the batch has a greater impact than the size of the dataset.
Anomalies specific to certain datasets indicate complexities that go beyond simple scaling. The Haberman dataset (size = 1220; 4 features) demands 137–602 s for processing, which is significantly longer than is needed for the SlumpTest dataset (size = 1133; 11 features, taking 47–204 s), implying that the sparsity of features influences convergence rates. In contrast, the high-dimensional Sonar dataset (with 61 features) exhibits efficient processing (taking 94 s at batch 8), likely attributable to its well-structured representations.

5.3. Limitations and Future Perspectives

One major limitation arises from the memory constraints experienced when training on larger datasets, although our method has proven capable of processing vast amounts of data, as evidenced by the supplementary study on execution times (see Table A4 and Table A6). While the evaluation concentrated on small-to-medium benchmarks (150–700 samples) for the sake of computational feasibility and comparability, this approach does not fully represent high-dimensional industrial scenarios (>10,000 samples) or intricate temporal dependencies. Furthermore, we tested fixed missingness levels (10% MCAR/MNAR); however, real-world data typically involves a combination of mechanisms or time-varying patterns that were not addressed in this analysis. We note that future work will include extending validation to high-dimensional or large-scale datasets in order to assess scalability. We will provide further details on how the model responds to hyperparameters (such as the number of heads and the latent dimension) and its tendency toward overfitting.
Looking forward, several promising paths for expanding this work come to mind. First, integrating hybrid architectures that merge adversarial generation with explicit multiple-imputation techniques could enhance GAIN’s predictive accuracy while allowing for more robust uncertainty quantification. Second, creating methods to automatically identify batch effects—without the need for prior labels—would significantly enhance the applicability of this imputation technology to datasets from diverse or multiple sources. Third, employing dynamic batch-sizing strategies could help manage the balance between online and mini-batch learning, making the technique more adaptable to datasets of varying sizes. Additionally, incorporating modeling approaches that address temporal or hierarchical missingness would widen usability to longitudinal or structured contexts. Finally, enhancing training efficiency—by lowering computational costs without sacrificing statistical integrity—would be crucial for implementing this technique in large-scale or real-time imputation processes. These avenues hold the potential to strengthen the reliability and flexibility of the imputation technique, fostering broader adoption and significant applications.

6. Conclusions

This study introduced a hybrid technique for imputation that merges multiple imputation with deep learning-based autoencoders and explicit correction for batch effects. A comprehensive evaluation across diverse datasets, four mechanisms for missing data, and a comparative study against conventional methods and deep learning approaches demonstrated its competitive capabilities along with distinct methodological advantages. The core innovation lies in conducting multiple imputation within learned latent representations, alongside providing implicit batch correction through averaging—a fusion that tackles limitations found in both statistical techniques (MICE, which is inadequate for nonlinear patterns) and deep learning approaches (GAIN/MIDA, which lack uncertainty quantification and proper batch management). Our technique showed significant practical enhancements over standard methods, with effect-size measures indicating considerable decreases in RMSE relative to both DAE and MICE (achieving Bonferroni-corrected significance). Compared with current deep learning models, our approach exhibited performance that was practically aligned, with statistical support suggesting meaningful benefits under MCAR-Uniform conditions, especially for datasets exhibiting strong nonlinear dependencies. A targeted analysis on specific mechanisms revealed intricate performance trends, suggesting that advantages were dataset-specific rather than universally applicable. Online learning consistently outperformed mini-batch training, with substantial enhancements observed under MNAR mechanisms, underscoring that accurate gradient updates enhance adaptability to complex missing-data patterns in datasets of moderate size. These findings confirm that robust statistical frameworks, paired with deep learning, yield competitive accuracy while preserving interpretability and methodological soundness. The key contribution here is not solely about achieving universal accuracy superiority—the balanced performance rates and converged results among deep learning techniques mirror the evolution of the field—but rather about demonstrating the synergistic melding of statistical rigor with neural network frameworks. Our method uniquely integrates explicit uncertainty quantification via multiple imputation, systematic batch-effect correction absent in purely data-driven models, and competitive predictive accuracy. In multi-laboratory biomedical studies, correcting for batch effects and quantifying uncertainty facilitates trustworthy clinical decision-making among diverse patient groups. In the finance sector, this approach offers regulatory-compliant audit trails and addresses intricate patterns of missing transactions that are not random. The technique positions latent-space multiple imputation as a robust, interpretable, and uncertainty-aware solution, with both predictive performance and methodological clarity at the forefront.

Author Contributions

Conceptualization, C.B.B.-N., S.E.E.H. and A.E.O.; methodology, C.B.B.-N. and S.E.E.H.; investigation, C.B.B.-N., S.E.E.H., A.E.O. and M.B.; writing—original draft preparation, C.B.B.-N. and S.E.E.H.; writing—review and editing, C.B.B.-N., S.E.E.H. and A.E.O.; supervision, A.E.O. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on https://github.com/BBridgeCN/TCnns-tabImputation.git. (accessed on 1 August 2025).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UCIUniversity of California, Irvine machine learning repository
MCARMissing Completely at Random
MNARMissing Not at Random
MARMissing at Random
MVMissing Values
DAEDenoising AutoEncoders
MICEMultivariate Imputation by Chained Equations
MIDAMultiple Imputation Denoising Autoencoder
GAINGenerative Adversarial Imputation Networks
SVDSingular Value Decomposition
PCAPrincipal Component Analysis
PCAIPrincipal Component Analysis Imputation
ANNArtificial Neural Networks
AEsAutoencoders
GANsGenerative Adversarial Networks
RNNsRecurrent Neural Networks
HDIHot Deck Imputation
CDICold Deck Imputation
KNNK-Nearest Neighbors
KNNIK-Nearest Neighbors Imputation
MissFIMissForest Imputation
SISoft Imputation
MFIMatrix Factorization Imputation
MLPIMultilayer Perceptron Imputation
DLINDeep Ladder Imputation Network
BRITSBidirectional Recurrent Imputation for Time Series
SAITSSelf-Attention-based Imputation for Time Series
LLMsLarge Language Models
MOGIMulti-Objective Genetic Algorithms
DEKCFDifferential Evolution involving KNN Imputation, Clustering, and Feature Delection
PSOFIParticle Swarm Optimization-based Feature Selection and Imputation
RMSERoot Mean Square Error
MSEMean Square Error
MAEMean Absolute Error
ACCAccuracy

Appendix A. Experimental Results and Imputation Algorithm

This appendix documents the experimental infrastructure and extended results supporting our imputation technique. Algorithm A1 presents the complete training and imputation workflow from data preparation through evaluation. Table A1 provides statistical comparisons against baseline methods with win rates and effect sizes. Table A2 details all hyperparameters, model architecture specifications, and training configurations. Table A3 compares MIDA and our technique across MCAR/MNAR mechanisms under uniform and random missingness. Table A5 examines performance across batch sizes 1, 2, 4, and 8 under different missingness mechanisms. The efficiency analysis section examines the effects of batch size on computational cost and imputation quality. Table A4 and Table A6 provide comprehensive metrics across diverse datasets, demonstrating scalability from 7680 to 101,500 cells with consistent accuracy and practical execution times for real-world deployment.
Algorithm A1 Complete Training and Imputation Workflow
Bdcc 09 00321 i004
Table A1. Statistical comparison summary.
Table A1. Statistical comparison summary.
ComparisonWin Ratep-ValueEffect SizeInterpretation
vs. DAE5/5 0.031 ** r = 1.00 significant
vs. MICE5/5 0.031 ** r = 1.00 significant
vs. MIDA4/5 0.062 r = 0.87 significant
vs. GAIN3/5 0.156 r = 0.60 not significant
** Practically significant improvements.
Table A2. Training parameters and configuration.
Table A2. Training parameters and configuration.
Training ConfigurationModel Architecture
Epochs50Dropout rate0.1
Batch sizes1, 2, 4, 8Model dimension ( d model )32
Learning rate0.01Attention heads ( n head )2
OptimizerSGDEncoder layers16
Momentum (0.99)Conv1D kernel1
Nesterov = TrueConv1D channels32
Loss functionMSE
Early stopping 10 6
Data ConfigurationEnvironment
Train-test split70–30%DeviceCPU/CUDA
Random state42FrameworkPyTorch (2.9.0)
Missing rate10%Preprocessingscikit-learn
Table A3. Comparison of MIDA and OURS under MCAR and MNAR mechanisms with uniform and random missingness.
Table A3. Comparison of MIDA and OURS under MCAR and MNAR mechanisms with uniform and random missingness.
DatasetUniform MissingnessRandom Missingness
MIDAOURSMIDAOURS
RMSEMSEMAEACCRMSEMSEMAEACCRMSEMSEMAEACCRMSEMSEMAEACC
MCAR
BostonHousing0.2320.0610.1580.8250.2140.0540.1630.8460.2100.0550.1620.8220.2190.0520.1680.828
BreastCancer0.1700.0550.1140.8510.1310.0460.1010.8650.1390.0250.0760.9130.2870.1090.1990.816
Credit0.3530.1380.2900.6130.3250.1560.2770.6970.4210.2180.3670.5870.3030.1170.2770.753
Glass0.2210.0320.2020.8210.1460.0350.1380.8630.1830.0390.1220.8750.1650.0360.1140.805
Haberman0.1760.0400.1370.8610.2380.0730.2020.8060.2960.0880.2350.5000.2030.0520.1720.800
Iris0.4450.3760.3970.4300.2690.0860.2070.7000.2710.0740.1960.8890.2550.0810.2110.875
SlumpTest0.2860.0880.2490.5380.2520.0730.2340.7420.2870.0870.2550.7830.2140.0580.2000.729
Sonar0.1740.0490.1480.8230.1870.0410.1580.8310.1480.0260.1230.8570.1840.0390.1520.835
MNAR
BostonHousing0.1740.0450.1490.7900.1510.0280.1370.8010.1290.0180.1000.8750.1530.0320.1390.867
BreastCancer0.1280.0200.0870.9810.1450.0360.1260.9360.1220.0210.1010.9440.2180.0490.1800.928
Credit0.2900.1380.2530.7650.3110.1640.2920.7310.2120.0490.1880.7430.3010.1160.2900.729
Glass0.1420.0320.1260.8330.1450.0310.1320.8880.1730.0330.1680.7500.2010.0500.1750.875
Haberman0.1970.0410.1760.8750.1930.0510.1700.8790.1430.0260.1120.9910.1590.0270.1480.750
Iris0.1770.0480.1560.7500.0830.0080.0780.9910.1940.0400.1870.5000.0470.0020.0470.996
SlumpTest0.2080.0580.1950.6670.1420.0310.1420.7500.2550.0730.2510.7500.1880.0520.1790.667
Sonar0.1380.0300.1250.8760.1690.0420.1570.8520.1210.0210.1120.8750.1630.0360.1530.873
Table A4. Performance metrics across different batch sizes for three datasets.
Table A4. Performance metrics across different batch sizes for three datasets.
DatasetMetricBatch 12Batch 8Batch 4Batch 2
Credit-G (1000 × 21 = 21,000)
RMSE0.7070.7080.7260.720
MSE0.9390.9410.9590.949
MAE0.5780.5830.6070.593
Accuracy0.6220.6220.6090.620
Time (s)786.46658.30685.321232.96
Airfoil (1503 × 6 = 9018)
RMSE0.2600.2630.2600.268
MSE0.0750.0770.0750.079
MAE0.2170.2200.2130.231
Accuracy0.7750.7560.7560.775
Time (s)1128.24978.731117.121813.31
Building Energy (768 × 10 = 7680)
RMSE0.3230.3230.3290.329
MSE0.1110.1110.1150.115
MAE0.2900.2910.2960.294
Accuracy0.5700.5770.5260.545
Time (s)570.70471.46507.49845.07
Table A5. OURS performance under MCAR and MNAR mechanisms with uniform and random missingness for batch sizes 1 and 8.
Table A5. OURS performance under MCAR and MNAR mechanisms with uniform and random missingness for batch sizes 1 and 8.
DatasetUniform MissingnessRandom Missingness
MCAR (OURS)MNAR (OURS)MCAR (OURS)MNAR (OURS)
RMSEMSEMAEACCRMSEMSEMAEACCRMSEMSEMAEACCRMSEMSEMAEACC
batch_size = 1
BostonHousing0.2140.0540.1630.8460.1510.0280.1370.8010.2190.0520.1680.8280.1530.0320.1390.867
BreastCancer0.1310.0460.1010.8650.1450.0360.1260.9360.2870.1090.1990.8160.2180.0490.1800.928
Credit0.3250.1560.2770.6970.3110.1640.2920.7310.3030.1170.2770.7530.3010.1160.2900.729
Glass0.1460.0350.1380.8630.1450.0310.1320.8880.1650.0360.1140.8050.2010.0500.1750.875
Haberman0.2380.0730.2020.8060.1930.0510.1700.8790.2030.0520.1720.8000.1590.0270.1480.750
Iris0.2690.0860.2070.7000.0830.0080.0780.9910.2550.0810.2110.8750.0470.0020.0470.996
SlumpTest0.2520.0730.2340.7420.1420.0310.1420.7500.2140.0580.2000.7290.1880.0520.1790.667
Sonar0.1870.0410.1580.8310.1690.0420.1570.8520.1840.0390.1520.8350.1630.0360.1530.873
batch_size = 2
BostonHousing0.2070.0490.1680.8250.1950.0520.1760.7810.2640.0740.2190.7730.1940.0390.1660.905
BreastCancer0.2940.1010.2650.7590.1640.0340.1510.9800.2950.1090.2470.8230.2470.0640.2360.973
Credit0.3190.1420.2710.6670.2730.0990.2500.7220.1650.0290.1320.9030.3300.1490.3150.600
Glass0.1890.0460.1580.7910.1090.0210.1070.9090.1670.0350.1400.9750.1200.0220.0970.833
Haberman0.2470.0780.2260.8270.1440.0290.1280.8120.1970.0390.1510.8890.2810.0790.2380.687
Iris0.2110.0580.1950.5670.0940.0090.0910.9970.2420.0640.2220.7500.1900.0360.1810.750
SlumpTest0.2800.0860.2500.5610.2600.0740.2480.8430.2870.1000.2630.7650.2080.0440.2030.667
Sonar0.2010.0510.1730.8220.1640.0380.1550.8760.1780.0480.1480.8530.1150.0170.1060.894
batch_size = 4
BostonHousing0.2230.0560.1830.7850.2030.0490.1660.8590.2330.0670.1680.8680.1920.0620.1830.900
BreastCancer0.2690.0860.2380.8620.1840.0500.1640.9600.2290.0720.2020.8550.1150.0190.1120.997
Credit0.3010.1100.2610.7390.2940.1290.2730.7850.2470.0750.2200.8280.2310.0690.1870.906
Glass0.1890.0510.1490.7990.1450.0340.1380.9130.1230.0190.0950.9130.1090.0190.0960.900
Haberman0.2610.0900.2170.7500.1110.0170.1070.9930.2430.0640.2070.7140.1810.0470.1640.818
Iris0.2870.1110.2150.6530.3350.2070.2940.5730.4170.2490.3160.6670.0060.0010.0060.999
SlumpTest0.2780.0850.2530.6630.1930.0460.1920.7500.2030.0430.1820.7670.1940.0470.1880.997
Sonar0.1920.0420.1630.7950.2010.0510.1830.7830.1920.0410.1650.7720.1640.0350.1580.906
batch_size = 8
BostonHousing0.2040.0590.1770.8310.2110.0520.1740.8510.2320.0590.1950.8220.1920.0620.1830.900
BreastCancer0.2570.0810.2250.8460.2080.0510.1890.9760.2800.0890.2420.8130.2000.0400.1760.977
Credit0.3000.1230.2700.7430.2190.0730.2030.7760.2330.0810.2180.8850.2160.0590.1650.860
Glass0.1460.0300.1210.8600.1200.0220.1130.8330.1880.0470.1430.8140.2060.0590.1580.887
Haberman0.2130.0510.1810.7790.2620.0750.2420.7500.2250.0530.1750.8180.2540.0970.2270.875
Iris0.3130.1290.2440.6190.2670.0810.2520.7500.1920.0370.1570.8780.0590.0040.0560.999
SlumpTest0.2160.0530.1930.6520.1650.0340.1620.5690.2080.0490.1740.5890.1750.0410.1700.573
Sonar0.1900.0440.1610.8250.1870.0570.1670.7630.2100.0550.1820.7850.1890.0600.1820.769
Table A6. Imputation performance and computational efficiency across datasets: additional experiment.
Table A6. Imputation performance and computational efficiency across datasets: additional experiment.
DatasetSamplesFeaturesSizeRMSEMSEMAEAccuracyTime (s)Batch
Credit-G10002121,0000.7260.9590.6070.609685.324
Airfoil1503690180.2600.0750.2130.7561117.124
Building Energy7681076800.3290.1150.2960.526507.494
Bank Marketing20001734,0000.6111.2270.4760.7571653.824
Phishing Websites15003146,5000.3830.1550.3200.744699.034
Optical Digits10006565,0000.2320.0750.1990.800836.444
Spambase175058101,5000.0730.0110.0420.9861248.704
Note: Size = Samples × Features. All experiments conducted with batch size = 4.

References

  1. Hameed, W.M.; Ali, N.A. Missing value imputation techniques: A survey. UHD J. Sci. Technol. 2023, 7, 72–81. [Google Scholar] [CrossRef]
  2. Gan, Q.; Gong, L.; Hu, D.; Jiang, Y.; Ding, X. A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. Sensors 2023, 23, 8678. [Google Scholar] [CrossRef] [PubMed]
  3. Hung, C.-Y.; Jiang, B.C.; Wang, C.-C. Evaluating Machine Learning Classification Using Sorted Missing Percentage Technique Based on Missing Data. Appl. Sci. 2020, 10, 4920. [Google Scholar] [CrossRef]
  4. Eid, M.M.; ElDahshan, K.; Abouali, A.H.; Tharwat, A. Using Optimization Algorithms for Effective Missing-Data Imputation: A Case Study of Tabular Data Derived from Video Surveillance. Algorithms 2025, 18, 119. [Google Scholar] [CrossRef]
  5. Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef]
  6. Leek, J.T.; Scharpf, R.B.; Bravo, H.C.; Simcha, D.; Langmead, B.; Johnson, W.E.; Geman, D.; Baggerly, K.; Irizarry, R.A. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010, 11, 733–739. [Google Scholar] [CrossRef]
  7. Zhang, Y.; Li, M.; Wang, S.; Dai, S.; Luo, L.; Zhu, E.; Xu, H.; Zhu, X.; Yao, C.; Zhou, H. Gaussian mixture model clustering with incomplete data. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–14. [Google Scholar] [CrossRef]
  8. Alruhaymi, A.Z.; Kim, C.J. Study on the missing data mechanisms and imputation methods. Open J. Stat. 2021, 11, 477–492. [Google Scholar] [CrossRef]
  9. Mensah, C.; Klein, J.; Bhulai, S.; Hoogendoorn, M.; Van der Mei, R. Detecting fraudulent bookings of online travel agencies with unsupervised machine learning. In Advances and Trends in Artificial Intelligence. From Theory to Practice, Proceedings of the 32nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2019, Graz, Austria, 9–11 July 2019; Wotawa, F., Pill, I., Koitz-Hristov, R., Friedrich, G., Ali, M., Eds.; Springer: Cham, Switzerland, 2019; pp. 334–346. [Google Scholar] [CrossRef]
  10. Getz, K.; Hubbard, R.A.; Linn, K.A. Performance of multiple imputation using modern machine learning methods in electronic health records data. Epidemiology 2023, 34, 206–215. [Google Scholar] [CrossRef]
  11. Seu, K.; Kang, M.; Lee, H. An intelligent missing data imputation techniques: A review. JOIV Int. J. Inf. Vis. 2022, 6, 278–283. [Google Scholar] [CrossRef]
  12. Mattei, P.A.; Frellsen, J. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; pp. 4413–4423. Available online: https://proceedings.mlr.press/v97/mattei19a.html (accessed on 11 September 2024).
  13. Choi, E.; Bahadori, M.T.; Schuetz, A.; Stewart, W.F.; Sun, J. Using recurrent neural network models for early detection of heart failure onset. J. Am. Med. Inform. Assoc. 2017, 24, 361–370. [Google Scholar] [CrossRef] [PubMed]
  14. Beaulieu-Jones, B.K.; Moore, J.H. Missing data imputation in the electronic health record using deeply learned autoencoders. Pac. Symp. Biocomput. 2017, 22, 207–218. [Google Scholar] [CrossRef] [PubMed]
  15. Huque, M.H.; Carlin, J.B.; Simpson, J.A.; Lee, K.J. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med. Res. Methodol. 2018, 18, 168. [Google Scholar] [CrossRef] [PubMed]
  16. Yoon, J.; Jordon, J.; Van der Schaar, M. GAIN: Missing Data Imputation Using Generative Adversarial Nets. Proc. Mach. Learn. Res. 2018, 80, 5689–5698. Available online: http://proceedings.mlr.press/v80/yoon18a.html (accessed on 1 November 2024).
  17. McCoy, J.T.; Kroon, S.; Auret, L. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-Pap. OnLine 2018, 51, 141–146. [Google Scholar] [CrossRef]
  18. Goh, W.W.B.; Hui, H.W.H.; Wong, L. How missing value imputation is confounded with batch effects and what you can do about it. Drug Discov. Today 2023, 28, 103661. [Google Scholar] [CrossRef]
  19. Lee, K.J.; Carlin, J.B.; Simpson, J.A.; Moreno-Betancur, M. Assumptions and analysis planning in studies with missing data in multiple variables: Moving beyond the MCAR/MAR/MNAR classification. Int. J. Epidemiol. 2023, 52, 1268–1275. [Google Scholar] [CrossRef]
  20. Seaman, S.R.; Galati, J.C.; Jackson, D.; Carlin, J.B. What is meant by “missing at random”? Stat. Sci. 2013, 28, 257–268. [Google Scholar] [CrossRef]
  21. Allison, P.D. Missing Data; Sage Publications: Thousand Oaks, CA, USA, 2002. [Google Scholar] [CrossRef]
  22. Aleryani, A.; Wang, W.; De la Iglesia, B. Multiple imputation ensembles (MIE) for dealing with missing data. SN Comput. Sci. 2020, 1, 134. [Google Scholar] [CrossRef]
  23. Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef]
  24. Pereira, R.C.; Abreu, P.H.; Rodrigues, P.P. Partial Multiple Imputation with Variational Autoencoders: Tackling Not at Randomness in Healthcare Data. IEEE J. Biomed. Health Inform. 2022, 26, 4218–4227. [Google Scholar] [CrossRef] [PubMed]
  25. Woods, A.D.; Gerasimova, D.; Van Dusen, B.; Nissen, J.; Bainter, S.; Uzdavines, A.; Davis-Kean, P.E.; Halvorson, M.; King, K.M.; Logan, J.A.R.; et al. Best practices for addressing missing data through multiple imputation. Infant Child Dev. 2023, 33, e2407. [Google Scholar] [CrossRef]
  26. Bridge-Nduwimana, C.B.; El Ouaazizi, A.; Benyakhlef, M. A New Data Imputation Technique for Efficient Used Car Price Forecasting. Int. J. Electr. Comp. Eng. 2025, 15, 2364–2371. [Google Scholar] [CrossRef]
  27. Sterne, J.A.C.; White, I.R.; Carlin, J.B.; Spratt, M.; Royston, P.; Kenward, M.G.; Wood, A.M.; Carpenter, J.R. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ 2009, 338, b2393. [Google Scholar] [CrossRef]
  28. Carpenter, J.R.; Kenward, M.G.; Bartlett, J.W.; Morris, T.P.; Quartagno, M.; Wood, A.M. Multiple Imputation and Its Application, 2nd ed.; Wiley: Hoboken, NJ, USA, 2023. [Google Scholar] [CrossRef]
  29. Li, H.; Cao, Q.; Bai, Q.; Li, Z.; Hu, H. Multistate Time Series Imputation Using Generative Adversarial Network with Applications to Traffic Data. Neural Comput. Appl. 2022, 35, 6545–6567. [Google Scholar] [CrossRef]
  30. Wang, Z.; Wang, L.; Tan, Y.; Yuan, J. Fault detection based on Bayesian network and missing data imputation for building energy systems. Appl. Therm. Eng. 2021, 182, 116051. [Google Scholar] [CrossRef]
  31. Dzulkalnine, M.F.; Sallehuddin, R. Missing data imputation with fuzzy feature selection for diabetes dataset. SN Appl. Sci. 2019, 1, 362. [Google Scholar] [CrossRef]
  32. Nguyen, C.D.; Carlin, J.B.; Lee, K.J. Model checking in multiple imputation: An overview and case study. Emerg. Themes Epidemiol. 2017, 14, 8. [Google Scholar] [CrossRef]
  33. Pereira, R.C.; Santos, M.S.; Rodrigues, P.P.; Abreu, P.H. Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes. J. Artif. Intell. Res. 2020, 69, 1255–1285. [Google Scholar] [CrossRef]
  34. Josse, J.; Husson, F. missMDA: A package for handling missing values in multivariate data analysis. J. Stat. Softw. 2016, 70, 1–31. [Google Scholar] [CrossRef]
  35. Pati, S.K.; Gupta, M.K.; Shai, R.; Banerjee, A.; Ghosh, A. Missing value estimation of microarray data using Sim-GAN. Knowl. Inf. Syst. 2022, 64, 2661–2687. [Google Scholar] [CrossRef]
  36. Jakobsen, J.C.; Gluud, C.; Wetterslev, J.; Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—A practical guide with flowcharts. BMC Med. Res. Methodol. 2017, 17, 162. [Google Scholar] [CrossRef] [PubMed]
  37. Hallaji, E.; Razavi-Far, R.; Saif, M. DLIN: Deep ladder imputation network. IEEE Trans. Cybern. 2022, 52, 8629–8641. [Google Scholar] [CrossRef] [PubMed]
  38. Cao, W.; Wang, D.; Li, J.; Zhou, H.; Li, L.; Li, Y. BRITS: Bidirectional Recurrent Imputation for Time Series. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2018; pp. 6775–6785. Available online: http://papers.nips.cc/paper/7911-brits-bidirectional-recurrent-imputation-for-time-series (accessed on 20 November 2024).
  39. Du, W.; Côté, D.; Liu, Y. SAITS: Self-attention–based imputation for time series. Expert Syst. Appl. 2023, 219, 119619. [Google Scholar] [CrossRef]
  40. Lall, R.; Robinson, T. Efficient multiple imputation for diverse data in Python and R: MIDASpy and rMIDAS. J. Stat. Softw. 2023, 107, 1–38. [Google Scholar] [CrossRef]
  41. Zhang, Z. Missing data imputation: Focusing on single imputation. Ann. Transl. Med. 2016, 4, 9. [Google Scholar] [CrossRef]
  42. Kang, H. The prevention and handling of the missing data. Korean J. Anesthesiol. 2013, 64, 402–406. [Google Scholar] [CrossRef]
  43. Fang, X.; Xu, W.; Tan, F.A.; Zhang, J.; Hu, Z.; Qi, Y.; Nickleach, S.; Socolinsky, D.; Sengamedu, S.; Faloutsos, C. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding—A Survey. Trans. Mach. Learn. Res. 2024. Available online: https://openreview.net/forum?id=IZnrCGF9WI (accessed on 1 September 2025).
  44. Enders, C.K. Multiple imputation as a flexible tool for missing data handling in clinical research. Behav. Res. Ther. 2017, 98, 4–18. [Google Scholar] [CrossRef]
  45. Hegselmann, S.; Buendia, A.; Lang, H.; Agrawal, M.; Jiang, X.; Sontag, D. TabLLM: Few-shot Classification of Tabular Data with Large Language Models. Proc. Mach. Learn. Res. 2023, 206, 5549–5581. Available online: https://proceedings.mlr.press/v206/hegselmann23a.html (accessed on 1 September 2025).
  46. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2022; pp. 24824–24837. Available online: https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html (accessed on 15 November 2025).
  47. Lobato, F.; Sales, C.; Araujo, I.; Tadaiesky, V.; Dias, L.; Ramos, L.; Santana, A. Multi-objective genetic algorithm for missing data imputation. Pattern Recognit. Lett. 2015, 68, 126–131. [Google Scholar] [CrossRef]
  48. Tran, C.T.; Zhang, M.; Andreae, P.; Xue, B.; Bui, L.T. Improving performance of classification on incomplete data using feature selection and clustering. Appl. Soft Comput. 2018, 73, 848–861. [Google Scholar] [CrossRef]
  49. Li, M.; Liu, Y.; Zheng, Q.; Li, G.; Qin, W. An evolutionary computation classification method for high-dimensional mixed missing variables data. Electron. Lett. 2023, 59, e13058. [Google Scholar] [CrossRef]
  50. Hu, J.; Pan, K.; Song, Y.; Wei, G.; Shen, C. An Improved Feature Selection Method for Classification on Incomplete Data: Non-Negative Latent Factor-Incorporated Duplicate MIC. Expert Syst. Appl. 2023, 212, 118654. [Google Scholar] [CrossRef]
  51. Bridge-Nduwimana, C.B.; El Ouaazizi, A.; Benyakhlef, M. An Integrated Intuitionistic Fuzzy-Clustering Approach for Missing Data Imputation. Computers 2025, 14, 325. [Google Scholar] [CrossRef]
  52. Sethia, K.; Singh, J.; Gosain, A. An effective imputation approach for handling missing data using intuitionistic fuzzy clustering algorithms. Discov. Comput. 2025, 28, 133. [Google Scholar] [CrossRef]
  53. Chiu, P.C.; Selamat, A.; Krejcar, O. Infilling missing rainfall and runoff data for Sarawak, Malaysia using Gaussian mixture model based K-nearest neighbor imputation. In Advances and Trends in Artificial Intelligence. From Theory to Practice, Proceedings of the 32nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2019, Graz, Austria, 9–11 July 2019; Wotawa, F., Pill, I., Koitz-Hristov, R., Friedrich, G., Eds.; Springer: Cham, Switzerland, 2019; pp. 27–38. [Google Scholar] [CrossRef]
  54. Tipirneni, S.; Reddy, C.K. Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series. ACM Trans. Knowl. Discov. Data 2022, 16, 105. [Google Scholar] [CrossRef]
  55. Ma, Q.; Lee, W.-C.; Fu, T.-Y.; Gu, Y.; Yu, G. MIDIA: Exploring denoising autoencoders for missing data imputation. Data Min. Knowl. Discov. 2020, 34, 1859–1897. [Google Scholar] [CrossRef]
  56. Johnson, W.E.; Li, C.; Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007, 8, 118–127. [Google Scholar] [CrossRef]
  57. Leek, J.T.; Storey, J.D. Capturing heterogeneity in gene-expression studies by surrogate variable analysis. PLoS Genet. 2007, 3, e161. [Google Scholar] [CrossRef]
  58. Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by back-propagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 4–9 July 2015; Volume 37, pp. 1180–1189. Available online: https://proceedings.mlr.press/v37/ganin15.html (accessed on 10 September 2025).
  59. Haghverdi, L.; Lun, A.T.L.; Morgan, M.D.; Marioni, J.C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018, 36, 421–427. [Google Scholar] [CrossRef]
  60. Li, C. Little’s Test of Missing Completely at Random. Stata J. 2013, 13, 795–809. [Google Scholar] [CrossRef]
  61. Little, R.J.; Carpenter, J.R.; Lee, K.J. A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation. Sociol. Methods Res. 2022, 53, 1105–1135. [Google Scholar] [CrossRef]
  62. Gondara, L.; Wang, K. Multiple Imputation Using Deep Denoising Autoencoders. Lect. Notes Comput. Sci. 2018, 10939, 260–272. [Google Scholar] [CrossRef]
  63. van Buuren, S.; Groothuis-Oudshoorn, K. MICE: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
  64. Wang, D.; Yan, Y.; Qiu, R.; Zhu, Y.; Guan, K.; Margenot, A.J.; Tong, H. Networked time series imputation via position-aware graph enhanced variational autoencoders. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’23), Long Beach, CA, USA, 6–10 August 2023. [Google Scholar] [CrossRef]
Figure 1. Complete data imputation pipeline from the original dataset through preprocessing, missing data generation, and transformer-based imputation.
Figure 1. Complete data imputation pipeline from the original dataset through preprocessing, missing data generation, and transformer-based imputation.
Bdcc 09 00321 g001
Figure 2. Transformer–CNN hybrid architecture.
Figure 2. Transformer–CNN hybrid architecture.
Bdcc 09 00321 g002
Figure 3. RMSE performance across five different datasets. Yellow-edged circles indicate, for each dataset, the lowest RMSE achieved among all methods.
Figure 3. RMSE performance across five different datasets. Yellow-edged circles indicate, for each dataset, the lowest RMSE achieved among all methods.
Bdcc 09 00321 g003
Figure 4. RMSE distribution under MCAR and MNAR mechanisms.
Figure 4. RMSE distribution under MCAR and MNAR mechanisms.
Bdcc 09 00321 g004
Figure 5. Mean RMSE by missingness mechanism.
Figure 5. Mean RMSE by missingness mechanism.
Bdcc 09 00321 g005
Figure 6. Mean RMSE trajectories by batch size and missingness mechanism.
Figure 6. Mean RMSE trajectories by batch size and missingness mechanism.
Bdcc 09 00321 g006
Figure 7. RMSE distributions by batch size across missingness mechanisms.
Figure 7. RMSE distributions by batch size across missingness mechanisms.
Bdcc 09 00321 g007
Figure 8. Dataset-specific RMSE trajectories across batch sizes.
Figure 8. Dataset-specific RMSE trajectories across batch sizes.
Bdcc 09 00321 g008
Table 1. Characteristics of the datasets.
Table 1. Characteristics of the datasets.
DatasetSize (Samples, Features)Original Missingness
BostonHousing(506, 14)0
BreastCancer(699, 11)0
Credit(400, 12)0
Glass(213, 11)0
Haberman(305, 4)0
Iris(150, 5)0
SlumpTest(103, 11)0
Sonar(208, 61)0
Table 2. Comparison of our imputation technique against state-of-the-art methods (RMSE).
Table 2. Comparison of our imputation technique against state-of-the-art methods (RMSE).
Imputation MethodBostonGlassSonarBreastCancerCredit
Our0.2140.1460.1870.1310.325
MIDA0.2320.2210.1740.1700.353
GAIN0.2010.1890.2160.2330.319
DAE0.5200.2400.6500.1420.487
MICE0.6900.3000.2230.1320.530
Note: Values in bold are best values per method.
Table 3. Analysis of comparative performance that highlights the advantages of being batch-aware.
Table 3. Analysis of comparative performance that highlights the advantages of being batch-aware.
Absolute RMSE ResultsRelative Improvements (%)
DatasetOursGAINMIDAvs. GAINvs. MIDABatch Level
Glass 0.1460.1890.221+22.8 **+33.9 **High
Breast0.1310.2330.170+43.8 **+22.9 **Low
Boston0.2140.2010.232−6.5+7.8Medium
Credit 0.3250.3190.353−1.9+7.9Medium
Sonar0.1870.2160.174+13.4−7.5Low
Multi-source datasets with documented heterogeneous origins (Glass: forensic labs; Credit: institutions). ** Statistically and practically significant improvements.
Table 4. Computational efficiency analysis across batch sizes under the MCAR-Uniform mechanism.
Table 4. Computational efficiency analysis across batch sizes under the MCAR-Uniform mechanism.
DatasetRowsFeaturesSizeBatch 8Batch 4Batch 2Batch 1
BostonHousing506147084218.99243.58583.101083.01
BreastCancer699117689306.38333.77832.381466.47
Credit400124800171.15202.62327.93809.93
Glass21311234395.49116.73172.35396.07
Haberman30541220137.17154.32240.67601.72
Iris150575060.4087.59133.73284.76
SlumpTest10311113347.3964.4385.43204.13
Sonar2086112,68893.51106.64174.66388.80
Mean 141.31163.71318.78654.36
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bridge-Nduwimana, C.B.; El Harrauss, S.E.; El Ouaazizi, A.; Benyakhlef, M. A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks. Big Data Cogn. Comput. 2025, 9, 321. https://doi.org/10.3390/bdcc9120321

AMA Style

Bridge-Nduwimana CB, El Harrauss SE, El Ouaazizi A, Benyakhlef M. A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks. Big Data and Cognitive Computing. 2025; 9(12):321. https://doi.org/10.3390/bdcc9120321

Chicago/Turabian Style

Bridge-Nduwimana, Charlène Béatrice, Salah Eddine El Harrauss, Aziza El Ouaazizi, and Majid Benyakhlef. 2025. "A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks" Big Data and Cognitive Computing 9, no. 12: 321. https://doi.org/10.3390/bdcc9120321

APA Style

Bridge-Nduwimana, C. B., El Harrauss, S. E., El Ouaazizi, A., & Benyakhlef, M. (2025). A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks. Big Data and Cognitive Computing, 9(12), 321. https://doi.org/10.3390/bdcc9120321

Article Metrics

Back to TopTop