Multiple Imputation of a Continuous Outcome with Fully Observed Predictors Using TabPFN
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe submitted manuscript investigates the performance of a transformer-based, pretrained foundation model (TabPFN) imputation model under nonlinear relationships. The work is promising and yields interesting simulation results. A few comments follow:
1. 39: It should be noted that FCS typically does not ensure compatible distributions when nonlinear relationships are involved, unless under specific data constellations. This issue does not occur when a joint imputation model is used.
2. 50ff.: The CART and boosting approaches still rely on FCS, correct? If so, the same compatibility issue would apply (see Comment 1).
3. Section 2: This section requires a thorough revision. I was not previously familiar with the TabPFN approach, but was unable to obtain an intuitive understanding of the method from the current description. How can relationships in other datasets be used in TabPFN for imputations in a specific dataset? What is the general idea or mechanism that makes this approach work?
4. Simulation outcomes: The mean is likely the simplest statistic to consider. The author should also examine the standard deviation and regression models (Y ~ U1 and Y ~ U1 + U1^2; applied to all data-generating models) to obtain a more comprehensive comparison.
5. Simulation: It would also be useful to include a larger sample size, such as 1000 or 2000.
6. 206: It is unclear what is meant by “90% Wilson confidence interval”. Please provide a reference and the exact computation formula. Alternatively, is this simply the ordinary confidence interval with t-quantiles? If so, a new label would not be necessary.
7. Simulation: Using only 250 replications might be too few for a reliable assessment of coverage.
8. Results: This is partly a matter of preference, but tables are generally preferable to figures. In particular, the complete case estimator could be excluded, as it can be expected to be biased. Moreover, it is important to directly compare the different estimators. Therefore, ordering the estimators according to the different sample sizes is not ideal; instead, the estimators should appear in neighboring rows for the same sample size. In a table, it would be sufficient to report bias and standard deviation.
9. Figure 2: This figure could be excluded, as the exact size of the estimated standard errors is not directly relevant; they are only used for the confidence interval coverage assessment.
10. Figure 3: Present coverage rates with one digit after the decimal in a table.
11. Section 5: Provide a statistical significance test (e.g., via bootstrap) to assess whether, and which, imputation methods differ from each other.
12. Reference [16]: Include “Research Square” as the preprint location, as is ordinarily done. Also include “arXiv” for reference [20]. Preprints should be cited in the same way as ordinary journal articles.
Author Response
Comment 1:
39: It should be noted that FCS typically does not ensure compatible distributions when nonlinear relationships are involved, unless under specific data constellations. This issue does not occur when a joint imputation model is used.
Response 1:
We thank the reviewer for pointing out this important aspect of fully conditional specification (FCS). We agree that FCS may lead to incompatible conditional models when nonlinear relationships or interactions are present, unless the specified conditional models correspond to a coherent joint distribution. In contrast, joint modeling approaches ensure compatibility by construction. To clarify this point, we revised the description of FCS in the manuscript and added a short discussion of the compatibility issue together with appropriate references.
Change 1:
Original text:
“This challenge is particularly relevant when several variables are partially observed, a setting commonly addressed using fully conditional specification (FCS) algorithms, such as those implemented in the widely used mice (Multivariate Imputation by Chained Equations) R package [6]. FCS MI proceeds by iteratively imputing each incomplete variable using a univariate conditional model given the current values of all other variables. While this framework is flexible and has proven effective in many applied settings, it requires careful specification of functional forms, transformations, and interactions in each conditional model to ensure compatibility with the substantive analysis model. This task becomes increasingly challenging in high-dimensional settings or when relationships between variables exhibit complex dependency structures [7].”
Revised text (48):
“This challenge is particularly relevant when several variables are partially observed, a setting commonly addressed using fully conditional specification (FCS) algorithms, such as those implemented in the widely used mice (Multivariate Imputation by Chained Equations) R package [15]. FCS MI proceeds by iteratively imputing each incomplete variable using a univariate conditional model given the current values of all other variables, thereby implicitly defining a joint distribution. However, unless the conditional models correspond to a coherent joint distribution, incompatibilities may arise, particularly when nonlinear relationships or interactions are present [16]. In many applications, scientific interest focuses on parameters that are more remote from the full joint distribution, such as regression coefficients or prevalence estimates. In such settings, the joint distribution primarily acts as a working model for generating imputations, and empirical studies suggest that FCS performs well in many applied contexts [16, Chapter 4.5]. A key advantage of FCS compared with joint modeling approaches is its flexibility, as it allows the imputation model for each variable to be specified separately. Nevertheless, this flexibility requires careful specification of functional forms, transformations, and interactions in each conditional model to ensure congeniality with the substantive analysis model [17]. This task becomes increasingly challenging in high- dimensional settings or when relationships between variables exhibit complex dependency structures [18].
Comment 2:
50ff.: The CART and boosting approaches still rely on FCS, correct? If so, the same compatibility issue would apply (see Comment 1).
Response 2:
We thank the reviewer for this observation. Yes, CART-based imputation as implemented in mice is embedded within the fully conditional specification (FCS) framework, and therefore the compatibility considerations mentioned in Comment 1 apply in the same way. This point is now discussed in the revised manuscript in the context of FCS.
Change 2:
See Change 1.
Comment 3:
Section 2: This section requires a thorough revision. I was not previously familiar with the TabPFN approach, but was unable to obtain an intuitive understanding of the method from the current description. How can relationships in other datasets be used in TabPFN for imputations in a specific dataset? What is the general idea or mechanism that makes this approach work?
Response 3:
We thank the reviewer for this comment. We agree that the intuitive motivation of TabPFN can be stated more explicitly, especially for readers not previously familiar with the method. We therefore revised Section 2.1 to clarify that TabPFN does not directly transfer empirical relationships from other datasets to the dataset at hand. Rather, it follows a Bayesian-style philosophy: through pretraining on a large and diverse collection of synthetic datasets generated from many plausible data-generating processes, the model learns a broad prior over tabular prediction problems. When applied to a new dataset, this prior is combined with the observed complete cases in that dataset to produce predictive distributions for the missing values. We now explain more clearly that this acts as an inductive bias, analogous to the structural assumptions imposed by conventional parametric models, and that such bias may be beneficial in small-sample settings where more flexible methods can be unstable.
Change 3:
Original text:
“This distribution of data-generating processes– referred to by the authors as the prior– is designed to favor parsimonious structures while still encompassing a broad spectrum of realistic relationships commonly encountered in applied data analysis.”
Revised text (124):
“This distribution of data-generating processes– referred to by the authors as the prior– reflects a broad but structured set of assumptions about plausible relationships in tabular data. Thus, rather than learning a dataset-specific model from scratch, TabPFN performs prediction under a learned prior that can be especially useful in small-sample settings.”
Original text:
“Through this pretraining, TabPFN learns to approximate predictive distributions across diverse data scenarios without requiring dataset-specific retraining.”
Revised text:
“Through this pretraining, TabPFN learns to approximate Bayesian-style predictive inference across diverse data scenarios without requiring dataset-specific retraining.”
New paragraph (130):
“Intuitively, TabPFN does not reuse specific relationships from external datasets for the dataset at hand. Instead, through pretraining on many synthetic datasets drawn from a broad range of plausible data-generating processes, it learns a general prior over how variables in tabular data may relate to one another. When applied to a new dataset, this prior is combined with the observed complete cases in that dataset to generate predictive distributions for missing values. In this sense, TabPFN imposes an inductive bias, much like a conventional regression model imposes structural assumptions through its functional form. Such bias can be advantageous in small-sample settings, where highly flexible methods may be unstable despite being asymptotically less restrictive.”
Comment 4:
Simulation outcomes: The mean is likely the simplest statistic to consider. The author should also examine the standard deviation and regression models (Y ~ U1 and Y ~ U1 + U1^2; applied to all data-generating models) to obtain a more comprehensive comparison.
Response 4:
We thank the reviewer for this helpful suggestion. We agree that additional estimands such as the standard deviation or regression coefficients could provide further insights into the behavior of the imputation methods. In the present study, we focused on the marginal mean as the target estimand because it represents the primary estimand considered in Little and An (2004), on which our simulation design is based. Furthermore, although straightforward to interpret, valid estimation of the marginal mean requires correct recovery of the marginal distribution of the outcome in the presence of missing data, which is not necessarily required for regression coefficients when missingness occurs only in the outcome variable. For this reason, we consider the marginal mean a suitable estimand to evaluate the ability of imputation methods to properly propagate uncertainty due to missingness. Exploring additional estimands would certainly be an interesting extension but was beyond the scope of the present study.
Change 4:
Section 3.1
Original text:
“We considered a single partially observed outcome Y, with the estimand of interest being the marginal mean µ = E(Y).”
Revised text (200):
“We considered a single partially observed outcome Y, with the estimand of interest being the marginal mean µ = E(Y), a quantity that depends on correctly recovering the marginal distribution of Y under missing data [16, Chapter 2.7].”
Comment 5:
Simulation: It would also be useful to include a larger sample size, such as 1000 or 2000.
Response 5:
We thank the reviewer for this suggestion. While including larger sample sizes such as 1000 or 2000 could indeed provide additional insights, the primary motivation of this work was to evaluate imputation performance in relatively small to moderate sample sizes, which are common in many applied research settings such as randomized controlled trials. In such settings, highly flexible machine learning approaches often struggle due to limited data availability, which is precisely where pretrained models such as TabPFN may offer an advantage.
Extending the simulation study to substantially larger sample sizes would therefore address a somewhat different research question. We agree that this could be an interesting direction for future work and have noted this point as a potential extension.
Comment 6:
206: It is unclear what is meant by “90% Wilson confidence interval”. Please provide a reference and the exact computation formula. Alternatively, is this simply the ordinary confidence interval with t-quantiles? If so, a new label would not be necessary.
Response 6:
We thank the reviewer for pointing this out. We agree that the description of the 90% Wilson confidence interval was not sufficiently clear. In the revised manuscript, we now explicitly state the formula used for the Wilson interval and added an appropriate reference.
Change 6:
Section 3.2:
Original text:
“Uncertainty in the estimated 95% coverage rate was quantified using a 90% Wilson confidence interval.”
Revised text (261):
“Uncertainty in the estimated 95% coverage rate was quantified using a 90% Wilson confidence interval for binomial proportions [44]. More specifically, for a coverage rate ˆp based on Nsim simulation replications, the interval is computed as
ˆ p + z2 2Nsim ± z ˆ p(1−ˆp) Nsim + z2 4N2 sim 1 + z2 Nsim (1)
where z = z1−α/2 denotes the corresponding quantile of the standard normal distribution.”
Comment 7:
Simulation: Using only 250 replications might be too few for a reliable assessment of coverage.
Response 7:
We thank the reviewer for this comment. We agree that a larger number of simulation replications would generally improve the precision of coverage estimates. However, the present simulation study evaluates multiple experimental factors and competing methods, which results in substantial computational cost. We therefore chose 250 replications as a compromise between the number of simulation scenarios and the number of repetitions per scenario. Since the study focuses on descriptive comparison of bias, standard error behavior, and coverage patterns rather than formal hypothesis testing, we consider this number sufficient to illustrate the relative performance of the methods under the considered settings.
Comment 8:
Results: This is partly a matter of preference, but tables are generally preferable to figures. In particular, the complete case estimator could be excluded, as it can be expected to be biased. Moreover, it is important to directly compare the different estimators. Therefore, ordering the estimators according to the different sample sizes is not ideal; instead, the estimators should appear in neighboring rows for the same sample size. In a table, it would be sufficient to report bias and standard deviation.
Response 8:
We thank the reviewer for this suggestion. The figures were intentionally designed to combine graphical comparison with numerical information, allowing readers to quickly assess performance patterns across simulation settings while still providing the exact values typically reported in tables. We believe this presentation facilitates interpretation across the multiple experimental conditions considered in the simulation study.
Regarding the complete case estimator, we agree that it is expected to be biased under MAR. However, we retained it as a benchmark because it illustrates the potential consequences of ignoring missing data and provides a useful reference point when evaluating imputation methods. Including such a benchmark is also recommended in guidelines for simulation studies of multiple imputation methods (e.g., Oberman and Vink, 2023).
Comment 9:
Figure 2: This figure could be excluded, as the exact size of the estimated standard errors is not directly relevant; they are only used for the confidence interval coverage assessment.
Response 9:
We thank the reviewer for this suggestion. We agree that coverage is the primary metric for assessing inferential validity in this setting. However, estimated standard errors (or equivalently confidence interval width) are also commonly reported in simulation studies of multiple imputation methods as a measure of statistical efficiency (see, e.g., Oberman and Vink, 2023). Reporting the estimated standard errors therefore provides complementary information that helps interpret differences in coverage and bias across methods. For this reason, we decided to retain Figure 2.
Comment 10:
Figure 3: Present coverage rates with one digit after the decimal in a table.
Response 10:
We thank the reviewer for this suggestion. While tables can be useful for presenting exact numerical values, we chose to present coverage rates graphically to facilitate comparison across the different simulation settings and estimators. The figure allows readers to quickly assess deviations from nominal coverage across methods and scenarios, which would be more difficult to visualize in a table. For this reason, we decided to retain the graphical presentation.
Comment 11:
Section 5: Provide a statistical significance test (e.g., via bootstrap) to assess whether, and which, imputation methods differ from each other.
Response 11:
We thank the reviewer for this thoughtful suggestion. While formal statistical tests could in principle be used to compare the resulting estimates, the purpose of the case study is primarily illustrative, namely to demonstrate how the different imputation approaches behave in a practical applied setting. Since the methods considered represent different estimation procedures rather than stochastic populations, a formal hypothesis test comparing the resulting estimates would be difficult to interpret and would not necessarily provide additional insight into their relative performance.
Instead, we report point estimates and corresponding measures of uncertainty for each method, allowing readers to assess the practical magnitude of the differences between approaches. We therefore decided not to introduce an additional statistical significance test.
Comment 12:
Reference [16]: Include “Research Square” as the preprint location, as is ordinarily done. Also include “arXiv” for reference [20]. Preprints should be cited in the same way as ordinary journal articles.
Response 12:
We thank the reviewer for pointing this out. The references have been revised to include the preprint locations (“Research Square” and “arXiv”) and are now formatted consistently with the citation style used for journal articles.
Change 12:
Added “arXiv” for all pre-printes currently in Arxiv and added “Research Square” for the pre-print in Research Square.
Reviewer 2 Report
Comments and Suggestions for Authorssee file attached
Comments for author File:
Comments.pdf
Author Response
Comment 1:
Misleading Title. The title is too general for what is being studied. Please revise to highlight that this is only about imputing univariate outcome variables with complete covariates. Moreover, even multiple imputation is confusing in the present form of your implementation.
Response 1:
We thank the reviewer for this helpful comment. We agree that the scope of the empirical evaluation should be stated more clearly. The simulation study in this paper focuses on imputing a single outcome variable conditional on fully observed covariates. To make this setting explicit, we clarified this point in the abstract and in the discussion section by stating that the predictors are assumed to be fully observed in the simulations.
However, the proposed TabPFN-based imputation procedure is not inherently restricted to imputing an outcome variable and can in principle be applied to other missing variables as well. For this reason, we prefer to retain the more general formulation in the title.
Change 1:
Original text:
“We conduct a simulation study focusing on univariate missingness in a continuous outcome, comparing TabPFN with standard MI methods.”
Revised Text (11):
“We conduct a simulation study focusing on univariate missingness in a continuous outcome with complete predictors, comparing TabPFN with standard MI methods.”
Original Text:
“Our simulation study focused on univariate missingness in a continuous outcome with continuous predictors.”
Revised Text (436):
“Our simulation study focused on univariate missingness in a continuous outcome with continuous, fully observed predictors.”
Comment 2:
Introduction. I liked the motivation through regulatory guidelines and the motivation wrt estimands; and that not only predictive imputation accuracy is of importance in biostatistics. I nevertheless believe that you can improve it wrt motivation (next comment) and literature review. For example, there exist several (even recent) review papers that compare different imputation methods for data in the life science, see. e.g. [4, 1, 3, 8] which should be cited as well to show the importance. Moreover, it should be highlighted, that a large amount of literature focuses on imputation accuracy and not on uncertainty quantification (also wrt references). However, there exist a few that study uncertainty quantification, inference or even distributional accuracy [7, 6] also in ML approaches to imputation and some even study a true combination with MI [2].
Response 2:
We thank the reviewer for these helpful suggestions and for pointing us to the additional references. We agree that the introduction can be strengthened by providing a broader overview of the existing literature on imputation methods, particularly in the life sciences. In the revised manuscript, we expanded the introduction accordingly by incorporating additional review and benchmarking studies and by more clearly distinguishing between work focusing on predictive accuracy and work addressing uncertainty quantification and inferential validity.
Change 2:
Original text:
“Quantitative analyses in health research, including clinical trials and observational studies, commonly assume that all variables are fully observed. In practice, however, missing data in covariates or outcomes are ubiquitous and, if not appropriately handled, can lead to biased estimates, loss of efficiency, and invalid statistical inference [1,2]. Regulatory guidance therefore, emphasizes the need for principled approaches to missing data that support valid estimation and uncertainty quantification, particularly for prespecified estimands in confirmatory analyses. Multiple imputation (MI) provides a widely accepted framework for addressing missing data under valid uncertainty quantification by generating multiple plausible completed datasets, analyzing each dataset separately, and combining results using Rubin’s rules to reflect both sampling variability and uncertainty due to missingness [3].”
Revised text (21):
“Quantitative analyses in health research, including clinical trials and observational studies, commonly assume that all variables are fully observed. In practice, however, missing data in covariates or outcomes are ubiquitous and can arise even in carefully designed randomized controlled trials (RCTs), for example when patients withdraw from follow-up and outcome measurements cannot be obtained. If not appropriately handled, missing data can lead to biased estimates, loss of efficiency, and invalid statistical inference [1,2]. Regulatory guidance therefore emphasizes the need for principled approaches to missing data that support valid estimation and uncertainty quantification, particularly for prespecified estimands in confirmatory analyses. More broadly, the literature on missing data methods is extensive, and identifying appropriate imputation strategies is of particular interest in the life sciences, where a large share of comparative studies has been conducted [3]. Illustrative examples include applications to electronic health records and clinical measurements [4], binary classification problems [5], and time series health data [6]. Beyond classical approaches, a growing body of work investigates machine learning and deep learning–based imputation methods in healthcare data [7], as well as methods designed for high-dimensional settings [8]. These studies consistently emphasize that no single method performs uniformly best, with performance depending strongly on the data structure, dimensionality, and the analysis objective. At the same time, much of the existing literature focuses on predictive performance metrics, whereas relatively fewer studies assess the ability of imputation methods to recover the underlying data distribution or support valid statistical inference [9–11]. This distinction is particularly relevant in applications where valid inference, rather than predictive accuracy alone, is the primary objective.”
Comment 3:
Motivation. Though I liked the writing style of the introduction, I missed a concrete motivation of the importance of imputing univariate outcomes when all predictors are available. From my own practical experience, that is a rather uncommon setting. Therefore, I encourage the author to motivate this better by giving practical examples where this was the case and at least mention this more explicitly at the onset.
Response 3:
We thank the reviewer for this helpful suggestion. While missing covariates are often discussed in the missing data literature, missing outcomes with fully observed predictors also arise in several applied settings. Examples include longitudinal studies with incomplete follow-up measurements, clinical trials with loss to follow-up in the outcome variable, and causal inference analyses where baseline covariates and treatment assignment are observed but the outcome is missing for some individuals. This setting is also closely related to causal inference analyses, where counterfactual outcomes are conceptually treated as missing data. Following the reviewer’s suggestion, we motivated this in the introduction.
Change 3:
Original text:
“In practice, however, missing data in covariates or outcomes are ubiquitous and, if not appropriately handled, can lead to biased estimates, loss of efficiency, and invalid statistical inference [1,2]”
Revised text (22):
“In practice, however, missing data in covariates or outcomes are ubiquitous and can arise even in carefully designed randomized controlled trials (RCTs), for example when patients withdraw from follow-up and outcome measurements cannot be obtained. If not appropriately handled, missing data can lead to biased estimates, loss of efficiency, and invalid statistical inference [1,2].”
Comment 4:
Methods. I like the rather abstract-style of the description of the method but expected a description of an implementation as in Rubin’s rule (similar to [2]). At the end; I realized that you don’t use Rubin’s rule at all for TabPFN. I nevertheless would have expected a more detailed discussion on this; especially explaining why not. Moreover, I missed the description of the competitors (especially for the concrete pmm) for your simulation study; a concrete comparison wrt what they can do and what not, and a reasoning why you did not study others. This would also help later when you discuss limitations. Finally, I was wondering whether you like to mention the concrete package and functions you used for your implementations at the end of this section and also explain why you did not use AutoTabFN.
Response 4:
We thank the reviewer for these helpful suggestions. First, we would like to clarify that Rubin’s rules are indeed used for TabPFN-based multiple imputation in the same way as for the other MI approaches considered in this study. After generating multiple imputed datasets using TabPFN-based predictive draws, the resulting estimates and standard errors are pooled using Rubin’s rules. We have clarified this point in Section 2.2 to avoid confusion. Second, following the reviewer’s suggestion, we expanded the description of the comparator methods and their implementation. In particular, we added a new subsection describing the competing imputation approaches, their main assumptions, and their implementation using the mice package. Third, we now report the concrete software packages and functions used for the implementations to improve reproducibility. Finally, regarding AutoTabPFN, we agree that this approach may also be of interest. However, the field of foundation models for tabular data is evolving rapidly, and in the present work we focused on the original TabPFN implementation as a computationally feasible baseline for a large simulation study. AutoTabPFN is expected to be substantially more computationally demanding, which would make its inclusion in the present simulation study challenging.
Change 4:
Original text:
“Thus, under the MAR (or MCAR) assumption, when all variables driving the missingness mechanism are observed and included in the imputation model, MIs can be generated by sampling from these discrete predictive distributions without specifying a parametric error distribution. For additional details, including feature pre-processing, acceleration, and post-hoc ensembling, we refer to Hollmann et al. [21].”
Revised text (174):
“Thus, under the MAR (or MCAR) assumption, when all variables driving the missingness mechanism are observed and included in the imputation model, MIs can be generated by sampling from these discrete predictive distributions without specifying a parametric error distribution. After generating m imputed datasets using TabPFN-based predictive draws, the resulting parameter estimates and standard errors are combined using Rubin’s rules in the same manner as in conventional MI procedures. For additional details, including feature pre-processing, acceleration, and post-hoc ensembling, we refer to Hollmann et al. [32].”
New 3.2 Section (228):
“3.2. Comparator Multiple Imputation Methods
Comparator MI strategies were implemented using the mice package (version 3.19.0) [15] with default hyperparameters. Since missingness was limited to a single variable, the number of iterations (maxit) was set to 1, and the number of imputations to m = 20, consistent with common MI practice [16, Chapter 2.8]. To avoid unnecessary filtering of collinear predictors, the eps parameter was fixed at 0.
The comparator MI approaches used here comprised Predictive Mean Matching (PMM) [42], a semi-parametric method that selects donor values from observed cases and is the default for continuous variables in mice, and Classification and Regression Trees (CART) [19], a non-parametric recursive partitioning approach implemented in mice and previously studied for MI in Shah et al. [23] and Doove et al. [24]. CART was selected as a representative tree-based imputation method because previous studies have shown that it can perform competitively in MI settings while preserving interaction structures in the data [24].
For context, two non-imputation baselines were also analysed. The full-data estimate used the complete simulated dataset prior to introducing missingness, providing a benchmark for the best achievable performance. The complete case (CC) analysis included only observations without missing values in the target variable.“
Comment 5:
Simulation study. I usually recommend to follow the guidelines of [5] for conducting simulation studies and would also encourage the author to do this. This especially relates to a clearer motivation for the chosen (i) competitors, (ii) design parameters (sample sizes, dimensions, missing rates), (iii) data generating processes, and (iv) missing processes. In particular, you merely motivate through your own preprint and the Little and An (2004) study but there exist many other simulation studies with different approaches. Here, I would expect much more motivation and reasoning. Moreover, you motivate in the Introduction with high-dimensional settings while this is only a minor part of your study (in fact, I usually define high-dimensional as p > n but know that there are othr definitions out there). Similarly, you only discuss normally distributed models which is a key limitation (the same for the 50% missing rate). I would like to see at least a small sensitive analyses wrt the above points (i)-(iv) (e.g. with different distribution; different missing rates etc.) to learn more about the applicability of TabPFN.
Response 5:
We thank the reviewer for this helpful suggestion and for pointing us to the guidelines by Morris et al. (2019). In designing the simulation study, we followed guidance specific to simulation studies for multiple imputation methods, particularly the framework proposed by Oberman and Vink (2023), which focuses on evaluating bias, standard error behavior, and coverage properties under controlled missing-data mechanisms.
The chosen design parameters were intended to provide a transparent comparison between methods while keeping the computational burden manageable. In particular, the normally distributed outcome model and error structure were selected because they represent a commonly used baseline setting in simulation studies for MI methods. While we are aware that alternative simulation designs exist, we followed the approach of Little and An (2004), as it allows a broad range of experimental factors to be studied within a coherent framework.
In the present study, the missingness rate was fixed at 50% to represent a relatively challenging scenario, consistent with the design of Little and An (2004). In multiple imputation settings, higher missingness primarily increases the fraction of missing information, which in practice is typically addressed by increasing the number of imputations m (see e.g. van Buuren, 2018). To assess the sensitivity of the results to this aspect, we conducted a secondary simulation study varying the number of imputations m, which did not materially change the conclusions (Appendix Figure A2).
We agree that exploring alternative distributions could provide additional insights into the robustness of the methods. However, extending the simulation study along multiple additional dimensions would substantially increase its complexity. Following the reviewer’s suggestion, we now emphasize in the discussion that the simulation study focuses on a specific set of data-generating assumptions and that future work could extend the evaluation accordingly.
Change 5:
Original text:
“Many real-world datasets include multivariate missingness and mixed data types, which pose additional challenges.”
Revised text (437):
“Many real-world datasets include multivariate missingness and mixed data types, which pose additional challenges. Furthermore, we did not assess alternative distributions for the predictors or the error term that could affect finite-sample performance. However, the relative comparison between methods is expected to remain informative under a broad class of data-generating processes.”
Comment 6:
Numerical Experiments. In the simulation study you need to discuss your findings wrt existing results in the literature. In the data analyses, please also report the running time for the data set. Moreover, I think that the chosen data example doesn’t fit well with your simulation study which is a pity. For example, the simulation study focuses on normally distributed models of moderate sample size (100-400) and moderate to larger dimension (10-100) with an interaction only in one structural form while the analyses focuses on ACE estimation (btw: Why not CATE?) in a sample of size 1,500 with only a few predictors and even some squared ones. Here, the connection is too loose. In fact, I would have envisioned a data example of similar sizes as in your simulation study and at least a discussion to which DGP the data set could fit more (thereby also discussing distributional and strucutral points). In the current form, it is not satisfactory at all.
Response 6:
We thank the reviewer for these helpful suggestions. First, we would like to clarify that the effective sample size used in the empirical analysis is n = 490, which is comparable to the range considered in the simulation study (n = 100–400). We clarified this point in the manuscript to avoid confusion.
Regarding the use of the average causal effect (ACE), we followed the estimation objective used in the motivating example in the “What if…” Book, which focuses on estimating the overall treatment effect rather than conditional treatment effects (CATE). The goal of the case study was therefore to illustrate the use of the imputation procedure in a standard applied analysis rather than to investigate treatment effect heterogeneity (CATE).
We agree that relating empirical data examples to simulation settings is useful. However, because the case study uses real-world observational data, the underlying data-generating process is unknown, making it difficult to determine which of the simulated mechanisms would most closely correspond to the empirical setting.
Change 6:
Original text:
“The cohort consists of 1'566 adult smokers aged 25–74 who completed both a baseline visit (1971–1975) and a follow-up visit approximately 10 years later.”
Revised text (343):
“The cohort consists of 1'566 adult smokers aged 25–74 who completed both a baseline visit (1971–1975) and a follow-up visit approximately 10 years later. For the present analysis, we restrict attention to individuals with complete baseline covariate information, resulting in a smaller analytic sample described below (n = 490), which is comparable in magnitude to the sample sizes considered in the simulation study.”
Comment 7:
Discussion. Please mention all limitations in detail and align this with your conclusions. Moreover, you write ’However, these methods typically require moderate to large sample sizes to estimate such structures reliably, posing challenges in applied settings with limited data availability’ while I wouldn’t completely agree in your settings. Of course, n=100 is borderline 2 but for all others (especially the data set), I would have at least seen MICE RF, missranger and mixgboost as competitors.
Response 7:
We thank the reviewer for this suggestion. In designing the simulation study, we aimed to compare the proposed method with commonly used imputation approaches available within the mice framework. To keep the comparison focused, we included one flexible tree-based imputation model, namely CART.
Previous work has shown that CART-based imputation can perform competitively in MI settings and may preserve interaction effects well (better than RF) when estimating regression parameters (Doove et al., 2014). In related simulation experiments, CART-based imputation has also been found to perform favorably compared with random forest imputation in terms of bias of parameter estimates. For these reasons, we selected CART as a representative tree-based method in the present comparison.
We agree that additional machine-learning-based imputation methods could provide further useful benchmarks. Evaluating the proposed approach against a broader range of modern imputation methods would therefore be an interesting direction for future work.
Change 7:
Discussion:
Original text:
“By introducing and benchmarking a TabPFN-based MI algorithm, we show that it achieves performance that is comparable to, and in some scenarios superior to, standard MI methods with minimal model specification effort, potentially mitigating one of the key limitations of highly flexible approaches in low-sample settings.”
Revised text (417):
“By introducing and benchmarking a TabPFN-based MI algorithm, we show that it achieves performance that is comparable to, and in some scenarios superior to, standard MI methods with minimal model specification effort, potentially mitigating one of the key limitations of highly flexible approaches in low-sample settings. However, the comparison was restricted to imputation methods available within the mice framework, and other MI approaches outside this framework were not considered.”
Reviewer 3 Report
Comments and Suggestions for AuthorsIn this paper, the authors investigate the use of Tabular Prior-Data Fitted Network (TabPFN), a pretrained transformer-based foundation model, for multiple imputation (MI) of missing data. The authors conduct a simulation study comparing TabPFN with standard MI methods such as Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) and demonstrate a real-world application using data from the National Health and Nutrition Examination Survey Epidemiologic Follow-Up Study (NHEFS). The paper is scientifically sound and can be accepted after the authors consider the following comments:
1. Can you provide a formal theoretical justification for why Rubin's rules yield valid inference when applied to TabPFN-based imputation? 2. What conditions ensure that TabPFN's predictive distributions approximate proper posterior predictive distributions, and have you verified these conditions?
3. You mention discretizing continuous outcomes into "a fixed number of classes" (lines 134-141), but this critical implementation detail is not specified. How many classes did you use?
4. In Figure 2, TabPFN produces standard errors smaller than those of the full-data analysis. What explains this underestimation of uncertainty?
Author Response
Comment 1:
Can you provide a formal theoretical justification for why Rubin's rules yield valid inference when applied to TabPFN-based imputation?
Response 1:
We thank the reviewer for this important question. A formal theoretical justification for the validity of Rubin’s rules under TabPFN-based imputation is currently not available to our knowledge. Instead, the inferential adequacy of the proposed approach is evaluated empirically through bias, standard error behavior, and confidence interval coverage in controlled simulation settings. This perspective is already reflected in Section 2.2, where we state that Rubin’s rules are applied under the working assumption that the predictive draws approximate posterior predictive distributions for the missing values.
Comment 2:
What conditions ensure that TabPFN's predictive distributions approximate proper posterior predictive distributions, and have you verified these conditions?
Response 2:
We thank the reviewer for this insightful question. TabPFN does not provide formal guarantees that its predictive distributions correspond to proper posterior predictive distributions for a given dataset. Rather, the model is trained offline on a large collection of synthetic tabular prediction tasks and is designed to approximate Bayesian predictive inference across a broad class of data-generating processes. In the present work, we therefore treat the predictive distributions as an approximation and evaluate their inferential adequacy empirically through simulation. To clarify this point, we added a short remark in Section 2.2 emphasizing that this assumption is pragmatic and that the validity of the approach is assessed empirically rather than theoretically.
Change 2:
Original text:
“Rubin’s rules are applied under the working assumption that TabPFN’s predictive draws approximate proper posterior predictive distributions for the missing values, such that between-imputation variability reflects uncertainty about the missing data.”
Revised Text (183):
“Rubin’s rules are applied under the working assumption that TabPFN’s predictive draws approximate proper posterior predictive distributions for the missing values, such that between-imputation variability reflects uncertainty about the missing data. This assumption is pragmatic: in contrast to conventional MI approaches that rely on parametric or dataset-specific predictive models, TabPFN is trained offline to approximate Bayesian predictive inference across a broad class of tabular prediction tasks, but formal guarantees for a specific dataset are not available.“
Comment 3:
You mention discretizing continuous outcomes into "a fixed number of classes" (lines 134-141), but this critical implementation detail is not specified. How many classes did you use?
Response 3:
We thank the reviewer for pointing this out. In our implementation, continuous outcomes were discretized into 5'000 classes, which is the default setting of TabPFNRegressor. We have now added this detail explicitly to the Methods section.
Change 3:
Original text:
“Therefore, continuous outcomes are discretized into a fixed number of classes prior to imputation, effectively reframing the regression problem as a multi-class classification task.”
Revised text (166):
“Therefore, continuous outcomes are, by default, discretized into 5’000 classes prior to imputation, thereby reframing the regression problem as a multi-class classification task (see Müller et al. [39] for details on the discretization procedure).”
Comment 4:
In Figure 2, TabPFN produces standard errors smaller than those of the full-data analysis. What explains this underestimation of uncertainty?
Response 4:
We thank the reviewer for highlighting this observation. The comparison between standard errors obtained after imputation and those from the full-data analysis is not entirely straightforward. Differences may arise from how predictive uncertainty produced by the pretrained TabPFN model propagates through the multiple-imputation procedure. Since the exact mechanism is difficult to assess since it depends on the pretraining, we added a short remark in the discussion noting that the slightly reduced coverage observed in some scenarios may reflect an underestimation of uncertainty.
Change 4:
Original text:
“While TabPFN improved performance in many scenarios, it did not achieve nominal properties in all settings, particularly under the most challenging combinations of non-additive mean structure and non-additive missingness. Further methodological development is therefore required to improve robustness and theoretical foundation across a broader range of settings.”
Revised text (449):
“While TabPFN improved performance in many scenarios, it did not achieve nominal properties in all settings, particularly under the most challenging combinations of non-additive mean structure and non- additive missingness. This may reflect underestimation of the uncertainty or standard errors produced by the imputation procedure. One possible explanation is that the predictive distributions generated by the pretrained TabPFN model do not perfectly translate into calibrated between-imputation variability for a given dataset. Further methodological development is therefore required to improve robustness and theoretical foundation across a broader range of settings.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe reviewed paper addresses an important, real-world problem of data imputation. However, a few problems should be addressed before its possible publication:
- The paper considers one ML/neural network imputation method, but the introduction lacks a more detailed description of such approaches. Some other algorithms should be briefly compared, with the respective references added. There is also no direct review of benchmarking imputation methods (like [1]). Moreover, some authors point out problems with ML/network approaches for small samples or complex data types [2]. The introduction should be improved with additional discussion of ML/neural network approaches and new references (in addition to the above-mentioned ones).
- Novelty of the paper is not clearly stated, especially in Sect. 2.2. Please add some insight into the differences between your method and previous approaches.
- The numerical analysis in Sect. 3.1 is very interesting. However, it lacks discussion of simulations from probability distributions other than the uniform distribution (e.g., the exponential or a beta distribution) or of the normal distribution with a different standard deviation. At least some notes about these issues should be added, or new simulations should be provided.
- Sect 6 should be reorganized into two sections: the first one devoted explicitly to the discussion of the results of the numerical analyses (i.e., “Discussion of the results”), while the second (“Conclusions”) will be about more general conclusions for the whole paper and possible future research.
Given the issues mentioned, I advise a major revision of the paper.
[1] M. Alabadla, F. Sidi, I. Ishak, H. Ibrahim, L.S. Affendey, Z. Che Ani, M.A. Jabar, U.A. Bukar, N.K. Devaraj, A.S. Muda, A. Tharek, N. Omar, M.I.M. Jaya, Systematic review of using machine learning in imputing missing values, IEEE Access (2022)
[2] Romaniuk M., Grzegorzewski P., Fuzzy data imputation with DIMP and FGAIN, Journal of Computational Science (2026)
Author Response
Comment 1:
The paper considers one ML/neural network imputation method, but the introduction lacks a more detailed description of such approaches. Some other algorithms should be briefly compared, with the respective references added. There is also no direct review of benchmarking imputation methods (like [1]; Alabadla2025). Moreover, some authors point out problems with ML/network approaches for small samples or complex data types [2]. The introduction should be improved with additional discussion of ML/neural network approaches and new references (in addition to the above-mentioned ones).
Response 1:
We thank the reviewer for these helpful comments and for pointing us to the additional references. We agree that the introduction can be improved by providing a broader discussion of machine learning and neural network–based imputation approaches, as well as relevant benchmarking studies.
In the revised manuscript, we expanded the introduction accordingly by incorporating additional references on ML- and deep learning–based imputation methods, including recent review and benchmarking studies. We also added a brief discussion highlighting that, while these approaches can be highly flexible, their performance may depend on sample size and data characteristics.
Change 1:
Original text:
“Quantitative analyses in health research, including clinical trials and observational studies, commonly assume that all variables are fully observed. In practice, however, missing data in covariates or outcomes are ubiquitous and, if not appropriately handled, can lead to biased estimates, loss of efficiency, and invalid statistical inference [1,2]. Regulatory guidance therefore, emphasizes the need for principled approaches to missing data that support valid estimation and uncertainty quantification, particularly for prespecified estimands in confirmatory analyses. Multiple imputation (MI) provides a widely accepted framework for addressing missing data under valid uncertainty quantification by generating multiple plausible completed datasets, analyzing each dataset separately, and combining results using Rubin’s rules to reflect both sampling variability and uncertainty due to missingness [3].”
Revised text (1):
“Quantitative analyses in health research, including clinical trials and observational studies, commonly assume that all variables are fully observed. In practice, however, missing data in covariates or outcomes are ubiquitous and can arise even in carefully designed randomized controlled trials (RCTs), for example when patients withdraw from follow-up and outcome measurements cannot be obtained. If not appropriately handled, missing data can lead to biased estimates, loss of efficiency, and invalid statistical inference [1,2]. Regulatory guidance therefore emphasizes the need for principled approaches to missing data that support valid estimation and uncertainty quantification, particularly for prespecified estimands in confirmatory analyses. More broadly, the literature on missing data methods is extensive, and identifying appropriate imputation strategies is of particular interest in the life sciences, where a large share of comparative studies has been conducted [3]. Illustrative examples include applications to electronic health records and clinical measurements [4], binary classification problems [5], and time series health data [6]. Beyond classical approaches, a growing body of work investigates machine learning and deep learning–based imputation methods in healthcare data [7], as well as methods designed for high-dimensional settings [8]. These studies consistently emphasize that no single method performs uniformly best, with performance depending strongly on the data structure, dimensionality, and the analysis objective. At the same time, much of the existing literature focuses on predictive performance metrics, whereas relatively fewer studies assess the ability of imputation methods to recover the underlying data distribution or support valid statistical inference [9–11]. This distinction is particularly relevant in applications where valid inference, rather than predictive accuracy alone, is the primary objective.”
Comment 2:
Novelty of the paper is not clearly stated, especially in Sect. 2.2. Please add some insight into the differences between your method and previous approaches.
Response 2:
We thank the reviewer for this helpful comment. While the main contributions are summarized in the introduction, we agree that the novelty of the proposed approach could be stated more clearly in the methods section. Therefore, we added a sentence in Section 2.2 highlighting how the method differs from conventional multiple imputation approaches, namely by leveraging a pretrained foundation model rather than fitting dataset-specific predictive models.
Change 2:
In Section 2.2:
Original text:
“We do not claim formal theoretical validity of Rubin’s rules under TabPFN-based imputation. Instead, inferential adequacy of the proposed approach is evaluated empirically through bias, standard error behavior, and confidence interval coverage in controlled simulation settings. Rubin’s rules are applied under the working assumption that TabPFN’s predictive draws approximate proper posterior predictive distributions for the missing values, such that between-imputation variability reflects uncertainty about the missing data.”
Revised text (181):
“We do not claim formal theoretical validity of Rubin’s rules under TabPFN-based imputation. Instead, inferential adequacy of the proposed approach is evaluated empirically through bias, standard error behavior, and confidence interval coverage in controlled simulation settings. Rubin’s rules are applied under the working assumption that TabPFN’s predictive draws approximate proper posterior predictive distributions for the missing values, such that between-imputation variability reflects uncertainty about the missing data. This assumption is pragmatic: in contrast to conventional MI approaches that rely on parametric or dataset-specific predictive models, TabPFN is trained offline to approximate Bayesian predictive inference across a broad class of tabular prediction tasks, but formal guarantees for a specific dataset are not available.”
Comment 3:
The numerical analysis in Sect. 3.1 is very interesting. However, it lacks discussion of simulations from probability distributions other than the uniform distribution (e.g., the exponential or a beta distribution) or of the normal distribution with a different standard deviation. At least some notes about these issues should be added, or new simulations should be provided.
Response 3:
We thank the reviewer for this helpful suggestion. While extending the simulation study to additional predictor and error distributions could provide further insights, the goal of the present simulations was to compare the methods under a simple and transparent data-generating mechanism that has been used in related work. Following the reviewer’s suggestion, we added a note in the discussion highlighting this limitation and pointing to alternative distributions as a direction for future research.
Change 3:
Original text:
“Many real-world datasets include multivariate missingness and mixed data types, which pose additional challenges.”
Revised text (437):
“Many real-world datasets include multivariate missingness and mixed data types, which pose additional challenges. Furthermore, we did not assess alternative distributions for the predictors or the error term that could affect finite-sample performance. However, the relative comparison between methods is expected to 440 remain informative under a broad class of data-generating processes.”
Comment 4:
Sect 6 should be reorganized into two sections: the first one devoted explicitly to the discussion of the results of the numerical analyses (i.e., “Discussion of the results”), while the second (“Conclusions”) will be about more general conclusions for the whole paper and possible future research.
Response 4:
We thank the reviewer for this helpful suggestion. We agree that the current Discussion section combines two distinct elements: (i) interpretation of the numerical results and (ii) broader conclusions and future directions. To improve clarity, we have reorganized Section 6 into two separate sections. The first section, “Discussion” now focuses explicitly on the interpretation of the simulation findings, including the relative performance of TabPFN-based MI and competing methods, as well as the main limitations of the numerical analyses. The second section, “Conclusion,” now summarizes the broader implications of the paper and outlines directions for future methodological and applied research.
Change 4:
Original text:
“Section 4 presents the simulation results, Section 5 provides a real-data application, and Section 6 concludes with a discussion of implications and directions for future research.”
Revised text (111):
“Section 4 presents the simulation results, Section 5 provides a real-data application, Section 6 discusses the numerical findings, and Section 7 concludes with broader implications and directions for future research.”
Revised text:
“7. Conclusion (464)
TabPFN-based MI represents a promising addition to the MI toolkit and illustrates the broader potential of foundation-model–based approaches for missing data imputation, particularly in limited-data settings where traditional imputation models struggle and expert knowledge about functional relationships is limited or unavailable. Its ability to combine modeling flexibility with generally robust coverage properties makes it a promising candidate for further application in missing data imputation. At the same time, practical and methodological challenges remain. TabPFN is computationally more demanding than simpler MI methods such as PMM or CART. While access to GPU resources is increasingly common in research environments and may not be prohibitive relative to the overall cost of large studies, the additional computational and software complexity may nevertheless pose practical barriers for routine use in standard statistical workflows. Moreover, TabPFN’s relative novelty means it has not yet been extensively tested across diverse applied contexts. Even though 476 adoption in the Healthcare and Life Sciences section is strongest [34], uptake in regulatory settings, where transparency, reproducibility, and extensive methodological validation are emphasized, may be more gradual [1,30]. Future research should focus on extending the approach to more general missing-data settings, improving computational efficiency to facilitate broader adoption, and systematically benchmarking TabPFN-based MI against both established MI methods and alternative foundation-model approaches. More generally, the area of foundation-model–based imputation is developing rapidly, and TabPFN should be viewed as one promising example within a broader emerging class of methods.”
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI still think that some points should be addressed, but the editor should make a final decision.
Author Response
Comment 1:
I still think that some points should be addressed, but the editor should make a final decision.
Response 1:
We thank the reviewer sincerely for the continued time and care devoted to the evaluation of our manuscript. We have taken the reviewer’s comments seriously and have revised the manuscript to the best of our ability in response to the issues raised during the review process. We hope that the present version satisfactorily addresses the reviewer’s concerns and reflects the improvements made during revision.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author has improved the paper condiserably and answered most of my comments. However, I still don't see all of them addressed:
- The title remains confusing as only missings in the target with complete predictors is studied. In my opinion this has to become clear from the title. I suggest sth. like "Multiple Imputation of Continuous Target Variables with Fully Observed Predictors Using TabPFN"
- The limitations should be extended. In particular, you should explictly mention that your simulation design was restricted (mentioning all parameters of importance (e.g. also the missing rate); not only the distribution) and that you only choose a few competitors and left several ones out (e.g. missranger or mice RF). Moreover, I don't agree with the added statement "However, the relative comparison between methods is expected to remain informative under a broad class of data-generating processes." as your simulation designs remained restricted as stressed before. These limitations should be made more explict in the Discussion and Section 3
- Regarding the data example you replied "However, because the case study uses real-world observational data, the underlying data-generating process is unknown, making it difficult to determine which of the simulated mechanisms would most closely correspond to the empirical setting." I agree with the first statement but not with the second as there exist plenty of possibilities to obtain more realistic simulation settings, e.g. PLASMODE simulations. You should make this more explicit as a limitation and at least mention this as a potential future direction. The same holds for the AutoTabPFN.
Author Response
Comment 1:
The title remains confusing as only missings in the target with complete predictors is studied. In my opinion this has to become clear from the title. I suggest sth. like "Multiple Imputation of Continuous Target Variables with Fully Observed Predictors Using TabPFN"
Response 1:
We thank the reviewer for this helpful comment and apologize that this point was not addressed in the previous revision. Upon reconsideration, we agree that the title should reflect the specific scope of the paper more precisely. Since the present study focuses on multiple imputation of a continuous outcome under univariate missingness with fully observed predictors, we have revised the title accordingly to make this setting clearer from the outset. In doing so, we used the term outcome rather than target variable, as this terminology is more common in our field and aligns better with the applied statistical framing of the manuscript.
Change 1:
Change the title to: “Multiple Imputation of a Continuous Outcome with Fully Observed Predictors Using TabPFN”
Comment 2:
The limitations should be extended. In particular, you should explictly mention that your simulation design was restricted (mentioning all parameters of importance (e.g. also the missing rate); not only the distribution) and that you only choose a few competitors and left several ones out (e.g. missranger or mice RF). Moreover, I don't agree with the added statement "However, the relative comparison between methods is expected to remain informative under a broad class of data-generating processes." as your simulation designs remained restricted as stressed before. These limitations should be made more explict in the Discussion and Section 3
Response 2:
We thank the reviewer for this important comment and agree that the limitations of the simulation study should be stated more explicitly. In the revised manuscript, we have therefore expanded the Discussion to clarify that our evaluation was restricted to a specific set of experimental conditions, namely univariate missingness in a continuous outcome with fully observed continuous predictors, a limited set of sample sizes and predictor dimensions, and an overall missingness proportion of approximately 50%. We now also state more clearly that only selected mean and missingness structures were considered and that other potentially important design features, such as different missingness rates, mixed data types, multivariate missingness patterns, and broader distributional settings, were not examined. In addition, we clarify that the comparison was limited to a selected set of benchmark methods, primarily approaches available within the mice framework, and that other relevant competitors, including methods such as missRanger, mixgb, and misl, were not included. Finally, we agree that the previous statement suggesting that the relative comparison would remain informative under a broad class of data-generating processes was too strong given the restricted simulation design. We have therefore removed this sentence and emphasized that the findings should be interpreted within the scope of the simulation scenarios considered.
Change 2:
Removed text:
“However, the comparison was restricted to imputation methods available within the mice framework, and other MI approaches outside this framework were not considered.”
Original text:
“Several limitations warrant consideration. Our simulation study focused on univariate missingness in a continuous outcome with continuous, fully observed predictors. Many real-world datasets include multivariate missingness and mixed data types, which pose additional challenges. Furthermore, we did not assess alternative distributions for the predictors or the error term that could affect finite-sample performance. However, the relative comparison between methods is expected to remain informative under a broad class of data-generating processes.”
Revised text (line 437):
“Several limitations of the simulation study warrant explicit consideration. First, as with any simulation study, our evaluation was restricted to a specific set of experimental conditions. In particular, we considered univariate missingness in a continuous outcome with fully observed continuous predictors, a limited set of sample sizes and predictor dimensions, and an overall missingness proportion of approximately 50%. We further considered only selected mean and missingness structures and did not vary several other potentially important design features, such as the missingness rate, mixed data types, multivariate missingness patterns, or broader distributional settings for predictors and errors. Second, the comparison was limited to a selected set of benchmark methods, primarily methods available within the mice framework. Other relevant competitors, including approaches such as missRanger, mixgb, misl, or random-forest-based imputation variants, were not included and may have provided additional insight into the relative performance of TabPFN-based imputation [24–26,49]. Accordingly, the findings should be interpreted within the scope of the simulation scenarios considered.”
Comment 3:
Regarding the data example you replied "However, because the case study uses real-world observational data, the underlying data-generating process is unknown, making it difficult to determine which of the simulated mechanisms would most closely correspond to the empirical setting." I agree with the first statement but not with the second as there exist plenty of possibilities to obtain more realistic simulation settings, e.g. PLASMODE simulations. You should make this more explicit as a limitation and at least mention this as a potential future direction. The same holds for the AutoTabPFN.
Response 3:
We thank the reviewer for this important comment and agree that our previous reply was not sufficiently precise on this point. While the true data-generating process in the observational case study is unknown, more realistic simulation settings can nevertheless be constructed, for example through plasmode or other semi-synthetic simulation approaches anchored in empirical data. We are grateful to the reviewer for highlighting this point, as such designs could provide additional insight into the performance of TabPFN-based imputation under more practically relevant conditions. We have therefore revised the Discussion to make this limitation more explicit and to note that more realistic simulation-based evaluations represent an important direction for future research.
Furthermore, we agree that related TabPFN-based extensions such as AutoTabPFN may also be relevant in this context, as well as, more broadly, alternative tabular foundation models developed by other research groups. We have therefore expanded the Discussion to note that future work should benchmark TabPFN-based MI not only against established MI methods, but also against related foundation-model approaches, including AutoTabPFN and alternative tabular foundation models such as TabICL.
Change 3:
Original text:
“Because TabPFN is pretrained exclusively on synthetically generated datasets, evaluation based solely on synthetic simulation studies may raise concerns about potential over-optimism if the synthetic pretraining distribution overlaps substantially with the data-generating mechanisms used for evaluation. In line with Murray [48] and Oberman and Vink [41], we advocate for unified evaluations of MI methods combining controlled simulations with results from publicly available benchmark datasets, similar to TabArena in prediction research [33] or RealCause in causal inference from observational data [49], to ensure fair and transparent comparisons.”
Revised text (465):
“Because TabPFN is pretrained exclusively on synthetically generated datasets, evaluation based solely on synthetic simulation studies may raise concerns about potential over-optimism if the synthetic pretraining distribution overlaps substantially with the data-generating mechanisms used for evaluation. More realistic simulation strategies, such as plasmode or other semi-synthetic designs anchored in empirical data, may therefore provide further insight into the performance of TabPFN-based imputation under practically relevant conditions [50,51]. In line with Murray [52] and Oberman and Vink [42], we advocate for unified evaluations of MI methods combining controlled simulations with results from publicly available benchmark datasets, similar to TabArena in prediction research [34] or RealCause in causal inference from observational data [53], to ensure fair and transparent comparisons.”
Original text:
“Future research should focus on extending the approach to more general missing-data settings, improving computational efficiency to facilitate broader adoption, and systematically benchmarking TabPFN-based MI against both established MI methods and alternative foundation-model approaches. More generally, the area of foundation-model–based imputation is developing rapidly, and TabPFN should be viewed as one promising example within a broader emerging class of methods.”
Revised text (491):
“Future research should focus on extending the approach to more general missing-data settings, improving computational efficiency to facilitate broader adoption, and systematically benchmarking TabPFN-based MIagainst both established MI methods and alternative foundation-model approaches, including related TabPFN-based extensions such as AutoTabPFN, which enables automatic hyperparameter tuning, and alternative tabular foundation models such as TabICL [54]. More generally, the area of foundation-model–based imputation is developing rapidly, and TabPFN should be viewed as one promising example within a broader emerging class of methods.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have significantly improved their paper. The only minor issue, in my opinion, is the need to further point out the serious problem with ML/network approaches, especially in some special cases involving more complex data. As noted in [1], the more “classical” and “direct” imputation algorithms can perform far better than ML methods. Therefore, I advise incorporating some additional sentence about it, together with the above-mentioned reference. Taking this into account, I advise a minor revision of the paper.
[1] Romaniuk M., Grzegorzewski P., Fuzzy data imputation with DIMP and FGAIN, Journal of Computational Science (2026)
Author Response
Comment 1:
The authors have significantly improved their paper. The only minor issue, in my opinion, is the need to further point out the serious problem with ML/network approaches, especially in some special cases involving more complex data.
As noted in [1], the more “classical” and “direct” imputation algorithms can perform far better than ML methods. Therefore, I advise incorporating some additional sentence about it, together with the above-mentioned reference. Taking this into account, I advise a minor revision of the paper.
[1] Romaniuk M., Grzegorzewski P., Fuzzy data imputation with DIMP and FGAIN, Journal of Computational Science (2026)
Response 1:
We thank the reviewer for this helpful suggestion and apologize if this point was not made sufficiently clearly in the previous version. We agree that the manuscript should acknowledge more explicitly that machine-learning and neural-network–based imputation methods are not uniformly superior to more classical or direct approaches across all settings. We have therefore added a sentence to the Introduction clarifying that the relative performance of imputation methods is context dependent, particularly in specialized data settings, and citing the suggested reference by Romaniuk and Grzegorzewski.
Change 1:
Original text:
“Despite these advances, highly flexible models typically require substantial sample sizes to reliably estimate complex dependency structures [28,29]. While large observational datasets may support such approaches, clinical trials and other confirmatory studies are often limited in size, making it difficult to estimate complex imputation models without imposing strong structural assumptions. Nevertheless, appropriately addressing missing data remains essential for maintaining valid statistical inference in these settings, particularly when regulatory decisions depend on accurate estimation and uncertainty quantification [30].”
Revised text (line 75):
“Despite these advances, highly flexible models typically require moderate to large sample sizes to reliably estimate complex dependency structures [28,29]. Hence, ML and neural-network–based imputation methods should not be viewed as universally preferable. Their performance depends strongly on the data structure and application context, and in some specialized settings more direct imputation approaches have been shown to outperform more complex ML-based methods [30]. While large observational datasets may support such approaches, clinical trials and other confirmatory studies are often limited in size, making it difficult to fit highly flexible imputation models reliably without imposing strong structural assumptions. Nevertheless, appropriately addressing missing data remains essential for maintaining valid statistical inference in these settings, particularly when regulatory decisions depend on accurate estimation and uncertainty quantification [31].”
Author Response File:
Author Response.pdf
