1. Introduction
Modern population analytics increasingly require methods that reconcile two distinct inferential goals—
prediction and
causal estimation—in data generated by complex sampling designs. In multistage probability samples with stratification, clustering, and unequal inclusion probabilities, valid population-level statements hinge on design-consistent estimators, survey-aware performance metrics, and resampling schemes that mimic the sampling process [
1,
2,
3]. At the same time, high-dimensional covariate sets and multicollinearity motivate penalized estimation and stability assessments grounded in pseudo-likelihood arguments and stochastic resampling [
4,
5]. When the scientific interest is causal—e.g., estimating a structural effect in the presence of confounding or endogeneity—orthogonalized estimators that leverage Neyman-orthogonal moments and cross-fitting provide a principled route to robust inference even when nuisance components are learned by flexible methods [
6,
7].
Recent advances have extended penalized methods to complex surveys, but most contributions address variable selection and penalty tuning without delivering an integrated pipeline that (i) evaluates prediction with survey-consistent discrimination and calibration, (ii) quantifies uncertainty via design-preserving bootstrap, and (iii) corrects endogeneity when estimating structural parameters. For example, Iparragirre et al. [
8] developed design-based cross-validation for LASSO in surveys, while McConville et al. [
9] studied LASSO and adaptive LASSO from a model-assisted, finite-population perspective; both strengthen the predictive regularization toolkit for complex designs, yet neither fully integrates endogeneity corrections or weighted predictive metrics. Likewise, Jasiak and Tuvaandorj [
10] advanced post-LASSO inference with weights mainly at the theoretical level, without embedding it into an end-to-end predictive–causal workflow.
The proposed framework extends several active lines of methodological research. Existing contributions can be broadly grouped into two complementary strands: predictive modeling under complex survey designs and causal inference under high-dimensional settings.
First, the predictive literature has focused on extending penalized estimation techniques to survey data. Prior work includes the design-based cross-validation and variable-selection developments of Iparragirre et al. [
8], the model-assisted LASSO formulations of McConville et al. [
9], and the theoretical contributions of Jasiak and Tuvaandorj on penalized likelihood with survey weights [
10]. These approaches strengthen variable selection, stability, and predictive performance under complex sampling, but remain primarily oriented toward prediction and regularization, without explicitly addressing endogeneity or causal interpretation.
Second, the causal inference literature has emphasized robustness to confounding and high-dimensional nuisance estimation. Double/debiased machine learning (DML) [
7] and recent advances in weighted orthogonal learners for causal identification [
11] provide principled tools for estimating low-dimensional causal parameters using flexible machine learning components. These orthogonalization-based approaches originate in the double/debiased machine learning literature [
7], with extensions to heterogeneous treatment effects [
12]. However, these methods are typically developed under i.i.d. assumptions or without full integration of complex survey design features.
Recent developments in machine learning for medical diagnosis further illustrate the rapid evolution of predictive approaches. For example, reinforcement learning-based neural architectures have been proposed for cardiovascular risk detection in athletes, achieving strong classification performance in imbalanced settings [
13]. Likewise, deep convolutional models such as U-Net have been widely applied to medical imaging tasks, including brain tumor detection, demonstrating the effectiveness of data-driven feature extraction [
14]. In addition, recent work on cardiovascular disease detection has focused on addressing class imbalance and optimizing discrimination metrics such as the area under the ROC curve (AUC), further improving predictive accuracy [
15,
16]. These contributions highlight the strength of modern machine learning methods in prediction and classification tasks. However, they are primarily designed to maximize predictive performance and do not explicitly address causal identification, endogeneity, or design-based inference under complex sampling.
Recent advances in statistical modeling have extended beyond cross-sectional settings to increasingly complex data structures, including survival and longitudinal data with clustering, competing risks, and heterogeneous dependence structures. For example, Wang et al. [
17] propose a unified framework that simultaneously accounts for cure fractions, competing risks, and within-cluster dependence.
Similarly, modern approaches emphasize flexible modeling of heterogeneity and multi-scale stochastic processes, as illustrated in recent developments such as Zhuang et al. [
18]. These contributions highlight a broader trend toward unified frameworks capable of integrating dependence, heterogeneity, and complex data-generating mechanisms.
While the present work focuses on cross-sectional binary outcomes, these developments provide an important broader context. The proposed design-aware framework can be viewed as a complementary contribution within this landscape, targeting settings where the outcome is binary but the sampling design introduces analogous challenges related to dependence and heterogeneity.
Despite these advances, a key gap remains: there is no unified framework that simultaneously addresses (i) penalized, survey-weighted prediction, (ii) causal estimation under confounding and potential endogeneity, and (iii) design-consistent uncertainty quantification via resampling. In particular, the interaction between sampling design, regularization, and causal identification has not been systematically addressed. While recent working papers and conference contributions have explored elements of causal machine learning [
19] and weighted estimation in complex survey settings [
20,
21], these developments typically address prediction and causal inference separately or are derived under i.i.d. assumptions.
This paper proposes a unified, design-consistent framework that jointly addresses prediction and causality in complex survey data. The framework combines: (i) survey-weighted pseudo-likelihood estimation for finite-population targets, (ii) -penalization (LASSO) to stabilize prediction under multicollinearity and limited effective sample size, (iii) design-consistent bootstrap (PSU-within-strata) for uncertainty and stability, and (iv) endogeneity-aware causal estimators—two-stage residual inclusion (2SRI) and double/debiased machine learning (DML)—implemented with survey weights and cross-fitting. Predictive performance is assessed using weighted AUC and Brier score, while structural effects are estimated through Neyman-orthogonal moments to mitigate regularization bias and confounding. Methodologically, the contribution is to deliver a coherent pipeline that makes explicit when predictive and causal conclusions diverge under complex sampling and how to report each with appropriate, design-aware uncertainty.
This paper makes four primary mathematical contributions within the study of penalized estimation and causal inference under complex survey designs:
- 1.
Design-consistency of penalized survey-weighted estimators. We formalize a LASSO-penalized pseudo-likelihood estimator and characterize its expected design-based behavior under multistage sampling. Proposition 1 characterizes the approximation of the penalized estimator to its finite-population target under appropriate regularity conditions.
- 2.
Design-based bootstrap approximation for uncertainty quantification. We formulate a stratified, PSU-level bootstrap scheme to approximate the sampling variability of penalized and causal estimators under complex sampling, and we evaluate its adequacy through simulation evidence.
- 3.
Operator-theoretic interpretation of penalized estimation. We show that the survey-weighted LASSO estimator can be expressed through a stochastic proximal mapping, providing an optimization-based perspective on penalized pseudo-likelihood estimation, without claiming that this representation directly implies asymptotic properties.
- 4.
Integration of orthogonal moments with survey weights. We extend double/debiased machine learning (DML) to complex surveys by incorporating sampling weights into nuisance estimation, cross-fitting at the PSU level, and orthogonal score construction. This ensures robustness of structural estimates to regularization bias and design-induced variability.
These components together form a coherent mathematical framework that connects penalized prediction, causal estimation, and design-based inference for finite populations.
To illustrate the framework, we consider an application where predictive and causal goals are routinely conflated: the relationship between chronic kidney disease (CKD) and cardiovascular (CV) risk. CKD is epidemiologically linked to elevated CV risk and is often embedded in risk stratification, yet its
independent structural contribution is sensitive to confounding and endogeneity; in such settings, naive predictive associations may not carry causal meaning [
22,
23]. We therefore use two complementary exercises. First, ENS-like simulations with a known causal structure show how naive survey-weighted penalized models can overstate structural effects under confounding, whereas 2SRI/DML can recover the target when identification holds. Second, we apply the pipeline to the Chilean National Health Survey (ENS) 2016–2017 to examine whether routinely collected biomarkers (hs-CRP, lipid fractions, hemoglobin, HbA1c, creatinine) improve the prediction of high CV risk and how endogeneity-aware estimators alter the interpretation of the CKD effect in a finite population.
Section 2 introduces the mathematical and statistical foundations (pseudo-likelihood under complex sampling, penalization, design-based bootstrap, and endogeneity corrections).
Section 3 details the modeling pipeline—survey-weighted estimation, LASSO regularization, weighted performance metrics, and orthogonal/control-function estimators.
Section 4 reports simulation and empirical results, focusing on the divergence between predictive and causal targets and on design-consistent uncertainty.
Section 5 discusses implications, limitations, and avenues for future work. Finally,
Section 6 summarizes the main insights and emphasizes the generalizability of the framework to other complex health surveys and application areas.
2. Motivation and Mathematical Background
2.1. Applied Motivation
The mathematical literature has recently emphasized the need for risk prediction models that account for non-independence among biomarkers, high-dimensionality, and survey sampling variability [
24]. Penalized estimators such as least absolute shrinkage and selection operator (LASSO) have demonstrated utility for feature selection under these constraints, particularly when the effective sample size is reduced due to complex survey designs [
25].
The physiological mechanisms underlying cardiovascular disease in CKD have been extensively described [
26,
27]. Inflammatory markers such as hs-CRP have shown predictive value for cardiovascular outcomes [
28], although their interpretation in CKD remains challenging due to the presence of uremic inflammation [
29]. Oxidative stress is also implicated, with oxidized LDL linked to accelerated atherosclerosis [
30]. Biomarkers such as cystatin C have proven especially informative, outperforming creatinine as predictors of cardiovascular and renal outcomes [
31,
32]. Beyond predictive performance, a central mathematical challenge is potential endogeneity between CKD status and cardiovascular risk—where CKD may influence cardiovascular outcomes, but cardiovascular events may also exacerbate CKD [
33,
34,
35]. In econometric terms, endogeneity arises when regressors correlate with the error term, leading to biased estimates. To address this, we incorporate statistical techniques tailored for nonlinear survey-weighted models:
Control-function (Two-Stage Residual Inclusion, 2SRI): For logistic regression with endogenous variables, we include residuals from a first-stage survey-weighted model as covariates in the second-stage outcome model. This method is consistent under nonlinearity and extends to complex designs [
36,
37].
Double/Debiased Machine Learning (DML): We employ orthogonal estimators via Neyman-orthogonal moment conditions with sample splitting and penalized learners. This approach mitigates both regularization bias and endogeneity, and it can be adapted for survey designs to yield valid inference on the causal CKD parameter [
7,
38].
These methods preserve the weighted pseudo-likelihood framework and allow bootstrap-based uncertainty quantification, ensuring robust and interpretable inference on CKD effects under complex sampling.
Given the abundance of routine biomarkers collected in ENS 2016–2017 and the lack of CKD-specific predictive models in Chile, this study offers both mathematical and applied contributions.
2.2. A Stochastic Framework for Penalized Estimation Under Complex Sampling
Let denote observations drawn from a finite population under a multistage, stratified sampling design. Each sampled unit carries a survey weight , where is the inclusion probability. Only the sample is observed; the full population remains unobserved.
The survey-weighted pseudo-likelihood for parameter vector
is defined as
where
denotes the logistic probability kernel. Taking logarithms yields the weighted pseudo-log-likelihood:
Under standard regularity conditions, expression (
2) provides a design-consistent approximation to the finite-population log-likelihood.
To incorporate regularization and enable stochastic variable selection, we consider the LASSO-penalized objective
whose maximizer satisfies
This optimization defines a stochastic mapping
whose randomness reflects both the sampling design and the underlying finite population.
To characterize the sampling distribution of
, we use a design-consistent bootstrap that resamples clusters and strata according to the ENS 2016–2017 design. This produces bootstrap replicates
whose empirical distribution is intended to approximate that of
under the complex sampling design.
This stochastic framework offers two key advantages:
- 1.
It provides uncertainty quantification for penalized estimators under complex survey designs, where classical asymptotics may be unreliable.
- 2.
It yields a finite-population interpretation of penalized pseudo-likelihood estimation, supporting inference on population-level cardiovascular risk mechanisms.
Thus, the proposed approach integrates survey theory, penalized estimation, and stochastic resampling into a unified mathematical framework.
2.3. Proposition 1: Design-Based Approximation for Penalized Estimation
We summarize here a set of working conditions under which the penalized survey-weighted estimator is expected to behave regularly under multistage complex sampling. The goal is not to provide a complete theorem-proof development, but rather to clarify the assumptions underlying the estimation and resampling strategy.
Let denote the LASSO-penalized pseudo-likelihood estimator. We consider the following:
- (B1)
The sampling design is stratified and multistage, with strictly positive inclusion probabilities and PSU-level resampling within strata.
- (B2)
Weighted score contributions satisfy boundedness and moment conditions that ensure stable weighted averages.
- (B3)
The finite-population objective admits a well-defined minimizer with locally regular curvature.
- (B4)
The penalty sequence and dimensionality satisfy standard regularization conditions ensuring stability.
Under these conditions, the estimator is expected to concentrate around a finite-population target:
This statement is interpreted as a design-based consistency heuristic grounded in penalized M-estimation arguments, rather than as a formally established theorem in this work.
To characterize the sampling variability of , we employ a design-consistent bootstrap that resamples primary sampling units (PSUs) within strata.
For bootstrap replicates
,
, consider:
where
denotes the bootstrap distribution. This approximation reflects the expectation that PSU-level resampling preserves the dependence structure induced by the complex survey design.
While a full formal proof of bootstrap consistency for penalized estimators under complex sampling is beyond the scope of this work, the proposed procedure is grounded in established design-based resampling principles. In particular, resampling PSUs within strata reproduces first-order sampling variability under multistage designs.
Empirical support for this approximation is provided in
Section 4, where bootstrap-based uncertainty closely matches the variability observed under repeated sampling in simulation experiments.
We therefore interpret Equation (
8) as a design-based approximation supported by both theoretical and simulation evidence [
2,
3,
39].
Stochastic Stability Paths
To assess the robustness of penalized estimators under finite-population and design-induced randomness, consider a grid of tuning parameters
. For each
and bootstrap replicate
b, compute
and define the selection probability for predictor
j as
A predictor is considered
stochastically stable if
These stability paths generalize classical LASSO solution paths by incorporating sampling variability and design-based uncertainty.
Figure 1 summarizes the workflow from finite population to risk prediction, highlighting how bootstrap replicates feed into stability path estimation.
2.4. Operator-Theoretic View
Define the convex function
Then
satisfies the stochastic proximal equation:
demonstrating that
(Equation
5) is a stochastic proximal operator acting on survey data.
Figure 1.
Workflow for design-aware penalized modeling: from finite population to risk prediction via survey-weighted LASSO, bootstrap replicates, and stability paths.
Figure 1.
Workflow for design-aware penalized modeling: from finite population to risk prediction via survey-weighted LASSO, bootstrap replicates, and stability paths.
The methodological contribution of this work lies in integrating penalized pseudo-likelihood estimation, complex-survey inference, and design-based bootstrap into a coherent framework for predictive and causal modeling in finite populations. The proximal operator representation provides an optimization-based interpretation of the penalized estimator, clarifying its variational and computational structure. Its role is primarily interpretive, and we do not rely on this formulation to establish the asymptotic properties discussed in this paper.
2.5. Addressing Endogeneity in Survey-Weighted Nonlinear Models
Endogeneity arises when one or more regressors are correlated with the structural error term, violating the exogeneity assumption and producing biased and inconsistent estimators in nonlinear models. In our setting, CKD may be endogenous due to reverse causality (cardiovascular events affecting kidney function), omitted confounders, or measurement error. We develop two complementary strategies tailored to complex survey designs: a control-function (two-stage residual inclusion, 2SRI) approach for nonlinear models and a double/debiased machine learning (DML) approach based on Neyman-orthogonal moments.
Let
denote observations drawn from a stratified multistage complex survey, where
are sampling weights. The binary outcome
indicates high cardiovascular risk;
are exogenous covariates;
is a potentially endogenous regressor; and
are instruments satisfying relevance and exclusion restrictions described below. The survey-weighted pseudo-log-likelihood for a logistic outcome is
2.5.1. Control-Function Approach (Two-Stage Residual Inclusion, 2SRI)
The control-function method corrects endogeneity in nonlinear models by augmenting the structural equation with first-stage residuals [
36,
37]. We adapt 2SRI to complex survey designs via weighted estimation in both stages.
Model the endogenous regressor
using instruments
and exogenous covariates
:
where
can be a linear specification (survey-weighted least squares), a generalized linear model (survey-weighted GLM), or a flexible penalized learner. Estimate
by minimizing a survey-weighted loss (e.g., pseudo-likelihood) and compute residuals
In practice, the choice of the first-stage function is data-driven and depends on the predictive structure of the endogenous regressor. In this study, we consider linear, penalized (LASSO), and flexible machine learning specifications as alternative candidates.
We recommend selecting based on the following criteria:
- (i)
Predictive performance. The specification should provide adequate out-of-sample prediction of the endogenous variable, as assessed by cross-validation or hold-out evaluation.
- (ii)
Stability. The fitted values should be stable across resamples or folds, avoiding excessive sensitivity to small changes in the data.
- (iii)
Robustness of second-stage estimates. The estimated structural coefficient in the second stage should not vary substantially across reasonable alternative specifications of .
Given that the exclusion restriction is unlikely to hold in the present application, the first-stage model is not interpreted as a structural instrument equation. Instead, it is treated as a predictive nuisance component within a control-function sensitivity analysis. Consequently, model selection is guided by predictive adequacy and robustness considerations rather than by structural identification criteria.
Augment the outcome model with the first-stage residuals:
Estimate
by maximizing the survey-weighted pseudo-log-likelihood in (
14) with
and
included. Under suitable conditions,
captures the endogeneity correction, and
is consistently estimated despite nonlinearity [
36,
37].
The standard 2SRI framework relies on (i) relevance, (ii) exclusion, and (iii) independence of instruments. In this study, we evaluate the relevance condition empirically. However, the exclusion restriction is unlikely to hold given the clinical role of the candidate variables. Therefore, these conditions are not interpreted as satisfied for causal identification purposes, and the resulting estimates are treated as sensitivity analyses rather than structurally identified effects.
Practical limitations of identification. In the empirical application based on ENS data, the exclusion restriction is unlikely to hold. Variables such as age, sex, hypertension, and diabetes are known to directly affect cardiovascular risk beyond their association with CKD. As a result, these variables cannot be interpreted as valid instruments satisfying the exclusion condition required for causal identification.
Accordingly, the 2SRI specification should not be interpreted as providing structurally identified causal effects. Instead, it is used as a control-function-based sensitivity analysis to assess how the estimated CKD effect responds to alternative modeling assumptions that account for potential endogeneity.
Nuisance estimation under complex sampling introduces additional challenges relative to standard i.i.d. settings. In particular, the effective sample size is reduced by clustering and unequal weights, which may affect the stability of flexible learners used for nuisance estimation. To mitigate this issue, we employ cross-fitting at the PSU level and restrict model complexity to ensure stable estimation.
Nevertheless, we emphasize that the DML estimator remains sensitive to misspecification or instability in nuisance components, and therefore should be interpreted as providing robustness-oriented evidence rather than definitive causal identification.
When
or
are high-dimensional or collinear, we optionally include an
penalty in stage 1 and/or stage 2:
where
is the weighted pseudo-log-likelihood incorporating
as in (
17).
To quantify uncertainty under complex sampling, we employ a design-consistent bootstrap that resamples primary sampling units (PSUs) within strata. For each bootstrap replicate :
- 1.
Draw PSUs within each stratum to form a bootstrap sample and recompute weights .
- 2.
Re-estimate stage 1 to obtain and residuals .
- 3.
Re-estimate stage 2 to obtain .
The empirical distribution of approximates the sampling distribution of the CKD effect under the survey design.
2.5.2. Double/Debiased Machine Learning (DML) with Survey Weights
DML constructs estimators based on Neyman-orthogonal moments that are insensitive to small estimation errors in high-dimensional nuisance functions [
7,
38]. We adapt DML to complex surveys by employing survey weights and design-preserving sample splitting.
Let
and consider nuisance functions
estimated via penalized learners (e.g., LASSO, elastic net, or tree-based methods) using survey weights. Define the orthogonal score
where
collects nuisance parameters. The moment condition
identifies the causal parameter
(here, the average structural effect of
on
). Orthogonality implies
which mitigates bias from regularization and nuisance estimation.
The parameter identified by the orthogonal moment condition corresponds to a partially linear projection parameter of the conditional mean function
. Importantly, when the outcome is binary, this estimand should be interpreted on the probability scale as an average partial effect, rather than as a structural log-odds coefficient from a logistic model. The use of the score in Equation
19 does not rely on Gaussian error assumptions. Instead, it defines a valid orthogonal moment condition for the conditional expectation function, allowing consistent estimation of a low-dimensional parameter even when the outcome is discrete.
The DML framework builds on the orthogonalization approach introduced by [
7], which enables root-
n consistent estimation of low-dimensional parameters in the presence of high-dimensional nuisance components estimated via machine learning. Extensions of this framework further allow for heterogeneous treatment effect estimation under similar orthogonality principles [
12]. However, these results are primarily derived under i.i.d. sampling assumptions. In complex survey settings, additional challenges arise due to clustering, unequal weights, and reduced effective sample size, which may affect both nuisance estimation and asymptotic approximations. Our implementation adapts these ideas by incorporating survey weights and PSU-level cross-fitting to preserve design consistency.
The validity of the DML estimator critically depends on the quality of nuisance function estimation, namely and . In this study, these components are estimated using survey-weighted learners combined with cross-fitting at the PSU level, which reduces overfitting and preserves the dependence structure induced by the sampling design.
The resulting estimator should be interpreted as a partially linear projection parameter, representing the best linear approximation to the conditional mean of the outcome with respect to the residualized treatment. Consequently, even under correct specification of the nuisance components, the estimand does not recover a fully nonparametric structural effect, but rather a design-aware average effect under the maintained assumptions.
Importantly, the causal interpretation relies on the joint validity of: (i) conditional mean independence, (ii) sufficient overlap in the weighted sample, and (iii) stable estimation of nuisance functions under the effective sample size induced by the survey design. These conditions may be only approximately satisfied in practice, and results should therefore be interpreted with appropriate caution.
The use of flexible machine learning methods for nuisance estimation follows a broader literature on causal inference with data-adaptive models, where predictive accuracy is leveraged to improve adjustment for confounding [
40]. However, as emphasized in that literature, valid causal interpretation requires that such models adequately capture the relevant confounding structure, which may be challenging in finite samples and complex survey settings.
Partition the sample into K folds at the PSU level to respect the design. For each fold k:
- 1.
Train and on all but fold k using survey weights.
- 2.
Compute residuals on fold
k:
- 3.
Estimate from the weighted moment:
Aggregate
. Under regularity conditions and appropriate weighting,
is normal and admits valid design-based inference under regularity conditions [
7,
38].
Let
denote the orthogonal score evaluated using cross-fitted nuisance estimates. Under complex sampling, observations are not independent, and variance must be computed at the cluster (PSU) level. Define the cluster-level score contributions as
where
indexes primary sampling units. A cluster-robust, design-based variance estimator for
is given by
where
.
This estimator extends classical sandwich-type variance estimators to the orthogonal moment setting under clustering and unequal weights.
In practice, our primary inference relies on a design-consistent bootstrap that resamples PSUs within strata and recomputes the full cross-fitting procedure in each replicate. This approach captures variability due to clustering, sample splitting, and nuisance estimation, and is used to construct confidence intervals for throughout the empirical analysis.
For a logistic outcome, orthogonal scores can be constructed by replacing with the working residual under a weighted logistic link and defining as the survey-weighted conditional mean. Alternatively, use an orthogonalized regression of the logit-transformed probabilities on residualized with cross-fitting, maintaining the survey weights throughout.
Analogous to 2SRI, apply the design-consistent bootstrap:
- 1.
Resample PSUs within strata to form bootstrap samples and weights .
- 2.
Refit nuisance learners with weights .
- 3.
Recompute cross-fitted scores and obtain .
Percentile or studentized intervals from yield uncertainty quantification consistent with the sampling design.
2SRI is attractive when valid instruments are available, and the endogenous regressor admits a well-specified first-stage model. DML is preferable when high-dimensional covariates and flexible nuisance functions are needed, leveraging orthogonality and cross-fitting to control regularization bias. Both strategies are compatible with our survey-weighted pseudo-likelihood framework, -penalization for variable selection, and design-consistent bootstrap to quantify uncertainty and stability.
3. Methodology
Our methodological framework combines design-based (survey-weighted) estimation, penalized regularization, stochastic resampling, and econometric extensions to address potential endogeneity and causal directionality between chronic kidney disease (CKD) and high cardiovascular (CV) risk. We describe each component below and then outline the simulation and empirical studies used for evaluation.
3.1. Survey-Weighted Logistic Regression
Let indicate high cardiovascular risk (10-year predicted risk ), and let denote a vector of demographic, behavioral, and biochemical predictors for individual i. Under the ENS multistage sampling design, each observation carries a sampling weight and is associated with a stratum and a PSU identifier.
We consider the logistic specification
and estimate
by maximizing the survey-weighted pseudo-log-likelihood
Maximizing (
21) yields design-consistent estimation under standard regularity conditions for complex surveys, provided that the sampling structure is correctly specified in the analysis.
3.2. Penalized Estimation via LASSO
Given the substantial collinearity among biomarkers and cardiometabolic indicators, together with the limited effective sample size induced by complex sampling, we use the Least Absolute Shrinkage and Selection Operator (LASSO) [
41] to regularize estimation and perform variable selection.
LASSO augments the pseudo-log-likelihood with an
penalty,
and defines the estimator
The
penalty shrinks small coefficients exactly to zero, producing sparse and interpretable models that mitigate multicollinearity and reduce overfitting. This is especially relevant in CKD-related biomarker analysis, where lipid fractions, inflammatory markers, and glycemic indices tend to be highly correlated [
42].
In the empirical application, LASSO is used as a screening step. Final effect sizes and inference are obtained from unpenalized survey-weighted logistic regressions (svyglm) refitted on the covariate sets selected by LASSO.
3.3. Addressing Endogeneity and Causal Directionality
A key methodological challenge is the potential endogeneity of CKD status, which may correlate with unobserved factors or exhibit reverse causality with cardiovascular outcomes. In econometric terms, endogeneity arises when regressors correlate with the error term, leading to biased estimates. To address this concern in survey-weighted nonlinear settings, we consider two complementary strategies.
We first estimate a survey-weighted first-stage model for the endogenous regressor
(CKD) using instruments
and covariates
:
Residuals
from the first stage are then included in the second-stage logistic regression:
This method is consistent under nonlinearity and is widely used in applied health econometrics [
36,
37]. In our setting, it provides a transparent way to assess the sensitivity of CKD effects to endogeneity corrections.
To mitigate regularization bias and endogeneity in high-dimensional settings, we also consider Neyman-orthogonal moment conditions combined with sample splitting and cross-fitting:
where
and
are estimated using survey-weighted penalized learners. Orthogonality provides robustness to nuisance estimation error, and cross-fitting yields valid inference under weak conditions [
7,
38]. This framework supports a causal interpretation of the CKD effect under appropriate identification assumptions, where the estimand corresponds to a probability-scale average partial effect derived from a partially linear projection, rather than a structural parameter of a fully specified nonlinear model. This specification follows the standard partially linear DML framework, where the target parameter is defined via orthogonal moments and does not require correct specification of the full outcome distribution.
Both approaches preserve the survey-weighted pseudo-likelihood logic and are combined with design-consistent uncertainty quantification.
The DML estimator relies on several key identification conditions adapted to the complex-survey setting: (i) sufficient overlap between treatment groups after weighting; (ii) relevance of covariates for predicting both the outcome and the potentially endogenous regressor; (iii) conditional exogeneity or valid instruments when applicable; and (iv) sampling ignorability under the survey design. Practical limitations arise when treatment prevalence is low—as in the case of CKD—leading to reduced effective sample size and potentially unstable nuisance estimation. To mitigate these challenges, cross-fitting is performed at the PSU level to respect the sampling structure while reducing overfitting and regularization bias in the estimation of nuisance components. From an identification standpoint, the absence of credible exclusion-based instruments is a fundamental limitation of large-scale health surveys. This constraint reinforces the need for approaches such as DML, which rely on conditional independence assumptions rather than instrumental variables, and highlights the role of 2SRI as a diagnostic rather than a primary identification strategy in this context.
The causal interpretation of the CKD effect in this study relies on distinct identification strategies that must be interpreted carefully. First, the double/debiased machine learning (DML) estimator is defined as the primary causal estimand under a conditional mean independence assumption, namely:
which requires that, conditional on observed covariates
and survey weights, treatment assignment (CKD status) is as good as random. Under this assumption, the DML estimator recovers the average treatment effect in a partially linear projection framework.
Importantly, the DML estimand should be interpreted as the best linear approximation to the conditional mean function, rather than as a fully nonparametric structural parameter.
Second, although we implement a Two-Stage Residual Inclusion (2SRI) approach, we do not claim the availability of valid exclusion-based instruments in the ENS dataset. Variables such as age, sex, hypertension, and diabetes are likely to violate the exclusion restriction, as they may directly affect cardiovascular risk beyond their association with CKD.
Accordingly, 2SRI is used strictly as a diagnostic sensitivity analysis to assess the stability of CKD effect estimates under alternative endogeneity corrections, rather than as a source of causal identification.
This distinction ensures that causal claims are based exclusively on the assumptions underlying DML, while 2SRI results are interpreted as robustness checks.
3.4. Stochastic Resampling and Variability Assessment
To quantify the sampling distribution of
and of endogeneity-corrected estimators, we use a design-consistent bootstrap that resamples primary sampling units within strata while preserving the complex survey structure. For bootstrap replicates
, we compute
and, when applicable,
for 2SRI or
for DML. These replicates support inference on the coefficient variability, estimator stability, and predictive performance.
Design-Consistent Bootstrap and PSU-Level Cross-Fitting
Uncertainty was quantified using a design-consistent bootstrap that resamples primary sampling units (PSUs) with replacement within strata, preserving the multistage ENS sampling design. For each bootstrap replicate (with ), PSUs were resampled within strata, and survey weights were recomputed accordingly. All analytical steps—including LASSO screening, refitted survey-weighted models, two-stage residual inclusion (2SRI), and double/debiased machine learning (DML) nuisance estimation—were re-estimated within each bootstrap replicate.
For DML, cross-fitting was performed using folds defined at the PSU level, such that all observations within a given PSU were assigned to the same fold. This design-aware sample splitting prevents information leakage across clusters and preserves validity under complex sampling.
3.5. Predictive Performance
3.5.1. Weighted ROC Curves
Discriminatory ability is assessed using the survey-weighted area under the ROC curve (AUC). A design-based estimator is
where
denotes fitted probabilities. This estimator approximates the probability that a randomly drawn case (weighted by the survey design) receives a higher predicted risk than a randomly drawn control.
3.5.2. Calibration and Reclassification
Calibration is evaluated using survey-weighted Brier scores and calibration curves based on weighted risk deciles. Incremental predictive value is assessed using survey-adapted reclassification metrics, including the net reclassification improvement (NRI) and the integrated discrimination improvement (IDI), computed under the complex sampling design.
3.6. Software Implementation
All analyses were conducted in R (version 4.5.1). Core functionality for complex survey inference and penalized regression relied on survey and glmnet. Several components required custom development to ensure design-consistent inference and integration of causal estimators:
Survey-weighted metrics: functions for design-based AUC, Brier score, and decile-based calibration diagnostics preserving PSU-level structure.
Endogeneity correction: implementation of two-stage residual inclusion (2SRI) for nonlinear models under survey weights.
Double/debiased machine learning: cross-fitting routines for orthogonal score estimation of under complex sampling.
Design-based bootstrap: PSU-level resampling within strata for uncertainty quantification of predictive and causal parameters.
High-dimensional penalization: integration of survey weights into LASSO fitting, -selection, and stability assessment across penalty grids.
Calibration and sensitivity: weighted calibration plots and raking-based sensitivity analyses under shifts in CKD prevalence.
All routines, including simulation scripts for population generation, complex sampling, endogeneity mechanisms, bootstrap procedures, and sensitivity analyses, are provided in the companion scripts to support full reproducibility.
3.7. Simulation and Empirical Study
We present two complementary analyses: (i) a simulation study emulating ENS-like design features and CKD prevalence, and (ii) an empirical application using ENS 2016–2017 data. Both analyses evaluate predictive performance, calibration, and robustness to endogeneity corrections using control-function (2SRI) and double/debiased machine learning (DML) approaches.
3.8. Simulation Study
We generated a synthetic finite population of units to emulate key structural features of national health surveys, including stratification ( strata), primary sampling units (PSUs) within strata, and survey weights approximating inverse inclusion probabilities with mild calibration noise. Each unit was assigned demographic and clinical covariates (age, sex, smoking, diabetes, treated hypertension, systolic blood pressure) and a high-dimensional biomarker panel () with AR(1) correlation structure ().
The binary outcome (high CV risk) was generated via a logistic model combining classical risk factors, biomarker effects, and an endogeneity component correlated with CKD status. CKD was generated as a function of covariates, biomarkers, and instruments (two proxies with strong relevance), inducing correlation between CKD and the outcome error term. This setup enables evaluation of naive bias and performance of endogeneity- corrected estimators under complex sampling.
From the finite population, we drew a stratified multistage sample by selecting PSUs within strata and then individuals within PSUs, yielding survey weights, strata, and cluster identifiers. All estimators incorporate these design elements.
We fit survey-weighted penalized logistic regressions using LASSO for variable selection, maximizing the pseudo-log-likelihood with an penalty. Tuning was selected via weighted cross-validation. Endogeneity was addressed using 2SRI and DML with cross-fitting ( folds).
Design-based uncertainty was quantified via a PSU-level bootstrap within strata ( replicates), providing confidence intervals for AUC, Brier score, and . Sensitivity analyses were conducted by raking weights to target CKD prevalence levels of 8%, 12%, and 20%.
Predictive performance was evaluated using survey-weighted AUC and Brier score. Causal inference focused on , reported on a probability scale. Bootstrap percentiles were used to construct confidence intervals.
3.9. Empirical Study
We use ENS 2016–2017, a nationally representative health survey implemented using multistage, stratified cluster sampling. All analyses incorporate sampling weights, strata, and PSUs to ensure design-consistent inference.
CKD is defined based on kidney function criteria using estimated glomerular filtration rate (eGFR) and albuminuria, following KDIGO-related standards [
43]. Cardiovascular risk is estimated using a Framingham-type score widely applied in Chilean epidemiology [
22].
Candidate predictors include demographic characteristics, behavioral indicators, cardiometabolic conditions, and laboratory measurements commonly linked to cardiovascular risk, including age, sex, hypertension, diabetes, body mass index, metabolic syndrome, glucose-related markers, lipid profile, kidney function indicators, inflammatory markers (e.g., hs-CRP), and regional indicators [
27,
44].
To reduce overfitting and address multicollinearity, we apply LASSO as a screening step. Covariates with excessive missingness (greater than 20%) were excluded prior to LASSO estimation. The penalty was selected via five-fold cross-validation using deviance, retaining both and solutions. Final inference is based on unpenalized survey-weighted logistic regressions refit on selected covariate sets.
To evaluate sensitivity to potential endogeneity between CKD and cardiovascular risk, we consider 2SRI and DML extensions within the survey-weighted framework. These analyses complement the primary predictive and associative results by assessing the robustness of CKD effect estimates under alternative identification strategies.
We compare:
- 1.
Base model (Framingham-like): classical risk factors (age, sex, SBP, smoking, diabetes, and treated hypertension).
- 2.
CKD-augmented model: adds CKD-related biomarkers and indicators (e.g., hs-CRP, hemoglobin, HbA1c, eGFR/uACR, where available, and a CKD indicator).
For each specification, we report discrimination (AUC), calibration diagnostics, and reclassification metrics where applicable, and we evaluate how CKD effect estimates change under endogeneity corrections.
Variance estimation accounts for the complex multistage ENS design using Taylor linearization. Strata containing a single PSU after data cleaning are handled using the standard “lonely PSU” adjustment in the survey package (survey.lonely.psu = "adjust"), applying a conservative correction without excluding strata.
To ensure full reproducibility, all data and code are provided in a public repository [
45].
Survey-weighted logistic models (naive and LASSO-refit) and the two-stage residual inclusion (2SRI) approach report effects on the odds ratio (OR) scale and are interpreted as associational or diagnostic. In contrast, double/debiased machine learning (DML) is defined as the primary causal estimator and reports the average treatment effect of CKD on the probability of high cardiovascular risk (probability difference, pp). Uncertainty is quantified using a design-consistent PSU-level bootstrap throughout.
Table 1 summarizes these distinctions.
4. Results
4.1. Results of Simulation Study
Under a generalizable survey-like simulation with stratification (), PSUs within strata, high-dimensional biomarkers () with AR(1) correlation (), and endogeneity between CKD and the outcome error (two relevant instruments), we evaluated predictive and causal performance using survey-weighted LASSO (naive), a DML-constrained outcome model (svyglm with offset), and 2SRI residual inclusion. Design-based uncertainty was quantified via PSU-level bootstrap (), and sensitivity to CKD prevalence was assessed via raking to 8%, 12%, and 20%.
Table 2 reports survey-weighted AUC and Brier for the three specifications. The naive LASSO achieves moderate discrimination (AUC
) and reasonable calibration (Brier
), with narrow design-based intervals. The DML-offset svyglm maintains causal consistency for the CKD component and slightly improves discrimination and calibration. The endogeneity-corrected 2SRI shows the highest discrimination (AUC
) and lowest Brier (≈0.210).
Table 3 summarizes the DML-estimated causal effect of CKD on high cardiovascular risk. The parameter is reported in percentage points on a residualized probability scale (not odds ratios). Bootstrap percentiles provide design-based uncertainty.
Table 4 shows the stability of predictive metrics and the DML parameter when reweighting the sample to target CKD prevalence levels of 8%, 12%, and 20%. Changes are minimal, supporting generalizability under aggregate prevalence shifts.
Figure 2 compares ROC curves for the naive LASSO, DML-offset svyglm, and 2SRI. Consistent with
Table 2, 2SRI attains the highest discrimination, while the DML-offset maintains competitive performance with causal consistency for the CKD component.
Figure 3 displays weighted decile calibration plots for naive LASSO and 2SRI. The 2SRI specification yields slightly better global calibration (lower Brier), with points closer to the identity line. Point sizes reflect the sum of survey weights per decile.
Figure 4 summarizes the bootstrap distribution of
under PSU resampling within strata. The distribution is approximately unimodal, with the percentile interval not including zero, supporting a positive CKD effect on residualized probability of high cardiovascular risk.
Across high-dimensional, endogeneity-aware specifications under complex sampling, the naive LASSO provides a practical baseline for risk stratification, while DML-offset preserves causal consistency for the CKD component with competitive predictive metrics. The instability and wide confidence intervals observed in the 2SRI estimates should be interpreted in light of both the weak effective sample size and the lack of valid exclusion restrictions. Consequently, these results are not taken as causal estimates, but rather as indicative of the sensitivity of the CKD effect to alternative modeling assumptions that attempt to account for endogeneity. Design-based bootstrap confirms estimator stability, and raking analyses indicate minimal sensitivity to CKD prevalence shifts, supporting generalizability of the framework.
4.2. Empirical Results: ENS 2016–2017
Using data from the Chilean National Health Survey (ENS) 2016–2017 and explicitly accounting for its complex multistage sampling design, we estimated survey-weighted prevalence figures and regression models for chronic kidney disease (CKD) and high cardiovascular (CV) risk in the adult population. CKD was defined as an estimated glomerular filtration rate below 60 mL/min/1.73 m2, while high cardiovascular risk was defined as the highest risk category (category 3) of the Chilean-adapted Framingham 10-year cardiovascular risk score.
The analytic sample comprised adults, including approximately individuals with CKD events. The survey-weighted prevalence of CKD was 3.1% (95% CI: 2.4–3.8), and the prevalence of high cardiovascular risk was 23.9% (95% CI: 21.5–26.3).
All variance estimates fully accounted for the complex survey design, including stratification, clustering, unequal sampling weights, and strata containing a single primary sampling unit, which were handled using conservative variance adjustments.
Table 5 presents survey-weighted baseline characteristics stratified by CKD status. Individuals with CKD were substantially older than those without CKD, with a mean age of 74.8 years compared to 42.3 years in the non-CKD group (
). The proportion of women was similar across groups (50.9% vs. 51.6%,
), and the mean body mass index (BMI) did not differ significantly despite the large age gap (28.9 vs. 28.5 kg/m
2,
).
In contrast, cardiometabolic conditions were markedly more prevalent among individuals with CKD. Hypertension affected 86.7% of participants with CKD compared to 25.5% among those without CKD (), while diabetes prevalence was nearly three times higher in the CKD group (31.1% vs. 11.2%, ). Most notably, the prevalence of high cardiovascular risk exceeded 90% among individuals with CKD, compared to 21.5% in the non-CKD population ().
Taken together, these findings reveal a pronounced clustering of cardiovascular risk factors and predicted cardiovascular risk among individuals with CKD in the Chilean adult population, providing a strong empirical motivation for the multivariable and regularized modeling strategies developed in the subsequent analyses.
4.2.1. Naive Survey-Weighted Model
As an initial benchmark, we estimated a survey-weighted logistic regression model treating CKD status as exogenous. The model adjusted for age, sex, hypertension, diabetes, and body mass index, while fully accounting for the complex survey design.
In this naive specification, CKD was strongly associated with high cardiovascular risk. Individuals with CKD exhibited substantially higher odds of being classified as having high cardiovascular risk compared to those without CKD (OR = 5.66; 95% CI: 2.71–11.82; ). Age and diabetes emerged as the strongest predictors, while hypertension also showed a positive and statistically significant association. Sex and body mass index were not independently associated with high cardiovascular risk in this model.
While these results indicate a robust association between CKD and predicted cardiovascular risk, the naive specification does not address potential overfitting or multicollinearity among cardiometabolic covariates. These limitations motivate the use of regularized variable selection techniques in the subsequent analysis.
4.2.2. Regularized Variable Selection via LASSO
Under the conservative criterion, LASSO selected a compact and clinically interpretable set of predictors, including chronic kidney disease (CKD), age, hypertension, and diabetes. This subset captures the core cardiometabolic pathways linking renal dysfunction and cardiovascular risk.
Using the more permissive criterion, additional variables entered the model, including body mass index and selected metabolic and laboratory indicators, as well as regional fixed effects. While this specification improved in-sample fit, it did so at the cost of increased model complexity.
Table 6 reports the survey-weighted logistic regression results re-estimated using the variables selected by each LASSO rule. Across both specifications, CKD remained a strong and statistically significant predictor of high cardiovascular risk. In the
model, individuals with CKD exhibited more than fivefold higher odds of being classified as having high cardiovascular risk compared to those without CKD (OR = 5.73; 95% CI: 2.80–11.73). Nearly identical effect sizes were observed under the
specification (OR = 5.62; 95% CI: 2.72–11.60), indicating substantial robustness of the CKD association to alternative model selection criteria.
Age and diabetes consistently emerged as the strongest predictors in both models, while hypertension retained a moderate but statistically significant association. Importantly, the sign and magnitude of the CKD coefficient were stable across specifications, and all shared covariates exhibited identical coefficient signs, underscoring the structural robustness of the estimated relationships.
Model performance metrics were similar across the two specifications. The McFadden pseudo- was 0.445 for the model and 0.449 for the model, suggesting that the additional covariates selected under provided only marginal improvements in explanatory power.
Overall, these results support the use of the parsimonious specification as the primary empirical model, while the specification serves as a sensitivity analysis confirming the stability of the CKD effect.
In the empirical application, LASSO is used exclusively as a data-driven variable screening step. All reported odds ratios, confidence intervals, and causal estimates are obtained from unpenalized survey-weighted models refitted using the covariate sets selected under (main specification) and (sensitivity analysis).
Both naive and LASSO-refitted survey-weighted models yield similar odds ratios, indicating a strong association between CKD and high cardiovascular risk. These models are interpreted as predictive or associational and do not admit a causal interpretation. See
Table 7.
4.2.3. Endogeneity Assessment via Two-Stage Residual Inclusion
To explore potential endogeneity between chronic kidney disease (CKD) status and cardiovascular risk, we implemented a two-stage residual inclusion (2SRI) approach adapted to the complex ENS sampling design. In this framework, CKD status was treated as potentially endogenous.
In the first stage, CKD was modeled as a function of age, sex, hypertension, and diabetes, which act as clinically motivated predictors of kidney dysfunction and are strongly related to CKD prevalence. The first-stage model was estimated using survey-weighted logistic regression without penalization. First-stage relevance was assessed using design-based Wald tests that account for stratification, clustering, and survey weights. Coefficient estimates, standard errors, effective sample size, and joint relevance diagnostics are reported in
Table 8.
The resulting first-stage residuals were then included as a control function in the second-stage outcome model. In this specification, the residual term was not statistically significant (), providing no strong evidence of residual endogeneity after conditioning on observed covariates.
The estimated effect of CKD increased markedly in magnitude (OR = 69.1), but with a very wide confidence interval (95% CI: 1.06–4502.7), reflecting numerical instability due to limited overlap and the small effective sample size of individuals with CKD. These results indicate that conventional control-function approaches may be unreliable in this setting and motivate the use of orthogonalized machine learning estimators as the primary causal strategy.
The extreme odds ratio observed in the 2SRI specification reflects sparse-data issues and limited overlap between CKD and non-CKD groups. The number of CKD events per covariate was below conventional thresholds, and quasi-separation was detected in the second-stage model. Mean variance inflation factors were below 5, indicating limited multicollinearity. Penalized likelihood corrections (e.g., Firth) were not applied, as the 2SRI results are reported solely for diagnostic comparison.
4.2.4. Double/Debiased Machine Learning Estimation
Finally, we estimated the effect of CKD on high cardiovascular risk using a double/debiased machine learning (DML) approach based on the partially linear regression (PLR) framework with cross-fitting. This specification avoids reliance on inverse propensity weighting, which can be unstable in the presence of rare treatments and complex survey designs.
Nuisance components for the outcome and treatment models were estimated via survey-weighted logistic regressions, and the causal effect of CKD was obtained from a second-stage regression on orthogonalized residuals. This strategy yields a stable and approximately unbiased estimate of the average treatment effect under weak regularity conditions.
The DML estimate suggests that CKD is not associated with a statistically significant increase in the probability of being classified as having high cardiovascular risk once demographic characteristics and cardiometabolic comorbidities are accounted for (, SE = 0.161; 95% CI: ). While naive and control-function specifications indicate a strong association between CKD and cardiovascular risk, the DML results indicate that this relationship is largely explained by confounding factors, particularly age and diabetes.
Table 9 summarizes the empirical modeling results for the association between CKD and high cardiovascular risk.
Figure 5 provides a visual comparison of the CKD effect across modeling strategies, while
Figure 6 illustrates how predicted cardiovascular risk distributions differ between naive and DML-adjusted specifications.
Given the low prevalence of CKD (approximately 3%) and the resulting limited effective sample size, the minimum detectable effect for the DML estimator is on the order of 0.25–0.30 probability points. Smaller causal effects cannot be reliably distinguished from zero under the observed design, and estimates should be interpreted with this power limitation in mind.
The attenuation and loss of statistical significance observed in the DML estimate, relative to naive models, is consistent with both confounding adjustment and the increased variance associated with nuisance estimation under limited effective sample size. This pattern reinforces the interpretation of DML results as conservative, design-aware estimates rather than precise structural effects.
From an identification perspective, the relatively small number of CKD cases () implies a limited effective sample size for estimating treatment effects, particularly under complex sampling with clustering and unequal weights. This constraint is further amplified when using cross-fitting and flexible nuisance estimation in DML, which partitions the data and may reduce effective information per fold.
5. Discussion
This study develops and illustrates a unified, design-aware framework that integrates penalized prediction, causal inference, and design-consistent uncertainty quantification under complex survey sampling. Consistent with this objective, the principal contribution of this work is methodological rather than substantive.
5.1. Methodological Reflections
The application to chronic kidney disease (CKD) and cardiovascular (CV) risk serves as a concrete example that exposes distinctions that matter mathematically and operationally but are often blurred in applied research.
First, the empirical results highlight a structural separation between
predictive association and
causal contribution. Survey-weighted descriptive analyses and penalized predictive models reproduce well-established empirical patterns in the CKD literature [
22,
23]. However, once confounding and endogeneity are addressed using control-function or orthogonalized estimators, the independent causal effect of CKD on high predicted CV risk is greatly attenuated.
Second, the methodological architecture emphasizes the importance of design-consistent prediction metrics. In complex surveys, naive AUC and calibration measures can be misleading. Recent methodological developments provide weighted estimators of ROC curves and AUC that honor sampling weights and clustering [
46,
47]. Integrating these tools prevents overestimation of discriminative ability and ensures that predictive evaluation aligns with the finite population represented by the survey.
Third, the use of design-based bootstrap for both predictive and causal estimators is a core part of the framework. Classical resampling methods for complex surveys—including Rao–Wu and Rust–Rao replicate-weight approaches—provide valid variance estimation under stratification and clustering [
2,
3]. Modern software implementations further facilitate reproducible variance estimation and stability assessment [
48].
Fourth, addressing endogeneity under complex sampling requires careful methodological integration. Control-function logic in nonlinear models [
6] and orthogonalization via double/debiased machine learning (DML) provide complementary strategies. Recent advances in weighted orthogonal learners [
11,
49] reinforce the value of orthogonality when nuisance estimation must incorporate survey weights, high-dimensionality, or limited effective sample sizes. In this illustrative example, the attenuation of the CKD effect is therefore interpreted not as a biomedical conclusion but as an instance of how orthogonal estimators isolate structural signal from design-induced or confounding-driven associations.
Fifth, the example illustrates a fundamental principle: under complex sampling, prediction and causal inference need not coincide. Predictive estimators may exploit associations that are not causally meaningful, while causal estimators prioritize robustness even at the cost of reduced discrimination. The proposed framework makes these trade-offs explicit and provides a reproducible workflow for applied research in any domain where complex sampling, penalization, and causal inference intersect.
While DML provides a powerful framework for bias reduction via orthogonalization, its performance in complex survey settings depends on the interaction between sampling design and nuisance estimation. In particular, clustering, weighting, and limited effective sample size may constrain the flexibility and stability of machine learning estimators.
In this context, DML should be interpreted as a principled compromise between bias control and variance inflation, rather than as a fully nonparametric causal estimator. This perspective aligns with recent discussions emphasizing the projection-based interpretation of orthogonal estimators in finite samples.
From a broader perspective, our approach connects recent advances in orthogonal machine learning [
7,
12] with the design-based inference tradition in survey statistics. While the former emphasizes robustness to high-dimensional nuisance estimation, the latter focuses on valid population-level inference under complex sampling. Bridging these two perspectives remains an open methodological challenge.
The framework developed in this paper is closely related to recent advances in survival analysis and longitudinal modeling under complex dependence structures. In particular, modern approaches increasingly integrate clustering, competing risks, and heterogeneity within unified estimation frameworks [
17,
18].
Although our focus is on cross-sectional binary outcomes, the underlying principles—namely, design-aware estimation, orthogonalization, and resampling-based inference—extend conceptually to time-to-event settings. In such contexts, the target parameter becomes a hazard function or cumulative incidence rather than a conditional mean, but similar challenges arise in accounting for dependence and ensuring robust inference. Extending the present framework to survival and longitudinal data represents a natural direction for future research.
From an applied-mathematics standpoint, the framework contributes two decision-relevant capabilities. (
i) It yields
design-consistent predictive evaluation—weighted ROC/AUC, calibration, and bootstrap uncertainty—so that model performance reflects the target finite population, a prerequisite for downstream policy rules based on risk thresholds [
46,
47]. (
ii) It provides
endogeneity-aware causal effect estimates via orthogonal moments, ensuring that structural parameters used to justify escalation/escalation-avoidance policies are not artifacts of confounding. Together, these components support evidence-to-decision workflows that combine discrimination and calibration with explicit trade-offs, e.g., via decision-curve analysis and net benefit, which quantify whether model-guided actions improve outcomes relative to treat-all or treat-none strategies [
50,
51].
5.2. Substantive Insights
While the primary contribution of this study is methodological, the empirical application provides context for interpreting CKD within cardiovascular risk stratification. Survey-weighted predictive models reproduce patterns consistent with existing clinical literature and guidelines, which continue to treat CKD and kidney measures as important components of CV risk assessment [
52,
53].
However, once confounding and endogeneity are addressed, the independent contribution of CKD is substantially attenuated. This pattern is consistent with modern risk engines—such as QRISK3 [
54] and SCORE2 CKD extensions [
55]—where kidney metrics improve predictive performance, but binary CKD indicators do not necessarily retain a strong stand-alone structural effect.
These findings should be interpreted with caution. The data and identification strategy do not support definitive causal claims, and the absence of credible exclusion restrictions further limits structural interpretation. Rather than suggesting changes to clinical guidelines, the results highlight the importance of distinguishing between predictive markers and causal drivers in population-level risk modeling.
5.3. Limitations
Although the framework is broadly applicable, several limitations must be recognized. First, causal interpretation of orthogonalized estimators still relies on strong identification assumptions—particularly approximate unconfoundedness or valid instruments—which cannot be empirically verified in observational surveys. The absence of credible exclusion-based instruments in large-scale health surveys remains a fundamental limitation for structural identification, reinforcing the importance of orthogonal, design-aware estimators under conditional exogeneity assumptions. Second, effective sample size can be severely limited for rare conditions like CKD, amplifying the variance of both predictive and causal estimators. Third, while the simulation study emulates key features of complex surveys, real-world surveys may present additional complications such as differential measurement error, multi-phase sampling, or uncontrolled response patterns. Fourth, the empirical application uses a unidirectional causal structure for conceptual clarity, though bidirectional cardio–renal interactions are clinically plausible. These limitations indicate that the empirical results should be viewed strictly as an illustrative case study validating the methodological pipeline rather than as substantive epidemiological findings.
5.4. Future Work
Several extensions follow naturally from the present framework. One avenue concerns the development of survey-specific orthogonal learners that directly incorporate replicate weights or calibration constraints. Another concerns sharper finite-sample guarantees for penalized and debiased estimators under multistage sampling, where asymptotic approximations may be less reliable. A third direction is the integration of this framework into decision-support contexts, where prediction and causality must be balanced optimally under sampling constraints—e.g., combining design-consistent model evaluation with decision curves and cost–utility targets [
50,
51]. Finally, extending the theoretical analysis of design-consistent bootstrap methods for orthogonal estimators represents a promising path toward more rigorous foundations for causal inference in complex survey environments.
6. Conclusions
This study demonstrates the value of integrating survey-weighted inference, simulation analysis, and endogeneity-aware machine learning to separate predictive performance from causal interpretation in population health data. Using a nationally representative complex survey, we show that chronic kidney disease (CKD) is strongly associated with high predicted cardiovascular risk, yet its independent causal contribution is substantially attenuated once age and major cardiometabolic comorbidities are accounted for.
From a methodological standpoint, three lessons emerge. First, under complex sampling, predictive and causal targets need not coincide: penalized, survey-weighted models can retain strong discrimination and calibration for risk stratification, while orthogonalized estimators (e.g., DML) appropriately shrink structural effects in the presence of confounding and endogeneity. Second, survey-consistent performance assessment—weighted ROC/AUC, calibration, and a design-preserving bootstrap—matters for population-level claims and prevents overstating model utility. Third, control-function and orthogonal methods provide complementary routes to causal identification; when assumptions hold, they recover causal parameters without sacrificing design awareness.
The simulation study confirms that naive predictive models may overstate causal effects under realistic confounding structures, whereas DML and control-function approaches can recover causal parameters when identification assumptions hold. Together, the empirical and simulation results underscore the importance of aligning modeling strategies with inferential objectives in complex surveys and of making explicit the trade-offs between prediction and causality.
Practically, the framework offers a scalable, design-consistent blueprint for evaluating extensions to cardiovascular risk tools (e.g., adding kidney measures) and for supporting decision-making with metrics that reflect the finite population and quantify uncertainty. Future work will (i) extend the analysis to additional survey waves and settings to assess transportability; (ii) develop survey-adapted orthogonal learners that directly incorporate replicate weights and calibration constraints; and (iii) further integrate decision-analytic tools (e.g., net benefit and cost–utility analysis) with design-aware evaluation to optimize guideline adaptations and clinical decision support in CKD-specific cardiovascular risk management.