Next Article in Journal
Existence and Maximal Regularity of Solutions for a Class of Third-Order Differential Equations with an Unbounded Coefficient
Next Article in Special Issue
The Touchard Process for Count Data with Dependent Increments
Previous Article in Journal
Quantifying Terrain Effects on Turbine Wake Recovery with Field Data and Simulation of a Real Wind Farm
Previous Article in Special Issue
The Baker Type-I Model: Theory, Comprehensive Inference, and Empirical Evidence from Complex Reliability and Biomedical Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches

1
Centro de Micro-Bio Innovación, Escuela de Nutrición y Dietética, Facultad de Farmacia, Universidad de Valparaíso, Valparaíso 2340000, Chile
2
Escuela de Química y Farmacia, Facultad de Farmacia, Universidad de Valparaíso, Valparaíso 2340000, Chile
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(9), 1554; https://doi.org/10.3390/math14091554
Submission received: 7 April 2026 / Revised: 28 April 2026 / Accepted: 1 May 2026 / Published: 4 May 2026
(This article belongs to the Special Issue Applied Probability and Statistics: Theory, Methods, and Applications)

Abstract

We develop a design-aware framework that combines penalized prediction and causal inference for finite populations observed through complex survey designs. The framework integrates survey-weighted pseudo-likelihoods, 1 -penalized estimation, Neyman-orthogonal moment functions, and a bootstrap procedure that resamples primary sampling units within strata. Methodologically, the contribution is an explicit pipeline that supports design-based inference while separating predictive associations from structurally adjusted effects in high-dimensional, clustered data. We illustrate the framework using data from the Chilean National Health Survey (ENS) 2016–2017 to study the relationship between chronic kidney disease (CKD) and high cardiovascular (CV) risk. In the ENS adult population, the survey-weighted prevalence of CKD was 3.1% (95% CI: 2.4–3.8), and the prevalence of high CV risk was 23.9% (95% CI: 21.5–26.3). High CV risk was markedly more frequent among individuals with CKD than among those without CKD (90.9% versus 21.5%). Predictive and associational analyses combined survey-weighted penalized logistic regression (LASSO) with refitted unpenalized models. In conventional survey-weighted logistic regressions, CKD showed a strong association with high CV risk (odds ratio = 5.66; 95% CI: 2.71–11.82; p < 0.001 ), and effect sizes remained stable after LASSO-based variable selection. To assess causal relevance under confounding and potential endogeneity, we implemented two endogeneity-aware estimators: two-stage residual inclusion (2SRI) and double/debiased machine learning (DML). The DML estimator, defined as the primary causal estimand, reports an orthogonalized estimate of the average treatment effect of CKD on the probability of high CV risk. After adjustment for age and major cardiometabolic comorbidities, the DML estimate was attenuated and statistically non-significant (average treatment effect = 0.094 ; 95% CI: [ 0.409 , 0.220 ] ). The 2SRI approach yielded unstable estimates with wide confidence intervals, consistent with the limited effective sample size of CKD cases ( n C K D 190 in a sample with n ≈ 6233) and weak identification conditions under low-prevalence settings. Simulation experiments under ENS-like complex sampling suggest that naive predictive associations may overestimate the structural contribution of CKD under confounding, whereas orthogonalized estimators yield more conservative estimates when identification holds. The causal interpretation relies on a conditional mean independence assumption given observed covariates and survey design, while control-function specifications are treated as diagnostic sensitivity analyses due to the absence of credible exclusion-based instruments. Overall, the results demonstrate a fundamental divergence between predictive relevance and causal importance in finite-population settings, underscoring the need for design-aware and endogeneity-robust methods in statistical modeling.

1. Introduction

Modern population analytics increasingly require methods that reconcile two distinct inferential goals—prediction and causal estimation—in data generated by complex sampling designs. In multistage probability samples with stratification, clustering, and unequal inclusion probabilities, valid population-level statements hinge on design-consistent estimators, survey-aware performance metrics, and resampling schemes that mimic the sampling process [1,2,3]. At the same time, high-dimensional covariate sets and multicollinearity motivate penalized estimation and stability assessments grounded in pseudo-likelihood arguments and stochastic resampling [4,5]. When the scientific interest is causal—e.g., estimating a structural effect in the presence of confounding or endogeneity—orthogonalized estimators that leverage Neyman-orthogonal moments and cross-fitting provide a principled route to robust inference even when nuisance components are learned by flexible methods [6,7].
Recent advances have extended penalized methods to complex surveys, but most contributions address variable selection and penalty tuning without delivering an integrated pipeline that (i) evaluates prediction with survey-consistent discrimination and calibration, (ii) quantifies uncertainty via design-preserving bootstrap, and (iii) corrects endogeneity when estimating structural parameters. For example, Iparragirre et al. [8] developed design-based cross-validation for LASSO in surveys, while McConville et al. [9] studied LASSO and adaptive LASSO from a model-assisted, finite-population perspective; both strengthen the predictive regularization toolkit for complex designs, yet neither fully integrates endogeneity corrections or weighted predictive metrics. Likewise, Jasiak and Tuvaandorj [10] advanced post-LASSO inference with weights mainly at the theoretical level, without embedding it into an end-to-end predictive–causal workflow.
  • Positioning relative to the existing literature.
The proposed framework extends several active lines of methodological research. Existing contributions can be broadly grouped into two complementary strands: predictive modeling under complex survey designs and causal inference under high-dimensional settings.
First, the predictive literature has focused on extending penalized estimation techniques to survey data. Prior work includes the design-based cross-validation and variable-selection developments of Iparragirre et al. [8], the model-assisted LASSO formulations of McConville et al. [9], and the theoretical contributions of Jasiak and Tuvaandorj on penalized likelihood with survey weights [10]. These approaches strengthen variable selection, stability, and predictive performance under complex sampling, but remain primarily oriented toward prediction and regularization, without explicitly addressing endogeneity or causal interpretation.
Second, the causal inference literature has emphasized robustness to confounding and high-dimensional nuisance estimation. Double/debiased machine learning (DML) [7] and recent advances in weighted orthogonal learners for causal identification [11] provide principled tools for estimating low-dimensional causal parameters using flexible machine learning components. These orthogonalization-based approaches originate in the double/debiased machine learning literature [7], with extensions to heterogeneous treatment effects [12]. However, these methods are typically developed under i.i.d. assumptions or without full integration of complex survey design features.
Recent developments in machine learning for medical diagnosis further illustrate the rapid evolution of predictive approaches. For example, reinforcement learning-based neural architectures have been proposed for cardiovascular risk detection in athletes, achieving strong classification performance in imbalanced settings [13]. Likewise, deep convolutional models such as U-Net have been widely applied to medical imaging tasks, including brain tumor detection, demonstrating the effectiveness of data-driven feature extraction [14]. In addition, recent work on cardiovascular disease detection has focused on addressing class imbalance and optimizing discrimination metrics such as the area under the ROC curve (AUC), further improving predictive accuracy [15,16]. These contributions highlight the strength of modern machine learning methods in prediction and classification tasks. However, they are primarily designed to maximize predictive performance and do not explicitly address causal identification, endogeneity, or design-based inference under complex sampling.
Recent advances in statistical modeling have extended beyond cross-sectional settings to increasingly complex data structures, including survival and longitudinal data with clustering, competing risks, and heterogeneous dependence structures. For example, Wang et al. [17] propose a unified framework that simultaneously accounts for cure fractions, competing risks, and within-cluster dependence.
Similarly, modern approaches emphasize flexible modeling of heterogeneity and multi-scale stochastic processes, as illustrated in recent developments such as Zhuang et al. [18]. These contributions highlight a broader trend toward unified frameworks capable of integrating dependence, heterogeneity, and complex data-generating mechanisms.
While the present work focuses on cross-sectional binary outcomes, these developments provide an important broader context. The proposed design-aware framework can be viewed as a complementary contribution within this landscape, targeting settings where the outcome is binary but the sampling design introduces analogous challenges related to dependence and heterogeneity.
Despite these advances, a key gap remains: there is no unified framework that simultaneously addresses (i) penalized, survey-weighted prediction, (ii) causal estimation under confounding and potential endogeneity, and (iii) design-consistent uncertainty quantification via resampling. In particular, the interaction between sampling design, regularization, and causal identification has not been systematically addressed. While recent working papers and conference contributions have explored elements of causal machine learning [19] and weighted estimation in complex survey settings [20,21], these developments typically address prediction and causal inference separately or are derived under i.i.d. assumptions.
  • Objective and contribution.
This paper proposes a unified, design-consistent framework that jointly addresses prediction and causality in complex survey data. The framework combines: (i) survey-weighted pseudo-likelihood estimation for finite-population targets, (ii) 1 -penalization (LASSO) to stabilize prediction under multicollinearity and limited effective sample size, (iii) design-consistent bootstrap (PSU-within-strata) for uncertainty and stability, and (iv) endogeneity-aware causal estimators—two-stage residual inclusion (2SRI) and double/debiased machine learning (DML)—implemented with survey weights and cross-fitting. Predictive performance is assessed using weighted AUC and Brier score, while structural effects are estimated through Neyman-orthogonal moments to mitigate regularization bias and confounding. Methodologically, the contribution is to deliver a coherent pipeline that makes explicit when predictive and causal conclusions diverge under complex sampling and how to report each with appropriate, design-aware uncertainty.
This paper makes four primary mathematical contributions within the study of penalized estimation and causal inference under complex survey designs:
1.
Design-consistency of penalized survey-weighted estimators. We formalize a LASSO-penalized pseudo-likelihood estimator and characterize its expected design-based behavior under multistage sampling. Proposition 1 characterizes the approximation of the penalized estimator to its finite-population target under appropriate regularity conditions.
2.
Design-based bootstrap approximation for uncertainty quantification. We formulate a stratified, PSU-level bootstrap scheme to approximate the sampling variability of penalized and causal estimators under complex sampling, and we evaluate its adequacy through simulation evidence.
3.
Operator-theoretic interpretation of penalized estimation. We show that the survey-weighted LASSO estimator can be expressed through a stochastic proximal mapping, providing an optimization-based perspective on penalized pseudo-likelihood estimation, without claiming that this representation directly implies asymptotic properties.
4.
Integration of orthogonal moments with survey weights. We extend double/debiased machine learning (DML) to complex surveys by incorporating sampling weights into nuisance estimation, cross-fitting at the PSU level, and orthogonal score construction. This ensures robustness of structural estimates to regularization bias and design-induced variability.
These components together form a coherent mathematical framework that connects penalized prediction, causal estimation, and design-based inference for finite populations.
  • Scope and illustration.
To illustrate the framework, we consider an application where predictive and causal goals are routinely conflated: the relationship between chronic kidney disease (CKD) and cardiovascular (CV) risk. CKD is epidemiologically linked to elevated CV risk and is often embedded in risk stratification, yet its independent structural contribution is sensitive to confounding and endogeneity; in such settings, naive predictive associations may not carry causal meaning [22,23]. We therefore use two complementary exercises. First, ENS-like simulations with a known causal structure show how naive survey-weighted penalized models can overstate structural effects under confounding, whereas 2SRI/DML can recover the target when identification holds. Second, we apply the pipeline to the Chilean National Health Survey (ENS) 2016–2017 to examine whether routinely collected biomarkers (hs-CRP, lipid fractions, hemoglobin, HbA1c, creatinine) improve the prediction of high CV risk and how endogeneity-aware estimators alter the interpretation of the CKD effect in a finite population.
  • Organization.
Section 2 introduces the mathematical and statistical foundations (pseudo-likelihood under complex sampling, penalization, design-based bootstrap, and endogeneity corrections). Section 3 details the modeling pipeline—survey-weighted estimation, LASSO regularization, weighted performance metrics, and orthogonal/control-function estimators. Section 4 reports simulation and empirical results, focusing on the divergence between predictive and causal targets and on design-consistent uncertainty. Section 5 discusses implications, limitations, and avenues for future work. Finally, Section 6 summarizes the main insights and emphasizes the generalizability of the framework to other complex health surveys and application areas.

2. Motivation and Mathematical Background

2.1. Applied Motivation

The mathematical literature has recently emphasized the need for risk prediction models that account for non-independence among biomarkers, high-dimensionality, and survey sampling variability [24]. Penalized estimators such as least absolute shrinkage and selection operator (LASSO) have demonstrated utility for feature selection under these constraints, particularly when the effective sample size is reduced due to complex survey designs [25].
The physiological mechanisms underlying cardiovascular disease in CKD have been extensively described [26,27]. Inflammatory markers such as hs-CRP have shown predictive value for cardiovascular outcomes [28], although their interpretation in CKD remains challenging due to the presence of uremic inflammation [29]. Oxidative stress is also implicated, with oxidized LDL linked to accelerated atherosclerosis [30]. Biomarkers such as cystatin C have proven especially informative, outperforming creatinine as predictors of cardiovascular and renal outcomes [31,32]. Beyond predictive performance, a central mathematical challenge is potential endogeneity between CKD status and cardiovascular risk—where CKD may influence cardiovascular outcomes, but cardiovascular events may also exacerbate CKD [33,34,35]. In econometric terms, endogeneity arises when regressors correlate with the error term, leading to biased estimates. To address this, we incorporate statistical techniques tailored for nonlinear survey-weighted models:
  • Control-function (Two-Stage Residual Inclusion, 2SRI): For logistic regression with endogenous variables, we include residuals from a first-stage survey-weighted model as covariates in the second-stage outcome model. This method is consistent under nonlinearity and extends to complex designs [36,37].
  • Double/Debiased Machine Learning (DML): We employ orthogonal estimators via Neyman-orthogonal moment conditions with sample splitting and penalized learners. This approach mitigates both regularization bias and endogeneity, and it can be adapted for survey designs to yield valid inference on the causal CKD parameter [7,38].
These methods preserve the weighted pseudo-likelihood framework and allow bootstrap-based uncertainty quantification, ensuring robust and interpretable inference on CKD effects under complex sampling.
Given the abundance of routine biomarkers collected in ENS 2016–2017 and the lack of CKD-specific predictive models in Chile, this study offers both mathematical and applied contributions.

2.2. A Stochastic Framework for Penalized Estimation Under Complex Sampling

Let { ( Y i , X i , w i ) } i = 1 n denote observations drawn from a finite population U = { 1 , , N } under a multistage, stratified sampling design. Each sampled unit carries a survey weight w i = 1 / π i , where π i is the inclusion probability. Only the sample is observed; the full population remains unobserved.
The survey-weighted pseudo-likelihood for parameter vector β is defined as
L ( β ) = i = 1 n p β ( Y i X i ) w i ,
where p β denotes the logistic probability kernel. Taking logarithms yields the weighted pseudo-log-likelihood:
( β ) = i = 1 n w i Y i log p β ( X i ) + ( 1 Y i ) log ( 1 p β ( X i ) ) .
Under standard regularity conditions, expression (2) provides a design-consistent approximation to the finite-population log-likelihood.
To incorporate regularization and enable stochastic variable selection, we consider the LASSO-penalized objective
Q λ ( β ) = ( β ) λ β 1 , λ 0 ,
whose maximizer satisfies
β ^ λ = arg max β R p Q λ ( β ) .
This optimization defines a stochastic mapping
β ^ λ = T λ { ( Y i , X i , w i ) } i = 1 n ,
whose randomness reflects both the sampling design and the underlying finite population.
To characterize the sampling distribution of β ^ λ , we use a design-consistent bootstrap that resamples clusters and strata according to the ENS 2016–2017 design. This produces bootstrap replicates
β ^ λ ( b ) , b = 1 , , B ,
whose empirical distribution is intended to approximate that of β ^ λ under the complex sampling design.
This stochastic framework offers two key advantages:
1.
It provides uncertainty quantification for penalized estimators under complex survey designs, where classical asymptotics may be unreliable.
2.
It yields a finite-population interpretation of penalized pseudo-likelihood estimation, supporting inference on population-level cardiovascular risk mechanisms.
Thus, the proposed approach integrates survey theory, penalized estimation, and stochastic resampling into a unified mathematical framework.

2.3. Proposition 1: Design-Based Approximation for Penalized Estimation

We summarize here a set of working conditions under which the penalized survey-weighted estimator is expected to behave regularly under multistage complex sampling. The goal is not to provide a complete theorem-proof development, but rather to clarify the assumptions underlying the estimation and resampling strategy.
  • Working conditions.
Let β ^ λ denote the LASSO-penalized pseudo-likelihood estimator. We consider the following:
(B1)
The sampling design is stratified and multistage, with strictly positive inclusion probabilities and PSU-level resampling within strata.
(B2)
Weighted score contributions satisfy boundedness and moment conditions that ensure stable weighted averages.
(B3)
The finite-population objective admits a well-defined minimizer β * with locally regular curvature.
(B4)
The penalty sequence and dimensionality satisfy standard regularization conditions ensuring stability.
Under these conditions, the estimator is expected to concentrate around a finite-population target:
β ^ λ P β * ,
This statement is interpreted as a design-based consistency heuristic grounded in penalized M-estimation arguments, rather than as a formally established theorem in this work.
  • Bootstrap approximation under complex sampling (heuristic statement).
To characterize the sampling variability of β ^ λ , we employ a design-consistent bootstrap that resamples primary sampling units (PSUs) within strata.
For bootstrap replicates β ^ λ ( b ) , b = 1 , , B , consider:
sup t R p | Pr * n ( β ^ λ ( b ) β ^ λ ) t Pr n ( β ^ λ β * ) t | 0 ,
where Pr * denotes the bootstrap distribution. This approximation reflects the expectation that PSU-level resampling preserves the dependence structure induced by the complex survey design.
  • Justification.
While a full formal proof of bootstrap consistency for penalized estimators under complex sampling is beyond the scope of this work, the proposed procedure is grounded in established design-based resampling principles. In particular, resampling PSUs within strata reproduces first-order sampling variability under multistage designs.
Empirical support for this approximation is provided in Section 4, where bootstrap-based uncertainty closely matches the variability observed under repeated sampling in simulation experiments.
We therefore interpret Equation (8) as a design-based approximation supported by both theoretical and simulation evidence [2,3,39].

Stochastic Stability Paths

To assess the robustness of penalized estimators under finite-population and design-induced randomness, consider a grid of tuning parameters Λ = { λ 1 , , λ K } . For each λ k and bootstrap replicate b, compute
β ^ ( b ) ( λ k ) ,
and define the selection probability for predictor j as
π ^ j ( λ k ) = 1 B b = 1 B 1 β ^ j ( b ) ( λ k ) 0 .
A predictor is considered stochastically stable if
π ^ j ( λ k ) 1 across a wide range of λ k .
These stability paths generalize classical LASSO solution paths by incorporating sampling variability and design-based uncertainty. Figure 1 summarizes the workflow from finite population to risk prediction, highlighting how bootstrap replicates feed into stability path estimation.

2.4. Operator-Theoretic View

Define the convex function
Φ ( β ) = ( β ) + λ β 1 .
Then β ^ λ satisfies the stochastic proximal equation:
β ^ λ = prox λ · 1 β ^ λ + ( β ^ λ ) ,
demonstrating that T λ (Equation 5) is a stochastic proximal operator acting on survey data.
Figure 1. Workflow for design-aware penalized modeling: from finite population to risk prediction via survey-weighted LASSO, bootstrap replicates, and stability paths.
Figure 1. Workflow for design-aware penalized modeling: from finite population to risk prediction via survey-weighted LASSO, bootstrap replicates, and stability paths.
Mathematics 14 01554 g001
The methodological contribution of this work lies in integrating penalized pseudo-likelihood estimation, complex-survey inference, and design-based bootstrap into a coherent framework for predictive and causal modeling in finite populations. The proximal operator representation provides an optimization-based interpretation of the penalized estimator, clarifying its variational and computational structure. Its role is primarily interpretive, and we do not rely on this formulation to establish the asymptotic properties discussed in this paper.

2.5. Addressing Endogeneity in Survey-Weighted Nonlinear Models

Endogeneity arises when one or more regressors are correlated with the structural error term, violating the exogeneity assumption and producing biased and inconsistent estimators in nonlinear models. In our setting, CKD may be endogenous due to reverse causality (cardiovascular events affecting kidney function), omitted confounders, or measurement error. We develop two complementary strategies tailored to complex survey designs: a control-function (two-stage residual inclusion, 2SRI) approach for nonlinear models and a double/debiased machine learning (DML) approach based on Neyman-orthogonal moments.
  • Notation and setup.
Let { ( Y i , X i , D i , Z i , w i ) } i = 1 n denote observations drawn from a stratified multistage complex survey, where w i = 1 / π i are sampling weights. The binary outcome Y i { 0 , 1 } indicates high cardiovascular risk; X i R p are exogenous covariates; D i CKD i is a potentially endogenous regressor; and Z i R q are instruments satisfying relevance and exclusion restrictions described below. The survey-weighted pseudo-log-likelihood for a logistic outcome is
( β , γ ) = i = 1 n w i Y i log p i + ( 1 Y i ) log ( 1 p i ) , p i = exp X i β + γ D i 1 + exp X i β + γ D i .

2.5.1. Control-Function Approach (Two-Stage Residual Inclusion, 2SRI)

The control-function method corrects endogeneity in nonlinear models by augmenting the structural equation with first-stage residuals [36,37]. We adapt 2SRI to complex survey designs via weighted estimation in both stages.
  • Stage 1 (endogenous regressor model).
Model the endogenous regressor D i using instruments Z i and exogenous covariates X i :
D i = h ( Z i , X i ; θ ) + ε i ,
where h ( · ) can be a linear specification (survey-weighted least squares), a generalized linear model (survey-weighted GLM), or a flexible penalized learner. Estimate θ ^ by minimizing a survey-weighted loss (e.g., pseudo-likelihood) and compute residuals
ε ^ i = D i h ( Z i , X i ; θ ^ ) .
  • Practical selection of the first-stage specification.
In practice, the choice of the first-stage function h ( Z i , X i ) is data-driven and depends on the predictive structure of the endogenous regressor. In this study, we consider linear, penalized (LASSO), and flexible machine learning specifications as alternative candidates.
We recommend selecting h ( · ) based on the following criteria:
(i)
Predictive performance. The specification should provide adequate out-of-sample prediction of the endogenous variable, as assessed by cross-validation or hold-out evaluation.
(ii)
Stability. The fitted values h ^ ( Z i , X i ) should be stable across resamples or folds, avoiding excessive sensitivity to small changes in the data.
(iii)
Robustness of second-stage estimates. The estimated structural coefficient in the second stage should not vary substantially across reasonable alternative specifications of h ( · ) .
Given that the exclusion restriction is unlikely to hold in the present application, the first-stage model is not interpreted as a structural instrument equation. Instead, it is treated as a predictive nuisance component within a control-function sensitivity analysis. Consequently, model selection is guided by predictive adequacy and robustness considerations rather than by structural identification criteria.
  • Stage 2 (residual inclusion in the logistic outcome).
Augment the outcome model with the first-stage residuals:
p i = Pr ( Y i = 1 X i , D i , ε ^ i ) = exp X i β + γ D i + δ ε ^ i 1 + exp X i β + γ D i + δ ε ^ i .
Estimate ( β , γ , δ ) by maximizing the survey-weighted pseudo-log-likelihood in (14) with D i and ε ^ i included. Under suitable conditions, δ 0 captures the endogeneity correction, and γ is consistently estimated despite nonlinearity [36,37].
  • Identification assumptions.
The standard 2SRI framework relies on (i) relevance, (ii) exclusion, and (iii) independence of instruments. In this study, we evaluate the relevance condition empirically. However, the exclusion restriction is unlikely to hold given the clinical role of the candidate variables. Therefore, these conditions are not interpreted as satisfied for causal identification purposes, and the resulting estimates are treated as sensitivity analyses rather than structurally identified effects.
Practical limitations of identification. In the empirical application based on ENS data, the exclusion restriction is unlikely to hold. Variables such as age, sex, hypertension, and diabetes are known to directly affect cardiovascular risk beyond their association with CKD. As a result, these variables cannot be interpreted as valid instruments satisfying the exclusion condition required for causal identification.
Accordingly, the 2SRI specification should not be interpreted as providing structurally identified causal effects. Instead, it is used as a control-function-based sensitivity analysis to assess how the estimated CKD effect responds to alternative modeling assumptions that account for potential endogeneity.
Nuisance estimation under complex sampling introduces additional challenges relative to standard i.i.d. settings. In particular, the effective sample size is reduced by clustering and unequal weights, which may affect the stability of flexible learners used for nuisance estimation. To mitigate this issue, we employ cross-fitting at the PSU level and restrict model complexity to ensure stable estimation.
Nevertheless, we emphasize that the DML estimator remains sensitive to misspecification or instability in nuisance components, and therefore should be interpreted as providing robustness-oriented evidence rather than definitive causal identification.
  • Regularization and multicollinearity.
When X i or ( X i , Z i ) are high-dimensional or collinear, we optionally include an 1 penalty in stage 1 and/or stage 2:
Q λ ( β , γ , δ ) = ( β , γ , δ ) λ β 1 + | γ | + | δ | ,
where ( · ) is the weighted pseudo-log-likelihood incorporating ε ^ i as in (17).
  • Design-consistent bootstrap for 2SRI.
To quantify uncertainty under complex sampling, we employ a design-consistent bootstrap that resamples primary sampling units (PSUs) within strata. For each bootstrap replicate b = 1 , , B :
1.
Draw PSUs within each stratum to form a bootstrap sample and recompute weights w i ( b ) .
2.
Re-estimate stage 1 to obtain θ ^ ( b ) and residuals ε ^ i ( b ) .
3.
Re-estimate stage 2 to obtain ( β ^ ( b ) , γ ^ ( b ) , δ ^ ( b ) ) .
The empirical distribution of { γ ^ ( b ) } b = 1 B approximates the sampling distribution of the CKD effect under the survey design.

2.5.2. Double/Debiased Machine Learning (DML) with Survey Weights

DML constructs estimators based on Neyman-orthogonal moments that are insensitive to small estimation errors in high-dimensional nuisance functions [7,38]. We adapt DML to complex surveys by employing survey weights and design-preserving sample splitting.
  • Partially linear score.
Let W i = ( Y i , D i , X i ) and consider nuisance functions
m ( X i ) E [ Y i X i ] , g ( X i ) E [ D i X i ] ,
estimated via penalized learners (e.g., LASSO, elastic net, or tree-based methods) using survey weights. Define the orthogonal score
ψ ( W i ; θ , η ) = Y i m ( X i ) D i g ( X i ) ,
where η collects nuisance parameters. The moment condition E [ ψ ( W i ; θ , η ) ] = 0 identifies the causal parameter θ (here, the average structural effect of D i on Y i ). Orthogonality implies
η E [ ψ ( W i ; θ , η ) ] | η = η 0 = 0 ,
which mitigates bias from regularization and nuisance estimation.
  • Interpretation of the estimand.
The parameter identified by the orthogonal moment condition corresponds to a partially linear projection parameter of the conditional mean function E [ Y i D i , X i ] . Importantly, when the outcome is binary, this estimand should be interpreted on the probability scale as an average partial effect, rather than as a structural log-odds coefficient from a logistic model. The use of the score in Equation 19 does not rely on Gaussian error assumptions. Instead, it defines a valid orthogonal moment condition for the conditional expectation function, allowing consistent estimation of a low-dimensional parameter even when the outcome is discrete.
The DML framework builds on the orthogonalization approach introduced by [7], which enables root-n consistent estimation of low-dimensional parameters in the presence of high-dimensional nuisance components estimated via machine learning. Extensions of this framework further allow for heterogeneous treatment effect estimation under similar orthogonality principles [12]. However, these results are primarily derived under i.i.d. sampling assumptions. In complex survey settings, additional challenges arise due to clustering, unequal weights, and reduced effective sample size, which may affect both nuisance estimation and asymptotic approximations. Our implementation adapts these ideas by incorporating survey weights and PSU-level cross-fitting to preserve design consistency.
  • Role of nuisance functions and interpretation.
The validity of the DML estimator critically depends on the quality of nuisance function estimation, namely m ( X i ) = E [ Y i X i ] and g ( X i ) = E [ D i X i ] . In this study, these components are estimated using survey-weighted learners combined with cross-fitting at the PSU level, which reduces overfitting and preserves the dependence structure induced by the sampling design.
The resulting estimator should be interpreted as a partially linear projection parameter, representing the best linear approximation to the conditional mean of the outcome with respect to the residualized treatment. Consequently, even under correct specification of the nuisance components, the estimand does not recover a fully nonparametric structural effect, but rather a design-aware average effect under the maintained assumptions.
Importantly, the causal interpretation relies on the joint validity of: (i) conditional mean independence, (ii) sufficient overlap in the weighted sample, and (iii) stable estimation of nuisance functions under the effective sample size induced by the survey design. These conditions may be only approximately satisfied in practice, and results should therefore be interpreted with appropriate caution.
The use of flexible machine learning methods for nuisance estimation follows a broader literature on causal inference with data-adaptive models, where predictive accuracy is leveraged to improve adjustment for confounding [40]. However, as emphasized in that literature, valid causal interpretation requires that such models adequately capture the relevant confounding structure, which may be challenging in finite samples and complex survey settings.
  • Survey-weighted cross-fitting.
Partition the sample into K folds at the PSU level to respect the design. For each fold k:
1.
Train m ^ ( k ) ( X ) and g ^ ( k ) ( X ) on all but fold k using survey weights.
2.
Compute residuals on fold k:
Y ˜ i = Y i m ^ ( k ) ( X i ) , D ˜ i = D i g ^ ( k ) ( X i ) .
3.
Estimate θ ^ ( k ) from the weighted moment:
i fold k w i Y ˜ i D ˜ i = 0 ( or via a weighted regression of Y ˜ i on D ˜ i ) .
Aggregate θ ^ = 1 K k = 1 K θ ^ ( k ) . Under regularity conditions and appropriate weighting, θ ^ is normal and admits valid design-based inference under regularity conditions [7,38].
  • Variance estimation under clustered complex sampling.
Let ψ i = ψ ( W i ; θ ^ , η ^ ) denote the orthogonal score evaluated using cross-fitted nuisance estimates. Under complex sampling, observations are not independent, and variance must be computed at the cluster (PSU) level. Define the cluster-level score contributions as
Ψ h = i PSU h w i ψ i ,
where h = 1 , , H indexes primary sampling units. A cluster-robust, design-based variance estimator for θ ^ is given by
Var ^ ( θ ^ ) = 1 n 2 h = 1 H Ψ h Ψ ¯ 2 ,
where Ψ ¯ = 1 H h = 1 H Ψ h .
This estimator extends classical sandwich-type variance estimators to the orthogonal moment setting under clustering and unequal weights.
In practice, our primary inference relies on a design-consistent bootstrap that resamples PSUs within strata and recomputes the full cross-fitting procedure in each replicate. This approach captures variability due to clustering, sample splitting, and nuisance estimation, and is used to construct confidence intervals for θ ^ throughout the empirical analysis.
  • Extension to logistic models.
For a logistic outcome, orthogonal scores can be constructed by replacing Y i with the working residual under a weighted logistic link and defining m ( X i ) as the survey-weighted conditional mean. Alternatively, use an orthogonalized regression of the logit-transformed probabilities on residualized D i with cross-fitting, maintaining the survey weights throughout.
  • Bootstrap inference under complex sampling.
Analogous to 2SRI, apply the design-consistent bootstrap:
1.
Resample PSUs within strata to form bootstrap samples and weights w i ( b ) .
2.
Refit nuisance learners m ^ ( b ) ( · ) , g ^ ( b ) ( · ) with weights w i ( b ) .
3.
Recompute cross-fitted scores and obtain θ ^ ( b ) .
Percentile or studentized intervals from { θ ^ ( b ) } b = 1 B yield uncertainty quantification consistent with the sampling design.
  • Choosing between 2SRI and DML.
2SRI is attractive when valid instruments are available, and the endogenous regressor admits a well-specified first-stage model. DML is preferable when high-dimensional covariates and flexible nuisance functions are needed, leveraging orthogonality and cross-fitting to control regularization bias. Both strategies are compatible with our survey-weighted pseudo-likelihood framework, 1 -penalization for variable selection, and design-consistent bootstrap to quantify uncertainty and stability.

3. Methodology

Our methodological framework combines design-based (survey-weighted) estimation, penalized regularization, stochastic resampling, and econometric extensions to address potential endogeneity and causal directionality between chronic kidney disease (CKD) and high cardiovascular (CV) risk. We describe each component below and then outline the simulation and empirical studies used for evaluation.

3.1. Survey-Weighted Logistic Regression

Let Y i { 0 , 1 } indicate high cardiovascular risk (10-year predicted risk 20 % ), and let X i R p denote a vector of demographic, behavioral, and biochemical predictors for individual i. Under the ENS multistage sampling design, each observation carries a sampling weight w i and is associated with a stratum and a PSU identifier.
We consider the logistic specification
p i = Pr ( Y i = 1 X i , β ) = exp ( X i β ) 1 + exp ( X i β ) ,
and estimate β by maximizing the survey-weighted pseudo-log-likelihood
( β ) = i = 1 n w i Y i log p i + ( 1 Y i ) log ( 1 p i ) .
Maximizing (21) yields design-consistent estimation under standard regularity conditions for complex surveys, provided that the sampling structure is correctly specified in the analysis.

3.2. Penalized Estimation via LASSO

Given the substantial collinearity among biomarkers and cardiometabolic indicators, together with the limited effective sample size induced by complex sampling, we use the Least Absolute Shrinkage and Selection Operator (LASSO) [41] to regularize estimation and perform variable selection.
LASSO augments the pseudo-log-likelihood with an 1 penalty,
Q λ ( β ) = ( β ) λ β 1 , λ 0 ,
and defines the estimator
β ^ λ = arg max β R p Q λ ( β ) .
The 1 penalty shrinks small coefficients exactly to zero, producing sparse and interpretable models that mitigate multicollinearity and reduce overfitting. This is especially relevant in CKD-related biomarker analysis, where lipid fractions, inflammatory markers, and glycemic indices tend to be highly correlated [42].
In the empirical application, LASSO is used as a screening step. Final effect sizes and inference are obtained from unpenalized survey-weighted logistic regressions (svyglm) refitted on the covariate sets selected by LASSO.

3.3. Addressing Endogeneity and Causal Directionality

A key methodological challenge is the potential endogeneity of CKD status, which may correlate with unobserved factors or exhibit reverse causality with cardiovascular outcomes. In econometric terms, endogeneity arises when regressors correlate with the error term, leading to biased estimates. To address this concern in survey-weighted nonlinear settings, we consider two complementary strategies.
  • Control-function (two-stage residual inclusion, 2SRI)
We first estimate a survey-weighted first-stage model for the endogenous regressor D i (CKD) using instruments Z i and covariates X i :
D i = h ( Z i , X i ; θ ) + ε i .
Residuals ε ^ i from the first stage are then included in the second-stage logistic regression:
Pr ( Y i = 1 X i , D i , ε ^ i ) = exp ( X i β + γ D i + δ ε ^ i ) 1 + exp ( X i β + γ D i + δ ε ^ i ) .
This method is consistent under nonlinearity and is widely used in applied health econometrics [36,37]. In our setting, it provides a transparent way to assess the sensitivity of CKD effects to endogeneity corrections.
  • Double/debiased machine learning (DML).
To mitigate regularization bias and endogeneity in high-dimensional settings, we also consider Neyman-orthogonal moment conditions combined with sample splitting and cross-fitting:
ψ ( W i ; θ , η ) = Y i m ( X i ) D i g ( X i ) ,
where m ( · ) and g ( · ) are estimated using survey-weighted penalized learners. Orthogonality provides robustness to nuisance estimation error, and cross-fitting yields valid inference under weak conditions [7,38]. This framework supports a causal interpretation of the CKD effect under appropriate identification assumptions, where the estimand corresponds to a probability-scale average partial effect derived from a partially linear projection, rather than a structural parameter of a fully specified nonlinear model. This specification follows the standard partially linear DML framework, where the target parameter is defined via orthogonal moments and does not require correct specification of the full outcome distribution.
Both approaches preserve the survey-weighted pseudo-likelihood logic and are combined with design-consistent uncertainty quantification.
  • Identification considerations.
The DML estimator relies on several key identification conditions adapted to the complex-survey setting: (i) sufficient overlap between treatment groups after weighting; (ii) relevance of covariates for predicting both the outcome and the potentially endogenous regressor; (iii) conditional exogeneity or valid instruments when applicable; and (iv) sampling ignorability under the survey design. Practical limitations arise when treatment prevalence is low—as in the case of CKD—leading to reduced effective sample size and potentially unstable nuisance estimation. To mitigate these challenges, cross-fitting is performed at the PSU level to respect the sampling structure while reducing overfitting and regularization bias in the estimation of nuisance components. From an identification standpoint, the absence of credible exclusion-based instruments is a fundamental limitation of large-scale health surveys. This constraint reinforces the need for approaches such as DML, which rely on conditional independence assumptions rather than instrumental variables, and highlights the role of 2SRI as a diagnostic rather than a primary identification strategy in this context.
The causal interpretation of the CKD effect in this study relies on distinct identification strategies that must be interpreted carefully. First, the double/debiased machine learning (DML) estimator is defined as the primary causal estimand under a conditional mean independence assumption, namely:
E [ Y i ( d ) X i ] = E [ Y i D i = d , X i ] ,
which requires that, conditional on observed covariates X i and survey weights, treatment assignment (CKD status) is as good as random. Under this assumption, the DML estimator recovers the average treatment effect in a partially linear projection framework.
Importantly, the DML estimand should be interpreted as the best linear approximation to the conditional mean function, rather than as a fully nonparametric structural parameter.
Second, although we implement a Two-Stage Residual Inclusion (2SRI) approach, we do not claim the availability of valid exclusion-based instruments in the ENS dataset. Variables such as age, sex, hypertension, and diabetes are likely to violate the exclusion restriction, as they may directly affect cardiovascular risk beyond their association with CKD.
Accordingly, 2SRI is used strictly as a diagnostic sensitivity analysis to assess the stability of CKD effect estimates under alternative endogeneity corrections, rather than as a source of causal identification.
This distinction ensures that causal claims are based exclusively on the assumptions underlying DML, while 2SRI results are interpreted as robustness checks.

3.4. Stochastic Resampling and Variability Assessment

To quantify the sampling distribution of β ^ λ and of endogeneity-corrected estimators, we use a design-consistent bootstrap that resamples primary sampling units within strata while preserving the complex survey structure. For bootstrap replicates b = 1 , , B , we compute
β ^ λ ( b ) ,
and, when applicable, ( γ ^ ( b ) , δ ^ ( b ) ) for 2SRI or θ ^ ( b ) for DML. These replicates support inference on the coefficient variability, estimator stability, and predictive performance.

Design-Consistent Bootstrap and PSU-Level Cross-Fitting

Uncertainty was quantified using a design-consistent bootstrap that resamples primary sampling units (PSUs) with replacement within strata, preserving the multistage ENS sampling design. For each bootstrap replicate b = 1 , , B (with B = 150 ), PSUs were resampled within strata, and survey weights were recomputed accordingly. All analytical steps—including LASSO screening, refitted survey-weighted models, two-stage residual inclusion (2SRI), and double/debiased machine learning (DML) nuisance estimation—were re-estimated within each bootstrap replicate.
For DML, cross-fitting was performed using K = 5 folds defined at the PSU level, such that all observations within a given PSU were assigned to the same fold. This design-aware sample splitting prevents information leakage across clusters and preserves validity under complex sampling.

3.5. Predictive Performance

3.5.1. Weighted ROC Curves

Discriminatory ability is assessed using the survey-weighted area under the ROC curve (AUC). A design-based estimator is
AUC ^ = i = 1 n j = 1 n w i w j 1 ( Y i = 1 , Y j = 0 ) 1 ( p ^ i > p ^ j ) ,
where p ^ i = p i ( β ^ λ ) denotes fitted probabilities. This estimator approximates the probability that a randomly drawn case (weighted by the survey design) receives a higher predicted risk than a randomly drawn control.

3.5.2. Calibration and Reclassification

Calibration is evaluated using survey-weighted Brier scores and calibration curves based on weighted risk deciles. Incremental predictive value is assessed using survey-adapted reclassification metrics, including the net reclassification improvement (NRI) and the integrated discrimination improvement (IDI), computed under the complex sampling design.

3.6. Software Implementation

All analyses were conducted in R (version 4.5.1). Core functionality for complex survey inference and penalized regression relied on survey and glmnet. Several components required custom development to ensure design-consistent inference and integration of causal estimators:
  • Survey-weighted metrics: functions for design-based AUC, Brier score, and decile-based calibration diagnostics preserving PSU-level structure.
  • Endogeneity correction: implementation of two-stage residual inclusion (2SRI) for nonlinear models under survey weights.
  • Double/debiased machine learning: cross-fitting routines for orthogonal score estimation of θ DML under complex sampling.
  • Design-based bootstrap: PSU-level resampling within strata for uncertainty quantification of predictive and causal parameters.
  • High-dimensional penalization: integration of survey weights into LASSO fitting, λ -selection, and stability assessment across penalty grids.
  • Calibration and sensitivity: weighted calibration plots and raking-based sensitivity analyses under shifts in CKD prevalence.
All routines, including simulation scripts for population generation, complex sampling, endogeneity mechanisms, bootstrap procedures, and sensitivity analyses, are provided in the companion scripts to support full reproducibility.

3.7. Simulation and Empirical Study

We present two complementary analyses: (i) a simulation study emulating ENS-like design features and CKD prevalence, and (ii) an empirical application using ENS 2016–2017 data. Both analyses evaluate predictive performance, calibration, and robustness to endogeneity corrections using control-function (2SRI) and double/debiased machine learning (DML) approaches.

3.8. Simulation Study

  • Synthetic finite population and complex-survey design.
We generated a synthetic finite population of N = 120 , 000 units to emulate key structural features of national health surveys, including stratification ( H = 6 strata), primary sampling units (PSUs) within strata, and survey weights approximating inverse inclusion probabilities with mild calibration noise. Each unit was assigned demographic and clinical covariates (age, sex, smoking, diabetes, treated hypertension, systolic blood pressure) and a high-dimensional biomarker panel ( p = 50 ) with AR(1) correlation structure ( ρ = 0.5 ).
  • Outcome and endogeneity mechanism.
The binary outcome Y i (high CV risk) was generated via a logistic model combining classical risk factors, biomarker effects, and an endogeneity component correlated with CKD status. CKD was generated as a function of covariates, biomarkers, and instruments (two proxies with strong relevance), inducing correlation between CKD and the outcome error term. This setup enables evaluation of naive bias and performance of endogeneity- corrected estimators under complex sampling.
  • Sampling and weighting.
From the finite population, we drew a stratified multistage sample by selecting PSUs within strata and then individuals within PSUs, yielding survey weights, strata, and cluster identifiers. All estimators incorporate these design elements.
  • Modeling framework.
We fit survey-weighted penalized logistic regressions using LASSO for variable selection, maximizing the pseudo-log-likelihood with an 1 penalty. Tuning was selected via weighted cross-validation. Endogeneity was addressed using 2SRI and DML with cross-fitting ( K = 5 folds).
  • Uncertainty and generalization.
Design-based uncertainty was quantified via a PSU-level bootstrap within strata ( B = 150 replicates), providing confidence intervals for AUC, Brier score, and θ ^ DML . Sensitivity analyses were conducted by raking weights to target CKD prevalence levels of 8%, 12%, and 20%.
  • Performance metrics.
Predictive performance was evaluated using survey-weighted AUC and Brier score. Causal inference focused on θ ^ DML , reported on a probability scale. Bootstrap percentiles were used to construct confidence intervals.

3.9. Empirical Study

We use ENS 2016–2017, a nationally representative health survey implemented using multistage, stratified cluster sampling. All analyses incorporate sampling weights, strata, and PSUs to ensure design-consistent inference.
  • CKD and cardiovascular risk definitions.
CKD is defined based on kidney function criteria using estimated glomerular filtration rate (eGFR) and albuminuria, following KDIGO-related standards [43]. Cardiovascular risk is estimated using a Framingham-type score widely applied in Chilean epidemiology [22].
  • Biochemical predictors and candidate covariates.
Candidate predictors include demographic characteristics, behavioral indicators, cardiometabolic conditions, and laboratory measurements commonly linked to cardiovascular risk, including age, sex, hypertension, diabetes, body mass index, metabolic syndrome, glucose-related markers, lipid profile, kidney function indicators, inflammatory markers (e.g., hs-CRP), and regional indicators [27,44].
To reduce overfitting and address multicollinearity, we apply LASSO as a screening step. Covariates with excessive missingness (greater than 20%) were excluded prior to LASSO estimation. The penalty λ was selected via five-fold cross-validation using deviance, retaining both λ min and λ 1 se solutions. Final inference is based on unpenalized survey-weighted logistic regressions refit on selected covariate sets.
  • Endogeneity assessment in ENS data.
To evaluate sensitivity to potential endogeneity between CKD and cardiovascular risk, we consider 2SRI and DML extensions within the survey-weighted framework. These analyses complement the primary predictive and associative results by assessing the robustness of CKD effect estimates under alternative identification strategies.
  • Modeling strategy.
We compare:
1.
Base model (Framingham-like): classical risk factors (age, sex, SBP, smoking, diabetes, and treated hypertension).
2.
CKD-augmented model: adds CKD-related biomarkers and indicators (e.g., hs-CRP, hemoglobin, HbA1c, eGFR/uACR, where available, and a CKD indicator).
For each specification, we report discrimination (AUC), calibration diagnostics, and reclassification metrics where applicable, and we evaluate how CKD effect estimates change under endogeneity corrections.
  • Variance estimation and lonely-PSU adjustment.
Variance estimation accounts for the complex multistage ENS design using Taylor linearization. Strata containing a single PSU after data cleaning are handled using the standard “lonely PSU” adjustment in the survey package (survey.lonely.psu = "adjust"), applying a conservative correction without excluding strata.
  • Reproducibility.
To ensure full reproducibility, all data and code are provided in a public repository [45].
  • Estimation scales and reporting strategy.
Survey-weighted logistic models (naive and LASSO-refit) and the two-stage residual inclusion (2SRI) approach report effects on the odds ratio (OR) scale and are interpreted as associational or diagnostic. In contrast, double/debiased machine learning (DML) is defined as the primary causal estimator and reports the average treatment effect of CKD on the probability of high cardiovascular risk (probability difference, pp). Uncertainty is quantified using a design-consistent PSU-level bootstrap throughout. Table 1 summarizes these distinctions.

4. Results

4.1. Results of Simulation Study

  • Overview.
Under a generalizable survey-like simulation with stratification ( H = 6 ), PSUs within strata, high-dimensional biomarkers ( p = 50 ) with AR(1) correlation ( ρ = 0.5 ), and endogeneity between CKD and the outcome error (two relevant instruments), we evaluated predictive and causal performance using survey-weighted LASSO (naive), a DML-constrained outcome model (svyglm with offset), and 2SRI residual inclusion. Design-based uncertainty was quantified via PSU-level bootstrap ( B = 150 ), and sensitivity to CKD prevalence was assessed via raking to 8%, 12%, and 20%.
  • Predictive performance (design-based).
Table 2 reports survey-weighted AUC and Brier for the three specifications. The naive LASSO achieves moderate discrimination (AUC 0.682 ) and reasonable calibration (Brier 0.217 ), with narrow design-based intervals. The DML-offset svyglm maintains causal consistency for the CKD component and slightly improves discrimination and calibration. The endogeneity-corrected 2SRI shows the highest discrimination (AUC 0.705 ) and lowest Brier (≈0.210).
  • Causal effect (DML) and uncertainty.
Table 3 summarizes the DML-estimated causal effect of CKD on high cardiovascular risk. The parameter is reported in percentage points on a residualized probability scale (not odds ratios). Bootstrap percentiles provide design-based uncertainty.
  • Prevalence sensitivity (raking).
Table 4 shows the stability of predictive metrics and the DML parameter when reweighting the sample to target CKD prevalence levels of 8%, 12%, and 20%. Changes are minimal, supporting generalizability under aggregate prevalence shifts.
  • Survey-weighted ROC curves.
Figure 2 compares ROC curves for the naive LASSO, DML-offset svyglm, and 2SRI. Consistent with Table 2, 2SRI attains the highest discrimination, while the DML-offset maintains competitive performance with causal consistency for the CKD component.
  • Survey-weighted calibration.
Figure 3 displays weighted decile calibration plots for naive LASSO and 2SRI. The 2SRI specification yields slightly better global calibration (lower Brier), with points closer to the identity line. Point sizes reflect the sum of survey weights per decile.
  • Bootstrap distribution of the DML parameter.
Figure 4 summarizes the bootstrap distribution of θ ^ DML under PSU resampling within strata. The distribution is approximately unimodal, with the percentile interval not including zero, supporting a positive CKD effect on residualized probability of high cardiovascular risk.
  • Interpretation.
Across high-dimensional, endogeneity-aware specifications under complex sampling, the naive LASSO provides a practical baseline for risk stratification, while DML-offset preserves causal consistency for the CKD component with competitive predictive metrics. The instability and wide confidence intervals observed in the 2SRI estimates should be interpreted in light of both the weak effective sample size and the lack of valid exclusion restrictions. Consequently, these results are not taken as causal estimates, but rather as indicative of the sensitivity of the CKD effect to alternative modeling assumptions that attempt to account for endogeneity. Design-based bootstrap confirms estimator stability, and raking analyses indicate minimal sensitivity to CKD prevalence shifts, supporting generalizability of the framework.

4.2. Empirical Results: ENS 2016–2017

Using data from the Chilean National Health Survey (ENS) 2016–2017 and explicitly accounting for its complex multistage sampling design, we estimated survey-weighted prevalence figures and regression models for chronic kidney disease (CKD) and high cardiovascular (CV) risk in the adult population. CKD was defined as an estimated glomerular filtration rate below 60 mL/min/1.73 m2, while high cardiovascular risk was defined as the highest risk category (category 3) of the Chilean-adapted Framingham 10-year cardiovascular risk score.
The analytic sample comprised n = 6233 adults, including approximately n CKD 190 individuals with CKD events. The survey-weighted prevalence of CKD was 3.1% (95% CI: 2.4–3.8), and the prevalence of high cardiovascular risk was 23.9% (95% CI: 21.5–26.3).
All variance estimates fully accounted for the complex survey design, including stratification, clustering, unequal sampling weights, and strata containing a single primary sampling unit, which were handled using conservative variance adjustments.
Table 5 presents survey-weighted baseline characteristics stratified by CKD status. Individuals with CKD were substantially older than those without CKD, with a mean age of 74.8 years compared to 42.3 years in the non-CKD group ( p < 0.001 ). The proportion of women was similar across groups (50.9% vs. 51.6%, p = 0.92 ), and the mean body mass index (BMI) did not differ significantly despite the large age gap (28.9 vs. 28.5 kg/m2, p = 0.58 ).
In contrast, cardiometabolic conditions were markedly more prevalent among individuals with CKD. Hypertension affected 86.7% of participants with CKD compared to 25.5% among those without CKD ( p < 0.001 ), while diabetes prevalence was nearly three times higher in the CKD group (31.1% vs. 11.2%, p < 0.001 ). Most notably, the prevalence of high cardiovascular risk exceeded 90% among individuals with CKD, compared to 21.5% in the non-CKD population ( p < 0.001 ).
Taken together, these findings reveal a pronounced clustering of cardiovascular risk factors and predicted cardiovascular risk among individuals with CKD in the Chilean adult population, providing a strong empirical motivation for the multivariable and regularized modeling strategies developed in the subsequent analyses.

4.2.1. Naive Survey-Weighted Model

As an initial benchmark, we estimated a survey-weighted logistic regression model treating CKD status as exogenous. The model adjusted for age, sex, hypertension, diabetes, and body mass index, while fully accounting for the complex survey design.
In this naive specification, CKD was strongly associated with high cardiovascular risk. Individuals with CKD exhibited substantially higher odds of being classified as having high cardiovascular risk compared to those without CKD (OR = 5.66; 95% CI: 2.71–11.82; p < 0.001 ). Age and diabetes emerged as the strongest predictors, while hypertension also showed a positive and statistically significant association. Sex and body mass index were not independently associated with high cardiovascular risk in this model.
While these results indicate a robust association between CKD and predicted cardiovascular risk, the naive specification does not address potential overfitting or multicollinearity among cardiometabolic covariates. These limitations motivate the use of regularized variable selection techniques in the subsequent analysis.

4.2.2. Regularized Variable Selection via LASSO

Under the conservative λ 1 se criterion, LASSO selected a compact and clinically interpretable set of predictors, including chronic kidney disease (CKD), age, hypertension, and diabetes. This subset captures the core cardiometabolic pathways linking renal dysfunction and cardiovascular risk.
Using the more permissive λ min criterion, additional variables entered the model, including body mass index and selected metabolic and laboratory indicators, as well as regional fixed effects. While this specification improved in-sample fit, it did so at the cost of increased model complexity.
Table 6 reports the survey-weighted logistic regression results re-estimated using the variables selected by each LASSO rule. Across both specifications, CKD remained a strong and statistically significant predictor of high cardiovascular risk. In the λ 1 se model, individuals with CKD exhibited more than fivefold higher odds of being classified as having high cardiovascular risk compared to those without CKD (OR = 5.73; 95% CI: 2.80–11.73). Nearly identical effect sizes were observed under the λ min specification (OR = 5.62; 95% CI: 2.72–11.60), indicating substantial robustness of the CKD association to alternative model selection criteria.
Age and diabetes consistently emerged as the strongest predictors in both models, while hypertension retained a moderate but statistically significant association. Importantly, the sign and magnitude of the CKD coefficient were stable across specifications, and all shared covariates exhibited identical coefficient signs, underscoring the structural robustness of the estimated relationships.
Model performance metrics were similar across the two specifications. The McFadden pseudo- R 2 was 0.445 for the λ 1 se model and 0.449 for the λ min model, suggesting that the additional covariates selected under λ min provided only marginal improvements in explanatory power.
Overall, these results support the use of the parsimonious λ 1 se specification as the primary empirical model, while the λ min specification serves as a sensitivity analysis confirming the stability of the CKD effect.
In the empirical application, LASSO is used exclusively as a data-driven variable screening step. All reported odds ratios, confidence intervals, and causal estimates are obtained from unpenalized survey-weighted models refitted using the covariate sets selected under λ 1 se (main specification) and λ min (sensitivity analysis).
Both naive and LASSO-refitted survey-weighted models yield similar odds ratios, indicating a strong association between CKD and high cardiovascular risk. These models are interpreted as predictive or associational and do not admit a causal interpretation. See Table 7.

4.2.3. Endogeneity Assessment via Two-Stage Residual Inclusion

To explore potential endogeneity between chronic kidney disease (CKD) status and cardiovascular risk, we implemented a two-stage residual inclusion (2SRI) approach adapted to the complex ENS sampling design. In this framework, CKD status was treated as potentially endogenous.
In the first stage, CKD was modeled as a function of age, sex, hypertension, and diabetes, which act as clinically motivated predictors of kidney dysfunction and are strongly related to CKD prevalence. The first-stage model was estimated using survey-weighted logistic regression without penalization. First-stage relevance was assessed using design-based Wald tests that account for stratification, clustering, and survey weights. Coefficient estimates, standard errors, effective sample size, and joint relevance diagnostics are reported in Table 8.
The resulting first-stage residuals were then included as a control function in the second-stage outcome model. In this specification, the residual term was not statistically significant ( p = 0.14 ), providing no strong evidence of residual endogeneity after conditioning on observed covariates.
The estimated effect of CKD increased markedly in magnitude (OR = 69.1), but with a very wide confidence interval (95% CI: 1.06–4502.7), reflecting numerical instability due to limited overlap and the small effective sample size of individuals with CKD. These results indicate that conventional control-function approaches may be unreliable in this setting and motivate the use of orthogonalized machine learning estimators as the primary causal strategy.
  • Diagnostics for extreme odds ratios.
The extreme odds ratio observed in the 2SRI specification reflects sparse-data issues and limited overlap between CKD and non-CKD groups. The number of CKD events per covariate was below conventional thresholds, and quasi-separation was detected in the second-stage model. Mean variance inflation factors were below 5, indicating limited multicollinearity. Penalized likelihood corrections (e.g., Firth) were not applied, as the 2SRI results are reported solely for diagnostic comparison.

4.2.4. Double/Debiased Machine Learning Estimation

Finally, we estimated the effect of CKD on high cardiovascular risk using a double/debiased machine learning (DML) approach based on the partially linear regression (PLR) framework with cross-fitting. This specification avoids reliance on inverse propensity weighting, which can be unstable in the presence of rare treatments and complex survey designs.
Nuisance components for the outcome and treatment models were estimated via survey-weighted logistic regressions, and the causal effect of CKD was obtained from a second-stage regression on orthogonalized residuals. This strategy yields a stable and approximately unbiased estimate of the average treatment effect under weak regularity conditions.
The DML estimate suggests that CKD is not associated with a statistically significant increase in the probability of being classified as having high cardiovascular risk once demographic characteristics and cardiometabolic comorbidities are accounted for ( θ ^ = 0.094 , SE = 0.161; 95% CI: [ 0.409 , 0.220 ] ). While naive and control-function specifications indicate a strong association between CKD and cardiovascular risk, the DML results indicate that this relationship is largely explained by confounding factors, particularly age and diabetes.
Table 9 summarizes the empirical modeling results for the association between CKD and high cardiovascular risk.
Figure 5 provides a visual comparison of the CKD effect across modeling strategies, while Figure 6 illustrates how predicted cardiovascular risk distributions differ between naive and DML-adjusted specifications.
Given the low prevalence of CKD (approximately 3%) and the resulting limited effective sample size, the minimum detectable effect for the DML estimator is on the order of 0.25–0.30 probability points. Smaller causal effects cannot be reliably distinguished from zero under the observed design, and estimates should be interpreted with this power limitation in mind.
The attenuation and loss of statistical significance observed in the DML estimate, relative to naive models, is consistent with both confounding adjustment and the increased variance associated with nuisance estimation under limited effective sample size. This pattern reinforces the interpretation of DML results as conservative, design-aware estimates rather than precise structural effects.
From an identification perspective, the relatively small number of CKD cases ( n C K D 190 ) implies a limited effective sample size for estimating treatment effects, particularly under complex sampling with clustering and unequal weights. This constraint is further amplified when using cross-fitting and flexible nuisance estimation in DML, which partitions the data and may reduce effective information per fold.

5. Discussion

This study develops and illustrates a unified, design-aware framework that integrates penalized prediction, causal inference, and design-consistent uncertainty quantification under complex survey sampling. Consistent with this objective, the principal contribution of this work is methodological rather than substantive.

5.1. Methodological Reflections

The application to chronic kidney disease (CKD) and cardiovascular (CV) risk serves as a concrete example that exposes distinctions that matter mathematically and operationally but are often blurred in applied research.
First, the empirical results highlight a structural separation between predictive association and causal contribution. Survey-weighted descriptive analyses and penalized predictive models reproduce well-established empirical patterns in the CKD literature [22,23]. However, once confounding and endogeneity are addressed using control-function or orthogonalized estimators, the independent causal effect of CKD on high predicted CV risk is greatly attenuated.
Second, the methodological architecture emphasizes the importance of design-consistent prediction metrics. In complex surveys, naive AUC and calibration measures can be misleading. Recent methodological developments provide weighted estimators of ROC curves and AUC that honor sampling weights and clustering [46,47]. Integrating these tools prevents overestimation of discriminative ability and ensures that predictive evaluation aligns with the finite population represented by the survey.
Third, the use of design-based bootstrap for both predictive and causal estimators is a core part of the framework. Classical resampling methods for complex surveys—including Rao–Wu and Rust–Rao replicate-weight approaches—provide valid variance estimation under stratification and clustering [2,3]. Modern software implementations further facilitate reproducible variance estimation and stability assessment [48].
Fourth, addressing endogeneity under complex sampling requires careful methodological integration. Control-function logic in nonlinear models [6] and orthogonalization via double/debiased machine learning (DML) provide complementary strategies. Recent advances in weighted orthogonal learners [11,49] reinforce the value of orthogonality when nuisance estimation must incorporate survey weights, high-dimensionality, or limited effective sample sizes. In this illustrative example, the attenuation of the CKD effect is therefore interpreted not as a biomedical conclusion but as an instance of how orthogonal estimators isolate structural signal from design-induced or confounding-driven associations.
Fifth, the example illustrates a fundamental principle: under complex sampling, prediction and causal inference need not coincide. Predictive estimators may exploit associations that are not causally meaningful, while causal estimators prioritize robustness even at the cost of reduced discrimination. The proposed framework makes these trade-offs explicit and provides a reproducible workflow for applied research in any domain where complex sampling, penalization, and causal inference intersect.
  • Methodological reflections on DML under complex sampling.
While DML provides a powerful framework for bias reduction via orthogonalization, its performance in complex survey settings depends on the interaction between sampling design and nuisance estimation. In particular, clustering, weighting, and limited effective sample size may constrain the flexibility and stability of machine learning estimators.
In this context, DML should be interpreted as a principled compromise between bias control and variance inflation, rather than as a fully nonparametric causal estimator. This perspective aligns with recent discussions emphasizing the projection-based interpretation of orthogonal estimators in finite samples.
From a broader perspective, our approach connects recent advances in orthogonal machine learning [7,12] with the design-based inference tradition in survey statistics. While the former emphasizes robustness to high-dimensional nuisance estimation, the latter focuses on valid population-level inference under complex sampling. Bridging these two perspectives remains an open methodological challenge.
The framework developed in this paper is closely related to recent advances in survival analysis and longitudinal modeling under complex dependence structures. In particular, modern approaches increasingly integrate clustering, competing risks, and heterogeneity within unified estimation frameworks [17,18].
Although our focus is on cross-sectional binary outcomes, the underlying principles—namely, design-aware estimation, orthogonalization, and resampling-based inference—extend conceptually to time-to-event settings. In such contexts, the target parameter becomes a hazard function or cumulative incidence rather than a conditional mean, but similar challenges arise in accounting for dependence and ensuring robust inference. Extending the present framework to survival and longitudinal data represents a natural direction for future research.
  • Methodological and decision-support implications.
From an applied-mathematics standpoint, the framework contributes two decision-relevant capabilities. (i) It yields design-consistent predictive evaluation—weighted ROC/AUC, calibration, and bootstrap uncertainty—so that model performance reflects the target finite population, a prerequisite for downstream policy rules based on risk thresholds [46,47]. (ii) It provides endogeneity-aware causal effect estimates via orthogonal moments, ensuring that structural parameters used to justify escalation/escalation-avoidance policies are not artifacts of confounding. Together, these components support evidence-to-decision workflows that combine discrimination and calibration with explicit trade-offs, e.g., via decision-curve analysis and net benefit, which quantify whether model-guided actions improve outcomes relative to treat-all or treat-none strategies [50,51].

5.2. Substantive Insights

While the primary contribution of this study is methodological, the empirical application provides context for interpreting CKD within cardiovascular risk stratification. Survey-weighted predictive models reproduce patterns consistent with existing clinical literature and guidelines, which continue to treat CKD and kidney measures as important components of CV risk assessment [52,53].
However, once confounding and endogeneity are addressed, the independent contribution of CKD is substantially attenuated. This pattern is consistent with modern risk engines—such as QRISK3 [54] and SCORE2 CKD extensions [55]—where kidney metrics improve predictive performance, but binary CKD indicators do not necessarily retain a strong stand-alone structural effect.
These findings should be interpreted with caution. The data and identification strategy do not support definitive causal claims, and the absence of credible exclusion restrictions further limits structural interpretation. Rather than suggesting changes to clinical guidelines, the results highlight the importance of distinguishing between predictive markers and causal drivers in population-level risk modeling.

5.3. Limitations

Although the framework is broadly applicable, several limitations must be recognized. First, causal interpretation of orthogonalized estimators still relies on strong identification assumptions—particularly approximate unconfoundedness or valid instruments—which cannot be empirically verified in observational surveys. The absence of credible exclusion-based instruments in large-scale health surveys remains a fundamental limitation for structural identification, reinforcing the importance of orthogonal, design-aware estimators under conditional exogeneity assumptions. Second, effective sample size can be severely limited for rare conditions like CKD, amplifying the variance of both predictive and causal estimators. Third, while the simulation study emulates key features of complex surveys, real-world surveys may present additional complications such as differential measurement error, multi-phase sampling, or uncontrolled response patterns. Fourth, the empirical application uses a unidirectional causal structure for conceptual clarity, though bidirectional cardio–renal interactions are clinically plausible. These limitations indicate that the empirical results should be viewed strictly as an illustrative case study validating the methodological pipeline rather than as substantive epidemiological findings.

5.4. Future Work

Several extensions follow naturally from the present framework. One avenue concerns the development of survey-specific orthogonal learners that directly incorporate replicate weights or calibration constraints. Another concerns sharper finite-sample guarantees for penalized and debiased estimators under multistage sampling, where asymptotic approximations may be less reliable. A third direction is the integration of this framework into decision-support contexts, where prediction and causality must be balanced optimally under sampling constraints—e.g., combining design-consistent model evaluation with decision curves and cost–utility targets [50,51]. Finally, extending the theoretical analysis of design-consistent bootstrap methods for orthogonal estimators represents a promising path toward more rigorous foundations for causal inference in complex survey environments.

6. Conclusions

This study demonstrates the value of integrating survey-weighted inference, simulation analysis, and endogeneity-aware machine learning to separate predictive performance from causal interpretation in population health data. Using a nationally representative complex survey, we show that chronic kidney disease (CKD) is strongly associated with high predicted cardiovascular risk, yet its independent causal contribution is substantially attenuated once age and major cardiometabolic comorbidities are accounted for.
From a methodological standpoint, three lessons emerge. First, under complex sampling, predictive and causal targets need not coincide: penalized, survey-weighted models can retain strong discrimination and calibration for risk stratification, while orthogonalized estimators (e.g., DML) appropriately shrink structural effects in the presence of confounding and endogeneity. Second, survey-consistent performance assessment—weighted ROC/AUC, calibration, and a design-preserving bootstrap—matters for population-level claims and prevents overstating model utility. Third, control-function and orthogonal methods provide complementary routes to causal identification; when assumptions hold, they recover causal parameters without sacrificing design awareness.
The simulation study confirms that naive predictive models may overstate causal effects under realistic confounding structures, whereas DML and control-function approaches can recover causal parameters when identification assumptions hold. Together, the empirical and simulation results underscore the importance of aligning modeling strategies with inferential objectives in complex surveys and of making explicit the trade-offs between prediction and causality.
Practically, the framework offers a scalable, design-consistent blueprint for evaluating extensions to cardiovascular risk tools (e.g., adding kidney measures) and for supporting decision-making with metrics that reflect the finite population and quantify uncertainty. Future work will (i) extend the analysis to additional survey waves and settings to assess transportability; (ii) develop survey-adapted orthogonal learners that directly incorporate replicate weights and calibration constraints; and (iii) further integrate decision-analytic tools (e.g., net benefit and cost–utility analysis) with design-aware evaluation to optimize guideline adaptations and clinical decision support in CKD-specific cardiovascular risk management.

Author Contributions

Conceptualization, F.R. and A.T.; methodology, A.T. and F.R.; software, A.T.; validation, F.R. and H.E.; formal analysis, A.T.; investigation, A.T.; resources, H.E.; data curation, A.T.; writing—original draft preparation, A.T.; writing—review and editing, F.R. and H.E.; visualization, A.T.; supervision, F.R.; project administration, F.R.; funding acquisition, F.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Master’s Program in Clinical Analysis, School of Chemistry and Pharmacy, Faculty of Pharmacy, Universidad de Valparaíso. The APC was funded by the same program.

Data Availability Statement

The data presented in this study are openly available in [Zenodo] at [https://doi.org/10.5281/zenodo.19358307] and in [Mendeley ENS 2016–2017] at [https://data.mendeley.com/datasets/3myh3nzcy3/1? (accessed on 7 April 2026)].

Acknowledgments

The authors acknowledge the support of the academic staff of the Master’s Program in Clinical Analysis for their guidance throughout the project.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Lumley, T. Complex Surveys: A Guide to Analysis Using R; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
  2. Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
  3. Rust, K.; Rao, J. Variance estimation for complex surveys using replication techniques. Stat. Methods Med. Res. 1996, 5, 283–310. [Google Scholar] [CrossRef] [PubMed]
  4. Roy, J.; Shou, H.; Xie, D.; Hsu, J.Y.; Yang, W.; Anderson, A.H.; Landis, J.R.; Jepson, C.; He, J.; Liu, K.D.; et al. Statistical methods for cohort studies of CKD: Prediction modeling. Clin. J. Am. Soc. Nephrol. 2017, 12, 1010–1017. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, J. Weighted estimation for multivariate shared frailty models for complex surveys. Lifetime Data Anal. 2019, 25, 469–479. [Google Scholar] [CrossRef] [PubMed]
  6. Wooldridge, J.M. Control Function Methods in Applied Econometrics. J. Hum. Resour. 2015, 50, 420–445. [Google Scholar] [CrossRef]
  7. Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/Debiased Machine Learning for Treatment and Structural Parameters. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
  8. Iparragirre, J.; Lumley, T.; Barrio, I.; Arostegui, I. Variable selection with LASSO regression for complex survey data. Stat 2023, 12, e578. [Google Scholar] [CrossRef]
  9. McConville, K.S.; Breidt, F.J.; Lee, T.; Moisen, G.G. Model-assisted survey regression estimation with the lasso. J. Surv. Stat. Methodol. 2017, 5, 131–158. [Google Scholar] [CrossRef]
  10. Jasiak, J.; Tuvaandorj, P. Penalized Likelihood Inference with Survey Data. arXiv 2023, arXiv:2304.07855. [Google Scholar] [CrossRef]
  11. Morzywołek, P.; Decruyenaere, J.; Vansteelandt, S. On Weighted Orthogonal Learners for Heterogeneous Treatment Effects. arXiv 2024, arXiv:2303.12687. [Google Scholar]
  12. Semenova, V.; Chernozhukov, V. Debiased Machine Learning of Conditional Average Treatment Effects and Other Causal Functions. Econom. J. 2021, 24, 264–289. [Google Scholar] [CrossRef]
  13. Bostani, A.; Mirzaeibonehkhater, M.; Najafi, H.; Mehrtash, M.; Alizadehsani, R.; Tan, R.S.; Acharya, U.R. MLP-RL-CRD: Diagnosis of Cardiovascular Risk in Athletes Using a Reinforcement Learning-Based Multilayer Perceptron. Physiol. Meas. 2023, 44, 125012. [Google Scholar] [CrossRef]
  14. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
  15. Alizadehsani, R.; Habibi, J.; Hosseini, M.J.; Mashayekhi, H.; Boghrati, R.; Ghandeharioun, A.; Bahadorian, B.; Sani, Z.A. Machine Learning-Based Coronary Artery Disease Diagnosis: A Comprehensive Review. Comput. Biol. Med. 2019, 111, 103346. [Google Scholar] [CrossRef]
  16. Liu, X.Y.; Zhou, Z.H. The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study. IEEE Trans. Knowl. Data Eng. 2021, 34, 1452–1470. Available online: https://www.lamda.nju.edu.cn/publication/icdm06b.pdf (accessed on 7 April 2026).
  17. Wang, Y.; Li, J.; Zhou, Q.; Wang, W. A unified framework for complex survival data: Accounting for clustering, cure fractions, and competing risks. BMC Med. Res. Methodol. 2026. [Google Scholar] [CrossRef]
  18. Zhuang, L.; Ma, Y.; Fang, G.; Xu, A. Modeling two-scale degradation with heterogeneity: A unified random-effects inverse Gaussian framework. IISE Trans. 2026, 1–16. [Google Scholar] [CrossRef]
  19. Kaddour, J.; Lynch, A.; Liu, Q.; Kusner, M.J.; Silva, R. Causal Machine Learning: A Survey and Open Problems. Found. Trends Optim. 2025, 9, 1–247. [Google Scholar] [CrossRef]
  20. Shahu, A.; Chung, J.; Tarraf, W.; Ramos, A.R.; González, H.M.; Redline, S.; Cai, J. Method comparison and estimation of causal effects of insomnia on health outcomes in a survey sampled population. Sci. Rep. 2023, 13, 9831. [Google Scholar] [CrossRef]
  21. Hu, L.; Yu, J.; Yang, C.; Chen, M.; Tang, Z.; Liao, R.; Jike, C.; Wang, J.; Wang, R.; Liao, Q.; et al. Inverse probability weighting for causal inference in hierarchical data. BMC Med. Res. Methodol. 2025, 25, 185. [Google Scholar] [CrossRef] [PubMed]
  22. Weiner, D.E.; Tighiouart, H.; Elsayed, E.F.; Griffith, J.L.; Salem, D.N.; Levey, A.S.; Sarnak, M.J. The Framingham Predictive Instrument in Chronic Kidney Disease. J. Am. Coll. Cardiol. 2007, 50, 217–224. [Google Scholar] [CrossRef]
  23. Li, J.; Lindholm, B. Cardiovascular Risk Prediction in Chronic Kidney Disease. Am. J. Nephrol. 2022, 53, 730–739. [Google Scholar] [CrossRef]
  24. Taylor, J.M.; Ankerst, D.P.; Andridge, R.R. Validation of biomarker-based risk prediction models. Clin. Cancer Res. 2008, 14, 5977–5983. [Google Scholar] [CrossRef]
  25. Vasquez, M.M.; Hu, C.; Roe, D.J.; Chen, Z.; Halonen, M.; Guerra, S. Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: Simulation and application. BMC Med. Res. Methodol. 2016, 16, 154. [Google Scholar] [CrossRef]
  26. Chronic Kidney Disease Prognosis Consortium. Association of estimated glomerular filtration rate and albuminuria with all-cause and cardiovascular mortality: A collaborative meta-analysis of general population cohorts. Lancet 2010, 375, 2073–2081. [Google Scholar] [CrossRef]
  27. Stenvinkel, P. Inflammation in end-stage renal disease: The hidden enemy (Review Article). Nephrology 2006, 11, 36–41. [Google Scholar] [CrossRef]
  28. Stenvinkel, P.; Heimbürger, O.; Paultre, F.; Diczfalusy, U.; Wang, T.; Berglund, L.; Jogestrand, T. Strong Association Between Malnutrition, Inflammation, and Atherosclerosis in Chronic Renal Failure. KIdney Int. 1999, 55, 1899–1911. [Google Scholar] [CrossRef]
  29. Morena, M.; Cristol, J.P.; Senécal, L.; Leray-Moragues, H.; Krieter, D.; Canaud, B. Oxidative Stress in Hemodialysis Patients: Is NADPH Oxidase Complex the Culprit? Kidney Int. Suppl. 2002, 61, S109–S114. [Google Scholar] [CrossRef] [PubMed]
  30. Vicca, S.; Massy, Z.A.; Hennequin, C.; Rihane, D.; Nguyen-Khoa, T.; Drüeke, T.B.; Lacour, B. New insights into the effects of the protein moiety of oxidized LDL (oxLDL). Kidney Int. 2003, 63, S125–S127. [Google Scholar] [CrossRef] [PubMed]
  31. Shlipak, M.G.; Katz, R.; Sarnak, M.J.; Fried, L.F.; Newman, A.B.; Stehman-Breen, C.; Seliger, S.L.; Kestenbaum, B.; Psaty, B.; Tracy, R.P.; et al. Cystatin C and prognosis for cardiovascular outcomes in Elderly Persons without Chronic Kidney Disease. Ann. Intern. Med. 2006, 145, 237–246. [Google Scholar] [CrossRef] [PubMed]
  32. Peralta, C.A.; Katz, R.; Sarnak, M.J.; Ix, J.; Fried, L.F.; De Boer, I.; Palmas, W.; Siscovick, D.; Levey, A.S.; Shlipak, M.G. Cystatin C Identifies Chronic Kidney Disease Patients at Higher Risk for Complications. J. Am. Soc. Nephrol. 2011, 22, 147–155. [Google Scholar] [CrossRef]
  33. Herrington, W.; Staplin, N.; Judge, P.K.; Mafham, M.; Emberson, J.; Haynes, R.; Wheeler, D.C.; Walker, R.; Tomson, C.; Agodoa, L.; et al. Evidence for Reverse Causality in the Association Between Blood Pressure and Cardiovascular Risk in Patients with Chronic Kidney Disease. Hypertension 2017, 69, 314–322. [Google Scholar] [CrossRef]
  34. Yang, W.; Wu, X.; Zhao, M.; Hu, J.; Lin, C.; Mei, Z.; Chen, J.; Zhou, X.-J.; Nie, S.; Nie, J.; et al. Kidney Function and Cardiovascular Disease: Evidence from Observational Studies and Mendelian Randomization Analyses. Phenomics 2024, 4, 250–253. [Google Scholar] [CrossRef]
  35. Hu, C.; Li, Y.; Qian, Y.; Wu, Z.; Hu, B.; Peng, Z. Kidney Function and Cardiovascular Diseases: A Large-scale Observational and Mendelian Randomization Study. Front. Immunol. 2023, 14, 1190938. [Google Scholar] [CrossRef] [PubMed]
  36. Terza, J.V.; Basu, K.; Rathouz, P.J. Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. J. Health Econ. 2008, 27, 531–543. [Google Scholar] [CrossRef] [PubMed]
  37. Terza, J.V. Two-Stage Residual Inclusion Estimation: A Practitioners Guide to Stata Implementation. Stata J. 2017, 17, 916–938. [Google Scholar] [CrossRef]
  38. Emmenegger, C.; Bühlmann, P. Regularizing Double Machine Learning in Partially Linear Endogenous Models. Electron. J. Stat. 2021, 15, 6461–6543. [Google Scholar] [CrossRef]
  39. Beaumont, J.F.; Patak, Z. On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling. Int. Stat. Rev. 2012, 80, 127–148. [Google Scholar] [CrossRef]
  40. McCaffrey, D.F.; Ridgeway, G.; Morral, A.R. Propensity Score Estimation with Boosted Regression for Evaluating Causal Effects in Observational Studies. Psychol. Methods 2004, 9, 403–425. [Google Scholar] [CrossRef]
  41. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  42. Deo, R.; Dubin, R.F.; Ren, Y.; Murthy, A.C.; Wang, J.; Zheng, H.; Zheng, Z.; Feldman, H.; Shou, H.; Coresh, J.; et al. Proteomic cardiovascular risk assessment in chronic kidney disease. Eur. Heart J. 2023, 44, 2095–2110. [Google Scholar] [CrossRef]
  43. Levey, A.; Coresh, J. Chronic kidney disease. Lancet 2012, 379, 165–180. [Google Scholar] [CrossRef]
  44. Bansal, N.; Zelnick, L.; Go, A.; Anderson, A.; Christenson, R.; Deo, R.; Defilippi, C.; Lash, J.; He, J.; Ky, B.; et al. Cardiac Biomarkers and Risk of Incident Heart Failure in Chronic Kidney Disease. J. Am. Heart Assoc. 2019, 8, 21. [Google Scholar] [CrossRef]
  45. Rojas, F. Reproducible Code and Data Repository: Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches. Zenodo 2026. Available online: https://zenodo.org/records/19358307 (accessed on 7 April 2026).
  46. Izrael, D.; Battaglia, A.A.; Hoaglin, D.C.; Battaglia, M.P. Use of the ROC Curve and the Bootstrap in Comparing Weighted Logistic Regression Models. In Proceedings of the Twenty-Seventh Annual SAS Users Group International Conference (SUGI 27); SAS Institute Inc.: Cary, NC, USA, 2002; p. 248-27. [Google Scholar]
  47. Iparragirre, A.; Barrio, I.; Arostegui, I. svyROC: Estimation of the ROC Curve and the AUC for Complex Survey Data. CRAN Package Manual, Version 1.0.0, 2025. Available online: https://cran.r-project.org/web/packages/svyROC/svyROC.pdf (accessed on 7 April 2026).
  48. Feehan, D.M. Surveybootstrap: Bootstrap with Survey Data, 2024. R Package (dev) v0.1.0.90000. Available online: https://cran.r-project.org/web/packages/surveybootstrap/index.html (accessed on 7 April 2026).
  49. Chernozhukov, V.; Hansen, C.; Kallus, N.; Spindler, M.; Syrgkanis, V. Applied Causal Inference Powered by ML and AI. arXiv 2024, arXiv:2403.02467. [Google Scholar] [CrossRef]
  50. Vickers, A.J.; Elkin, E.B. Decision curve analysis: A novel method for evaluating prediction models. Med. Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef] [PubMed]
  51. Steyerberg, E.W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, 2nd ed.; Springer: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
  52. Stevens, P.E.; Ahmed, S.B.; Carrero, J.J.; Foster, B.; Francis, A.; Hall, R.K.; Herrington, W.G.; Hill, G.; Inker, L.A.; Kazancıoğlu, R.; et al. KDIGO 2024 clinical practice guideline for the evaluation and management of chronic kidney disease. Kidney Int. 2024, 105, S117–S314. [Google Scholar] [CrossRef]
  53. Ndumele, C.E.; Neeland, I.J.; Tuttle, K.R.; Chow, S.L.; Mathew, R.O.; Khan, S.S.; Coresh, J.; Baker-Smith, C.M.; Carnethon, M.R.; Després, J.-P.; et al. A Synopsis of the Evidence for the Science and Clinical Management of Cardiovascular–Kidney–Metabolic (CKM) Syndrome. Circulation 2023, 148, 1636–1664. [Google Scholar] [CrossRef]
  54. Hippisley-Cox, J.; Coupland, C.; Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease. BMJ 2017, 357, j2099. [Google Scholar] [CrossRef]
  55. Matsushita, K.; Kaptoge, S.; Hageman, S.H.J.; Visseren, F.L.J.; Pennells, L.; Coresh, J. Including measures of chronic kidney disease to improve cardiovascular risk prediction by SCORE2 and SCORE2-OP. Eur. Heart J. 2022, 43, ehac544.2256. [Google Scholar] [CrossRef]
Figure 2. Survey-weighted ROC curves for the full-like scenario. DML-offset svyglm (green), and 2SRI residual inclusion (orange). The dashed diagonal denotes the no-discrimination reference line.
Figure 2. Survey-weighted ROC curves for the full-like scenario. DML-offset svyglm (green), and 2SRI residual inclusion (orange). The dashed diagonal denotes the no-discrimination reference line.
Mathematics 14 01554 g002
Figure 3. Survey-weighted calibration by deciles. Points near the identity line indicate good calibration. The 2SRI model (orange) shows slightly better global calibration than naive LASSO (cyan). Point sizes reflect population mass by decile.The dashed diagonal line represents the line of no discrimination (AUC = 0.5), corresponding to a random classifier.
Figure 3. Survey-weighted calibration by deciles. Points near the identity line indicate good calibration. The 2SRI model (orange) shows slightly better global calibration than naive LASSO (cyan). Point sizes reflect population mass by decile.The dashed diagonal line represents the line of no discrimination (AUC = 0.5), corresponding to a random classifier.
Mathematics 14 01554 g003
Figure 4. Bootstrap distribution of θ ^ DML under complex sampling. The solid vertical line marks the bootstrap mean; dashed lines mark the 2.5% and 97.5% percentiles. Values are expressed in percentage points on a residualized probability scale.
Figure 4. Bootstrap distribution of θ ^ DML under complex sampling. The solid vertical line marks the bootstrap mean; dashed lines mark the 2.5% and 97.5% percentiles. Values are expressed in percentage points on a residualized probability scale.
Mathematics 14 01554 g004
Figure 5. Empirical estimates of the effect of CKD on high cardiovascular risk using alternative modeling strategies. The naive and 2SRI specifications report odds ratios (OR) from survey-weighted logistic regressions, while the DML estimate reports the average treatment effect on the probability scale. The figure highlights the instability of conventional endogeneity-correction approaches in this setting and the attenuation of the CKD effect under orthogonalized machine learning.
Figure 5. Empirical estimates of the effect of CKD on high cardiovascular risk using alternative modeling strategies. The naive and 2SRI specifications report odds ratios (OR) from survey-weighted logistic regressions, while the DML estimate reports the average treatment effect on the probability scale. The figure highlights the instability of conventional endogeneity-correction approaches in this setting and the attenuation of the CKD effect under orthogonalized machine learning.
Mathematics 14 01554 g005
Figure 6. Survey-weighted predicted probabilities of high cardiovascular risk by CKD status. Panel A displays predictions from the naive model, showing a pronounced separation between individuals with and without CKD. Panel B shows predictions after orthogonalization using the DML framework, where the separation is substantially attenuated, indicating that the naive association is largely driven by confounding factors.
Figure 6. Survey-weighted predicted probabilities of high cardiovascular risk by CKD status. Panel A displays predictions from the naive model, showing a pronounced separation between individuals with and without CKD. Panel B shows predictions after orthogonalization using the DML framework, where the separation is substantially attenuated, indicating that the naive association is largely driven by confounding factors.
Mathematics 14 01554 g006
Table 1. Estimators, effect scales, and inferential roles.
Table 1. Estimators, effect scales, and inferential roles.
EstimatorScaleInferential RoleVariance Estimation
Naive svyglmOdds ratio (OR)AssociationalTaylor/PSU bootstrap
LASSO → refit svyglmORAssociationalPSU bootstrap
2SRIORDiagnostic (endogeneity)PSU bootstrap
DML (primary)Probability difference (pp)CausalPSU bootstrap
Table 2. Design-based predictive performance (full-like scenario, N = 120 , 000 ; p = 50 ; ρ = 0.5 ; endogeneity = 0.25 ; 2 instruments; CKD target prevalence = 12 % ).
Table 2. Design-based predictive performance (full-like scenario, N = 120 , 000 ; p = 50 ; ρ = 0.5 ; endogeneity = 0.25 ; 2 instruments; CKD target prevalence = 12 % ).
ModelAUC (Weighted)Brier (Weighted)
Naive (survey LASSO)0.68230.2166
DML-offset (survey svyglm)0.68750.2156
2SRI residual inclusion0.70520.2102
95% CI (bootstrap) for Naive
AUC
[0.6750, 0.6949]
95% CI (bootstrap) for Naive
Brier
[0.2130, 0.2192]
Table 3. Causal CKD effect via double/debiased machine learning (DML) under complex sampling (reported in percentage points on a residualized probability scale).
Table 3. Causal CKD effect via double/debiased machine learning (DML) under complex sampling (reported in percentage points on a residualized probability scale).
StatisticValue
θ ^ DML (pp)14.206
95% CI (percentile bootstrap)[12.408,15.809]
Table 4. Sensitivity under raking to target CKD prevalence (design-based metrics and DML parameter).
Table 4. Sensitivity under raking to target CKD prevalence (design-based metrics and DML parameter).
Target CKD PrevalenceAUC (Naive)Brier (Naive) θ ^ DML (pp)
8%0.68200.216714.202
12%0.68160.216914.212
20%0.68230.216614.215
Table 5. Baseline characteristics of the ENS 2016–2017 adult population by CKD status.
Table 5. Baseline characteristics of the ENS 2016–2017 adult population by CKD status.
No CKDCKDp-Value
(eGFR60)(eGFR < 60)
Prevalence (%)96.93.1
Age, years (mean ± SE)42.3 ± 0.474.8 ± 1.5<0.001
Female (%)51.650.90.92
Hypertension (%)25.586.7<0.001
Diabetes (%)11.231.1<0.001
BMI, kg/m2 (mean ± SE)28.5 ± 0.128.9 ± 0.80.58
High CV risk (%)21.590.9<0.001
Notes: All estimates are survey-weighted and account for the complex multistage sampling design of the ENS 2016–2017. CKD is defined as an estimated glomerular filtration rate below 60 mL/min/1.73 m2. High cardiovascular risk corresponds to the highest category of the Chilean-adapted Framingham 10-year risk score. p-values are derived from design-based t-tests for continuous variables and Rao–Scott adjusted χ 2 tests for categorical variables.
Table 6. Survey-weighted logistic regression models selected via LASSO.
Table 6. Survey-weighted logistic regression models selected via LASSO.
λ 1 se λ min
CKD (eGFR < 60)5.73 ***5.62 ***
(2.80–11.73)(2.72–11.60)
Age (years)1.06 ***1.07 ***
Hypertension1.66 *1.54 *
Diabetes35.86 ***34.02 ***
BMI1.03
McFadden pseudo- R 2 0.4450.449
Notes: Odds ratios are reported with 95% confidence intervals in parentheses. All models are estimated using survey-weighted logistic regression, accounting for the complex ENS 2016–2017 sampling design. Variables are selected via cross-validated LASSO. *** p < 0.001 , * p < 0.05 .
Table 7. Associational estimates for CKD and high cardiovascular risk.
Table 7. Associational estimates for CKD and high cardiovascular risk.
EstimatorScaleEstimate95% CI n eff Interpretation
Naive svyglmOR5.66[2.71, 11.82]6233Associational
LASSO → refitOR5.73[2.80, 11.73]6233Associational
CI obtained via PSU-level bootstrap (B = 150). OR = odds ratio.
Table 8. First-stage diagnostics for two-stage residual inclusion (2SRI).
Table 8. First-stage diagnostics for two-stage residual inclusion (2SRI).
PredictorCoefficientStandard ErrorWald zp-Value
Intercept 9.944 1.110 8.96 <0.001
Age (years) 0.118 0.014 8.47 <0.001
Female (vs. male) 0.088 0.245 0.36 0.721
Hypertension 1.155 0.309 3.74 <0.001
Diabetes 0.339 0.259 1.31 0.191
Model diagnostics
Effective sample size n eff = 1042
Joint relevance (Wald test) p = 1.6 × 10 29
Notes: The first stage models chronic kidney disease (CKD) as a function of age, sex, hypertension, and diabetes using survey-weighted logistic regression. CKD is defined as an estimated glomerular filtration rate of <60 mL/min/1.73 m2 using the ENS precomputed Schwartz-based indicator (fg_CKDschwartz_diminuido_60, coded as 2 = yes). Coefficients are reported on the log-odds scale. Standard errors and Wald statistics account for the complex ENS 2016–2017 sampling design, including stratification, clustering, and survey weights. The first stage is reported for diagnostic purposes in the 2SRI specification.
Table 9. Empirical modeling results: association between CKD and high cardiovascular risk.
Table 9. Empirical modeling results: association between CKD and high cardiovascular risk.
Naive2SRIDML (PLR)
CKD effectOR = 3.70OR = 69.1 θ ^ = 0.094
95% CI[1.00,13.74][1.06,4502.7][ 0.409 , 0.220 ]
Statistical significanceBorderlineYes (unstable)No
InterpretationAssociationalDiagnosticCausal (primary)
Endogeneity
correction
NoPartialYes
Survey design
accounted
YesYesYes
Estimator stabilityHighLowHigh
Notes: Naive and 2SRI models report odds ratios (OR) from survey-weighted logistic regressions. The DML estimate reports the average treatment effect of CKD on the probability of high cardiovascular risk (in probability units), obtained using a partially linear regression framework with cross-fitting. DML is the primary estimator of interest; naive and 2SRI results are reported for comparison.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rojas, F.; Tapia, A.; Espinoza, H. Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches. Mathematics 2026, 14, 1554. https://doi.org/10.3390/math14091554

AMA Style

Rojas F, Tapia A, Espinoza H. Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches. Mathematics. 2026; 14(9):1554. https://doi.org/10.3390/math14091554

Chicago/Turabian Style

Rojas, Fernando, Axa Tapia, and Hilda Espinoza. 2026. "Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches" Mathematics 14, no. 9: 1554. https://doi.org/10.3390/math14091554

APA Style

Rojas, F., Tapia, A., & Espinoza, H. (2026). Design-Aware Predictive and Causal Modeling of Cardiovascular Risk in Chronic Kidney Disease Using Penalized and Double Machine Learning Approaches. Mathematics, 14(9), 1554. https://doi.org/10.3390/math14091554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop