Next Article in Journal
Forgotten Factors in Knowledge Conversion and Routines: A Fuzzy Analysis of Employee Knowledge Management in Exporting Companies in Boyacá
Next Article in Special Issue
A Bayesian Variable Selection Method for Spatial Autoregressive Quantile Models
Previous Article in Journal
Buckling of Coated Functionally Graded Spherical Nanoshells Rested on Orthotropic Elastic Medium
Previous Article in Special Issue
Robust Estimation for Semi-Functional Linear Model with Autoregressive Errors
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Variable Selection and Allocation in Joint Models via Gradient Boosting Techniques

Chair of Spatial Data Science and Statistical Learning, Georg-August-Universität Göttingen, 37073 Göttingen, Germany
Department of Medical Biometrics, Informatics and Epidemiology, University Hospital Bonn, 53127 Bonn, Germany
Author to whom correspondence should be addressed.
Mathematics 2023, 11(2), 411;
Received: 28 November 2022 / Revised: 5 January 2023 / Accepted: 9 January 2023 / Published: 12 January 2023
(This article belongs to the Special Issue Recent Advances in Computational Statistics)


Modeling longitudinal data (e.g., biomarkers) and the risk for events separately leads to a loss of information and bias, even though the underlying processes are related to each other. Hence, the popularity of joint models for longitudinal and time-to-event-data has grown rapidly in the last few decades. However, it is quite a practical challenge to specify which part of a joint model the single covariates should be assigned to as this decision usually has to be made based on background knowledge. In this work, we combined recent developments from the field of gradient boosting for distributional regression in order to construct an allocation routine allowing researchers to automatically assign covariates to the single sub-predictors of a joint model. The procedure provides several well-known advantages of model-based statistical learning tools, as well as a fast-performing allocation mechanism for joint models, which is illustrated via empirical results from a simulation study and a biomedical application.

1. Introduction

Joint models for longitudinal and time-to-event data, first introduced in [1], are a powerful tool for analyzing data where event times are recorded alongside a longitudinal outcome. If the research interest lies in the association between these two outcomes, joint modeling avoids potential bias arising from separate analyses by combining two sub-models in one single modeling framework. A thorough introduction to the concept of joint models can be found in [2], and various well-established R packages are available covering frequentist [3,4] and Bayesian [5] approaches.
Like many regression models, joint models suffer from the usual drawbacks, where proper tools for variable selection are not immediately available and computation becomes more and more infeasible in higher dimensions. In addition, joint models also raise the question of which sub-model a variable should be assigned to, i.e., should a variable x have a direct impact on the survival outcome T, or should the potential influence be modeled indirectly by an impact of x on the longitudinal outcome y, which then might affect T? This choice gets exponentially more complex with an increasing amount of covariates, and usually has to be made by researchers based on background knowledge. Boosting techniques from the field of statistical learning, however, are well-known for addressing these exact issues. Originally emerging from the machine learning community as an approach to classification problems [6,7], boosting algorithms have been adapted to regression models [8] and, by now, cover a wide range of statistical models. For an introduction and overview of model-based boosting, we recommend [9,10].
The formulation of boosting routines for joint models is, to date, still a little-developed field. The foundations were made by [11], where generalized additive models for location scale and shape (GAMLSS) were fitted using boosting techniques. Due to the multiple predictors for each single distributional parameter, these models consist of a similar structure to joint models and thus the boosting concept for GAMLSS could be adapted to joint models by [12]. Furthermore, in [13], joint models were estimated using likelihood-based boosting techniques, and [14,15] compare boosting routines for joint models with various other estimation approaches.
In the last few years, several additional developments have been made in order to enable a variable selection for joint models, usually by applying different shrinkage techniques. In [16], an adaptive LASSO estimator was constructed that estimates L 1 -penalized likelihoods in a two-stage fashion. This approach was later extended to multivariate longitudinal outcomes in [17] and to time-varying coefficients in [18]. In [19,20], Bayesian shrinkage estimators were applied to achieve either variable or model selection for various classes of joint models and, recently, ref. [21] applied Monte Carlo methods to enable a variable selection for joint models with an interval-censored survival outcome. However, all of these mentioned approaches are only capable of selecting and estimating effects into predefined predictor functions. To the best of our knowledge, no methods exist that allocate single features to the given sub-models in a data-driven way.
The aim of the present work is to combine recent developments from the field of model-based gradient boosting in order to develop a new routine, JMalct, that is able to allocate the single candidate variables to the specific sub-models. Therefore, the initial boosting approach by [12] was equipped with a non-cyclical updating scheme proposed by [22] and adaptive step-lengths as investigated in [23]. These two preliminary works are of high importance and their combination is the foundation of our proposed allocation procedure. Furthermore, the JMalct algorithm makes use of a recent random effects correction [24] providing an unbiased estimation of the random effects using gradient boosting and tuning based on probing [25] for faster computation and improved selection properties.
The remainder of this article is structured as follows. In Section 2, the underlying joint model as well as the JMalct boosting algorithm are formulated. Section 3 then applies the proposed method to simulated data with varying amounts of candidate variables. Several real-world applications are presented in Section 4 and the final section gives a brief summary and outlook.

2. Methods

This section first formulates the considered joint model as well as the basics of the underlying JMboost approach. Afterwards, the new JMalct routine and a thorough discussion of its computational details are provided.

2.1. Model Specification

A joint model consists of two sub-models modeling the longitudinal and time-to-event outcome, respectively. The longitudinal sub-model is specified as a linear mixed model
y i j = η long l ( t i j , x long i l ) + ε i j = β 0 + β t t i j + β long T x long i l + γ 0 i + γ t i t i j + ε i j ,
with individuals i = 1 , , n and corresponding measurements j = 1 , , n i . Here, x long i l R p long denotes a set of longitudinal time-independent covariates, and t i j the specific measurement times and normal distributed error components, i.e., ( γ 0 i , γ t i ) N 2 ( 0 , Q ) and ε i j N ( 0 , σ 2 ) are assumed.
In the survival sub-model, the individual hazard is modeled by
λ i ( t ) = λ 0 ( t ) exp η surv ( x surv i l ) + α η long l ( t , x long i l )
with the survival predictor η surv l ( x surv i l ) = β surv T x surv i l including baseline covariates x surv i l R p surv and the longitudinal predictor η long l reappearing in the survival sub-model, this time scaled by the association parameter α . The baseline hazard λ 0 ( t ) : = λ 0 > 0 is chosen to be constant as conventional gradient boosting methods tend to struggle with a proper estimation of time-varying baseline hazard functions [26].
Given the sub-models (1) and (2) and assuming independence between the random components, the joint log-likelihood is
( η long l , η surv l , α , λ 0 , σ 2 | y , T , δ ) = i = 1 n { j = 1 n i log ϕ y i j | η long l ( t i j , x long i l ) , σ 2 + δ i log λ i ( T i | η long l , η surv l , α , λ 0 ) 0 T i exp ( λ i ( t | η long l , η surv l , α , λ 0 ) ) d t } ,
where, in the longitudinal part, ϕ ( · | m , v ) denotes the density of a normal distribution with mean m and variance v. In this context, we considered the complete data log-likelihood as it is used solely for allocation purposes. The random effects will be estimated in a less time-consuming way based on a fixed penalization integrated in the random effects base-learner discussed in Section 2.4.

2.2. The JMboost Concept

In [12], joint models were estimated for the first time using a boosting algorithm, although they addressed a slightly different model to the one described above. The original concept in this publication was based on an alternating technique that used two loops: one outer loop circling through the two sub-predictors and two inner loops that circle through the single base-learners. In a very simple manner, the boosting algorithm can hence be summarized as follows:
  • Initialize η long l , η surv l , α , λ 0 and σ 2 ;
  • While m max ( m stop , l , m stop , s ) ;
    If m m stop , l : perform one boosting cycle to update η long l ;
    If m m stop , s : perform one boosting cycle to update η surv l ;
    If m m stop , l : update σ 2 ;
    If m m stop , s : update λ 0 and α .
Both sub-predictors have their own stopping iteration m stop , l and m stop , s , which need to be optimized via a grid search. The latter is computationally quite burdensome, particularly for high numbers of candidate variables.

2.3. The JMalct Boosting Algorithm

The central JMalct algorithm is depicted in Algorithm 1.
Algorithm 1: JMalct
  • Initialize predictors η ^ long [ 0 ] and η ^ surv [ 0 ] . Specify base-learners h long 1 , , h long p and h surv 1 , , h surv p , as well as h γ . Initialize association α ^ [ 0 ] and baseline hazard λ ^ 0 [ 0 ] . Choose iteration limit m stop and learning rate c, and define the sets S long [ 0 ] = S surv [ 0 ] : = { 1 , , p } .
  • for m = 1 to m stop do
  • step1: Allocation step

Compute the gradients
u long [ m ] = u long i j [ m ] i n , j n i = y i j η ^ long i j [ m 1 ] i n , j n i
u surv [ m ] = u surv i [ m ] i n = δ i 0 T i λ ^ i [ m 1 ] ( t , · ) d t i n .

Fit both gradients separately to the base-learners
u long [ m ] base - learner h ^ long r [ m ] , r S long [ m 1 ] ,
u surv [ m ] base - learner h ^ surv r [ m ] , r S surv [ m 1 ] ,
and select the best performing component for each predictor:
r long * = arg min r p i j u long i j [ m ] h ^ long i j [ m ] 2 , r surv * = arg min r p i u surv i [ m ] h ^ surv i [ m ] 2

Compute the optimal step lengths ν long r * , ν surv r * with corresponding likelihood values long r * , surv r * and only update the component resulting in the best joint likelihood improvement:
η ^ long [ m ] = η ^ long [ m 1 ] + c ν long r * h ^ long r * [ m ] , if long r * > surv r * ,
η ^ surv [ m ] = η ^ surv [ m 1 ] + c ν surv r * h ^ surv r * [ m ] , if long r * < surv r *

Update the active sets
S long [ m ] = S long [ m 1 ] { r surv * } , if surv r * > long r * ,
S surv [ m ] = S surv [ m 1 ] { r long * } , if long r * > surv r * .

  • step2: Update remaining parameters

Perform an additional longitudinal boosting update regarding the random structure:
u long [ m ] base - learner h ^ γ [ m ] η ^ long [ m ] = η ^ long [ m ] + c h ^ γ [ m ]

Obtain updates for the association by maximizing the joint likelihood:
α ^ [ m ] = arg max α R ( α , · )

  • end for
  • Stop the algorithm early based on probing, i.e., when a phantom variable would
  • get selected.

2.4. Computational Details of the JMalct Algorithm

In the new JMalct algorithm, we only have one cycle. This cycle consists of three steps: in the first one, the base-learner with the best fitting gradient for the longitudinal predictor η long is chosen and the corresponding step length ν long is calculated. In the second step, the base-learner with the best fitting gradient for the time-to-event submodel η surv is chosen and the corresponding step length ν surv is calculated. These first two steps will be referred to as the G-steps (gradient-steps) in the following. In the third step, referred to as the L-step (likelihood step), the likelihood is calculated for both the best longitudinal base-learner, weighted with the step length ν long , and the best survival base-learner, weighted with the step length ν surv . The base-learner performing better in the L-step is then chosen to be updated. The algorithm is summarized in the following overview and depicted in Figure 1. A detailed description is provided below.
  • while m m stop :
  • G-step 1
    Fit all base-learners to the longitudinal gradient with regard to η ^ long [ m ] ;
    Find the best-performer, β long * and corresponding step-length ν l o n g .
  • G-step 2
    Fit all base-learners to gradient with regard to η ^ surv [ m ] ;
    Find the best-performer, β surv * and corresponding step-length ν l s u r v .
  • L-step
    Fit likelihood for η long * and η surv * with updates from G1 and G2;
    Select the best-performer and update corresponding sub-predictor;
    Remove the selected candidate variable from options to choose for the other predictor (if not performed already).
  • Step 4
    Update α ^ [ m ] , σ ^ 2 [ m ] based on the current fit.
The baseline covariates that enter the allocation process are not assigned to a sub-model in the beginning and therefore have to be considered in two forms. X long Mat R ( N , p ) , where N = i n i , denotes the set of candidate variables resembled as longitudinal covariates, i.e., measurements assigned to the same individual i contain the same cluster-constant measurement n i times. On the other hand, X surv Mat R ( n , p ) contains the exact same variables as X long but reduced to just one representative of each individual in order to fit the corresponding base-learner to the survival gradient. The measurements of one specific covariate r are denoted by x long r and x surv r , which matches the rth column of the corresponding matrix.
Starting values. The regression coefficients underlying the allocation mechanism are necessarily set to zero, i.e., β long [ 0 ] = β surv [ 0 ] = 0 . The remaining longitudinal parameters are extracted by an initial linear mixed model fit
y = β 0 + β t · t + γ 0 + γ t · t
containing only the intercept as well as time and random effects. For the remaining survival parameters, we chose α [ 0 ] = 0 and λ ^ [ 0 ] = i δ i / i T i .
Computing the gradients. The gradients u long and u surv are a crucial component of the JMalct algorithm. For the longitudinal part, we considered the quadratic loss ρ ( y , η ) = 1 2 ( y η ) 2 and calculated
u long = ρ η long l ( y , η long l ) = y η long l
as the regular residuals of the longitudinal sub-model, following [9]. The survival gradient was obtained by differentiating the likelihood (3) with respect to η surv l , yielding
u surv = ρ η surv l ( η surv l , · ) = δ i 0 T i λ ^ i [ m 1 ] ( t , · ) d t i n ,
as the longitudinal part vanishes. This is a nice analogy to the longitudinal gradient, as u surv represents the martingale residuals of the survival sub-model.
Fitting the longitudinal base-learners. The possible fixed effects estimates were obtained by fitting the pre-specified base-learners to the longitudinal and survival gradient. In the longitudinal case, the fixed effects base-learners h long 1 , , h long p were equipped with an additional effect estimate for the time coefficient β t as this variable shall not be subject to the selection and allocation mechanism. Fitting the base-learners is achieved by
h ^ long r = S long r u long , r = 1 , , p ,
with the projection matrix
S long r = x ˜ long r ( x ˜ long r T x ˜ long r ) 1 x ˜ long r T , r = 1 , , p ,
where x ˜ long r = ( 1 , t , x long r ) and t denotes the collection of longitudinal measurement times. If the base-learner actually gets selected, estimates β ^ 0 for the intercept and β ^ t for the time effect receive the corresponding updates computed in the fitting process.
Fitting the survival base-learners. Similar to the longitudinal part, the survival base-learner was fitted by applying the corresponding projection matrix to the survival gradient, i.e.,
h ^ surv r = S surv r u surv , r = 1 , , p ,
where the survival gradient u surv represents the martingale residuals of the time-to-event model. The projection matrix takes the form
S surv r = x ˜ surv r ( x ˜ surv r T x ˜ surv r ) 1 x ˜ surv r T , r = 1 , , p ,
with x ˜ surv r = ( 1 , x surv r ) . This means that, if the base-learner actually gets selected, the estimate λ ^ 0 for the constant baseline hazard receives the corresponding update computed in the fitting process.
Adaptive step lengths. As the two distinct sub-models affect different parts of the joint likelihood, it may not be sufficient to stick to a fixed learning rate, e.g., ν long = ν surv = 0.1 . To ensure that the comparison of potential likelihood improvements is fair, for each selected component, the optimal step length was computed using a basic line search by finding
ν long = arg max ν R + ( η ^ long + ν h ^ long r long * , · ) , ν surv = arg max ν R + ( η ^ surv + ν h ^ surv r surv * , · ) .
following [23]. The corresponding maximal likelihood values are denoted by long * and surv * , which were used to determine the overall best-performing sub-model of each iteration. When this is achieved, the learning rate for the actual update was then again scaled by a constant c < 1 —here, c = 0.1 —in order to ensure small updates with weak base-learners.
Fitting the random effects base-learner. In general, the random effects base-learner is similar to the formulation found in the appendix of [27]. One major difference is that it is fixed for all iterations and not updated based on the current covariance structure. It is defined through its projection matrix
S γ = Z C ( Z T Z + λ df ) 1 Z T
where λ df was chosen so that tr ( S γ ) = df holds, which fixes the degrees of freedom for the random effects update. In the simulation study, we used df = 10 and determined the corresponding λ df with the internal function mboost:::df2lambda().
The matrix Z denotes the conventional random effects design matrix for intercepts and slopes, i.e.,
Z = diag ( Z 1 , , Z n ) , Z i = 1 t i 1 1 t i n i , i = 1 , , n ,
and C is a correction matrix introduced in [28] correcting the random effects update for the candidate variables x long 1 , , x long p , which are baseline covariates and thus cluster-constant. A derivation of the correction matrix C can also be found in Appendix A.
Tuning the hyperparameter m based on probing. Both the step length as well as the number of iterations can be considered as hyperparameters of the boosting algorithm. Since the step length is usually set as constant or, like in this work, determined by an adaptive line search, the number of overall iterations m states the main tuning parameter of the algorithm. While this hyperparameter is usually tuned in a computationally more extensive way by considering out-of-bag loss, we determined the optimal amount m * with the help of probing. Probing for gradient boosting was introduced by [25]. The pragmatic idea avoids more time-consuming procedures such as cross validation or bootstrapping, which rely on a re-fitting of the model. For each covariate x r , another variable x ¯ r was added to the set of candidate variables, where x ¯ r is a random permutation of the observations contained in x r . These additional variables were artificially created to be non-informative and called probes or shadow variables. Instead of finding the best-performing number of iterations based on a computationally burdensome cross validation, the boosting routine was simply stopped as soon as one of the shadow variables x ¯ r , i.e., a known-to-be non-informative variable, would get selected. The focus is hence shifted from tuning the algorithm purely based on prediction accuracy (with regard to the test risk) towards a reasonable variable selection.
Computational complexity and asymptotic behavior. Due to the artificial construction of the algorithm and the comparatively complex model class, theoretical analysis regarding complexity and asymptotic behavior is quite a challenging task. The model-based boosting related literature is still little-developed with respect to theoretical investigations, but thorough analyses in simpler cases were carried out in [29] for the quadratic loss, where exponentially fast bias reduction could be proven, as well as for more general settings in [30,31]. Consistency properties for very-high-dimensional linear models were obtained in [32] and, regarding JMalct, we refer to the following section, where further insights with respect to the algorithm’s complexity are given based on numerical evaluations. In addition, we experienced no convergence issues in simulations and applications.

3. Simulation Study

The JMalct algorithm was evaluated by conducting a simulation study where data according to the assumed generating process specified in Section 2.1 were simulated and models were subsequently fitted using JMalct and, if sensible, JM [3] as a benchmark and well-established approach. In addition, we considered the combination JMalct+JM, where JMalct was used solely for allocating the variables, which were then refitted by JM according to the allocation obtained from JMalct. After briefly highlighting the single scenarios, the simulation section evaluates allocation properties and the accuracy of estimates, as well as the quality of the prediction and the computational burden.

3.1. Setup

We simulated data according to the model specification in Section 2.1 with n = 500 and n i = 5 using inversion sampling. The pre-specified true parameter values are
β 0 = 1 , β t = 1.5 , β long T = ( 1 , 2 , 1 , 2 ) , β surv T = ( 0.3 , 0.5 , 0.3 , 0.5 ) , α = 0.1
with variance components
σ = 0.1 , Q = τ 0 2 0 0 τ t 2 , τ 0 = 2 , τ t = 0.3 .
The entries of the covariate vectors x long i l and x surv i l were drawn independently from the uniform distribution U ( [ 0.1 , 0.1 ] ) . In addition to the informative covariates with effects β long l and β surv l , the total set of covariates was expanded with a varying number p non - inf of non-informative noise variables. The baseline hazard was chosen as λ 0 ( t ) 1 and given the censoring mechanism described in Algorithm A1 depicted in Appendix B. The chosen parameter values result in an average censoring rate of 50 % . All of the parameters were specified in a way to obtain reasonably distributed event times T .
Overall, we considered four scenarios with varying numbers of additional noise variables p non - inf yielding overall dimensions P { 10 , 25 , 50 , 100 } . In each scenario, 100 independent data sets were generated and models were fitted using the various routines. The results were then summarized over all 100 independent simulation runs.

3.2. Selection and Allocation

In order to address allocation, we considered the criteria of correctly allocated (CA) and incorrectly allocated (IA) variables per predictor, as well as the share of false positives (FPs). Precisely, CAlong is the share of longitudinal variables, which are correctly assigned to the longitudinal predictor, and IAlong is the share of survival variables, which are falsely assigned to the longitudinal predictor and CAsurv, IAsurv analogously. FPs, on the other hand, denote the share of wrongly selected noise variables regardless of which predictor they are assigned to.
Table 1 depicts allocation and selection properties obtained for the different simulation scenarios. While, for the longitudinal predictor, variables get allocated perfectly, the survival part shows less ideal but still satisfactory results. There are various possible explanations for this behavior. On the one hand, the simulated signal is less strong for the survival effects due to the chosen parameter values, which, in general, increases the chance of false negatives. On the other hand, the longitudinal part of the likelihood carries more information, as there are more longitudinal measurements available than event times, which increases the risk of incorrect allocations. Finally, survival variables being incorrectly allocated to the longitudinal predictor is inherently more probable than vice versa as the longitudinal predictor also appears in the survival sub-model and the model therefore still accounts for the variables’ impact on the time-to-event outcome. While the allocation properties are roughly constant with a varying number of dimensions, the false positives rate clearly diminishes with more and more noise variables.

3.3. Estimation Accuracy

The accuracy of coefficient estimation is shown in Table 2, separated for each sub-model. We considered the mean squared error (mse) computed as
mse long = θ long θ ^ long 2 , mse surv = θ surv θ ^ surv 2 ,
where θ long = ( β 0 , β t , β long T ) T and θ surv = ( λ , α , β surv T ) T . The lower half of the table discards all entries of the estimates β ^ long and β ^ surv referring to non-informative variables and thus only measures the accuracy of the effects that are known to be informative.
It is evident that the accuracy of JM is heavily influenced by the number of noise variables, whereas the routines relying on the allocation and selection mechanism by JMalct stay fairly robust. As usual for regularization techniques, JMalct’s estimates for informative effects are slightly biased due to the early stopping of the algorithm. The combination JMalct+JM, however, stays unaffected by the number of noise variables and is, at least for the longitudinal predictor, the most accurate. The main hindrance of this approach is that the estimation accuracy of survival effects is slightly more influenced by false negatives occurring in the selection mechanism, which is why the combination lags behind its two competitors regarding precision for the survival sub-model.

3.4. Predictive Performance

Boosting is a tool primarily designed for prediction, and thus the predictive performance of JMalct and how it compares to established routines are of interest. Since our underlying joint model focuses on the time-to-event outcome as the main endpoint, we evaluated the prediction accuracy regarding the predicted and actual event time based on additional test data with n test = 1000 individuals and n i = 5 . We considered the loss
L ( T , T ^ ) = | log T log T ^ | , T ^ = E [ T ] ,
as the absolute deviation between the predicted and actual event time T ^ and T, respectively, on a log-scale [33].
Figure 2 depicts the values of L over the varying numbers of additional noise variables. The prediction is comparable among the three routines in low-dimensional settings. However, as expected, it worsens for JM when the dimensions increase. Both JMalct and the combination JMalct+JM rely on the selection conducted by JMalct and hence produce sparse models, which is why their quality of prediction stays fairly equal even in higher dimensions.

3.5. Computational Effort

Table 3 shows the elapsed computation time measured in seconds, where each simulation run was carried out on a 2 x 2.66 GHz-6-Core Intel Xeon CPU (64GB RAM). Most obviously, JM becomes tremendously more burdensome as the dimensions increase. The constant or even a little decreasing computation times for JMalct over various dimensions might be surprising at first, as component-wise procedures such as gradient boosting tend to increase at least linearly in computation time with additional covariates. However, as the overall stopping criterion is based on probing, the algorithm tends to stop earlier in high-dimensional settings since more non-informative probes are available, increasing the probability that one might get selected earlier in the process. Due to the sparsity obtained by JMalct, the combination JMalct+JM also profits from the allocation and selection mechanism regarding the computational effort, as JM runs considerably faster again.

3.6. Complexity

While a formulation of explicit complexity results for the JMalct routine is quite technical in general, like that stated in Section 2.4, simulations can give insights toward how the algorithm scales up with varying numbers of observations and covariates. Therefore, we considered the same setup as above with different values for n and p and ran the JMalct routine 100 times independently for m stop = 100 iterations without early stopping. Figure 3 depicts the averaged computation times for increasing values of n and p.
The left figure depicts square root computation times with p = 3 as fixed, and thus reveals quadratically growing run times for increasing observations. On the other hand, the run times clearly expose a linear relationship with the amount of total covariates and n = 100 as fixed. This is to be expected, as further candidate variables simply add to the inner loops of univariate base-learner fits and, thus, the algorithm is capable of fitting data sets with almost arbitrary high dimensions.

4. 1994 AIDS Study

The 1994 AIDS data [34] were originally collected in order to compare two antiretroviral drugs based on a collective of HIV-positive patients. They include 1405 longitudinal observations of 467 individuals, from which, 188 unfortunately died during the course of the study. The main longitudinal outcome is each patient’s repeatedly measured CD4 cell counts. CD4 cells decline in HIV-positive patients and are a well-known proxy for disease progression, and are therefore of high interest. Apart from the CD4 cell count as the longitudinal outcome, death as the time-to-event outcome and time t itself, the four additional baseline variables—drug (treatment group), gender, AZT (indicator of whether a previous AZT therapy failed) and AIDS (indicator of whether AIDS is diagnosed)—were observed. The structure of the data is depicted in Table 4.
Figure 4 depicts the coefficient paths computed by the JMalct algorithm and the corresponding allocation process. The variable AIDS is selected into the longitudinal sub-model right away and frequently updated. This is not surprising, as the diagnosis of AIDS is by definition partly linked to the CD4 cell count. The remaining variables drug and gender were also allocated to the longitudinal sub-model by a smaller amount, whereas AZT was selected into the survival predictor, indicating an increased risk of death for patients with failed AZT therapy.

5. Discussion and Outlook

Finding adequate data-driven allocation mechanisms for joint models is a very important task, as modeling possibilities increase exponentially with a growing number of covariates. Until today, decisions about the specific model choice have to be made based on background knowledge or by conducting a preliminary analysis, and both of these approaches can be seen as rather unsatisfactory.
The JMalct algorithm combines recent findings from the field of gradient boosting to construct a fast-performing allocation and selection mechanism for a joint model focusing on time-to-event data as the primary outcome. A simulation study revealed that the selection and allocation mechanism yields promising results while preserving the well-known advantages from gradient boosting. Therefore, it is advised to use the JMalct algorithm in its current form in advance of the actual analysis in order to determine an allocation of covariates, which is then fitted using convenient frameworks such as JM.
Possible ways of improving the accuracy of estimates and allocation properties regarding the survival sub-model could be based on additional weighting rules. As the longitudinal part contributes substantially more to the likelihood due the higher number of observations, weighting the two sub-models solely by different step lengths may not be sufficient. Promising ideas are initial weightings of the sub-models using various maximum likelihood estimations or focusing on a relative likelihood improvement in the selection step.
Another aspects focuses on variable selection and tuning the algorithm via probing. Although probing leads to fast runtimes and good selection properties, the procedure comes with disadvantages. Especially in higher dimensions, the probability that one shadow variable is informative simply by chance increases, leading to very early stopping. An alternative could rely on stability selection [35,36], as shown to be helpful in other cases [37].
Furthermore, the difference in the proportion of falsely selected variables between the longitudinal and survival sub-predictor could be an inherent joint modeling problem and should also be subject of future analysis. Further research is also warranted on theoretical insights, as it remains unclear if the existing findings on the consistency of boosting algorithms [32,38] also hold for the adapted versions for boosting joint models.
In conclusion, the JMalct algorithm represents a promising statistical inference scheme for joint models that also provides a starting point for a much wider framework of boosting joint models, covering a great range of potential models and types of predictor effects.

Author Contributions

Conceptualization, C.G. and E.B.; methodology, C.G., A.M. and E.B.; software, C.G. and E.B.; formal analysis, C.G.; investigation, C.G.; writing—original draft preparation, C.G.; writing—review and editing, E.B. and A.M.; project administration, E.B.; funding acquisition, E.B. All authors have read and agreed to the published version of the manuscript.


The work on this article was supported by the DFG (Number 426493614) and the Volkswagen Foundation (Freigeist Fellowship).

Data Availability Statement

All code and data required to reproduce the finding of this article are available.


The work on this article was supported by the DFG (Number 426493614) and the Volkswagen Foundation (Freigeist Fellowship). We further acknowledge the support by the Open Access Publication Funds of the University of Göttingen.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Correction Matrix C

Due to the separated updating process for the random effects, it may be necessary to adjust the estimates for possible correlations with cluster-constant covariates using the correction matrix C . The following derivation is a special case of the more general version proposed in [28]. For the correction of the random intercepts γ ˜ 0 = ( γ 01 , , γ 0 n ) T and random slopes γ ˜ t = ( γ t 1 , , γ t n ) T with the baseline covariates X surv defined in Section 2.4, consider the residual generating matrix
C A = I n X surv T ( X surv T X surv ) 1 X surv T
and subsequently C B = diag ( C A , C A ) , so that the product ( C B ) γ ˜ , γ ˜ = ( γ ˜ 0 T , γ ˜ t T ) T corrects the random intercepts γ ˜ 0 and slopes γ ˜ t for any covariates contained in the corresponding matrix X surv by counting out the orthogonal projections of the given random effect estimates on the subspace generated by the covariates X surv . This ensures that the coefficient estimate for the random effects is uncorrelated with any observed covariate. The final correction matrix C is obtained by
C = P 1 C B P ,
where P is a permutation matrix mapping γ = ( γ 01 , γ t 1 , , γ 0 n , γ t n ) to
P γ = γ ˜
and thus accounts for the usual ordering of the random effects in mixed-model frameworks.

Appendix B. Simulation Algorithm

The following algorithm is used to generate data in Section 3.
Algorithm A1: simJM
  • Choose integers n , n i and parameter values β 0 , β t , β long l , β surv l and α with variance components σ and Q . Specify a baseline hazard λ 0 ( t ) .
  • Generate n · n i longitudinal measurement times mimicking yearly appointments the following way:
    Sample d i j U ( { 1 , , 365 } ) and set t ˜ i j : = ( j 1 ) · 365 + d i j for i = 1 , , n and j = 1 , , n i .
    For each i, shift observation times to t ˜ i 1 = 0 .
    Standardize time points to the unit interval by t i j : = t ˜ i j / ( n i · 365 ) .
  • Generate covariate vectors x long i l , x surv i l for i = 1 , , n corresponding to the lengths of β long l and β surv l .
  • Calculate the longitudinal response
y i j = β 0 + β t t i j + β long T x long i l + γ 0 i + γ t i t i j η long l ( t i j , x long i l ) + ε i j

with ε i j N ( 0 , σ 2 ) and ( γ 0 i , γ t i ) N 2 ( 0 , Q ) . Define hazard functions
λ i ( t ) = λ 0 ( t ) exp β surv T x surv i l + α η long l ( t , x long i l )

as described in Section 2.1.
  • Draw event times by generating random numbers u i U ( [ 0 , 1 ] ) and setting
T i * : = F i 1 ( u ) , F i ( t ) = 1 exp 0 t λ i ( s ) d s ,

according to inversion sampling.
  • Censor by setting T i : = min ( T i * , t i n i ) to obtain censored data with censoring indicator δ i : = 1 ( T i * t i n i ) and receive the observed survival outcome ( T , δ ) = ( T i , δ i ) i = 1 , , n .
  • Delete all longitudinal observations corresponding to times t i j > T i for every i.


  1. Wulfsohn, M.S.; Tsiatis, A.A. A Joint Model for Survival and Longitudinal Data Measured with Error. Biometrics 1997, 53, 330. [Google Scholar] [CrossRef] [PubMed]
  2. Rizopoulos, D. Joint Models for Longitudinal and Time-to-Event Data: With Applications in R; Chapman & Hall/CRC Biostatistics Series; CRC Press: Boca Raton, FL, USA, 2012; Volume 6. [Google Scholar]
  3. Rizopoulos, D. JM: An R Package for the Joint Modelling of Longitudinal and Time-to-Event Data. J. Stat. Softw. 2010, 35, 1–33. [Google Scholar] [CrossRef]
  4. Philipson, P.; Sousa, I.; Diggle, P.J.; Williamson, P.; Kolamunnage-Dona, R.; Henderson, R.; Hickey, G.L. JoineR: Joint Modelling of Repeated Measurements and Time-to-Event Data; R Package Version 1.2.6.; Springer: Berlin, Germany, 2018. [Google Scholar]
  5. Rizopoulos, D. The R Package JMbayes for Fitting Joint Models for Longitudinal and Time-to-Event Data Using MCMC. J. Stat. Softw. 2016, 72, 1–45. [Google Scholar] [CrossRef][Green Version]
  6. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning Theory, Bari, Italy, June 28–1 July 1996; Morgan Kaufmann: San Francisco, CA, USA, 1996; pp. 148–156. [Google Scholar]
  7. Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef][Green Version]
  8. Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
  9. Bühlmann, P.; Hothorn, T. Boosting algorithms: Regularization, prediction and model fitting. Stat. Sci. 2007, 27, 477–505. [Google Scholar]
  10. Mayr, A.; Binder, H.; Gefeller, O.; Schmid, M. The Evolution of Boosting Algorithms - From Machine Learning to Statistical Modelling. Methods Inf. Med. 2014, 53, 419–427. [Google Scholar] [CrossRef][Green Version]
  11. Mayr, A.; Fenske, N.; Hofner, B.; Kneib, T.; Schmid, M. Generalized additive models for location, scale and shape for high dimensional data-a flexible approach based on boosting. J. R. Stat. Soc. Ser. (Applied Stat.) 2012, 61, 403–427. [Google Scholar] [CrossRef][Green Version]
  12. Waldmann, E.; Taylor-Robinson, D.; Klein, N.; Kneib, T.; Pressler, T.; Schmid, M.; Mayr, A. Boosting joint models for longitudinal and time-to-event data. Biom. J. 2017, 59, 1104–1121. [Google Scholar] [CrossRef][Green Version]
  13. Griesbach, C.; Groll, A.; Bergherr, E. Joint Modelling Approaches to Survival Analysis via Likelihood-Based Boosting Techniques. Comput. Math. Methods Med. 2021, 2021, 4384035. [Google Scholar] [CrossRef]
  14. Tutz, G.; Binder, H. Generalized Additive Models with Implicit Variable Selection by Likelihood-Based Boosting. Biometrics 2006, 62, 961–971. [Google Scholar] [CrossRef] [PubMed]
  15. Rappl, A.; Mayr, A.; Waldmann, E. More than one way: Exploring the capabilities of different estimation approaches to joint models for longitudinal and time-to-event outcomes. Int. J. Biostat. 2021, 18, 127–149. [Google Scholar] [CrossRef] [PubMed]
  16. He, Z.; Tu, W.; Wang, S.; Fu, H.; Yu, Z. Simultaneous Variable Selection for Joint Models of Longitudinal and Survival Outcomes. Biometrics 2015, 71, 178–187. [Google Scholar] [CrossRef] [PubMed][Green Version]
  17. Chen, Y.; Wang, Y. Variable selection for joint models of multivariate longitudinal measurements and event time data. Stat. Med. 2017, 36, 3820–3829. [Google Scholar] [CrossRef] [PubMed]
  18. Xie, Y.; He, Z.; Tu, W.; Yu, Z. Variable selection for joint models with time-varying coefficients. Stat. Methods Med. Res. 2019, 29, 309–322. [Google Scholar] [CrossRef]
  19. Tang, A.M.; Zhao, X.; Tang, N.S. Bayesian variable selection and estimation in semiparametric joint models of multivariate longitudinal and survival data. Biom. J. 2017, 59, 57–78. [Google Scholar] [CrossRef]
  20. Andrinopoulou, E.R.; Rizopoulos, D. Bayesian shrinkage approach for a joint model of longitudinal and survival outcomes assuming different association structures. Stat. Med. 2016, 35, 4813–4823. [Google Scholar] [CrossRef]
  21. Yi, F.; Tang, N.; Sun, J. Simultaneous variable selection and estimation for joint models of longitudinal and failure time data with interval censoring. Biometrics 2022, 78, 151–164. [Google Scholar] [CrossRef]
  22. Thomas, J.; Mayr, A.; Bischl, B.; Schmid, M.; Smith, A.; Hofner, B. Gradient boosting for distributional regression: Faster tuning and improved variable selection via noncyclical updates. Stat. Comput. 2017, 28, 673–687. [Google Scholar] [CrossRef][Green Version]
  23. Zhang, B.; Hepp, T.; Greven, S.; Bergherr, E. Adaptive Step-Length Selection in Gradient Boosting for Generalized Additive Models for Location, Scale and Shape. Comput. Stat. 2022, 37, 2295–2332. [Google Scholar] [CrossRef]
  24. Griesbach, C.; Säfken, B.; Waldmann, E. Gradient boosting for linear mixed models. Int. J. Biostat. 2021, 17, 317–329. [Google Scholar] [CrossRef] [PubMed]
  25. Hepp, T.; Thomas, J.; Mayr, A.; Bischl, B. Probing for Sparse and Fast Variable Selection with Model-Based Boosting. Comput. Math. Methods Med. 2017, 2017, 1421409. [Google Scholar]
  26. Hofner, B. Variable Selection and Model Choice in Survival Models with Time-Varying Effects. Diploma Thesis, Ludwig-Maximilians-Universität München, Munich, Germany, 2008. [Google Scholar]
  27. Kneib, T.; Hothorn, T.; Tutz, G. Variable Selection and Model Choice in Geoadditive Regression Models. Biometrics 2009, 65, 626–634. [Google Scholar] [CrossRef] [PubMed][Green Version]
  28. Griesbach, C.; Groll, A.; Bergherr, E. Addressing cluster-constant covariates in mixed effects models via likelihood-based boosting techniques. PLoS ONE 2021, 16, e0254178. [Google Scholar] [CrossRef]
  29. Bühlmann, P.; Yu, B. Boosting With the L2 Loss. J. Am. Stat. Assoc. 2003, 98, 324–339. [Google Scholar] [CrossRef]
  30. Bissantz, N.; Hohage, T.; Munk, A.; Ruymgaart, F. Convergence Rates of General Regularization Methods for Statistical Inverse Problems and Applications. SIAM J. Numer. Anal. 2007, 45, 2610–2636. [Google Scholar] [CrossRef][Green Version]
  31. Yao, Y.; Rosasco, L.; Caponnetto, A. On Early Stopping in Gradient Descent Learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
  32. Bühlmann, P. Boosting for High-dimensional Linear Models. Ann. Stat. 2006, 34, 559–583. [Google Scholar] [CrossRef][Green Version]
  33. Korn, E.; Simon, R. Measures of explained variation for survival data. Stat. Med. 1990, 9, 487–503. [Google Scholar] [CrossRef]
  34. Abrams, D.I.; Goldman, A.I.; Launer, C.; Korvick, J.A.; Neaton, J.D.; Crane, L.R.; Grodesky, M.; Wakefield, S.; Muth, K.; Kornegay, S.; et al. A Comparative Trial of Didanosine or Zalcitabine after Treatment with Zidovudine in Patients with Human Immunodeficiency Virus Infection. N. Engl. J. Med. 1994, 330, 657–662. [Google Scholar] [CrossRef]
  35. Meinshausen, N.; Bühlmann, P. Stability selection. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2010, 72, 417–473. [Google Scholar] [CrossRef]
  36. Shah, R.D.; Samworth, R.J. Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2012, 75, 55–80. [Google Scholar] [CrossRef]
  37. Mayr, A.; Hofner, B.; Schmid, M. Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC Bioinform. 2016, 17, 288. [Google Scholar] [CrossRef] [PubMed][Green Version]
  38. Zhang, T.; Yu, B. Boosting with early stopping: Convergence and consistency. Ann. Stat. 2005, 33, 1538–1579. [Google Scholar] [CrossRef]
Figure 1. Schematic overview of the JMalct procedure.
Figure 1. Schematic overview of the JMalct procedure.
Mathematics 11 00411 g001
Figure 2. Comparison of the prediction error (L) of the survival part for the varying numbers of non-informative noise variables.
Figure 2. Comparison of the prediction error (L) of the survival part for the varying numbers of non-informative noise variables.
Mathematics 11 00411 g002
Figure 3. Average JMalct run times for varying numbers of clusters n and covariates p. Dashed gray lines depict the corresponding linear model fit. Left panel shows square root times to highlight the quadratic relationship.
Figure 3. Average JMalct run times for varying numbers of clusters n and covariates p. Dashed gray lines depict the corresponding linear model fit. Left panel shows square root times to highlight the quadratic relationship.
Mathematics 11 00411 g003
Figure 4. Coefficient progression in both sub-models for AIDS data. The variable AZT was assigned to η surv , and the rest to η long .
Figure 4. Coefficient progression in both sub-models for AIDS data. The variable AZT was assigned to η surv , and the rest to η long .
Mathematics 11 00411 g004
Table 1. Share of correctly allocated (CA) and incorrectly allocated (IA) variables for each predictor as well, as false positive rate. Values are averaged over 100 independent simulation runs of each scenario.
Table 1. Share of correctly allocated (CA) and incorrectly allocated (IA) variables for each predictor as well, as false positive rate. Values are averaged over 100 independent simulation runs of each scenario.
Table 2. Mean squared error for longitudinal (mselong) and survival (msesurv) coefficients averaged over 100 independent simulation runs. Regular parameter estimates are indicated by θ , whereas θ n . inf denotes the second half, where non-informative effects are neglected.
Table 2. Mean squared error for longitudinal (mselong) and survival (msesurv) coefficients averaged over 100 independent simulation runs. Regular parameter estimates are indicated by θ , whereas θ n . inf denotes the second half, where non-informative effects are neglected.
P JMalct JM JMalct+JM
mselongmsesurv mselongmsesurvmselongmsesurv
θ 100.4970.3430.7130.4890.3420.453
θ n . inf 100.4860.3230.3020.2540.2880.423
Table 3. Average computation times of the three approaches measured in seconds.
Table 3. Average computation times of the three approaches measured in seconds.
P JMalct JM JMalct+JM
Table 4. Structure of the data with primary outcomes for the joint analysis in the three columns on the left.
Table 4. Structure of the data with primary outcomes for the joint analysis in the three columns on the left.
yT δ tDrugGenderAZTprevOIid
    ⋮  ⋮ ⋮    ⋮  ⋮      ⋮    ⋮     ⋮  ⋮
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Griesbach, C.; Mayr, A.; Bergherr, E. Variable Selection and Allocation in Joint Models via Gradient Boosting Techniques. Mathematics 2023, 11, 411.

AMA Style

Griesbach C, Mayr A, Bergherr E. Variable Selection and Allocation in Joint Models via Gradient Boosting Techniques. Mathematics. 2023; 11(2):411.

Chicago/Turabian Style

Griesbach, Colin, Andreas Mayr, and Elisabeth Bergherr. 2023. "Variable Selection and Allocation in Joint Models via Gradient Boosting Techniques" Mathematics 11, no. 2: 411.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop