1. Introduction
Antiretroviral therapy (ART) is the cornerstone of HIV care, and timely initiation is critical for improving clinical outcomes and reducing HIV transmission. While current guidelines endorse immediate ART initiation upon diagnosis, in practice, the time between HIV diagnosis and treatment initiation varies considerably. These delays often arise due to multiple barriers, including anticipated side effects (i.e., perceived collateral effects of ART), stigma, mental health or substance use challenges, and structural issues such as inconsistent access to care. As a result, viral load trajectories observed in clinical data reflect substantial heterogeneity in ART uptake timing.
While HIV case surveillance in most U.S. states collects viral load data through mandatory electronic reporting, information on ART initiation is typically not recorded. Public health departments often infer treatment initiation indirectly using viral suppression data or through analyses of clinical cohorts. Motivated by Braunstein et al. [
1], who developed a rule-based method using viral load surveillance data to estimate ART initiation timing in New York City, this paper proposes a statistical modeling approach to infer ART initiation time more formally and flexibly.
Random change point models, allowing for individual-specific changes in longitudinal outcomes, are widely used in medical research [
2]. Typically, these models assume a normal distribution for the change point and use linear mixed-effects models for the segments before and after the change point [
3,
4,
5,
6]. However, the Gaussian assumption may not always be ideal. In our case, many individuals likely initiated ART at the time of diagnosis (time zero), making a zero-inflated distribution more suitable. Additionally, longitudinal data may be censored, meaning measurements below a certain threshold cannot be accurately quantified. For viral load data, the detection limit is generally between 50 copies/mL and 400 copies/mL. A linear model based on the observed data might not be suitable for the unobserved data. Alternatively, if a mechanical or scientific model is available for the longitudinal data, it can provide better predictions for the unobserved data and improve change point estimation. Such a mechanical model often takes a nonlinear form.
Accounting for a random change point within a likelihood framework poses a major challenge due to the lack of closed-form expressions [
7,
8,
9]. This challenge is compounded by the non-Gaussian distribution of the change point, nonlinear models, and data censoring. In this paper, we propose a zero-inflated exponential (ZIE) distribution-based random change point model with segmented NLME sub-models to analyze left-censored data. To facilitate full likelihood-based inference, we extend the Stochastic Expectation-Maximization (StEM) algorithm, which was initially introduced by [
10]. Our extension of the StEM involves a Gibbs sampler coupled with Metropolis–Hasting sampling for mixed-type random effects structure in the random change point model.
The remainder of this paper is structured as follows.
Section 2 introduces the ZIE-based nonlinear random change point model.
Section 3 presents the general model and details of the extended StEM algorithm.
Section 4 analyzes an HIV cohort dataset, and
Section 5 evaluates the proposed method through simulations. Finally,
Section 6 concludes the article with a discussion.
2. ZIE-Based Nonlinear Random Change Point Model
In longitudinal data analysis, understanding how a specific outcome changes over time is crucial. Random effects change point models offer a valuable framework for examining changes in time trajectories by incorporating individual change points. These changes are typically induced by external events, causing deviations from the original data pattern.
Our objective is to determine when individuals initiate HIV treatment using their HIV viral load measurements over time. Without ART, the viral load fluctuates significantly after HIV infection until it stabilizes at a set point. If untreated, the viral load eventually increases, leading to AIDS [
11]. Starting ART, however, results in substantial decreases in HIV viral load. To simplify our modeling approach, we assume the HIV diagnosis occurs after the viral set point and focus on modeling the changes in viral load dynamics post-ART initiation.
Traditional random effects change point models often assume that the longitudinal outcome can be described by a segmented linear mixed-effects model. However, as mentioned in the introduction, linear assumptions may not be suitable for many real-world applications. While linear models might adequately fit the observed data, they may not be appropriate for data subject to censoring, which is common with HIV viral load measurements, especially after ART initiation.
Extensive research has been conducted to understand the dynamics of viral load following ART initiation and to evaluate the effectiveness of these drugs in treating HIV. Building upon biological and clinical knowledge, ref. [
12] proposed a virological model to approximate the patterns observed in viral load data. This model is represented by the equation:
where
represents the total virus at time
t, and
and
are baseline values. The parameter
corresponds to the viral decay rate and can be interpreted as the turnover rate of productively infected cells and long-lived or latently infected cells in an ideal therapy setting. Refs. [
13,
14] provide detailed discussions of this model.
For our problem, we can consider the following random change point model to describe the viral loads
for individual
i at
visit with time
after HIV diagnosis:
Here, is a log 10-transformed viral load, is the (single) change point which induces the change of viral load trajectory, and is the error term. We use transformation in line with standard practice in HIV clinical research and surveillance, where viral load thresholds and treatment response metrics (e.g., a 1-log reduction) are conventionally interpreted on the scale. While natural logarithms are common in general modeling, the base-10 scale facilitates direct clinical relevance and comparability with prior studies.
The functions
and
correspond to
and
, respectively. The quantity
represents the subject-specific regression coefficient that captures the viral load slope before the change point, while
,
, and
represent subject-specific mixed effects governing the viral trajectory after the change point. We define these as
Here, , , , and represent the population parameters (fixed effects), while , , , and denote the random effects, typically assumed to follow normal distribution with a mean of zero. It is worth noting that the pre-change point segment and the post-change point segment meet at the intercept when .
The choice of distribution for the random change points is a modeling assumption that depends on the specific investigation. For our application, there could be a significant proportion of individuals who presumably received ART treatment at diagnosis, i.e., “test-and-treated”. The rest would initiate their HIV treatment after the diagnosis date. The zero inflation exponential (ZIE) distribution allows the combination of a point mass at zero with an exponential distribution for the positive values [
15]. It assumes that with a probability of
, the only possible observation is 0, and with a probability of 1 −
, an exponential random variable is observed. For our change point model, we have
where
represents the expectation of exponential distribution.
3. Estimation Procedure Based on StEM
3.1. The Models and Notations
In this section we describe the models and methods in a general form to illustrate their applicability to other applications. Let
, with
, represent the longitudinal measurements for subject
, taken at time
. We consider a general ZIE-based nonlinear random change point model:
Here, and are known nonlinear functions, and and are vectors of population parameters. The random effects and follow normal distribution with mean and , and covariance matrices A and B, respectively. The random change point is assumed to follow a zero-inflated exponential distribution with parameters and . The within-individual variance is denoted as . The function and are defined as before. We assume that , are independent and both are independent of , which is introduced externally.
To estimate and infer the model (2) for left-censored data, we employ a likelihood-based estimation procedure using the observed data where is the censoring indicator such that is observed if and is left-censored if . That is, if and () if , where d is the detection limit. Extension to the “doubly-censored” case, in which the response may either be left-censored or right-censored, is straightforward.
Let
denote the collection of all unknown parameters, and let
be a generic density function, with
denoting the conditional density of
X given
Y. The observed data likelihood is given by the following:
where
Directly maximizing the likelihood (3) is challenging due to the presence of mixed-type distributions, nonlinear models, and nested integrals. The numerical methods, e.g., Gauss–Hermite Quadrature [
16], can be prohibitively intensive for the computation. We therefore resort to EM algorithm-based methods. By treating
, the censored component of
, and the unobserved random effects
,
, and
as “missing data”, we have the “complete data”
. The complete-data log-likelihood function for individual
i is expressed as
3.2. The Estimation Procedure
The EM algorithm introduced by [
17] is a classical approach to estimate parameters of models with non-observed or incomplete data. Let us briefly cover the principle. Denote
as the vector of non-observed data,
the complete data, and
the log-likelihood of the complete data; the EM algorithm maximizes the
function in two steps. At the
iteration, the E-step is the evaluation of
, where the M-step updates
by maximizing
.
For cases where the E-step has no analytic form, ref. [
18] introduces the MCEM algorithm, which calculates the conditional expectations at the E-step via many simulations within each iteration and hence is quite computationally intensive. The choice of replicate size is the central issue in guaranteeing convergence. Ref. [
10] introduces a stochastic version of the EM algorithm, namely the StEM, which replaces the E-step with a single imputation of the complete data and then averages the last batch of
M estimates in the Markov Chain iterative sequence to obtain the point estimate of the parameters. The imputed data
at the
iteration are a random draw from the conditional distribution of the missing data given the observed data and the estimated parameter values at the
iteration,
. As
only depends on
,
is a Markov chain. Assuming that
take values in a compact space and the kernel of the Markov chain is positive continuous for a Lebesgue measure, the Markov chain is ergodic, and that ensures the existence of a unique stationary distribution [
19,
20].
In extending the StEM algorithm for the ZIE-based nonlinear random change point model, the imputation step is a crucial part of the process. At the
iteration, we aim to draw the missing data
, where direct sampling from the joint conditional distribution is often intractable. To address this, we employ a Metropolis-within-Gibbs sampler, wherein each component is updated conditionally. For variables with tractable full conditional distributions (e.g., censored outcomes), we use standard Gibbs updates. For components lacking closed-form conditionals, we embed Metropolis–Hastings steps within the Gibbs framework to sample from the appropriate target distributions. Unlike the EM or MCEM algorithms, this procedure does not require monotonic increases in the likelihood, but instead ensures that the Markov chain explores the parameter space in a way consistent with the joint posterior distribution [
21,
22].
As an example, after initializing , and , we update , as follows:
- Step 1:
simulate from , a multivariate truncated normal distribution with
d the lower bound of the truncation,
mean where , ,
and variance , where is a by identity matrix,
and, independently, sample from the uniform (0, 1) distribution;
- Step 2:
calculate ;
- Step 3:
if , we update by ; otherwise, .
The
maximization step of the StEM algorithm involves maximizing the log-likelihood
to update parameters
under the current imputed missing data. Since
are regarded as data, the complete log-likelihood no longer involves integrals, which substantially simplifies the maximization. Solving the corresponding score functions yields the following estimations:
The likelihood function represented by the joint probability density and mass function of the ZIE distribution can be written as
. Denote
, we have
. Solving the score function
and
, yields
In general, increasing the number of random effects does not substantially increase the complexity of the maximization step. However, the imputation step will be more complicated, as this increases the dimensions of missing data that need to be imputed.
As with the likelihood defined in (3), the Fisher information matrix of the ZIE-based nonlinear random change point model has no closed-form solution. To obtain the variance-covariance matrix of the MLE
, we consider the following approximate formula in [
23]. Denote the score function of the complete-data likelihood by
. Then, an approximate formula for the variance-covariance matrix of
is
where the expectation can be approximated by conditional mean of the Monte Carlo samples.
3.3. Convergence Diagnosis
Determining convergence is a critical aspect of the StEM algorithm, yet it remains an open question in the literature. The most commonly used approach for convergence diagnostics involves visual examination of trace plots [
24,
25,
26]. Recently, ref. [
27] proposed a Geweke Statistics-based method. We adopt this approach in our implementation for a more rigorous assessment of convergence. Specifically, for each run, after initializing the Markov chain with the specified initial values, we determined stationarity using a batch procedure based on the Geweke statistic [
28]. A Geweke statistic is computed at each increment of
w iterations using a moving window with batch size
M. Specifically, the procedure consists of the following steps:
- 1.
Initialization. Set and run the StEM algorithm to obtain the initial series of the estimates .
- 2.
Check stationarity. For each entry p in , compute the Geweke statistic from the Markov chain . The Geweke statistic is defined as the standardized mean difference between the first and last portion of the chain, where and can be fine-tuned for a specific application. We consider stationary to be reached when all are sufficiently small, i.e., where P is the total number of parameters and is another tuning parameter.
- 3.
Update. If stationarity is not reached, perform w additional runs of the chain, increase the number B by 1, and repeat step 2.
4. Data Analysis
The HIV clinical cohort database (HCCD), maintained by Einstein-Rockefeller-CUNY Center for AIDS Research, contains de-identified data on people living with HIV and receiving care at hospitals and clinics affiliated with Montefiore Medical Center, which is the largest provider of HIV care in the Bronx, New York City. Patients in the HCCD are demographically representative with respect to age, sex, race/ethnicity, and HIV transmission risk of the overall population of people living with HIV in the Bronx that is described by public health surveillance data [
29].
For this study, we include all patients living with HIV in HCCD diagnosed between 2005 and 2015 with last follow-up by 31 December 2017. Additional inclusion criteria included age ≥ 13 years and at least two HIV-1 viral loads recorded during the period. The final analytic dataset contains 2475 persons with a median viral load frequency of 5 and an inter-quantile range from 3 to 11. Notably, approximately 60% of the viral load measurements were found to be below the detection limits.
In addition to the primary model (2), we also fit a model where the post-ART segment is assumed to follow a two-compartment model [
12], which is commonly used for viral load after the treatment. Therefore, we have the following two random change point models:
where M1 allows for individual-specific baseline value
for the second phase of viral decay, while M2 also captures the second phase viral decay rate through the random effects
. Our preliminary analysis indicates that it is sufficient to model the pre-change point segment with a linear mixed-effects model and the post-change point segment with a diagonal variance-covariance random effects structure. As a result, we make use of the following assumptions:
We implemented the StEM algorithm in R [
30] with the following tuning parameters:
, where we also restrict
for M2 to ensure the model is identifiable. This configuration typically allows for convergence within 3000 iterations in most model-fitting runs. Owing to the stochastic nature of the StEM algorithm, the choice of starting values is quite flexible, with initial values set randomly within the possible range.
To simulate the multivariate truncated normal distribution for the left-censored viral loads, we utilized the R package truncnorm [
31]. For the ZIE-distributed change points, we set the value to zero with the current estimated probability
, and with probability
, we set it to a random variable from an exponential distribution with the current estimated mean
.
Figure 1 and
Figure 2 display the trace plots of the Markov Chains for each parameter under models M1 and M2, respectively.
Table 1 summarizes the estimation results for the fixed effects and dispersion parameters for both models. The estimations are comparable between M1 and M2, except for
which is necessarily smaller in M1 due to the single baseline parameter in the one-compartment model M1. In contrast, this baseline value is estimated to be larger in M2, where it, along with the viral decay rate
, captures the dynamics of the viral trajectory during the second phase. Both M1 and M2 estimate a similar positive slope of viral load before ART initiation, specifically 0.43 and 0.42, respectively. Additionally, the models estimate similar percentages for the proportion of individuals who started ART treatment at time zero (time of diagnosis), with M1 estimating 31% and M2 estimating 32%.
We also assess the performance of our model by comparing the observed viral load values with the predicted values for each individual in the HCCD. Individual random effects, including the change point, are predicted using the conditional mean, which is obtained by averaging the parameters from extra iterations of the imputation step after convergence.
To further contextualize our model-based estimates, we also implemented the approach denoted as
, adapted from the rule-based algorithm developed by [
1], which uses viral load surveillance data to infer ART initiation timing. Specifically,
detects ART initiation when there is a decline of more than one
unit in viral load between two consecutive measurements occurring within a defined time window (e.g., three months). Additionally, ART initiation is inferred when a subject transitions from being detectable to being undetectable (i.e., left-censored measurement). While the original method was primarily used to identify ART initiation in a subset of individuals affected by the treatment,
generalizes this logic to the entire sample. Furthermore, instead of imputing ART initiation at the midpoint between the two relevant viral load measurements,
assigns the initiation time to the earlier measurement in the pair. This adjustment is biologically motivated, reflecting the expectation that viral load suppression begins shortly after ART initiation.
Figure 3 showcases the findings for nine selected individuals, chosen to represent typical patterns. For each individual, we present the predicted trajectory from M1 and M2, in addition to the observed viral loads. The predicted change point and the corresponding viral load at the time of ART treatment are indicated on the plot with different symbols for the two models. Predicted ART initiation by the
is also highlighted in the plot. It is important to note that the trajectories produced by the two models exhibit slightly different trends, with some trajectories displaying censoring (IDs 4 to 9) and others without censoring (IDs 1, 2, and 3).
Our model fits the fully observed data quite well and predicts a reasonable ART initiation time. For example, the change point is estimated right before the decline of the viral load for ID 1, while for ID 3, the change point is estimated beyond the last observed viral load, where the observed viral loads are ever-increasing, indicating no ART initiation. For ID 2, the ART initiation time occurs at about the time when the viral load starts to decline.
In contrast to ID 1, ID 4 has only one fully observed viral load, as the other is left-censored at around 4.8 months (0.4 years). Based on this information, M1 predicts that the viral trajectory dipped under the detection limit around 1.2 months (0.1 years), while M2 predicts the time of viral suppression at a time around 2.4 months (0.2 years). Nevertheless, both models predict ART initiation at around the time when the observed viral load is recorded, which is reasonable due to the occurrence of viral suppression.
For IDs 5 and 6, there are two fully observed viral loads before viral suppression. The values and the positions of the two viral loads influenced the shape of the entire trajectory. For example, compared to the slope of ID 6, the slope of the two viral loads for ID 5 is relatively flat; therefore, both M1 and M2 predict a flat pre-change point line.
We present the example for IDs 7, 8, and 9, where more fully observed viral loads are available. Here again, our model fits those data points quite well, and the predicted ART initiation times are reasonable. Importantly, ID 7 represents cases where ART initiation time is estimated to be zero, the time of HIV diagnosis. In such cases, the random change point model effectively degenerates to an NLME model. Similarly, ID 3 represents a case in which the change point model degenerates to an LME model, where the change occurs beyond the last observed viral load.
Visual inspection of
Figure 3 reveals that the ART initiation times estimated by the
method generally align with the model-based predictions (M1 and M2) when viral load measurements show a sharp and well-timed decline (e.g., IDs 4, 5, and 7). However, notable discrepancies occur in cases with sparse measurements or censoring. For instance, in ID 6 and ID 8, the
method identifies ART initiation earlier than both M1 and M2, likely due to its reliance on observed declines rather than inferred trajectories. These differences underscore the advantage of the model-based approach in accommodating censoring, nonlinear post-ART dynamics, and between-subject variability in estimating ART initiation timing.
5. Simulation Studies
The performance of the StEM algorithm was evaluated through simulation. Here, we design two simulation studies aimed at assessing and comparing their estimation accuracy under different specifications for the post-ART segment, inspired by the real data analysis.
We model irregular viral load recording times since HIV diagnosis using a progressive state-transition model, assuming a first-order Markov process. This means that the length of time between viral records depends on the previous recording time. To generate a stochastic measurement time , we use parameters obtained from fitting the model to the actual viral load test dates in HCCD. Specifically, we assume that the viral load measurement time T follows an exponential distribution with parameter . Given the previous recording time , the next recording time, conditioned on u, is given by , where . For this simulation, we set , which is estimated from the real data. The recording time is terminated when the time exceeds 3 years to simulate the real data. For left-censoring, in addition to simulating the 60% as in real data, we also run the model without any censoring to assess the StEM algorithm under such an ideal scenario.
To efficiently manage computational resources, each simulated dataset included 1000 individuals. In order to thoroughly evaluate the accuracy and bias of the estimated parameter values, we conducted 200 simulation runs for each scenario. Across these simulation replicates, we calculated both the mean squared error (MSE) and the bias (Bias) by comparing the estimated parameter values with the true values: . Here, S is the number of replicates, and is the estimate from simulation s.
Table 2 and
Table 3 display the simulation results for studies 1 and 2, respectively. We see that there is no major performance difference between M1 and M2. For each model, the fixed effects are estimated better than the dispersion parameters in general. The largest MSE occurred in the estimation of
under M1 when 60% of viral loads are left-censored. For M2, we see
is estimated with the largest bias, while
, the second-phase decay rate, is estimated with the biggest MSE. Such sub-optimal performances are likely due to the sparsity of the observation frequency where insufficient data are available to provide the best estimation.
6. Conclusions and Discussion
In this paper, we extend the StEM algorithm by incorporating a combination of Gibbs sampling and Metropolis–Hastings steps to address the complexity of modeling individual-specific change points in longitudinal data. Specifically, we model change points using a zero-inflated exponential (ZIE) distribution, allowing us to capture both immediate and delayed antiretroviral therapy (ART) initiation. We also generalize standard random change point models by incorporating nonlinear mixed-effects (NLME) specifications before and after the change point.
Although our algorithm uses MCMC-based imputation steps reminiscent of Bayesian methods, it is embedded within a maximum likelihood framework. This hybrid structure avoids full posterior sampling—particularly of high-dimensional variance components—resulting in improved computational efficiency and estimation stability. It also preserves desirable asymptotic properties of maximum likelihood estimators while allowing for flexible inference via stochastic approximation. In addition, although our primary emphasis has been on point estimation, the framework supports approximate inference via a score-based variance-covariance estimator, enabling the construction of confidence intervals and facilitating hypothesis testing for model parameters.
Our method is evaluated through simulation studies and real data analysis. Compared to empirical rule-based approaches, which often suffer from instability due to measurement error or sparse sampling, our model-based framework leverages the full structure of the data to yield more reliable estimates. Application to clinical HIV cohort data demonstrates the utility of the method, revealing that a substantial proportion of individuals initiate ART immediately upon diagnosis—a biologically meaningful finding in light of “test-and-treat” policies.
Beyond statistical modeling, the clinical context of ART initiation is critical. While immediate initiation is now the recommended standard, delays remain common due to factors such as patient hesitancy, anticipated side effects, co-occurring mental health or substance use issues, stigma, or gaps in follow-up care. By accommodating both immediate and delayed ART initiation, the zero-inflated change point model enhances public health relevance and interpretability.
On the computational side, our use of Metropolis-within-Gibbs sampling and Geweke diagnostics enables convergence within 3000 iterations, even for complex, high-dimensional missing data structures. While our framework is flexible enough to incorporate other zero-inflated distributions (e.g., gamma, Weibull, or log-normal), we limited this work to the exponential case to preserve computational tractability. Future work will explore more efficient variants of the algorithm, such as independent sample approaches [
32], to accommodate these extensions.
Several methodological developments offer promising directions. One is the integration of machine learning (ML) techniques—for example, to model post-ART viral dynamics or improve scalability. While ML algorithms may enhance flexibility, they often lack the interpretability and inferential tools required for surveillance-oriented estimation tasks. Hybrid approaches combining ML with structured statistical models could offer the best of both worlds and merit future study.
Another important extension involves incorporating subject-level covariates (e.g., age, gender, or transmission risk) into the fixed effects components of the model. This would allow parameters such as , , , , , and to vary across individuals. However, this generalization would require replacing the closed-form M-step with iterative procedures such as Newton–Raphson, significantly increasing computational demands—particularly in the presence of censoring and complex random effects. We are actively pursuing this extension using parallel computing strategies to support large-scale applications.
It will also be important to assess the robustness of our method under misspecified change point distributions. While we focused on the ZIE distribution here, the algorithm can, in principle, accommodate zero-inflated log-Gaussian, gamma, or Weibull alternatives, as mentioned above. Incorporating these would, however, increase the complexity of Metropolis–Hastings updates and reduce sampling efficiency. Simulation-based robustness assessments under alternative distributions are planned as the next step.