3.1. Simulation
In the simulation study, we considered the following scenarios to estimate the optimal screening age, lead time, and probability of overdiagnosis:
Four values of the probability of incidence: p = 0.05, 0.10, 0.15, 0.20
Three different screening sensitivities: = 0.80, 0.90, 0.95
Four different mean sojourn times (MST): MST = 1.5, 2.5, 5, 10 years
Three different current ages: = 55, 60, 65 years
The transition density follows a log-normal probability density function multiplied by 30% (a sub-PDF):
where
t represents the duration time in the disease-free state
. The constant 0.3 in the numerator of the transition density function was empirically selected to scale the log-normal distribution such that the cumulative probability aligns with the age-specific incidence observed in the NLST population for heavy smokers. It reflects the expected lifetime probability of disease progression in high-risk individuals. It is the upper limit for being diagnosed with lung and bronchus cancer. See Wu, Erwin, and Rosner for a justification [
14]. The parameters
were chosen such that the mode of the transition density is approximately 70 years old.
The sojourn time distribution follows a Weibull distribution, as described by Rahman and Wu [
2].
where
x represents the sojourn time in the pre-clinical state. The parameters
and
represent the shape and scale parameters of the Weibull distribution, respectively. These values control the skewness and spread of the sojourn time distribution, which in turn affects lead time and overdiagnosis outcomes. Their interpretation is crucial for understanding how tumor progression rates impact optimal screening strategies. For the simulation study, specific parameter values were selected to achieve the desired mean sojourn times (MST) using the Weibull distribution. The chosen values for (
) are (3.47, 0.18), (1.56, 0.202), (2, 0.031), and (1.6, 0.021), with the corresponding mean sojourn times (MST) of 1.5, 2.5, 5, and 10 years, representing fast growing, medium growing, and slow growing tumors.
Table 1 shows the optimal initial screening age
obtained using the method in
Section 2 and binary search for different values of
p. The analysis considered various sensitivities
,
, and current ages (
). From
Table 1, when
is 2.5 years and
with
, the optimal screening ages for different
p values are
,
,
, and
. This means that with a 95% probability of avoiding clinical incidents before the first exam, the first screening should occur at age 55.16 (approximately two months after turning 55); and with an 80% probability of no clinical incidence, the first screening should be carried out at age 55.82 (about ten months after turning 55).
The results also show that as screening sensitivity increases from 0.8 to 0.95, the optimal initial screening age slightly increases when other factors remain unchanged. Additionally, the optimal initial screening age increases with higher incidence probability p and longer . The ideal first screening age is also influenced by one’s current age , with the time interval () decreasing as increases, assuming other factors remain the same.
The primary objective is to find the optimal screening time
. After finding it, we can investigate the lead time distribution
and the probability of overdiagnosis
at the future screening time,
.
Table 2 presents the estimated mean, median, mode, and standard deviation of the lead time at the
for individuals with a current age of 55, 60, or 65 years, respectively.
From the result, the lead time distribution is not directly influenced by , p, and . However, both the lead time distribution and the probability of overdiagnosis depend on factors like , , and . The findings across the tables exhibit similar patterns, suggesting consistency in the results:
When the MST increases, the mean, median, and mode of the lead time also increase indicating a longer expected time from the screening to symptom appearance.
The lead time distribution shows minimal dependence on the incidence probability p and the sensitivity when the optimal scheduling time is used.
As the current age increases, the mean, median, and mode of the lead time decrease, while the standard deviation remains relatively unchanged. This suggests that with increasing age, the expected time from screening to symptom onset becomes shorter, indicating potentially more rapid disease progression, but the variability in lead time remains consistent.
Figure 1 shows the lead time PDF curves under different factors:
p,
,
, and
. The figure consists of four panels, each depicting the estimated lead time density when the optimal first screening age
is used. In each panel, three factors are fixed, and the fourth factor varies.
The results demonstrate that, given , the lead time distribution exhibits minimal changes with respect to the incidence probability p and sensitivity . However, it varies notably with one’s current age and the MST. Specifically, as increases, the mean, median, and mode of the lead time slightly decrease. Conversely, as MST increases, the central location of the lead time distribution shifts towards higher values.
Table 3 displays the estimated probability of overdiagnosis (in percentage) when using the optimal initial scheduling age
. Specifically, if an individual undergoes the first screening exam at the age provided in
Table 1 and is subsequently diagnosed with cancer, the probability of overdiagnosis is given by the corresponding value in
Table 3.
The probability of overdiagnosis increases with higher MST, incidence probability (p) and older age (). However, it shows minimal variation with sensitivity (). Generally, when MST is less than or equal to one and a half years, the probability of overdiagnosis is typically less than 4%, considered negligible. Overall, the probability of overdiagnosis is very small, with the largest observed value less than 15% for an MST of 10 years.
3.2. Application
We applied the method to the NLST chest X-ray screening data. Rahman and Wu (2021) used a likelihood function and parametric link functions to estimate the
[
2]. The sensitivity was estimated by the epidemiologic method:
Although denoted as , sensitivity is assumed to be a constant value in this model to simplify the analysis and reflect an average test performance across ages. This is consistent with common practice in screening modeling where age-specific estimates are not available or stable. Additionally, K refers to the number of risk strata or groups (e.g., by age or smoking status) in the aggregated NLST chest X-ray data.
The parametric link functions of
,
, and
were the same as presented in Equations (
13)–(
15). The unknown parameters
were estimated using the Markov Chain Monte Carlo (MCMC) method with a Gibbs sampler and likelihood function [
2]. Initially, 200,000 samples were generated. After discarding the first 30,000 samples as burn-in and applying thinning every 200 iterations, we obtained a posterior sample of 500 from each chain. Running three initially overdispersed chains resulted in a total of 1500 Bayesian posterior samples
for each gender.
For each gender (male and female), we used the 1500 posterior samples
,
, assuming the current age
, and under different
p = 0.05, 0.1, 0.15, 0.20, and Equation (
1) was applied to determine the optimal scheduling time
. Once scheduling times were obtained for the 1500 MCMC samples, we also estimated the lead time distribution, probability of overdiagnosis, and true-early detection.
Table 4 summarizes the mean, standard error, and the 95% highest posterior density (HPD) interval of the future screening age
(in years) using the NLST X-ray data for male and female heavy smokers. The results show that optimal first screening times are very close for both genders under similar conditions, with the same current age
and incidence probability
p. However, male heavy smokers tend to have slightly longer optimal screening times compared to their female counterparts.
After determining the optimal first screening time, the posterior distribution of the lead time was obtained as the average distribution across the pairs
, where
taking the following form:
Table 5 presents the mean, median, mode, and standard deviation of the lead time, calculated using
. Generally, male heavy smokers exhibit slightly longer mean lead times compared to their female counterparts under similar conditions.
Figure 2 displays the estimated lead time density curves using the NLST X-ray data, with different current ages
and incidence probabilities
. The lead time curves barely change with the incidence probability
p when the optimal scheduling time
is used. However, the density curves vary with current age
: larger
values result in higher peaks in the density curve, leading to slightly smaller mode values.
Each pair
, with
, is used to estimate the probability of overdiagnosis. The posterior mean, standard error, and 95% highest posterior density (HPD) interval of this probability are listed in
Table 6. The probability of true-early detection is 1 minus the probability of overdiagnosis.
The probability of overdiagnosis at the first screening for heavy smokers, based on NLST X-ray data, is very low (less than 3%). This probability slightly increases with age and is slightly higher for male heavy smokers compared to females. It also increases slightly with higher incidence probability p. However, the maximum probability of overdiagnosis remains below 3%, indicating that overdiagnosis is not a significant concern at the first screening exam using chest X-rays for heavy smokers.