## 3. Results

In each simulation setting, power for N-of-1, parallel RCT and crossover design were computed for maximum sample sizes of $n=10,\text{}20,\text{}30,\text{}40,\text{}50,\text{}100,\text{}150,\text{}200\text{}$ with a fixed treatment effect $\tau =0.25$. After finding the value of $n$ that produced a power above 0.8 for the N-of-1 design, we fixed this sample size and varied $\tau =0,\text{}0.05,\text{}0.10,\dots ,\text{}1$ to determine how each design performed as the true treatment effect increased. Here $\tau =0$ represents a case where the treatment provides no true improvement over placebo, causing rejection of the null hypothesis when it is true, to constitute a type I error.

First, we examined how N-of-1, parallel RCT and crossover trials performed when all patients come from the population of interest, with true treatment effect

$\tau =0.25$, and no washout or carryover effects with a placebo vs treatment comparison in

$c=3$ cycles.

Figure 1 displays the power of the 3 designs as a function of trial sample size for the 4 different variance structure scenarios.

Across the 4 scenarios, the N-of-1 sample sizes needed to achieve 80% power with a main effect of $\tau =0.25$ were 30, 30, 30 and 100, respectively. For scenarios 1 and 2, the parallel RCT achieved 80% power with $n=150$ patients and the crossover design achieved 80% power with $n=100$ patients. The corresponding power of the N-of-1 designs for these two sample sizes were both 100%. The power of the two alternative designs in scenario 1 at $n=30$, where the N-of-1 design had a power of 92%, were 32% and 50% for the parallel RCT and crossover designs, respectively. For scenario 2, the power for the N-of-1 design was 92% at $n=30$, compared to 26% and 52% for the parallel RCT and crossover designs, respectively. In scenario 3, the parallel RCT did not achieve power above 80% for any sample size considered, and the crossover design required 100 patients to obtain a power of 94%, compared to a parallel RCT power of 70% for $n=50$. For scenario 4, where the random patient effect variance and error variance were largest, neither the parallel or crossover designs achieved power above 80%, with empirical power values of 34% and 70%, respectively for a sample size of $n=200$. The power for the N-of-1 design for this sample size was 99%.

Figure 2 displays the power of a representative sample without washout effects for a fixed sample size in each scenario and a varying treatment effect

$\tau $. The sample size used in each scenario corresponded to the minimum sample size needed to achieve at least 80% power for the N-of-1 design. For an effect size of

$\tau =0$, the N-of-1 empirical type I error probability was 0.05, 0.05, 0.05, 0.05, for the 4 scenarios, respectively. The empirical type I error probability for the RCT was 0.05, 0.06, 0.04, 0.05 across the 4 scenarios. The empirical type I error probability for the crossover design was 0.07, 0.07, 0.06, 0.05, across the 4 scenarios, respectively. The N-of-1 design best matched the empirical type-I error probability to the desired nominal type I error probability. By design,

$\tau =0.25$ produced power figures above 80% for both N-of-1 designs and the considered sample sizes. For parallel RCTs,

$\tau =0.60,\text{}0.50,\text{}0.80,\text{}0.65$, and for the crossover designs,

$\tau =0.40,\text{}0.40,\text{}0.40,\text{}0.45$ were required to achieve a power of at least 80% across the 4 variance scenarios.

Next, we examined the operating characteristics of each design in the presence of carryover effects. We increased patient outcomes by 0.05, 0.1, and 0.15 if they had just received the new therapy. By increasing patient outcomes in this manner, we do not consider length of cycle or washout, and remove the time-washout relationship from this simulation study. These represented small, medium, and large carryover effects compared to the true treatment effect of 0.25. Thus, patient outcomes could be increased for placebo or therapy outcomes within a cycle, or during the next cycle in N-of-1 designs. For example, if a patient received treatment in the first period in a cycle, their next outcome within the cycle for placebo will be increased due to the washout effect. Similarly, if a treatment is given at the end of a cycle, the next cycles first outcome – whether from a patient receiving treatment or placebo—will be increased from the washout effect. This did not affect parallel RCT operating characteristics as patients only receive either the placebo or treatment and only have one observation. For the crossover design, carryover effects were only seen if a patient received the new therapy before the placebo treatment, whereas carryover effects could be seen within each cycle and in between cycles in N-of-1 studies.

Figure 3 displays the power for sample sizes of

$n=10,\text{}20,\text{}30,\text{}40,\text{}50,\text{}100,\text{}150,\text{}200$ and a fixed treatment effect

$\tau =0.25$ for the three different carryover effects considered.

The sample sizes required for the N-of-1 design to achieve at least 80% power were increased to $n=40,\text{}40,\text{}40,\text{}150$, respectively, for a moderate carryover effect size of 0.1. The power of the crossover design at these sample sizes was 45%, 45%, 43%, and 43%, which was still higher than the parallel RCT despite it having no change in operating characteristics from the carryover effect. The N-of-1 designs required a sample size of $n=30,\text{}30,\text{}30,\text{}150$ to achieve 80% power with a carryover effect of 0.05 and $n=50,\text{}50,\text{}100,\text{}200$ for a carryover effect of 0.15. The N-of-1 designs achieved an exact power of 80% for $n=200$ and a large carryover effect of 0.15. The power of the crossover design for the same required sample sizes for 80% power for N-of-1 design with small and large carryover effects were 45%, 45%, 42%, 48%, and 45%, 46%, 63%, and 44%, across the four scenarios, respectively. When a true treatment effect was present, the N-of-1 trial designs had a higher power to detect a difference than traditional parallel RCT and crossover designs for any carryover effect size, if this effect was not larger than the treatment effect. For a very large carryover effect of 0.15, the power curves for the parallel RCT and crossover designs crossed, indicating that with a large enough sample size the parallel RCT outperformed the crossover design. This is likely due to crossover designs only having a carryover effect when the placebo is given after the treatment, which decreases the chances of detecting a true treatment effect.

Next, we examined the operating characteristics for varying

$\tau $ at each sample size required to achieve 80% power in the N-of-1 design for the small, moderate, and large carryover effect sizes of 0.05, 0.10, and 0.15. This also allowed us to examine the design performance as the treatment effect changes relative to the carryover effect. First, we should note that the empirical probability of a type I error was inflated for both the N-of-1 designs and crossover designs compared to the parallel RCT. The N-of-1 design resulted in empirical type I error probabilities of 15%, 15%, 14%, 14%, and the crossover design had empirical type I error probabilities of 11%, 12%, 11%, and 9% for a moderate effect size of 0.1. For a small carryover effect of 0.05, the N-of-1 and crossover designs had better controlled type I error probabilities of 8%, 7%, 6% 7%, and 7%, 7%, 6%, 6%, respectively. For a large carryover effect of 0.15, which is over half the true treatment effect size, the type I errors across the 4 scenarios considered were 25%, 26%, 27%, 23% for the N-of-1 design and 17%, 17%, 16%, 15% for the crossover design. These inflations are entirely due to the carryover effect and can be seen in the curved upwards left tail of the 4 plots in

Figure 4. For the crossover design, if a patient receives the placebo after treatment, the estimated treatment effect will be negative when

$\tau =0$. Likewise, the N-of-1 design can have treatment effect estimation bias for the placebo within a cycle if the new therapy is given first and biased for the new therapy if it is given at the end of a cycle and beginning of the following cycle, without washout. Additionally, we fit an N-of-1 model that adjusted for carryover effects by including an additional binary fixed effect corresponding to whether or not the patient received treatment in the previous treatment period. The type I error and corresponding power for different values of

$\tau $ were nearly identical when controlling for carryover effects. These results indicate that special care should be given to controlling for carryover effects, particularly to avoid type I error. Still, when

$\tau >0$, the power is higher for the N-of-1 design for any effect size, with the crossover design achieving at least 80% power with effect sizes of

$\tau =0.40$ for each scenario. Normally data from days when there could be a carryover effect is not used in the analysis.

Finally, we examined the extent of issues that N-of-1 trials can have when there is a non-representative sample for the population of interest (i.e., selection bias). We have already demonstrated that if samples are drawn from the population of interest, N-of-1 trials obtain desired power with fewer patients than parallel RCTs or crossover designs and maintain type I error constraints when carryover effects are not present.

Consider an N-of-1 trial of a new therapy versus placebo targeting improvements of psychological health in adults age 25–50 years old. It is plausible that our population contains individuals with disproportionately high or low baseline risk of poor psychological health relative to our target population. If we sample from this sub-population our statistical conclusions about the population of interest may be incorrect.

To test performance under selection bias, we performed the following simulation experiment using the 4 error variance structures described above. With probability

$p$, we sample patients for our N-of-1 trial from the population of interest, which has no treatment advantage (

$\tau =0)$. With probability

$1-p$, we sample patients from a sub-population that has a treatment effect of

$\tau =0.25$. The probability that we falsely conclude that a treatment effect is present in the population of interest (i.e., make a type I error) is plotted in

Figure 5 for varying

$p$ and an N-of-1 sample size of 30.

When $p=1$, the type I error is 0.05, as desired for each of the 3 designs in each scenario. But when $p\to 0$, the empirical probability of type I error increases, particularly for the N-of-1 design. When $p=0.7$, indicating that we incorrectly sample on average 9 patients from the sub-population, the type I error for the parallel RCT, crossover and N-of-1 designs are (7%, 6%, 6%, 6%), (11%, 11%, 9%, 7%), (19%, 19%, 17%, 9%), across the 4 scenarios, respectively.

When $p=0.5$, indicating that we incorrectly sample on average 15 patients from the sub-population, the type I error for the parallel RCT, crossover and N-of-1 designs are (9%, 10%, 7%, 6%), (18%, 19%, 17%, 9%), (37%, 39%, 40%, 13%), across the 4 scenarios, respectively. This indicates that special care must be taken to ensure that the sample represents the population of interest for the crossover design and especially for the N-of-1 design. A greater number of observations from the same patient compounds the error caused by non-representative sampling.

Next, we examined the power in a similar manner, when the population of interest truly has a treatment effect of

$\tau =0.25$ and patients from some sub-population have a true treatment effect of

$\tau =0$ (i.e., the treatment does not work for the sub-population). We examine the power for varying

$p$ and a fixed sample size of 30 in

Figure 6.

We see that for all 4 scenarios, the power increases dramatically as $p$ increases. It is essential that patients represent the population of interest to generalize conclusions from an N-of-1 trial. If $p=0.8$, indicating that about six patients are sampled from the non-representative population, the power for the N-of-1 design is (81%, 81%, 80%, 30%) for the 4 scenarios, compared to (18%, 17%, 12%, 8%) and (37%, 36%, 34%, 14%) for the parallel RCT and crossover designs, respectively. This indicates that the non-representative sampling from a sub-population not of interest has a much greater effect on type I error probability than power for the N-of-1 design compared to the parallel RCT and crossover designs.

In practice, we will never know if patients enrolled in a trial are representative of our treatment population or not. While N-of-1 designs exacerbate this problem compared to parallel RCT and crossover designs through leveraging multiple observations on each patient under two different treatments, N-of-1 designs also allow better estimation of patient-level random effects, which might possibly be used in future methodological improvements to better determine which patients are representative of the target population. As an example, to show this we simulated 1000 trials from scenario 1, with

$n=100$ and

$p=0.5$, i.e., about 50% of patients enrolled in the trial have no treatment effect, whereas the other 50% have a treatment effect of 0.25. In each simulation, we computed the average random effect for representative and non-representative patients using both the N-of-1 and crossover designs. Parallel RCTs cannot estimate patient-level random effects because they only have one observation on each patient.

Figure 7 displays the density of the differences in average random effects between the non-representative and representative patients for both the crossover and N-of-1 designs. We see that the N-of-1 design correctly identified that the individual random effects of the non-representative group are higher than for our population of interest, as indicated by the shift in densities. These results were similar, but less striking for

$n=30$, and more apparent with more cycles in the N-of-1 design. While these are individual random effects, and not individual treatment effects, these results suggest that future methodological advancements may be able to cluster patient treatment effects to determine existence of subgroups within patient cohorts.