Sample Size Calculations in Simple Linear Regression: A New Approach

The problem tackled is the determination of sample size for a given level and power in the context of a simple linear regression model. The standard approach deals with planned experiments in which the predictor X is observed for a number n of times and the corresponding observations on the response variable Y are to be drawn. The statistic that is used is built on the least squares’ estimator of the slope parameter. Its conditional distribution given the data on the predictor X is utilized for sample size calculations. This is problematic. The sample size n is already presaged and the data on X is fixed. In unplanned experiments, in which both X and Y are to be sampled simultaneously, we do not have data on the predictor X yet. This conundrum has been discussed in several papers and books with no solution proposed. We overcome the problem by determining the exact unconditional distribution of the test statistic in the unplanned case. We have provided tables of critical values for given levels of significance following the exact distribution. In addition, we show that the distribution of the test statistic depends only on the effect size, which is defined precisely in the paper.


Introduction
Multiple regression is one of the core methodologies in statistics. Power computation and sample size determination have become integral part of many research proposals submitted for funding. Funding agencies such as UKRI (UK Research and Innovation) and NIH (National Institutes of Health) have been demanding sample size calculations in all prospective proposals. Regression has a long history dating back to Galton [1]. Horton and Switzer [2] reported that 51% of research articles published in the New England Journal of Medicine during May 2004 have Multiple Regression as one of the methods used. The figure for power analysis is at 39%.
In this paper, we focus on power computation in the context of simple linear regression. The current approach in power computations lacks justification. We will point out difficulties in this setting [3].
Simple linear regression is ubiquitous in pediatric clinical diagnostics. The model sets standards for normal growth in children on several metrics [4]. As an illustration, a pediatrician wants to check whether the lung function of a 13-year-old patient is normal. Data is to be collected on healthy subjects in the age range 12-14 years with response, Y = FEV (Forced Expiratory Volume) and predictor, X = Height, which is an example of an unplanned experiment.
In order to trust the model, we need to decide on the sample size, which in turn, depends on the level of significance, power, and effect size.
First, we investigate the setting under the simple linear regression paradigm. The model has two entities X, the predictor, and Y, the response variable. It is stated as Y|X ∼ N β 0 + β 1 X, σ 2 for some β 0 , β 1 , and σ 2 > 0. The null hypothesis of interest is H 0 : β 1 = 0 against the alternative H 1 : β 1 = 0. What should be the required sample size, n, for a given level of significance α, power 1-β, and at the alternative value A of β 1 . Let (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ), be a potential sample for the testing problem. Letβ 1 be the least squares estimator of β 1 , i.e.,β Let RSS be the residual sum of squares, i.e., For testing the null hypothesis H0, the following test statistic is used: Under the null hypothesis, conditioned on the X-data, T has a t-distribution with n − 2 degrees of freedom. Under the alternative value β 1 = A, T has a non-central t-distribution with degrees of freedom n − 2, and non-centrality parameter λ = |A| * √ Sxx /σ. We reject the null hypothesis if and only if |T| > t n−2,1− α 2 where t n−2,1− α 2 is such that the area to the left of Student's t-curve on (n − 2) degrees of freedom is 1 − α/2.
The power formula is given by We can set power equal to 1-β and solve for n. This would work as long as we know what λ = |A| * √ S XX /σ is. This would require knowledge of the alternative value of β 1 , σ 2 , and S XX . We will not know what S XX is, prior to data collection, in the unplanned experiments. Equivalently, one should spell out what λ is. This is a tall order. Adcock [5] recognized these problems. Some software and textbooks assume that (1/n) ∑ n i=1 (X i − X) 2 is known. For example, the software PASS [6] and nQuery [7] proceed this way. To overcome these difficulties, we proceed with deriving the exact unconditional distribution of a variant of T. This requires a knowledge of the distribution of X. Let σ 2 X be the variances of X.
Modify the test statistic.
We obtain the unconditional distribution of T under β 1 = 0 as well as under β 1 = A = 0. We assume X~N (µ x , σ 2 X ), both parameters unknown. Under this assumption, the distribution of T is derived.
In due course, we will show the distribution of T when β 1 = A = 0 depends only on δ = |A| * σ X /σ, which we can deem as the effect size.
The five-parameter model now is: Note that the vector (X, Y) has a bivariate normal distribution. The paper is organized as follows. In Section 2, we provide a literature review. In Section 3, we outline the main results. We derive the unconditional distribution of T under the null hypothesis in Section 3.1. In Section 3.2, we calculate critical values using the main results. In Section 3.3, we lay out the sample sizes required for a given level, power, and effect size δ = |A| * σ X /σ. In Section 4, we summarize the results and draw conclusions. The computational details along with the R code [8] are presented in the Supplementary Materials.

Literature Review
Ryan [3] has pointed out difficulties in power calculations in the environment of simple linear regression. The problem is how we handle the predictor X. Adcock [5] has looked at some possible scenarios. One scenario is that the investigator knows the X i -values (deterministic) for every sample size n. In such a case, the test statistic is eminently usable for power calculations. Its (conditional) null and non-null distributions have been worked out explicitly. The conditional approach is also followed by Dupont et al. [9], Draper et al. [10], Hsieh et al. [11], Maxwell [12], and Thigpen [13].
As an alternative to the test statistic (2), we can build a test based on the sample correlation coefficientρ [14], under the joint normality of X and Y. The null and non-null distributions of the underlying test statistic based onρ have been worked out explicitly. In our consulting work, many researchers prefer to use the test based onβ 1 . It is a choice between causality and association [3,[14][15][16][17][18][19][20][21]. The hypotheses H 0 : β 1 = 0 and H 0 : ρ = 0 under bivariate normality are equivalent, but the test statistics are different. It is easy to determine sample size under the correlation context [14]. However, this sample size cannot be offered for testing the hypothesis on the slope. The power is less. In other words, test hopping is not permissible; i.e., they are two different tests with distinct power functions.

Outline of Results
We will now derive the unconditional distribution ofβ 1 , which will be instrumental in sample size calculations. We use the test statistic T =β 1 * σ X /σ.
Under the null hypothesis β 1 = 0, we show that where W 1 ∼ χ 2 1 , W 2 ∼ χ 2 n−1 , W 3 ∼ χ 2 n−2 and W 4 ∼ χ 2 n−1 , with the W i values being mutually independent. It follows implicitly that T ∼ (n − 2)/(n − 1) * U 1 * U 2 /U 3 with U 1 , U 2 , U 3 independently distributed, U 1 ∼ t n−1 , U 2 ∼ χ n−1 , and U 3 ∼ χ n−2 , and where χ n−1 is the χ distribution with (n − 1) degrees of freedom We use this result to obtain the critical values of the test based on T, for given levels. For power and sample size computations, we need the distribution of T for any given value of β 1 . The distribution depends on the alternative values of β 1 , σ 2 X and σ 2 . It turns out that the distribution depends only on δ = β 1 * σ X /σ, which we can deem as the effect size. The specification of δ facilitates computation of power. Despite all these deliberations, no magic explicit formula for power surfaces. Knowing the distribution of T 2 when δ is spelled out, the pain is eased a little bit.

Distributional Results
In this section, we will derive the distribution of T of (1) unconditionally. The following series of steps will give the desired result.
RSS and S XX are independent.
More generally, we obtain the distribution of T = β 1 − β 1 σ X /σ for a given value of β 1 .
The joint density function ofβ 1 and S XX : The (unconditional) marginal density ofβ 1 is given by: −∞ <β 1 < ∞ Some properties of this density are clear to observe. For example, the distribution is symmetric around the true value β 1 . If n = 2, the distribution is Cauchy. In addition, Further, if n > 3, unconditionally, In the conditional set-up, Var β 1 X 1 , X 2 , . . . , X n = σ 2 /S XX ;
It follows that: 8.
An alternative form of the distribution [22]: where Beta I I signifies the beta distribution of the second kind.

Critical Values
We obtain the critical values of the test based on the test statistic T =β 1 * σ X /σ for three levels of significance. We denote the critical value by C n,α . The critical value C n,α satisfies the equation: where W 1 ∼ χ 2 1 , W 2 ∼ χ 2 n−1 , W 3 ∼ χ 2 n−2 and W 4 ∼ χ 2 n−1 , with W i values being independent. There are two options. One is using the pdf of {(n − 2)/(n − 1)} * T 2 . Following Jambunathan [22], one can write the pdf of the product U*V of the random variables U and V with U∼ Beta II (1/2 , (n − 2)/2), V∼ Beta II ((n − 1)/2, (n − 1)/2), and U and V being independent. The pdf is in the form of a double integral and its evaluation would require the use of a quadrature formula with the attendant errors of approximation. The second option is to determine the distribution of T 2 by sampling extensively the components that make up T 2 via Monte-Carlo. We have pursued the second option. The critical values are tabulated in File S1.
One can also obtain the critical value C n,α via the asymptotic distribution of T. One benefit of our derivation of the exact distribution is that if n is large, and null hypothesis is true, There are several ways to establish the asymptotic normality of T. The exact unconditional distribution of √ n − 1 * β 1 − β 1 * σ X /σ is t n−1 , which is asymptotically N (0, 1).
Then we use the fact thatσ X is consistent for σ X and thatσ is consistent for σ. Since we know the variance of T exactly, we use this variance in the description of the asymptotic distribution of T. We can now calculate the critical values, as well as those coming from the exact distribution, following the asymptotic distribution.
In File S1, we report the average critical values C n,α along with the critical values stemming from the asymptotic theory. A description of these asymptotic critical values is provided below.
Critical values from the normal approximation: Comments on File S1: The Normal Critical Value column is explained.

4.
Critical value 10% = critical value coming from the exact distribution of T when α = 0.10.

5.
Critical value 5% = critical value coming from the exact distribution of T when α = 0.05. 6.
Critical value 1% = critical value coming from the exact distribution of T when α = 0.01. 7.

Sample Size and Power
For a given level α, sample size n, and alternative value of β 1 = A, power is given by Power (A) = Pr β 1 * σ X /σ > C n,α β 1 = A).
We simulate the regression model for power computations. Simulations are greatly simplified when we exploit the key nature of the alternative distribution, namely, that it depends only on n and δ. Simulations are reported in the Supplementary Materials. Sample sizes are tabulated in Tables 1-3. The third column in each table spells out the requisite sample size.

4.
The fourth column is the fruit of our effort to validate the sample size. At the ascertained sample size, data are generated under the specifications, power calculated, and power averaged over thousand times.

5.
The fifth column records the standard deviation of the thousand powers calculated. 6.
We are satisfied that the sample sizes laid out are holding true.

Discussion
A simple linear regression is a five-parameter model spelling out causality between two quantitative variables Y and X typified by: for some parameters β 0 , β 1 , µ X , σ 2 > 0, and σ 2 X > 0. The goal is to sample (X, Y) for testing H 0 : β 1 = 0 versus the alternative H 1 : β 1 = 0. For determining sample size, we need the level of significance α, power 1 − β, and the effect size δ = A * σ X /σ, where A is the given alternative value of β 1 . The test statistic T used here is the one based on the least squares' estimatorβ 1 of β 1 .
The regression model, as originally formulated, is a conditional model, i.e., Y|X ∼ N β 0 + β 1 X, σ 2 . In practice, in a planned experiment, the experimenter selects x 1 , x 2 , . . ., x n of X, and observes one or more Ys from the conditional distribution of Y|x i for each i. Thus, the sample size n has already been chosen. The statistic β 1 √ S XX / RSS/(n − 2) is used for testing H 0 : β 1 = 0 against the alternative H 1 : β 1 = 0. The conditional distribution of the test statistic given the data on X is Student's t with n-2 degree of freedom under H 0 , and the distribution is non-central Student's t with n-2 degrees of freedom and non-centrality parameter |A| * √ Sxx/σ under H 1 : β 1 = A. The alternative distribution can be used to calculate the power of the test at β 1 = A, and nothing more. The entities n and S XX are already in place, and σ has to be spelled out. The value of A is provided by the experimenter as the one of clinical significance. From the consulting experience of one of the authors, the experimenter usually comes up with value for σ from his/her pilot study.
In some statistical circles [6,7], the non = null distribution is used to calculate the sample size with the desired power, with S XX remaining the same. This is controversial and discussed in [3,5,13].
We are dealing with unplanned experiments in which both X and Y are sampled together. Unplanned experiments are very common in clinical studies [4]. The effect size, in this context, is the multiple of the alternative value of β 1 by the ratio of the two standard deviations of the model.
The current practice demands α, 1 − β, A, σ and S XX , which we do not have. Specification of S XX is avoided by determining the unconditional distribution of Exploiting the unconditional distribution of T, we calculated the critical values and required sample size. The unconditional distribution under the alternative depends on the effect size δ = β 1 * σ X /σ, as well as n and α. As a contrast, popular software such as PASS [6] and nQuery [7] use the conditional distribution of the test statistic T * given the data on X, for calculating sample size.
An additional feature of our paper is that we provide a comprehensive table of critical values and sample sizes, unlike commercial software.
The main result that the non-null distribution of the test statistic T depends only on the effect size δ has an echo in other inference problems. For example, when testing µ 1 = µ 2 under the normality and common variance σ 2 assumptions, the non-null distribution of the two-sample t-statistic depends only on the effect size λ = (µ 1 − µ 2) /σ. This result, in spirit, is like ours. We have archived our findings for comments and insights [23].
We trust that the tables provided will help researchers to calculate sample size in the context of simple linear regression in unplanned experiments avoiding the controversies that have been problematic till now. We will continue to study how sample sizes are contrasted between one test based on the slope parameter of the model and one based on the correlation coefficient.