Sample Size Estimation for Detection of Splicing Events in Transcriptome Sequencing Data

Merging data from multiple samples is required to detect low expressed transcripts or splicing events that might be present only in a subset of samples. However, the exact number of required replicates enabling the detection of such rare events often remains a mystery but can be approached through probability theory. Here, we describe a probabilistic model, relating the number of observed events in a batch of samples with observation probabilities. Therein, samples appear as a heterogeneous collection of events, which are observed with some probability. The model is evaluated in a batch of 54 transcriptomes of human dermal fibroblast samples. The majority of putative splice-sites (alignment gap-sites) are detected in (almost) all samples or only sporadically, resulting in an U-shaped pattern for observation probabilities. The probabilistic model systematically underestimates event numbers due to a bias resulting from finite sampling. However, using an additional assumption, the probabilistic model can predict observed event numbers within a <10% deviation from the median. Single samples contain a considerable amount of uniquely observed putative splicing events (mean 7122 in alignments from TopHat alignments and 86,215 in alignments from STAR). We conclude that the probabilistic model provides an adequate description for observation of gap-sites in transcriptome data. Thus, the calculation of required sample sizes can be done by application of a simple binomial model to sporadically observed random events. Due to the large number of uniquely observed putative splice-sites and the known stochastic noise in the splicing machinery, it appears advisable to include observation of rare splicing events into analysis objectives. Therefore, it is beneficial to take scores for the validation of gap-sites into account.

As isoform identification in complex genomes currently suffers from insufficiencies [4]-the detection of alternative splicing is associated with low sensitivity-especially when transcript abundance or read coverage is low [5,6]. Therefore, a reasonable alternative strategy is to focus on gapped alignments, an approach we elaborated recently [7].
Genomic alignments of reads obtained from whole transcriptome sequencing contain gapped alignments due to the removal of introns during pre-mRNA splicing. To increase the detection sensitivity of gapped reads resulting from low abundant transcripts, alternative splicing, or sporadically used splice-sites, data from multiple samples needs to be merged. Gap-sites are alignment gap locations possibly shared by multiple alignments. They represent putative splice-sites and need to be validated because they are reported by aligners with a high false discovery rate (FDR) [7].

Observation of Splicing Events
For the detection of splicing events, we recently developed three R packages allowing accumulative extraction of gap-site information from transcriptome sequencing data [8], calculation of two scores for gap-site validation (Gap Quality Score, gqs and Weighted Gap Information Score, wgis), and annotation of gap-sites [7]. For each gap-site, detected in a batch of samples, the total number of covering alignments in all samples (nAligns, alignment depth) and the number of samples in which a gap-site was identified (nProbes, multiplicity) are reported.
The distribution of wgis values implies a division of gap-sites into four gap-quality levels (gql0 = not validated to gql3 = high confidence level). Thus, gap-sites are a heterogeneous population with varying alignment coverage and multiplicity as well as varying resemblance to confirmed splicing events.
Gap quality level 0 is assigned to a gap-site when the value of wgis is 0. The value wgis = 0 indicates that either the number of matching nucleotides in the merged samples (qsm < 16) or one of the MaxEnt scores (score5 or score3) are below threshold. Thus, evidence from alignments is sparse or splice-site strength is too low for validation. A detailed description can be found in our recently published manuscript [7].

Sample Size Estimation
The starting point of considerations on sample size estimation is the expectation that the number of observed gap-sites should increase with the number of samples in a batch. The number of observed gap-sites thus was examined by repeatedly drawing sample batches of varying size from the fibroblast transcriptomes.

Number of Gap-Sites Observed in Small Samples
In a simulation experiment, 100 random batches consisting of 2, 4, 8 and 12 samples were extracted (from the 54 fibroblast samples) and analysed for the presence of gap-sites as expected from alignments by STAR. The probability density estimates shown in Figure 1 indicate, that the number of identified gap-sites increase with sample size.
A more detailed analysis, however, shows that the increase in observation numbers is not equally distributed between gap-sites of different gap-quality levels. The number of not validated (gql0) gap-sites increases nearly linearly with a rate of 128,000 new gap-sites per sample, while the number of gql3 sites increases only at a much lower rate (Table 1).
Thus, gap-sites are a heterogeneous population with varying contributions to growing total number of observations in larger samples. An alternative interpretation of these varying contributions as consequence of varying observation probabilities leads to the hypothesis that a calculation of expected values from observation probabilities provides a prediction of total event numbers. In the following, we elaborate and evaluate this model.

The Probabilistic Model for Prediction of Event Numbers
The probabilistic model for observation of events (identification of gap-sites in a sample or in a batch of samples) is based on two assumptions: The observation of gap-sites in single samples are • random events • independent from each other Independence means, that the observation probabilities for the observation of gap-sites are not influenced by observations in other samples or by observations of other gap-sites.
Using these assumptions, a relation between observation probability in single samples and in a batch of 54 samples can be related using basic probabilistic considerations. In essence, the considerations are based on the fact that not observing a gap-site in a batch of samples is equivalent to not observing a gap-site in any sample in the batch.

Definition of Two Probabilities
In the probabilistic model, two types of probabilities are considered: the observation probabilities (p j ) and the observation prior (Π).
The observation probabilities represent the gap-site multiplicity (the number of samples in which a gap-site is identified) in the model.
As a real batch consists of a finite number of samples (n), the observation probabilities are numbers in { 1 n , 2 n , . . . , 1}.
The observation prior represents the relative abundance of gap-site multiplicities in real samples. The distribution of absolute numbers of gap-site multiplicities is shown in Figure 2. The U-shaped distribution indicates the fact that 73.7% (in TopHat) and 90.3% (in STAR) of gap-site multiplicities are <5 or >50 and thus are located at the extremes. Also, there is considerable variation between different samples (SD/mean > 34% for nProbes < 10), which is also demonstrated in Appendix C.
Normalising both axes in Figure 2 creates the observation prior (Π).

Calculation of Expected Values
The two probabilities are connected to each other in a two step model, in order to model the observation of a single gap-site in a single sample: when a gap-site is to be observed, first an observation probability (p) is drawn from the observation prior Π. Then, the observation is drawn from a binomial distribution with probability p.
The expected number of gap-site observations in a sample of size ν then can be calculated by an integration. As real sample batches consist of a finite number of samples, both probabilities are discrete and thus expectations are calculated using sums. In a batch of size ν, the expected number of observed gap-sites ( |S ν |) is calculated from the probabilistic model as where |S n | is the number of gap-sites in the full batch (of size n) and z j is the total number of gap-sites with multiplicity j therein. The detailed definitions and calculations are shown in Appendix B.

Evaluation of the Probabilistic Model
The predicted and observed numbers of gap-sites were evaluated using a simulation study. From 54 sequenced fibroblast transcriptomes, 200 random sub-batches of random size (drawn from a uniform distribution on {2, . . . , 53}) were extracted and completely analysed for number of gap-sites.
The data from all single samples were added to the simulated data. The number of observed events are shown in Figure 3. The mean number of gap-sites is modelled by a Loess regression (solid line).

Limitations Arising from Finite Samples
The predicted number of observed events using Equation (1) are considerably lower than the observed numbers, mainly due to a too low terminal slope that can be seen in the predictions from the raw model (dotted line) in Figure 3. Considerations shown in the model evaluation (Appendix B.2.3) clarify that a too small terminal slope is caused by an underestimated number of rare events (for example unique gap-sites). The following, this effect is related to estimation from finite samples.
The observation probabilities for gap-sites, displayed in Figure 2, show sharp maxima at both ends of the x-axis (observation probabilities near 0 and 1). Due to steep ascents, estimation accuracy relies on data from close proximity to the extremes. However, due to finite sample sizes, data on the proximity of the extremes is limited.
The resulting impact is quantified by examination of events (gap-sites) with multiplicity 1 (unique events). When a batch of finite size n is analysed, the lowest observable multiplicity is 1, assigned with an observation probability of 1 n . The observation probability of unique events approaches 0 with increasing n and thus the probability of being unobserved should be near 1. However, according to Equation (1), the probability of being unobserved is (1 − 1 n ) n for unique events and thus the theoretical limit (approached with <1% error for n = 54) is Consider the number unique events (m u ) in batch of n samples. As a consequence, the predicted number of unique events from Equation (1) is in a batch of size n (instead of m u ). Thus, the number of unique events in the full sized batch is 36.8% underestimated-an inaccuracy that cannot be avoided by increasing sample size.

Correction for Estimation Inaccuracies
As the lack of information at the extremes does not inevitably distort available data, we explored whether the model predictions recover when the informational gap is closed by adding artificial estimations.
Thus, virtual events with multiplicity <1 are added to the data, which does not change the relations between observed gap-site multiplicities, and is outside a range accessible by real samples. The observation probabilities are recalculated after adding 7.75 × 10 6 gap-sites with multiplicity 0.24 to alignments from STAR and adding 4.5 × 10 5 gap-sites with multiplicity 0.4 to alignments from TopHat. These numbers were determined by manually optimising the total number of predicted events in the full sized batch (n = 54).

Predictions by the Completed Probabilistic Model
The event numbers predicted by the completed model and the observed event numbers are shown in Figure 3, where the predictions of the completed model are shown as dashed dark line. The median difference between the corrected model and the mean values calculated by Loess regression is 8.16% in alignments from STAR and 1.93% in alignments from TopHat. The probabilistic model thus provides a sufficient explanation for the observed gap-sites numbers.

Basal Rate for Observation of Gap-Sites
The number of gap-sites modelled by the Loess regression in Figure 3 show a nearly linear increase in batches of large size (>40 samples). This constant terminal slope defines a basal rate for the observation of gap-sites (gbr), meaning that with every added sample, the total number of gap-sites increases by a constant value.
As, in an empirical sample, all unique events are part of the gbr, the gbr must be greater than the mean content of unique events in each sample. The gbr is 8056 gap-sites per sample in alignments from TopHat and 92,764 gap-sites per sample in alignments from STAR.
In alignments from TopHat, in total 384,576 unique gap-sites are identified with mean 7122 (SD 2905) per sample. In alignments from STAR, in total 4,655,597 unique gap-sites are identified with mean 86,215 (SD 34,464) per sample (Details are shown in Appendix C).
Thus, in the analysed samples, approximately 90% (88.4% in TopHat alignments and 92.9% in STAR alignments) of the gbr is represented by unique gap-sites.

Sample Size Estimation
The application of independency (presupposed in the probabilistic model) to calculation of sample sizes, required for experimental observation of splicing events (for example non-canonical splicing), allows utilisation of a simple binomial model. First, the lowest observation probability (p o ) for splice-sites of interest must be estimated. Together with the required power (p w ), the number of required samples can be calculated from the binomial model using the formula (details of derivation are shown in Appendix D). The required sample sizes for a selection of observation probabilities are shown in Table 2. In order to provide a rule of thumb, recommended sample sizes for detection power of >80% are >10 for rare gap-sites (p o < 0.15), 3-10 for occasionally observed gap-sites (0.15 < p o < 0.5), and 1-3 for regularly observed gap-sites (p o > 0.5).
For detection of non canonical splicing or alternative splicing, required sample sizes have been proposed in the range of ≈1-4 × 10 8 reads (per sample or condition) [6]. According to the described model (and assuming a power of 80% and 180 × 10 6 reads per sample), this would suffice for detection of gap-sites with observation probabilities down to 1 3 . As observation probabilities in single samples depend on alignment depth, calculated sample sizes need to be adjusted to alignment numbers (see Appendix A).

Discussion
The goal for the current investigation was to answer the question how many gap-sites are observed in batches of different sizes and to solve the problem of sample size estimation. The observation that gap-sites are a heterogeneous population differing by observation multiplicity and by validation status (gql) led to the construction of a (simple) probabilistic model. Using predictions from the model and a simulation study on observed data, the accuracy of the model is further explored.
Besides inaccuracies resulting from finite sampling (corrected by adding artificial estimates), the predicted and observed number of gap-sites are in good accordance (median deviation < 2%) with TopHat alignments and in acceptable accordance (median deviation < 10%) with STAR alignments.
Although the results do not provide a direct proof of the probabilistic model, we discuss the consequences arising from the model assumptions, namely (i) that gap-sites are (in range of detectable variation) observed independently from each other and (ii) the observation of gap-sites is a random event.

Independency of Gap-Site Observations
The independency implies that the likelihood of observation of single gap-sites is not influenced by previous observations or by the presence or absence of other gap-sites. As a regulated co-occurrence of gap-sites in a subset of samples would increase the variance of observed numbers, independency can only be deduced down to co-regulated gap-site clusters of size 10 4 or more, which provides only a weak upper boundary. The situation for gap-sites is thus analogous to the practice in gene expression data, where the assumption of independent regulation also has been applied [9].

Observations of Gap-Sites Are Random Events
The view that gap-site observation is affected by random effects is consistent with the process of mRNA sequencing and with procedures in the alignment algorithms. We additionally propose, that random effects also are inherent in the splicing machinery.

Stochastic Noise in the Splicing Machinery
The high degree of evolutionary conservation [10] and the ubiquity of splicing [11,12] underline the functional relevance of the splicing process. Alternative splicing facilitates the generation of multiple transcript (mRNA) isoforms from single genes and thereby the production of ≈100,000 transcripts from ≈20,000 genes [13,14] in humans. The diversified transcript pool potentially expands protein functionalities, which may be advantageous for individuals (documented for a large variety of genes [15,16]) as well as for the species (by increasing the rate of evolutionary change [11,[17][18][19]). The fact, that for almost all genes only a single translated (protein) isoform could be detected [20][21][22], and the existence of a subsequent (quality based) filter (for example NMD) [23][24][25] emphasise a functional role of the diversified transcript pool as a driver of evolutionary change. Evolutionary demands for variation may imply that splicing noise rather than splicing accuracy is under selection pressure. This could in turn explain why the splicing code is degenerated [26,27] and why complex splice regulatory mechanisms are necessary. Thus, the described stochastic noise in the splicing machinery [28][29][30] may not be accidental. This randomness would be in accordance with the probabilistic model described here. Analysis thus needs to separate three sources of stochastic variation: mRNA sequencing, alignment of sequencing data to the genome, and the splicing process itself. For this differentiation, accounting for splice-site strength will be helpful, which is included in our recently described wgis score [7].

Consequences of Basal Rates for Observation of Gap-Sites
The simulation data ( Figure 3) indicates that for new gap-site observations, a constant basal rate exists even for larger batch sizes (n > 40). This basal rate largely consists of unique gap-sites.
The alignments from STAR contain a very high number (4,655,597) of unique gap-sites. Additionally, the corrections introduced into the probabilistic model (although artificial) may be indicative of a much higher number of potential gap-sites required for the explanation of this data. Numbers of splicing events in the range of 5-10 × 10 6 are very high and potentially cannot be explained by noise in the splicing machinery alone. Also, 92.1% of unique gap-sites in STAR alignments are gql0-sites, meaning they are not validated by wgis and thus either there is only weak support from alignments or they are weak splice-sites. Therefore, the contribution of artificial sources to observation of gql0-sites in unique gap-sites reported by STAR may be not be negligible.

Influence of Read-Length and Alignment Depth
We assume, that aligned reads are randomly distributed on the transcriptome. As direct consequence, observation probabilities for splice-junctions are influenced only by alignment depth and not by read length.
Thus, variations in alignment depth primarily influence measures correlating with number of matching nucleotides. Also, likelihood of observation in a sample as well as likelihood of validation (by gqs or wgis) will perceivably change only when alignment depth crosses thresholds.
We consider, for instance, a reduction of alignment depth by 50% in alignments from STAR. In order to reach a 50% validation rate for gap-sites, 240 alignments are required when gqs is used and 19 alignments when wgis is used [7]. Thus, the validation status essentially will remain unchanged for gap-sites supported by 1000 alignments. Gap-sites supported by 300 alignments will presumably no more be gqs-validated (but still wgis-validated) and gap-sites supported by one alignment are likely to become unobserved. The example shows that heterogeneous effects of experimental conditions on observation probabilities are provoked.
These considerations show that spreading of sequencing power on more samples may be a sensible approach as long as observation probabilities in single samples is not impaired. Thereby, a more complete picture of the stochastic dispersion may be generated.

Influence of Sequencing Technology
Meanwhile, significant advances have been made since the invention of second generation sequencing (SGS) [31,32]. Third generation sequencing (TGS) platforms, for example single molecule real-time sequencing technology (SMRT) [33] and nanopore sequencing [34], offer read lengths up to 20,000 base pairs but currently is associated with 60-100 higher costs per (Giga-) base than SGS on the Illumina platform [32]. As splice-site observation probability is not improved by longer reads and considerable amounts of samples are required for the detection of occasionally observed splice-sites, TGS unlikely will replace SGS here. Also, the splice-site detection mechanism in rbamtools meanwhile does not consider whether different gap-sites belong to the same transcript or arise from the same read. Though, observation probability may not be altered by short reads (for example read length = 20), the likelihood of gap-site validation is severely impaired as the minimal number of matching nucleotides (on both sides of the alignment gap) is limited (to 10 in the example).

Influence of Different Species and Tissues
We suppose that the U-shaped pattern of gap-site multiplicities present in our fibroblast sample can be found in most tissues and species. Thus, the considerations of the shown probabilistic model should apply.
Gap-sites with high observation probability (the right hand side of the U-shape) are caused by genes ubiquitously expressed in a tissue. Gap-sites with low observation probability (the left hand side of the U-shape) are caused by splicing events that are only occasionally present in a tissue and also by errors in sequencing and alignment. A shift in the relation between gap-sites with high and low observation probability might be introduced by different tissues (for example via differing numbers of constitutively expressed genes). In order to obtain the exact distribution of observation probabilities, gap-site multiplicities on a batch of sufficient size will have to be analysed.

Comparison with Other Analysis Strategies
In a recent report, re-extraction of reads from two human samples that were sequenced with ultra-high coverage (≈10 9 reads), had been described [35]. In order to recover 80% of alternative splice events from the main sample, 100-150 × 10 6 were required and more than 400 × 10 6 reads were required in order to recover 80% of differential alternative splicing events. According to the estimations presented here, these read numbers would suffice only for detection of regularly observed gap-sites (p 0 > 0.5).
Also, as transcriptome sequencing is commonly performed using 50-100 × 10 6 reads [6], the read numbers resulting from the sample size estimation presented here are much higher.

Transcriptome Sequencing Data
All transcriptome data shown in this study originates from an investigation where the effects of age, gender, and UV exposition were studied in 54 samples of dermal fibroblasts obtained from healthy human donors. A main result of the study is that no consistent differential expression of genes is observed [36]. The gene expression in the 54 samples is thus deemed to be homogeneous. Collection and processing of dermal samples from donors was approved by the Ethical Committee of the Medical Faculty of the University of Düsseldorf (# 3361) (11 April 2011). Raw Fastq files are available under ArrayExpress accession E-MTAB-4652 (ENA study ERP015294). This batch of samples is used as training batch where statistical distributions are derived from.
The transcriptomes were sequenced on an Illumina HiSeq 2000 sequencer. Sequences were aligned against the Human genomic sequence (GRCh38) using STAR (version 2.4.1d modified) [37] and TopHat (version 2.0.14) [38,39] aligners. For alignment, the soft-masked version of the toplevel sequence regions were downloaded from ENSEMBL (version 76). As both aligners neglect effects of soft-masking, alignments on soft-masked and unmasked sequences yield equal results. The transcriptome sequencing data of the 54 fibroblast samples contain in mean 179.0 × 10 6 alignments from STAR and 162.5 × 10 6 alignments from TopHat (for more details see Appendix A).
The number of gap-sites present in single samples as well as in merged samples has been calculated using the framework provided by R packages rbamtools (version 2.16.0, available on CRAN), refGenome (version 1.7.3, available on CRAN), and spliceSites (version 1.23.3, available on Bioconductor).

Conclusions
The observation of sporadic splicing events, which, for example, may be due to splicing inaccuracies, can be a worthwhile analysis objective. Their observation can be described using a simple binomial model. High read numbers are required for their detection.

Acknowledgments:
This study was supported in part by the Deutsche Forschungsgemeinschaft (DFG, SCHW 1508/3-1) to Holger Schwender and (DFG, SCHA 909/4-1) and the German Ministry of Research and Education (Network Gerontosys, Stromal Aging to Heiner Schaal WP3, part C) to Heiner Schaal. We thank Anastasia Ritchie for editing the manuscript.
Author Contributions: Wolfgang Kaisers developed the software, constructed the model, performed the simulations and wrote the paper; Holger Schwender reviewed and edited the paper; Heiner Schaal conceived and designed the experiments.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Number of Alignments in Fibroblast Transcriptome Samples
In single samples, alignments from STAR aligner contained in mean 179.0 × 10 6 alignments (SD 62.1 × 10 6 ). Alignments from TopHat in single samples contained in mean 162.5 × 10 6 alignments (SD 58.0 × 10 6 ). The detailed distribution is shown in Figure A1.

Number of alignments in single samples (STAR)
Number of alignments (10 6  Consider a batch of samples, B n = {s i | i ∈ 1, . . . , s n } of size n (|B n | = n), indexed by i. Define a sample s i as a set of observed events s i := {g j : g j ∈ s i } ⊂ G, where g j ∈ s i reflects the fact that event j has been observed in sample i. The indicator function reflects the observation of an event in a sample: Define, that an event g j is observed in a batch B n , exactly when g j is observed in at least one sample s i . By designation the observation of g j in B n is equivalent to g j ∈ S n . The cardinality of S n (|S n |) is the number of observed events in B n . The expected number of observed events will be calculated using the opposite formulation of Equation (A1), obtained using De Morgan's law where s c i is the complement set of s i . Thus, the calculations essentially determine the probability of events being unobserved.
An actual experiment produces a finite set of putative observation events O = {[g j ∈ s i ] : 1 ≤ i ≤ n; 1 ≤ j ≤ m}, from which a subset occurs ([g j ∈ s i ]) and the complement does not ([g j / ∈ s i ]).

Appendix B.1.2. Observation of Events in Merged Samples
The observation of events in merged samples is modelled by a two step procedure, and bases on two separate stochastically distributed entities: • First, observation probabilities (p j ) are drawn from an observation prior Π. • Second, the actual observation of events [g j ∈ s i ] are iid (independent identical distributed) drawn from a Bernoulli distribution B(1, p j ).
Thus, observation probabilities are distributed according to the observation prior Π. For a set of events G, a vector of observation probabilities p = (p 1 , . . . , p m ) ∈ [0, 1] m is obtained the from the observation prior Π. The probability for observation of an event is given by where P is the probability measure defined by B(1, p j ) on O, and E P is the expectation with respect to P. The number of observations in each sample is |s i | = ∑ n j=1 1 i (g j ) and the expectation for |s i | is given by Counting of observations in merged samples: Observations g j ∈ s i in a sample batch B n are counted using an indicator function 1 S n (g j ) := 1 when g j ∈ n i=1 s i 0 when g j / ∈ n i=1 s i .
The second case (g j / ∈ ∪ n i=1 s i ) is equivalent to g j ∈ ∩ n i=1 s c i . Thus, 1 S n (g j ) can be rewritten as .
Using the independence of event observations in different samples, it follows that is the probability for not observing g j in B n . Thus, the expectation for observation of an event g j in a batch B n is given by As the observation probabilities p j themselves are random variables, the expectation for event observation requires an additional integration with respect to the observation prior Π The number of observed events in a merged sample is a random variable, taking values in {0, . . . , m}. As the observation probabilities P[g j ∈ s i ] are equally distributed for all events, the expectation of the total number of events can simply be calculated from Appendix B.1.

Estimation of Observation Prior from Real Samples
The application of the model requires estimation of observation priors from real samples. The prior probabilities will be estimated from the multiplicity of observed events (nProbes: The number of samples in which a gap-site is identified) and thus bases on count data.
Discretised integration: Assume, that Π is a discrete measure. Let p Π = (p 1 , . . . , p k ) be a set of distinct observation probabilities, which shall occur with prior probabilities π = (π 1 , . . . , π k ). For k ∈ {1, . . . , n}, let z k the number of events observed with multiplicity k. The vector Z = {z 1 , . . . , z n } then is a vector of count values, from which estimations for the prior probabilities π = ( π 0 , . . . , π n ) are calculated using the definition Estimation of number of observed events: Let ν ∈ {1, . . . , n}. We want to estimate the number of observed events in a subset of samples (S ν ⊂ B n ). Using the discretised prior, the model derived estimation for the total number of observed events in S ν is given by By replacing the total number of observable events (m) by the total number of observed events (|S n |), the final estimate for the number of observed events in a batch of size ν, is obtained.

Appendix B.2. Evaluation of Model Predictions for Special Cases
The following section evaluates model-based predictions for selected special cases. Although, in transcriptome data, these specialisations are not isolated observed but part of a heterogeneous mixture of events, they provide insight into relationships between observation probabilities and sample composition.

Appendix D. Derivation of of the Formula for Sample Size Estimation
A derivation for the provided formula for sample size estimation (Equation (3)) is given here. Denote the lowest observation probability for an event of interest as p g . Equation (A6) then provides the probability of being observed in a batch of size n: p b = 1 − (1 − p g ) n . The algebraic reformation then allows direct calculation of n, when the desired power (for example 80%) is introduced as value for p b .