Proposed Framework for Comparison of Continuous Probabilistic Genotyping Systems amongst Di ﬀ erent Laboratories

: Continuous probabilistic genotyping (PG) systems are becoming the default method for calculating likelihood ratios (LRs) for competing propositions about DNA mixtures. Calculation of the LR relies on numerical methods and simultaneous probabilistic simulations of multiple variables rather than on analytical solutions alone. Some also require modelling of individual laboratory processes that give rise to electropherogram artefacts and peak height variance. For these reasons, it has been argued that any LR produced by continuous PG is unique and cannot be compared with another. We challenge this assumption and demonstrate that there are a set of conditions deﬁning speciﬁc DNA mixtures which can produce an aspirational LR and thereby provide a measure of reproducibility for DNA proﬁling systems incorporating PG. Such DNA mixtures could serve as the basis for inter-laboratory comparisons, even when di ﬀ erent STR ampliﬁcation kits are employed. We propose a procedure for an inter-laboratory comparison consistent with these conditions.


Introduction
As forensic short tandem repeat (STR) genotyping assays have become more sensitive, DNA samples that may once have been classified as single source (assessed as being derived from a single DNA donor) may instead be classified as having multiple contributors as low-level alleles are now detected. The presence of multiple contributors has significant implications for propositions involving DNA transfer, persistence, prevalence and recovery (TPPR) [1]. Estimating the weight of evidence of these mixtures with the combined probability of inclusion/exclusion (CPI/E) has proved limiting, mostly because of problems with the treatment of allele drop in and drop out [2][3][4]. As a result, in many jurisdictions, probabilistic genotyping (PG) has become the default process for generating likelihood ratios (LRs) for forensic analysis of DNA mixtures [5]. Continuous PG algorithms model the probability distributions of observed peak heights in STR electropherograms (epgs) under different scenarios. These can then be used to generate likelihoods for propositions which can in turn be combined into LRs. There are a number of continuous PG algorithms available including DNA·VIEW ® [6], TrueAllele ® Casework [7,8] (Cybergenetics, Pittsburgh, PA, USA), STRmix™ [9] (Institute of Environmental Science and Research, Forensic Science South Australia, Adelaide, SA, Australia), EuroForMix [10] and DNAxs [11]. The latter two PG systems are an extended version of the model proposed by Cowell et al. [12] which is open source while the other three require commercial licences [13][14][15]. Until relatively recently [16], there has been little evidence that continuous PG is reproducible amongst different laboratories, and little attempt has been made to define credible intervals for the LRs produced.
Swaminathan et al. [17] collated the LRs for 30 × one-person samples, 82 × two-person mixtures and 90 × three-person mixtures generated by four variations of their CEESIt continuous PG algorithm [18]. The four variations included different permutations of models for "mixture ratio" (also known as the "mixture proportion" [19] and as a "mass parameter" [9,20]), peak height distribution and forward stutter designed to mimic the diversity of available continuous PG algorithms. LRs were calculated five times for each mixture to assess intra-model variance resulting from the Markov chain Monte Carlo (MCMC) simulation procedure. In all four models, intra-model variability increased with an increase in the number of contributors and with a decrease in the contributors' template mass. The LRs were binned into ranges corresponding with verbal expressions for the weight of evidence according to the Association of Forensic Science Providers [21] ranging from "weak" for an LR between 1 and 10 to "extremely strong" for an LR > 10 6 . For 9% of intra-model comparisons, LRs did not fall in the same bin for the same mixture, and for 1.5%, LRs were more than one bin apart. For 16% of inter-model comparisons (where two or more of the four models yielded LRs in the same bin for all five runs), LRs from one model fell in a different bin from one or more other models, and 11% were more than one bin apart.
Bright et al. [22] originally proposed and demonstrated a series of tests for validating PG systems using single source, simulated major/minor (3:1) mixtures and simulated balanced (1:1) mixtures. The LRs generated by PG were compared with those expected under theoretical modelling in Excel. Input electropherograms had peak heights adjusted so that there was: Replicate analyses were employed to test for reproducibility. The results of their tests showed good agreement between expected results, continuous PG and semicontinuous PG for single source and balanced profiles, although for the latter, continuous PG yielded higher LRs than semicontinuous PG, as expected. This is because of the extra peak height information considered by continuous PG. For major/minor profiles, agreement between continuous and semicontinuous PG only occurred when the major contributor was manually extracted. This is because continuous PG is able to take advantage of the peak height information in an unbalanced mixture while semicontinuous PG does not (putting aside manual interpretation by an analyst of stutter peaks, for example). All electropherograms were simulated from single source profiles derived from the same capillary electrophoresis instrument, and only one continuous PG algorithm (STRmix) was employed.
There have been other attempts to compare the reproducibility of outputs amongst different PG systems, but most of these (e.g., [23][24][25][26][27]) have involved submitting the same epgs from the same STR amplification kits to different PG algorithms. Benschop et al. [28] describe a validation of one PG system (DNAxs) in five laboratories using STR genotype data (alleles and peak heights) generated within each laboratory from different STR assays. Each laboratory shared its genotyping results with the others, and LRs were mostly within an order of magnitude for the same genotype data. However, the same DNA samples were not processed in each individual laboratory so that the LRs were all generated from the same epgs. Alladio et al. [29] showed that it was possible to compare the reproducibility of LRs from different PG systems and different STR assays. The LRs generated from DNA·VIEW, STRmix and EuroForMix were reproducible for high DNA template amounts over a wide range of mixtures with different numbers of contributors and mixture ratios. Once again, the LRs were all generated from the same epgs. Different STR assays produced LRs that differed by many orders of magnitude, as expected. This is because different STR assays employ different STR loci and different numbers of loci. While this might seem like an impediment to inter-laboratory comparisons, we demonstrate that it can be overcome.
Inter-laboratory comparisons are a standard feature of forensic DNA analysis methods [16,26,[30][31][32]. They indicate the reproducibility of a particular method amongst different laboratories and the variance of quantitative results. It is a reasonable expectation that they be undertaken. They serve to calibrate amongst laboratories, which helps to ensure equality of justice outcomes amongst jurisdictions. The US President's Council of Advisors on Science and Technology (PCAST) "believes that test-blind proficiency testing of forensic examiners should be vigorously pursued, with the expectation that it should be in wide use, at least in large laboratories" [33]. The US National Institute of Standards and Technology (NIST) states: "Inter-laboratory tests are the means by which multiple laboratories compare results and demonstrate that the methods used in one's own laboratory are reproducible in another laboratory. These tests are essential to demonstrate consistency in results from multiple laboratories" (quoted from [34]).
McNevin et al. [35] have previously suggested a method for assessing reproducibility and defining credible intervals for LRs derived from the same DNA extracts (not electropherograms) and calculated by STRmix in particular and continuous PG in general. This was met with some scepticism by Buckleton et al. [36] who contend, firstly, that there are "multiple reasonable answers in the case of evidence from one extract" [36,37] and, secondly, that it is sufficient to calibrate the LRs generated by PG from multiple laboratories using the method of Ramos and Gonzalez-Rodriguez [38]. In summary, this last method uses the LRs and a prior odds ratio from known numbers of contributors and non-contributors submitted by multiple laboratories to calculate a posterior odds ratio. The posterior ratio is compared with the relative frequencies of contributors. The number of non-contributors with LR above a certain threshold should reflect the number expected given the numbers of contributors and non-contributors [39]. This is a reasonable test of the bulk or macro properties of the LR from multiple laboratories; however, it does not provide any indication of the variance in LRs amongst laboratories for the same sample or whether an individual laboratory is producing reasonable LRs. For example, in a multi-laboratory comparison, a laboratory that consistently produces large LRs might be balanced by a laboratory that consistently produces small LRs without perturbing the bulk or macro properties of all the LRs produced. It also requires a large number of contributors and non-contributors for many mixtures.
We argue that there is a true test of each laboratory's ability to produce reasonable LRs, consistent with McNevin et al. [35] and regardless of the instrumentation and STR assays used to produce epgs. Here we provide a formal proof that such a test exists, and we define the conditions under which such a test could be performed.

The Likelihood Ratio Produced by Probabilistic Genotyping
We start with the general formulation of the LR for a DNA mixture as a ratio of two conditional probabilities: We will loosely follow the notation of Taylor, Bright and Buckleton [9,20] in their descriptions of PG systems while acknowledging that other notations exist (e.g., [12,19]). The evidence, E, is an electropherogram (epg) from a crime trace (G C ) exhibiting a mixture of known reference profiles (G R ) and unknown profiles (G U ). There is also a person or persons of interest (POI or POIs). In general, one proposition, H 1 , is that a particular reference genotype (or genotypes) from a POI or POIs (G P ) is a contributor to the DNA mixture, while the alternate proposition, H 2 , is that the contributors are two or more known (G R ) or unknown (G U ) genotypes not including the POI(s). The propositions can take various forms, but H 2 will always differ from H 1 in that the genotype of at least one POI (G P ) is replaced with an unknown genotype (G U ), for example: Cowell et al. [12] show that, under the assumption of Hardy-Weinberg equilibrium (HWE), the LR for a mixture for which G P in H 1 is replaced by an unknown profile (G U0 ) in H 2 can never be greater than the LR for a single source profile for the POI responsible for G P . This places an upper limit on the LR under these circumstances.
The epg reveals M genotype sets S of possible explanatory genotype combinations from N contributors that could give rise to the DNA mixture at any locus. The likelihood ratio becomes: where S m is the mth possible explanatory genotype combination for N contributors and p(G C |S m ) is a conditional probability density (distinguishing it from a point probability, P). As an example, consider an epg at a locus where there are four alleles (A, B, C, D) detected above an analytical threshold. where: The normalised weights vary from 0 to 1 and account for the possibilities of allele drop in and allele drop out. For continuous PG, they also account for the possibilities of stutter, peak height stochasticity, peak height degradation and peak height variations as a result of allele overlap (shared alleles). Semicontinuous PG does not consider peak height information, although stutter must be differentiated from true alleles by the analyst. The weights for continuous PG are modelled using what Taylor et al. [9,20] refer to as "mass parameters" including a template DNA amount for each contributor, a degradation level for each contributor, an assay-specific locus amplification efficiency for each locus and a replicate amplification efficiency for each replicate. The last two parameters account for inter-locus and inter-replicate variabilities, respectively. The likelihood ratio then becomes:

A Reproducible Subset of Likelihood Ratios from Probabilistic Genotyping
The values for w m will vary from laboratory to laboratory. This is because each laboratory must model epg artefacts and peak height variance for the particular conditions in their laboratory, and these models inform the various w. At first glance, and this is certainly the view of Buckleton et al. [36], this suggests that LRs reported by different laboratories cannot be compared. While it is true that not all LRs can be compared, we can define specific conditions for which a subset of LRs can be compared. These conditions exist when the values of w m are the same for different laboratories.
The weight or likelihood, w m , for any genotype set, S m , will vary from almost impossibility (w m → 0) to almost certainty (w m → 1). We distinguish between genotype sets with at least one allele not belonging to any of the contributors or without all contributor alleles present (S i ) and those with all alleles belonging to contributors and no others (S j ): For our four-allele example, let us assume that the contributors have genotypes BC and CD (A is an artefact). Genotype sets S i include any genotypes with allele A (AA, AB, AC, AD) or without at least one of B, C and D, while genotype sets S j include all of B, C and D but not A (or any undetected alleles). We wish to restrict w i so that each laboratory finds w i → 0. Under these conditions, for any PG system: We then extract the unique genotype set S * that corresponds with the contributors to the mixture: where w * is the weight assigned to S * and the new subscript k is used because S * has been separated from the other S j . For our four-allele example, S * is {BC, CD} which is now distinguished from {BC, BD}, {BD, CD}, {BB, CD}, {BD, CC}, {BC, DD}, etc. When H 1 corresponds with the contributors only (H 1 true) then P(S k |H 1 ) = 0, P(S * |H 1 ) = 1 and: Note that there is an upper limit for LR which occurs if all w k → 0. This is essentially the same result obtained by Cowell et al. [12] for continuous PG but generalised for multiple contributors. When H 1 corresponds with non-contributors (H 1 false) where at least one allele of a non-contributor is not shared with a true contributor then P(S k |H 1 ) → 0 , P(S * |H 1 ) → 0 and LR → 0.
For semicontinuous PG, we have no way to reduce uncertainty amongst S k and S * (because peak height information is not considered). All remaining genotype sets are equally likely. Hence, w * = w k = w. In this case, Equation (8) becomes: When H 1 corresponds with the contributors only (H 1 true): This is the minimum performance expected of continuous PG. We therefore have an upper and lower bound for the LR from continuous PG if: a.
w i → 0 (i.e., uncertainty is minimised between genotype sets with all alleles belonging to contributors and no others and those with at least one allele not belonging to contributors or without all contributor alleles present) and; b.
H 1 is true (i.e., H 1 corresponds with the contributors only).
The range of expected values is given by: The lower bound is the LR for the same mixture derived from semicontinuous PG, and the upper, aspirational bound is the LR that would be possible if uncertainty could be eliminated amongst the true contributor genotype set, S * , and all others. To move from the lower bound to the upper bound requires increasing w * beyond the average weight used for semicontinuous PG. Indeed, this is the goal of continuous PG, and the relative ability to increase w * over all other weights is a performance measure for continuous PG systems.

Conditions for Achieving Reproducible LRs from Probabilistic Genotyping
The conditional probabilities, P(S * |H 1 ) and P(S * |H 2 ) are match probabilities defined by true contributor reference profiles, population genetic models, population allele frequencies and two alternative propositions. As long as any two laboratories have reference profiles for the same contributors, consider the same propositions and use the same models (e.g., Hardy-Weinberg proportions, NRC II recommendation 4.1, NRC II recommendation 4.2), the same population allele frequencies and the same θ for the same loci, they should obtain the same values for P(S * |H 1 ) and P(S * |H 2 ). This defines our first conditions for an inter-laboratory comparison of LRs:

1.
The same standard mixtures should be examined.

2.
The same propositions should be considered.

3.
The same loci should be employed.

4.
The same population allele frequencies should be employed.
Satisfying Equation (7) requires reducing the probabilities of genotype sets with at least one allele not belonging to any of the contributors or those without all contributor alleles present such that w i → 0. This will occur when there is little uncertainty between: • No allele and allele drop out; • A (low peak height) contributor allele and allele drop in; • A (low peak height) contributor allele and a stutter peak; • A single allele and shared ("stacked") alleles, either of which may or may not include allele drop in and stutter peaks.
We consider each of these in turn. The greater the amount of contributor DNA in the mixture, the higher contributor allele peaks are likely to be. The higher the allele peaks, the less likely that drop out will occur. Similarly, high allele peaks are unlikely to be confused with (low peak height) allele drop in. Stochastic variation in peak heights will also be minimised with increasing peak height. Heterozygote peak height imbalance has been shown to decrease as average peak height (APH) increases [40,41]. Continuous PG algorithms model allele peak height and stutter peak height to reflect observations that variance decreases with peak height. EuroForMix and TrueAllele use gamma [10] and normal distributions [7], respectively. STRmix models allele peak and stutter peak height variation according to a log normal distribution [9,20,42]: O and E refer to observed and expected peak heights for alleles (a) and stutter (a − 1), and c 2 and k 2 are locus-specific random variables which are in turn modelled by gamma distributions. For both allele and stutter peaks, the variance is inversely related to peak height (E a and O a , respectively) such that stochastic variation will be reduced with increasing peak height.
Too much DNA, however, will result in overloading of the epg with split peaks, pull ups and other artefacts, after which true allele peaks can be confused with these artefacts. This provides our next condition for an inter-laboratory comparison of LRs:

6.
The DNA template from true donors should be maximised to a point within the linear range and below saturation of the epg.
The optimal amount of DNA defined by condition 6 may be difficult to assess. One way to achieve it is to amplify a dilution series of DNA such that there is a range of DNA template input amounts ranging from below the optimum to above the optimum. This is a general approach when assessing PG systems (e.g., [41,43,44]) and has been previously used to compare amongst them [29]. The LR will approach a maximum for H 1 true as DNA template amount increases and as w i → 0. This is demonstrated by Bauer et al. [44] in their Figure 1 (originally in [8]) and defines our next condition:

7.
Each laboratory is presented with aliquots of the same dilution series of DNA solutions which then undergo analyses to produce epgs for each solution according to each laboratory's standard practice (according to which the PG system was validated in that laboratory).
Forensic. Sci. 2021, 1, FOR PEER REVIEW 7 O and E refer to observed and expected peak heights for alleles (a) and stutter (a − 1), and c 2 and k 2 are locus-specific random variables which are in turn modelled by gamma distributions. For both allele and stutter peaks, the variance is inversely related to peak height (Ea and Oa, respectively) such that stochastic variation will be reduced with increasing peak height.
Too much DNA, however, will result in overloading of the epg with split peaks, pull ups and other artefacts, after which true allele peaks can be confused with these artefacts. This provides our next condition for an inter-laboratory comparison of LRs: 6. The DNA template from true donors should be maximised to a point within the linear range and below saturation of the epg.
The optimal amount of DNA defined by condition 6 may be difficult to assess. One way to achieve it is to amplify a dilution series of DNA such that there is a range of DNA template input amounts ranging from below the optimum to above the optimum. This is a general approach when assessing PG systems (e.g., [41,43,44]) and has been previously used to compare amongst them [29]. The LR will approach a maximum for H1 true as DNA template amount increases and as wi → 0. This is demonstrated by Bauer et al. [44] in their Figure 1 (originally in [8]) and defines our next condition: 7. Each laboratory is presented with aliquots of the same dilution series of DNA solutions which then undergo analyses to produce epgs for each solution according to each laboratory's standard practice (according to which the PG system was validated in that laboratory).  Stutter artefact peak heights will scale with true allele peak heights approximately according to a stutter ratio. Hence, conditions 6 and 7 are insufficient on their own to reduce uncertainty between stutter peaks and smaller true allele peaks. Similarly, they will not reduce uncertainty between single alleles and stacked alleles. If the contributors to a DNA mixture are present in equal proportion, however, this uncertainty is minimised, and different labs and different PG systems will tend to find the same w j . Cheng et al. [45] have recently demonstrated that peak heights are additive and proportional to the donor contributions in a DNA mixture epg. This means that if an allele is shared by two donors, then it should have double the height expected from an allele belonging to a single donor if both donors' DNA templates are not degraded and are present in equal proportion. If it is shared by three donors, it should have triple the height expected from an allele belonging to a single donor if all three donors' DNA is present in equal proportion, and so on.
Stutter peak heights are typically 15% or less of the parent allele peak height, depending on the length of the longest uninterrupted repeat chain [46]. Uncertainty between a stutter peak and a true allele will occur if one contributor is present in the mixture with this order of magnitude relative to another donor (15% or less). When all contributors to a mixture are present in equal proportion, then the size of each donor's allele peak should be approximately 100% relative to all other donors' peaks (albeit with stochastic variance and taking account of degradation) and thus less likely to be confused with a stutter peak. Our next condition for an inter-laboratory comparison of LRs is:

8.
All known donors are present in equal proportion by DNA template amount.
We now have the eight conditions for an inter-laboratory comparison originally suggested by McNevin et al. [35]. Such a comparison should produce the results described by them in their Figure 1 and by Bauer et al. [44] in their Figure 1 where the maximised value of the LR corresponding with the plateau in both cases is given by Equation (7) for all propositions and Equation (9) for H 1 true (Figure 1). We would go so far as to say that Equation (12) defines the "expected" LR range under our eight conditions, in the same way that the reciprocal of the random match probability is the expected LR for a high quality single source profile. We acknowledge that there is debate here, including a special issue in Science & Justice devoted entirely to measuring (or not) the reproducibility of LRs [47], but the upper bound for the LR defined in Equation (9) is certainly aspirational.
We add two final conditions that should be employed for any inter-laboratory comparison consistent with best scientific practice. These are: 9. The trial should be blinded. Laboratories presented with a dilution series of DNA solutions to be analysed should not know which is which. 10. The trial should be facilitated by an entity not associated with the PG systems under comparison.
LR > 1 from semicontinuous PG will nearly always be less than the LR from continuous PG for the same mixture for H 1 true, except at low DNA template amounts when stochastic effects dominate. This is because more information (peak heights) is being used by continuous PG resulting in LRs further from 1. Exceptions may occur when a sample has an unlikely peak height that greatly deviates from the expected height, possibly due to extreme stochastic variation or a primer sequence polymorphism (null allele). This can lead to very low weights for S * and thus a lower LR than for semicontinuous PG. Such exceptions notwithstanding, Equation (12) defines a theoretical range for the LR from continuous PG where the lower bound is the LR from semicontinuous PG and the upper bound represents no uncertainty amongst the true contributor genotype set, S * , and all others. The greater the number of equal-proportion contributors, the lower the LR and the lower the theoretical range defined by Equation (12) will be. This is because there are greater numbers of allele permutations that could explain contributor genotype sets, S j , and hence the weight, w j , assigned to each one is lower.
We now address the questions of peak height imbalance and degradation (the typical "ski slope" of DNA profiles). STRmix (and, indirectly, other continuous PG systems) model these phenomena using the so-called mass parameters and then assign w m according to how far the observed peaks deviate from the modelled peaks. Allele decay is modelled as a function of molecular weight where longer alleles will have lower peak heights than shorter alleles. Different manufacturers of forensic STR assays will have different amplicon sizes for each of the loci and so the relative decay amongst loci will vary. At any particular locus, there will also be allele-specific variation leading to heterozygote imbalance, for example. For STRmix, this is modelled by Equations (13) and (14). If we consider two non-shared alleles in a genotype set, S j , the further they are from equality (balanced), the lower the weight assigned to a heterozygous genotype in S j , all other weights being equal.
Peak height variance and degradation have been posited by Buckleton et al. [36] as another reason LRs cannot be compared amongst laboratories. However, at any particular locus for any particular kit, we are restricting w m such that each laboratory finds w i → 0 and w j are the same for all laboratories. A true heterozygous genotype may have two unbalanced and unshared alleles, but the heterozygous genotype will still have a much higher probability, w j , than other possible genotypes under our eight conditions, all other weights being equal.

An Inter-Laboratory Comparison
Our proposed conditions and trial will not provide a comparison point for every possible LR generated by continuous PG. This is because LRs produced by continuous PG are subject to variance. However, we have specified conditions that minimise this variance. Even less variance is possible if we specify conditions that minimise uncertainty between one POI and all other contributors (i.e., w i → 0, w k → 0, w * → 1), but this is the trivial case where one contributor is present at much higher proportion than all others, approaching the case of a single source profile.
It may be argued that our set of conditions 1 to 8 is restrictive and does not test the reproducibility of PG systems when w i is not close to 0. However, by including a dilution series, we can see how the variance in LR increases from its minimum (at high average peak height, APH) as APH decreases. Swaminathan et al. [17] found that this variance increased with a decrease in the contributors' template mass for all four of their representative continuous PG model variations. Conditions 1 to 8 therefore provide for a minimum performance measure. Our condition 8 is a strenuous test because, as Bucketon et al. [43] point out: "testing two low-level contributors with similar APHs (a 1:1 mixture) presents more of a challenge to the software than does a 1:20 mixture, as the genotype of the higher contributor has less uncertainty and helps to inform the genotype of the lower contributor". This would equally be the case at high APHs.
An inter-laboratory comparison employing our conditions will provide the following information: • The position of the plateaued, maximum LR from any laboratory within the theoretical range defined by Equation (12). This is a measure of performance, if not accuracy.

•
The range of plateaued, maximum LRs reported by laboratories. This is an indication of the credible interval for LRs reported under the best possible conditions designed to minimise variance in LRs. This credible interval would suggest a minimum as we would expect the variance amongst laboratories to increase the further they are from conditions 1 to 8.

•
Outlier laboratories. This would provide guidance on which laboratories (if any) might need to re-validate their PG system. • Outlier PG systems. This would provide guidance on which PG systems (if any) do not model allele peak height variance adequately according to the procedures in a particular laboratory.

•
The minimum template amounts at which fortuitous LRs are encountered for any laboratory (LR > 1 for a non-contributor, LR < 1 for a contributor). As DNA template amounts decrease in the dilution series, LRs for contributors and non-contributors will approach 1 but may actually overshoot.
We now define the procedure for an inter-laboratory comparison consistent with our conditions 1 to 10:

1.
Identify participating laboratories. They are required not to communicate with each other concerning the trial.

2.
Identify reported loci in common amongst participating laboratories. Longer loci, where Equation (7) might not be expected to hold, could also be excluded (with agreement). These excluded loci should not be used either to estimate parameters such as mixture proportions or to calculate LRs. In practice, any laboratory could nominate a locus to be excluded. A comparison between PG systems could, theoretically, be made with as little as one locus but, of course, more loci will increase the stringency of any trial.

3.
Identify a trial facilitator not associated with any of the PG systems to be used. This could be a university, a centre of excellence or a national forensic regulator, for example. 4.
The trial facilitator collects samples from reference cell lines or consenting volunteers and performs DNA extraction and quantitation for each sample. 5.
The DNA concentration for each sample is normalised according to the quantitation results and assessed as being of a suitable (high) quantity and quality. 6.
A single source STR profile for each donor is generated according to best practice. These are the contributor reference profiles. Non-contributor reference profiles can also be generated. 7.
Equal volume and equal concentration aliquots of high abundance DNA are combined from various donors to create mixtures of 2, 3, 4, . . . and N contributors in equal proportion by DNA amount. 8.
Aliquots of the various dilution series (one dilution series per mixture) are distributed to the participating laboratories, labelled randomly such that the laboratory does not know the concentration of DNA in any sample. For one, two, three, four and five contributors each at seven different dilutions, for example, a total of 35 samples would be supplied. 10. Each participating laboratory produces an STR epg for each aliquot according to the standard procedures for that laboratory. 11. The participating laboratories are also supplied with the following:

•
Allele frequencies from a defined population.
12. The following propositions are also provided to each of the participating laboratories: • H 1 : The donor of reference profile X is a contributor to the mixture which also consists of N other known but unrelated contributors (where all N+1 reference profiles are supplied); • H 2 : The donor of reference profile X is not a contributor to the mixture which consists of an unrelated, random member of the (defined) population and N other known but unrelated contributors.
These can be applied to both contributor and non-contributor reference profiles. 13. Each laboratory is asked to provide a LR according to Equation (1). The laboratories are instructed to use the allele frequencies provided from the defined population without any population substructure corrections and using a consistent population genetic model (e.g., Hardy-Weinberg proportions). 14. The LRs are collated and compared by the trial facilitator.

Conclusions
We propose a procedure to allow comparison amongst PG systems and laboratories. The LR defined by Equation (7) and the LR range defined by Equation (12) and enabled by our conditions 1 to 8 will not depend on either the PG system or the laboratory if each PG system calculates LR according to Equation (5) and calculates w m according to maximum likelihood and if each laboratory has calibrated their PG system appropriately. Kelly et al. [40] state that LR "variance is more profile specific than laboratory specific" if c 2 and k 2 in Equations (13) and (14), respectively, are adequately modelled.
Proposing that a PG system can be calibrated according to the procedures and instruments of a particular laboratory raises the question: Can that calibration be tested? We believe it can and that there is no reason to avoid inter-laboratory comparison of PG systems, even when different STR amplification kits are employed. Differences in STR assay, PCR thermal cycling, capillary electrophoresis or profile analysis settings will all be manifested in peak height variances which are modelled by PG systems. How well it is modelled will be determined by where any generated LR sits in the range defined by Equation (12) and, indeed, whether it falls in this range at all. We hope that our proposed study builds upon existing validation of continuous PG and provides another step towards establishing a standardised, best practice approach for DNA mixture analysis.