A Nonparametric Bayesian Approach to the Rare Type Match Problem

The “rare type match problem” is the situation in which, in a criminal case, the suspect’s DNA profile, matching the DNA profile of the crime stain, is not in the database of reference. Ideally, the evaluation of this observed match in the light of the two competing hypotheses (the crime stain has been left by the suspect or by another person) should be based on the calculation of the likelihood ratio and depends on the population proportions of the DNA profiles that are unknown. We propose a Bayesian nonparametric method that uses a two-parameter Poisson Dirichlet distribution as a prior over the ranked population proportions and discards the information about the names of the different DNA profiles. This model is validated using data coming from European Y-STR DNA profiles, and the calculation of the likelihood ratio becomes quite simple thanks to an Empirical Bayes approach for which we provided a motivation.


Introduction
The largely accepted method for evaluating how much some available data D (typically forensic evidence) helps discriminate between two hypotheses of interest (the prosecution hypothesis H p and the defense hypothesis H d ) is the calculation of the likelihood ratio (LR), a statistic that expresses the relative plausibility of the data under these hypotheses, defined as Widely considered the most appropriate framework to report a measure of the 'probative value' of the evidence regarding the two hypotheses [1][2][3][4], it indicates the extent to which observed data support one hypothesis over the other. The likelihood ratio is supposed to be multiplied by the prior odds, in order to obtain the posterior odds. The latter is the quantity of interest for a judge, but the prior odds do not fall within the statistician's competence. Even if a judge does not explicitly do Bayesian updating, the likelihood ratio is still considered to be the correct way for the expert to communicate their evaluation of the weight of the evidence to the court. We refer the reader to Taroni et al. [5] for extensive arguments for the use of likelihood ratios in forensic statistics. Forensic literature presents many approaches to calculate the LR, mostly divided into Bayesian and frequentist methods (see Cereda [6,7] for a careful differentiation between these two approaches).
This paper proposes a first application of a Bayesian nonparametric method to assess the likelihood ratio in the rare type match case, the challenging situation in which there is a match between some characteristic of the recovered material and of the control material, but this characteristic has not been observed before in previously collected samples (i.e., in the database of reference). This constitutes a

The Rare Type Match Problem
The evaluation of a match between the profile of a particular piece of evidence and a suspect's profile should depend on the relative frequencies of that profile in the population of potential perpetrators. Indeed, it is intuitive that the rarer the matching profile, the more the suspect is in trouble. A big problem arises when the observed frequency of the profile in a sample (database) from the population of interest is 0 (there is already a problem when the number is small, since it means that we don't know the population frequency very well). This problem could have been named "the new type match problem", but we decided to use the name "rare type match problem", motivated by the fact that a profile that has zero occurrences is likely to be rare, even though it is challenging to quantify how rare it is. The rare type match problem is particularly important for new kinds of forensic evidence, such as results from DIP-STR markers (see, for instance, Cereda et al. [9]) for which the available database size is still limited. The problem also occurs when more established types of evidence, such as Y-chromosome (or mitochondrial) DNA profiles are used, as explained in Section 2.2: they have been our main motivation for the present study.
The rare type match problem has been addressed in well known non-forensic statistics domains, and many solutions have been proposed. The empirical frequency estimator, also called naive estimator that uses the frequency of the characteristic in the database, puts unit probability mass on the set of already observed characteristics, and it is thus unprepared for the observation of a new type. A solution could be the add-constant estimators (in particular the well-known add-one estimator, due to Laplace [10], and the add-half estimator of Krichevsky and Trofimov [11]), which add a constant to the count of each type, included the unseen ones. However, these methods require knowledge of the number of possible unseen types, and they perform badly when this number is probably large (an anyway unknown) compared to the sample size (see Gale and Church [12] for an additional discussion). Alternatively, Good [13], based on an intuition on A.M. Turing, proposed the Good-Turing estimator for the total unobserved probability mass, based on the proportion of singleton observations in the sample. An application of this estimator to the frequentist LR assessment in the rare type match case is proposed in Cereda [7].
More recently, Orlitsky et al. [14] have introduced the high-profile estimator, which extends the tail of the naive estimator to the region of unobserved types. Anevski et al. [15] improved this estimator and provided a consistency proof for their modified estimator (original authors only provided heuristic reasoning that turned out to be rather difficult to make rigorous).
Bayesian nonparametric estimators for the probability of observing a new type have been proposed by Tiwari and Tripathi [16] using Dirichlet processes, by Lijoi et al. [17], De Blasi et al. [18] using a general Gibbs prior, and by Favaro et al. [19] with specific focus on the two-parameter Poisson Dirichlet prior, for which Arbel et al. [20] provide large sample asymptotic and credible bands. In particular, Favaro et al. [21] show the link between the Bayesian nonparametric approach and the Good-Turing estimator. However, the LR assessment requires not only the probability of observing a new species, but also the probability of observing this same species twice (according to the defense, the crime stain profile and the suspect profile are two identical independent observations): to our knowledge, the present paper is the first to address the problem of assessment of the LR in the rare type match case using a Bayesian nonparametric model (i.e., with an infinite dimensional parameter). As a prior for p, we will use the two-parameter Poisson Dirichlet distribution, which is proving useful in many discrete domains, in particular language modeling [22]. In addition, it predicts a power-law behavior that describes an incredible variety of phenomena [23] including the observed distribution of Y-STR haplotypes too.

Evaluation of Matching Probabilities of Y-STR Data
Y-STR profiles are typically used to detect male DNA in male-female DNA mixtures and are made of a number (usually varying from 7 to 23) of integers, that we treat as categorical observations, corresponding to STR polymorphisms belonging to the non-recombining part of the Y-chromosome.
There is no biological reason to assume independence among Y-STR loci, and even though the lack of recombination is in principle balanced by recurrent and backward mutations, the existence of such a dependency is studied and confirmed by Caliebe et al. [24]. As far as Y-STR population frequencies are concerned, the dependency between loci implies that no factorization (of the kind used for the autosomal markers) can be used to calculate these frequencies, and that the available databases are too small with respect to the large space of possible profiles (hence, a database will likely contain a high proportion of singletons). Indeed, the rare type match case is very frequent when using Y-STR data, and the use of simplistic methods such as the profile count is too conservative for practical use (it is bounded from below by the inverse of the database size) [24]. In Andersen et al. [25] and Andersen et al. [26], approximations of the joint distribution with second and third order dependencies between loci are explored. However, as admitted by the authors, there is a limitation due to the inadequacy of the sizes of available databases that makes it necessary to use simulations that in turns are oversimplification of real data.
Moreover, as highlighted already in 1994 by Balding and Nichols [27], match probabilities cannot be identified with population frequencies since a match can be due also to a certain degree of relatedness between the two donors of the stain. This is particularly true for Y-STR data, since Y-STR profiles are inherited almost identically from father to son. More recently, Andersen and Balding [28] investigate the influence of relatedness on matches and make a study concluding that 95% of matching profiles are separated by a relatively small number (50-100) of meiosis, hence the degree of relatedness is a very influential factor, according to their study. They thus propose a method to describe the distribution of the number of males with a matching Y-STR profile, extending the approach to mixtures in Andersen and Balding [29]. One limitation of this study is that it is based on extensive simulations which have to be performed anew in each new application, on assumptions about the genetic evolutionary model, and on parameters which are essentially unknown.
There are a huge number of methods developed to assess the evidential values for Y-STR data. Among those that are developed precisely for the rare type match case, there are Egeland and Salas [30], Brenner [8], Cereda [7], and Cereda [6]. These particular (just listed) methods do not take into account genetic information contained in the allelic numbers forming a Y-STR DNA profile. They do not use the fact that, due to relatedness, the observation of a particular Y-STR profile increases the probability of observing the same Y-STR profile again or Y-STR profiles that differ only for few alleles. We refer the reader to Roewer [31], Buckleton et al. [32], Willuweit et al. [33], Wilson et al. [34] for models that use population genetics for coancestry. These models are not designed to be used for the rare type match case, though the Discrete Laplace method presented in Andersen et al. [35] can be successfully applied to that purpose, as shown in Cereda [7].
After a careful study of the available methods for assessing likelihood ratios (or matching probabilities) for Y-STR matches, one can see that they are of different natures (some of them do their best to exploit the genetic structure, others don't) and based on different assumptions. In our opinion, none of them is fully satisfactory and at the same time useful for the rare type match and for general cases.
In this paper, we study what can be done if we reduce the data, taking into account only the equalities and inequalities among profiles rather than considering the specific Y-STR observed characteristics. We know part of the scientific community will not agree with our approach, preferring an approach such as the one of Andersen and Balding [28], but we believe that our method can also be useful. Indeed, we think it can be very useful for an analyst, in a particular case, to obtain results using several different methods relying on different assumptions: this is actually "sensitivity analysis". It provides further information which can be used by the court-even if it only shows that the evidential value of the match is almost unquantifiable. Moreover, even though Y-STR data have been the main motivation for this study, this model is actually applicable to different kinds of data (in principle for all forensic data that show power law behavior). When applied to data without genetic structures (such as tire marks or glass fragments), these kinds of criticisms should fade away if, of course, one can also find empirical or theoretical support for power law behavior.
The Y-STR marker system will thus be employed here as an extreme but in practice common and important example in which the problem of assessing the evidential value of rare type match can arise. We believe that the analyst should perform several analyses using different models and different assumptions, and compare the performance of the different methods, in order to try to learn from the differences (or lack of differences) between the conclusions which would follow from each method individually.
The big issues of working with Y-STR data are the unavailability of reliable databases, which are representative of actual population. The YHRD database is in fact a collection of databases coming from police or laboratories. The individual databases are not actually random samples from well defined populations since different institutions and organizations use different selection criteria. For instance, is a prison population representative of the population outside? The sizes of the databases from different countries are not proportional to the sizes of the populations.e are well aware of this limitation.

Notation and Data
Throughout the paper, the following notation is chosen: random variables and their values are denoted, respectively, with uppercase and lowercase characters: x is a realization of X. Random vectors and their values are denoted, respectively, by uppercase and lowercase bold characters: p is a realization of the random vector P. Probability is denoted with Pr(·), while the density of a continuous random variable X is denoted alternatively by p X (x) or by p(x) when the subscript is clear from the context. For a discrete random variable Y, the density notation p Y (y) and the discrete one Pr(Y = y) will be interchangeably used. Moreover, we will use shorthand notation like p(y | x) to stand for the probability density of Y with respect to the conditional distribution of Y given X = x.
Notice that, in Formula (1), D was regarded as the event corresponding to the observation of the available data. However, later in the paper, D will be regarded as a random variable generically representing the data. The particular data at hand will correspond to the value d. In that case, the following notation will thus be preferred: Lastly, notice that "DNA types" are used throughout the paper as a general term to indicate Y-STR profiles.
The data used in the present study were obtained from the Y Chromosome Haplotype Reference Database (YHRD) [36,37]. Here, only seven of the markers included in the PowerPlex1Y23 system (PPY23, Promega Corporation, Madison, WI, USA) were investigated: DYS19, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393. The dependence between these seven "core markers" is studied in Caliebe et al. [24] that concludes that "each of these seven markers contribute indispensable information about each other markers from the same set".

Model Assumptions
Our mathematical model is based on the two following mathematical assumptions: Assumption 1. There are infinitely many different DNA types in nature.
This assumption, already used by e.g., Kimura [38] in the "infinite alleles model", allows the use of Bayesian nonparametric methods and is very useful for instance in "species sampling problems" when the total number of possible different species in nature cannot be specified. This assumption is sensible also in case of Y-STR DNA profiles since the state space of possible different haplotypes is so large that it can be considered infinite.

Assumption 2. The names of the different DNA types do not contain usable information.
Actually, the specific sequence of numbers that forms a DNA profile carries information: if two profiles show few differences, this means that they are separated by few mutation drifts, hence the profiles share a relatively recent common ancestor. However, this information can be very difficult to use and it might be wiser not to try to use it in the LR assessment. This is the reason why we will treat DNA types as "colors", and only consider their partition into different categories. Stated otherwise, we do not use any topological structure on the space of the DNA types.
Notice that this assumption makes the model a priori suitable for any characteristic which has many different possible types showing power law behavior, as we will see, thus the approach described in this paper might be considered, in principle, after replacing "DNA types" with any other category.

Prior
In Bayesian statistics, parameters of interest are modeled through random variables. The (prior) distribution over a parameter should represent the prior uncertainty about its value.
The assessment of the LR for the rare type match involves two unknown parameters of interest: one is h ∈ {h p , h d }, representing the unknown true hypothesis, the other is p, the vector of the unknown population frequencies of all DNA profiles in the population of potential perpetrators. The dichotomous random variable H is used to model parameter h, and the posterior distribution of this random variable, given the data, is the ultimate aim of the forensic inquiry. Similarly, a random variable P is used to model the uncertainty over p. Because of Assumption 1, p is an infinite-dimensional parameter, hence the need for Bayesian nonparametric methods [39,40]. In particular because of Assumption 2, data can be reduced to partitions, as explained in Section 3.5, and it will turn out that the distribution of these partitions does not depend on the order of the p i . Hence, we can define the parameter p as having values in ∇ ∞ = {(p 1 , p 2 , . . .) | p 1 ≥ p 2 ≥ . . . , ∑ p i = 1, p i > 0}, the ordered infinite-dimensional simplex. The uncertainty about its value will be expressed by the two-parameter Poisson Dirichlet prior [41][42][43][44].
The GEM distribution (short for Griffin-Engen-McCloskey distribution) is well-known in the literature as the "stick-breaking prior" since it measures the random sizes in which a stick is broken iteratively. Definition 2 (Two-parameter Poisson Dirichlet distribution). Given α and θ satisfying condition (2), and a vector W = (W 1 , W 2 , . . .) ∼ GEM(α, θ), the random vector P = (P 1 , P 2 , . . .) obtained by ranking W, such that P i ≥ P i+1 , is said to be Poisson Dirichlet distributed PD(α, θ). Parameter α is called the discount parameter, while θ is the concentration parameter.
For our model, we will not allow α = 0, hence we will assume 0 < α < 1, in order to have a prior that shows a power-law behavior as the one observed in the YHRD database (see Section 5.1). We think that it could be interesting to look for about models from population genetics and evolution which seem similar in some respects and which predict power-law behavior.
The main reason that prompted us to choose the two-parameter Poisson Dirichlet distribution among the possible Bayesian nonparametric priors is given by model fitting (see Section 5.1). However, there is a very nice feature of this model. It is the only one that has the following very convenient sufficiency property [45]: the probability of observing a new species only depends on the number of already observed species and on the sample size, and the probability of observing an already seen species only depends on its frequency in the sample and on the sample size.
Lastly, we point out that, in practice, we cannot assume that we know the parameters α and θ: we will resolve this by using a hyperprior.

Bayesian Network Representation of the Model
The typical data to evaluate in case of a match is D = (E, B), where E = (E s , E c ), and E s = suspect's DNA type, E c = crime stain's DNA type (matching the suspect's type), B = a reference database of size n, which is treated here as a random sample of DNA types from the population of possible perpetrators.
The hypotheses of interest for the case are: h p = The crime stain originated from the suspect, h d = The crime stain originated from someone else.
In agreement with Assumption 2, the model will ignore information about the names of the DNA types: data D = (E, B) will thus be reduced to D accordingly. The Bayesian network of Figure 1 encapsulates the conditional dependencies of the random variables (H, A, Θ, P, X 1 , . . . , X n+2 , D), whose joint distribution is defined below in terms of a collection of conditional distributions (one for each node). H is a dichotomous random variable that represents the hypotheses of interest and can take values h ∈ {h p , h d }, according to the prosecution or the defense, respectively. A uniform prior on the hypotheses is chosen for mathematical convenience since it will not affect the likelihood ratio (the variable H being in the conditioning part): (A, Θ) is the random vector that represents the hyperparameters α and θ, satisfying condition (2). The joint prior density of these two parameters will be generically denoted as p(α, θ): For obvious reasons, this will be called the 'hyperprior' throughout the text.
The random vector P with values in ∇ ∞ represents the ranked population frequencies of Y-STR profiles. P = (p 1 , p 2 , . . .) means that p 1 is the frequency of the most common DNA type in the population, p 2 is the frequency of the second most common DNA type, and so on. As a prior for P, we use the two-parameter Poisson Dirichlet distribution: The database is assumed to be a random sample from the population. Integer-valued random variables X 1 , . . . , X n are here used to represent the (unknown) ranks in the population of the frequencies of the DNA types in the database. For instance, X 3 = 5 means that the third individual in the database has the fifth most common DNA type in the population. Given p, they are an i.i.d. sample from p: X n+1 represents the rank in the population ordering of the suspect's DNA type. It is again an independent draw from p: X n+2 represents the rank in the population ordering, of the crime stain's DNA type. According to the prosecution, given X n+1 = x n+1 , this random variable is deterministic (it is equal to x n+1 with probability 1). According to the defence, it is another sample from p, independent of the previous ones: In order to observe X 1 , . . . , X n+2 , one would need, by definition, to know the rank, in terms of population proportions, of the frequency of each DNA type in the database. This is not known, hence X 1 , . . . , X n are not observed. Section 3.5 recalls some notions about random partitions, useful before defining node D, the "reduced" data that we want to evaluate through the likelihood ratio.

Random Partitions and Database Partitions
A partition of a set S is an unordered collection of nonempty and disjoint subsets of S, the union of which forms S. Particularly interesting for our model are partitions of the set S = [n] = {1, . . . , n}, denoted as π [n] . The set of all partitions of [n] will be denoted as P [n] . Random partitions of [n] will be denoted as Π [n] , n ∈ N. In addition, a partition of n is a finite nonincreasing sequence of positive integers that sum up to n. Partitions of n will be denoted as π n , while random partitions as Π n .
Given a sequence of integer valued random variables X 1 , . . . , X n , let Π [n] (X 1 , X 2 , . . . , X n ) be the random partition defined by the equivalence classes of their indices using the random equivalence relation i ∼ j if and only if X i = X j . This construction allows one to build a "reduction map" from the set of values of X 1 , . . . , X n to the set of the partitions of [n] as in the following example (n = 10): N 10 → P [10] (5) X 1 , . . . , X 10 −→ Π [10] (X 1 , X 2 , . . . , X 10 ) Similarly, and in agreement with Assumption 2, in our model, we can consider the reduction of data which ignores information about the names of the DNA types: this is achieved, for instance, by retaining from the database only the equivalence classes of the indices of the individuals, according to the equivalence relation "has the same DNA type". Stated otherwise, the database is reduced to the partition π Db [n] , obtained using these equivalence classes. However, the database only supplies part of the data. There are also two new DNA profiles that are equal to one another (and different from the already observed ones in the rare type match case). Considering the suspect's profile, we obtain the partition π Db+ [n+1] , where the first n integers are partitioned as in π Db [n] , and n + 1 constitutes a class by itself. Considering the crime stain profile too, we obtain the partition π Db++ [n+2] where the first n integers are partitioned as in π Db [n] , and n + 1 and n + 2 belong to the same (new) class. , respectively. Since prosecution and defense agree on the distribution of X 1 , . . . , X n+1 , but not on the distribution of X n+2 |X 1 , . . . , X n+1 , they also agree on the distribution of Π Db+ [n+1] but disagree on the distribution of Π Db++ [n+2] (see (4)).
The crucial points of the model are the following: 1. The random partitions defined through the random variables X 1 , . . . , X n+2 and through the database are the same: are observable.
Notice that, given X 1 , . . . , X n+2 , D is deterministic. An important result is that, according to Proposition 4 in Pitman [46], it is possible to derive directly the distribution of D | α, θ, H. In particular, it holds that, if P | α, θ ∼ PD(α, θ), then, for all n ∈ N, the random partition Π [n] = Π [n] (X 1 , . . . , X n ) has the following distribution: where n i is the size of the ith block of π [n] (the blocks are here ordered according to their least element), if a = 0 . This formula, also known as the Pitman sampling formula, is further studied in Pitman [47] and shows that P α,θ n (π [n] ) does not depend on X 1 , . . . , X n , but only on the sizes and the number of classes in the partitions. It follows that we can get rid of the intermediate layer of nodes X 1 , . . . , X n+2 . Moreover, it holds that Pr(D|α, θ, h p ) = P α,θ n+1 (π Db+ [n+1] ), while Pr(D|α, θ, h d ) = P α,θ n+2 (π Db++ [n+2] ). The model of Figure 1 can thus be simplified to the one in Figure 2.

Chinese Restaurant Representation
There is an alternative characterization of this model, called the "Chinese restaurant process", due to Aldous [48] for the one-parameter case, and studied in detail for the two-parameter version in Pitman [44]. It is defined as follows: consider a restaurant with infinitely many tables, each one infinitely large. Let Y 1 , Y 2 , . . . be integer-valued random variables that represent the seating plan: tables are ranked in order of occupancy, and Y i = j means that the ith customer seats at the jth table to be created. The process is described by the following transition matrix: where k is the number of tables occupied by the first n customers, and n i is the number of customers that at that time have been seated at table i. The process depends on two parameters α and θ with the same conditions (2). From (9), one can easily see the sufficientness property mentioned in Section 3.3. Y 1 , . . . , Y n are not i.i.d., nor exchangeable, but it holds that Π [n] (Y 1 , . . . , Y n ) is distributed as Π [n] (X 1 , . . . , X n ), with X 1 , . . . , X n defined as in (3), and they are both distributed according to the Pitman sampling formula (8) [44].
Stated otherwise, we can obtain the same partition π Db [n] through the seating plan of n customers or partitioning the X 1 , . . . , X n of the database. Similarly, π Db+ [n+1] is obtained when a new customer has chosen an unoccupied table (remember we are in the rare type match case), and π Db++ [n+2] is obtained when the (n + 2)nd customer goes to the table already chosen by the (n + 1)st customer (suspect and crime stain have the same DNA type). In particular, thanks to (9), we can write: since the (n + 2)nd customer goes to the same table as the (n + 1)st (who was sitting alone).

A Useful Lemma
The following lemma can be applied to four general random variables Z, X, Y, and H whose conditional dependencies are described by the Bayesian network of Figure 3. The importance of this result is due to the possibility of using it for assessing the likelihood ratio in a very common forensic situation: the prosecution and the defense disagree on the distribution of the entirety of data (Y) but agree on the distribution of a part it (X), and these distributions depend on parameters (Z).
Z H X Y

Lemma 1.
Given four random variables Z, H, X, and Y, whose conditional dependencies are represented by the Bayesian network of Figure 3, the likelihood function for h, given X = x and Y = y satisfies The Bayesian representation of the model, in Figure 3, allow for factoring the joint probability density of Z, H, X, and Y as p(z, h, x, y) = p(z) p(x | z) p(h) p(y | x, z, h).
By Bayes' formula, p(z) p(x | z) = p(x) p(z | x). This rewriting corresponds to reversing the direction of the arrow between Z and X: Z H

X Y
The random variable X is now a root node. This means that, when we probabilistically condition on X = x, the graphical model changes in a simple way: we can delete the node X, and just insert the value x as a parameter in the conditional probability tables of the variables Z and Y which formerly had an arrow from node X. The next graph represents this model: This tells us that, conditional on X = x, the joint density of Z, Y, and H is equal to The joint density of H and Y given X is obtained by integrating out the variable Z. It can be expressed as a conditional expectation value since p(z | x) is the density of Z given X = x. We find: Recall that this is the joint density of two of our variables, H and Y, after conditioning on the value X = x. Let us now also condition on Y = y. It follows that the density of H given X = x and Y = y is proportional (as function of H, for fixed x and y) to the same expression, p(h)E(p(y | x, Z, h) | X = x).
This is a product of the prior for h with some function of x and y. Since posterior odds equals prior odds times likelihood ratio, it follows that the likelihood function for h, given X = x and Y = y satisfies lik(h | x, y) ∝ E(p(y | x, Z, h) | X = x). Corollary 1. Given four random variables Z, H, X, and Y, whose conditional dependencies are represented by the network of Figure 3, the likelihood ratio for H = h 1 against H = h 2 given X = x and Y = y satisfies

The Likelihood Ratio
From now on, we will omit the superscripts Db, Db+, and Db++ for ease of notation. Using the hypotheses and the reduction of data D defined in Section 3, the likelihood ratio will be defined as The last equality holds due to the fact that Π [n+1] is a deterministic function of Π [n+2] .
Corollary 1 can be applied to our model since defense and prosecution agree on the distribution of π [n+1] , but not on the distribution of π [n+2] , and data depends on parameters α and θ. Thus, if (A, Θ) play the role of Z, X = Π [n+1] , and Y = Π [n+2] , by using (10) and (11), we obtain: .
The expected value is taken with respect to the posterior distribution of A, The solution we propose in this paper is to deal with the uncertainty about α and θ by using MLE estimators and plug those estimators into the formula. Notice that this is equivalent to a hybrid approach, in which the parameters are estimated in a frequentist way and their values are plugged into the Bayesian LR. In the future, we plan to use MCMC methods to calculate as exactly as possible the exact posterior distribution, given assumed priors on the hyperparameters.
By defining the random variable Φ = n 1 − A n + 1 + Θ , we can write the LR as

Analysis on the YHRD Database
In this section, we present the study we made on a database of 18,925 Y-STR 23-loci profiles from 129 different locations in 51 countries in Europe [37]. Our analyses are performed by considering only 7 Y-STR loci (DYS19, DYS389 I, DYS389 II, DYS3904, DYS3915, DY3926, DY3937), but similar results have been observed with the use of 10 loci.

Model Fitting
In Figure 4, the ranked frequencies of the 18'925 Y-STR profiles of the YHRD database are compared to the relative frequencies of samples of size n obtained from several realizations of PD(α MLE , θ MLE ), where α MLE and θ MLE are the maximum likelihood estimates obtained using the entire database and the likelihood defined by (8).Their values are α MLE = 0.51 and θ MLE = 216.
To do so, we run several times the Chinese Restaurant seating plan (up to n = 18, 925 customers): each run is used to approximate a new realization p from the PD(α MLE , θ MLE ). As explained in Section 3.6, the partition of the customers into tables is the same as the partition obtained from an i.i.d. sample of size n from p. We can see that, for the most common haplotypes (left part of the plot), there is some discrepancy. However, we are interested in rare haplotypes, which typically have a frequency belonging to the right part of the plot. In that region, the two-parameter Poisson Dirichlet follows the distribution of the data quite well. The dotted line in Figure 4 shows the asymptotic behavior on the two-parameter Poisson Dirichlet distribution. Indeed, if P ∼ PD(α, θ), then P i Zi −1/α → 1, a.s., when i → +∞ for a random variable Z such that Z −α = Γ(1 − α)/S α . This power-law behavior describes an incredible variety of phenomena [23]. The thick line in Figure 4 also seems to have a power-law behavior, and, to be honest, we were hoping to get the same asymptotic slope of the prior. This is not what we observe, but, in Figure 5, it can be seen that, for such a big value of θ, we would need a bigger database (at least n = 10 6 ) to see the correct slope.
A Gaussian distribution centered in the MLE parameters and with covariance matrix the inverse of the Fisher Information is also displayed (in dashed lines). This is not done to show an asymptotic property, but to show the symmetry of the log-likelihood, which validates approximation of Hence, one could safely make this approximation if one believed that this symmetry would also be true in the real data situation at hand: Notice that this is equivalent to a hybrid approach, commonly called "Empirical Bayes", in which the parameters are estimated through the MLE (frequentist) and their values are plugged into the Bayesian LR. We would like to reiterate that we are not using maximum likelihood estimates of the parameters because we consider the likelihood ratio from a frequentist point of view. Our aim is to calculate a Bayesian likelihood ratio, and we have observed empirically that, using the maximum likelihood estimates of the parameters, we can approximate this value.
Hence, in case of a rare type match problem, and using the YHRD database as the reference database, we have log 10 LR = 4.59 that corresponds to say that it is approximately 40,000 times more likely to observe the reduced data under the prosecution hypothesis than under the defense hypothesis. 95% confidence interval 99% confidence interval Figure 6. Relative log-likelihood for φ = n 1−α n+1+θ and θ compared to a Gaussian distribution displayed with 95% and 99% confidence intervals.

True LR
It is also interesting to study the likelihood ratio values obtained with out method according to Formula (4), and to compare it with the 'true' ones, meaning the LR values obtained when the vector p is known, over simulated rare type match cases. This corresponds to the desirable (even though completely imaginary) situation of knowing the ranked list of the frequencies of all the DNA types in the population of interest. The model can be represented by the Bayesian network of Figure 7. The likelihood ratio in this case can be obtained using again Corollary 1, where now X 1 , . . . , X n+1 play the role of Z. Indeed, now that p is known, the unobservable part of the model consists of the ranks of the types in the database.
Notice that, in the rare type case, X n+1 is observed only once among the X 1 , . . . , X n+1 . Hence, we call it a singleton, and its distribution given p, π [n+1] is the same as the distribution of each other singleton. Let s 1 denote the number of singletons, and S the set of indices of singletons observations in the augmented database. It holds that Notice also that the knowledge of p and π [n+1] is not enough to observe X 1 , . . . , X n+1 . Let us denote as X * 1 , . . . , X * K the K different values taken by X 1 , . . . , X n+1 , ordered decreasingly according to the frequency of their values. Stated otherwise, if n i is the frequency of x * i among x 1 , . . . , x n+1 , then n 1 ≥ n 2 ≥ . . . ≥ n K . Moreover, in case X * i and X * j have the same frequency (n i = n j ), then they are ordered increasingly according to their values. For instance, if X 1 = 2, X 2 = 4, X 3 = 2, X 4 = 4, X 5 = 3, X 6 = 3, X 7 = 10, X 8 = 13, X 9 = 5, X 10 = 4, X 11 = 9, then X * 1 = 4, X * 2 = 2, X * 3 = 3, X * 4 = 5, X * 5 = 9, X * 6 = 10, X * 7 = 13. By definition, it holds that Notice that (n 1 , n 2 , . . . , n K ) is a partition of n + 1, which will be denoted as π n+1 . In the example, π n+1 = (3, 2, 2, 1, 1, 1, 1). Since the distribution of ∑ j: n j =1 p X * j only depends on π n+1 , the latter can replace π [n+1] . Thus, it holds that A more compact representation for π n+1 can be obtained by using two vectors a and r where a j are the distinct numbers occurring in the partition, increasingly ordered, and each r j is the number of repetitions of a j . J is the length of these two vectors, and it holds that n + 1 = ∑ J j=1 a j r j . In the example above, we have that π n+1 can be represented by (a, r) with a = (1, 2, 3) and r = (4, 2, 1), J = 3.
There is an unknown map, χ, treated here as latent variable, which assigns the ranks of the DNA types, ordered according to their frequency in nature, to one of the number {1, Given π n+1 = (a, r), χ must satisfy the following set of J conditions: In addition, it should not be allowed that a profile observed k N times in the population is observed k n > k N times in the sample. Hence, we have to add a further condition: ∀i (16) where N is the size of the entire population. The map χ can be represented by a vector χ = (χ 1 , χ 2 , . . .) such that χ i = χ(i). In the example, we have that χ = (0, 2, 2, 3, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, . . .).
Notice that, given π n+1 = (a, r), the knowledge of χ implies the knowledge of X * 1 , . . . , X * K : indeed, it is enough to consider the position of the ranked positive values of χ, and to solve ties by considering the positions themselves (if χ i = χ j , than the order is given by i and j). For instance, in the example, if we sort the positive values of χ and we collect their positions, we get (4,2,3,5,9,10,13): the reader can notice that we got back to X * 1 , . . . , X * 7 . This means that, to obtain the distribution of X * 1 , . . . , X * K |π n+1 , p, which appears in (14), it is enough to obtain the distribution of χ|a, r, p, and since we are only interested in the mean of the sum of singletons in samples of size n + 1 from the distribution of X * 1 , . . . , X * K |a, r, p, we can just simulate samples from the distribution of χ|a, r, p and sum the p i such that χ i = 1.
It holds that where the proportionality factor is

Details of the Metropolis-Hastings Algorithm
Notice that for the model we assumed p to be infinitely long, but for simulations we will use a finitep, of length m. This is equivalent to assume that only m elements in the infinite p are positive, and the remaining infinite tail is made of zeros.
To simulate samples from the distribution of χ|a, r, p, we use a Metropolis-Hastings algorithm on the space of the vectors χ satisfying the J + m conditions (15) and (16). Then, the state space of the Metropolis-Hastings Markov chain is made of all vectors of length m whose elements belong to {0, 1, . . . , J}, and satisfy the conditions (15) and (16). If we start with an initial point χ 0 which satisfies (15) and, at each move t of the Metropolis-Hastings, we swap two different values χ i and χ j inside the vector, condition (15) remains satisfied while conditions (16) must be checked at every iterations. The Metropolis factor is the ratio of the two likelihoods p(a, r | χ t , p) and p(a, r | χ t+1 , p), where χ t and χ t+1 differ only because χ i and χ j are exchanged. Hence, using (17), the Metropolis factor for every move is Every exchange move is then accepted with probability R. The algorithm is iterated N = 10 5 times, with thinning steps of 10 3 and a burn-in period of 20,000 iterations. Since it holds that for every accepted χ, we calculate the sum of all p i s such that χ i = 1 and we use the average to approximate the denominator of (14). The algorithm is based on a similar one proposed in Anevski et al. [15]. This method allows us to approximate the 'true' LR when the vector p is known. This is almost never the case, but we can put ourselves in a fictitious world where we know p (such as the frequencies in the YHRD database, or as in the following section the frequencies from a smaller population) and compare the true values for the LR |p with the one obtained by applying our Bayesian nonparametric model when p is unknown.

Frequentist-Bayesian Analysis of the Error
A real Bayesian statistician chooses the prior and hyperprior according to his beliefs. Depending on the choice of the hyperprior over α and θ, she may or may not believe in the approximation (13), but she does not really talk of "error". However, hardliner Bayesian statisticians are a rare species, and most of the time the Bayesian procedure consists of choosing priors (and hyperpriors) which are a compromise between personal beliefs and mathematical convenience. It is thus interesting to investigate the performance of such priors. This can be done by comparing the Bayesian likelihood ratio with the likelihood ratio that one would obtain if the vector p was known, and for the same reduction of data. This is what we call "error": in other words, at the moment, we are considering the Bayesian nonparametric method proposed in this paper as a way to estimate (notice the frequentist terminology) the true LR |p . If we denote by p x the population proportion of the matching profile, another interesting comparison is the one between the Bayesian likelihood ratio and the frequentist likelihood ratio 1/p x (here denoted as LR f ) that one would obtain knowing p, but not reducing the data to partition. This is a sort of benchmark comparison and tells us how much we lose by using the Bayesian nonparametric methodology, and by reducing data.
In total, there are three quantities of interest (log 10 LR, log 10 LR |p , and log 10 LR f ), and two differences of interest, which will be denoted as • Diff 1 = log 10 LR − log 10 LR |p (loss due to choice of the Poisson Dirichlet model and approximation (13)), • Diff 2 = log 10 LR − log 10 LR f (overall loss).
In order to analyze these five quantities, we can study their distribution over different rare type match cases. However, there is an obstacle. The Metropolis-Hastings algorithm described in Section 5.3 is too slow to be used with the entire European database of Purps et al. [37] of size n = 18,925.
In this way, we can use the Metropolis-Hastings algorithm to simulate LR |p . The model fitting is still good enough, as shown in Figure 8 (as a side note, notice that the asymptotic behavior is reached faster for this smaller value of θ MLE = 22).
However, it is important to stress that the Gaussian shape and consequently the approximation (13) is not empirically supported for small databases of size n = 100.
In Table 1 and Figure 9a, we compare the distribution of log 10 LR |p , log 10 LR, and log 10 LR f obtained by 96 samples of size 100 from the Dutch population . Each sample represents a different rare type match case with a specific database of reference of size n = 100.
The distribution of the benchmark likelihood ratio log 10 LR f has more variation than the distribution of the Bayesian likelihood ratio, while log 10 LR |p appears to be the most concentrated around its mean.    Figure 9. (a) comparison between the distribution of log 10 LR, log 10 LR |p , and log 10 LR f ; (b) the error log 10 LR − log 10 LR |p and log 10 LR − log 10 LR f .
In Table 2 and Figure 9b, we consider the distribution of the two differences, Diff 1 and Diff 2 . Diff 1 is the smallest and the most concentrated: it ranges between −0.146 and 0.381 and has a small standard deviation. It means that the nonparametric Bayesian likelihood ratio obtained as in (13) can be thought of as a good approximation of the frequentist likelihood ratio for the same reduction of data (log 10 LR |p ), even though we have not empirically validated the approximation for small databases of size 100. This difference is due to three things: the approximation (13), the MLE estimation of the hyperparameters, and the choice of a prior distribution (two-parameter Poisson Dirichlet) which is quite realistic, as shown in Figure 8, but not perfectly fitting the actual population. Notice that the difference increases if the Bayesian nonparametric likelihood is compared to the benchmark likelihood ratio (Diff 2 ). Here, the reduction of data comes into play too. However, the difference ranges within one order of magnitude, but most of the time lies between −0.676 and 0.115; thus, it is small.

Conclusions
This paper discusses the first application of a Bayesian nonparametric method to likelihood ratio assessment in forensic science, in particular to the challenging situation of the rare type match. If compared to traditional Bayesian methods such as those described in Cereda [6], it presents many advantages. First of all, the prior chosen for the parameter p seems to be quite realistic for the population whose frequencies we want to model. Moreover, though the theoretical background on which it rests may seem very technical and difficult, the method is extremely simple in practice, thanks to the use of an empirical Bayes approximation. More could be done in the future: in particular regarding approximation (13). The posterior expectation in the denominator could, for instance, be treated using MCMC algorithms or ABC algorithms. Then, we can try to improve the efficiency of the Metropolis-Hastings algorithm defined in Section 5.3 in order to be used with bigger and better populations. The big problem is how to use these methods when relevant populations are poorly defined and accessible databases are of doubtful relevance. We don't solve those problems.
It is not clear whether other methods are better. This is all very open and controversial. We suggest the analyst to perform several very different analyses and think carefully what the differences between the conclusions tells her. With this aim, we plan to compare this Bayesian nonparametric method to other existing methods for the rare type match problem, investigating calibration and validation through the use of ECE plots [49].
Author Contributions: The original version of this paper was written while the G.C. was PhD student at Leiden University under the supervision of R.D.G. The lead author, G.C., contributed the original conception of this paper, performed most of the literature research, computation, and drafted the original manuscript. Intensive discussions and much rewriting and revision by the two authors together, during which both of them contributed further new ideas, led finally to this version. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the Swiss National Science Foundation through Grant No. P2LAP2 178195.