A Nonparametric Bayesian Approach to the Rare Type Match Problem

Cereda, Giulia; Gill, Richard D.

doi:10.3390/e22040439

Open AccessArticle

A Nonparametric Bayesian Approach to the Rare Type Match Problem

by

Giulia Cereda

^* and

Richard D. Gill

Mathematical Institute, Leiden University, Postbus 9512, 2300 RA Leiden, The Netherlands

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(4), 439; https://doi.org/10.3390/e22040439

Submission received: 27 January 2020 / Revised: 2 April 2020 / Accepted: 9 April 2020 / Published: 13 April 2020

Download

Browse Figures

Versions Notes

Abstract

:

The “rare type match problem” is the situation in which, in a criminal case, the suspect’s DNA profile, matching the DNA profile of the crime stain, is not in the database of reference. Ideally, the evaluation of this observed match in the light of the two competing hypotheses (the crime stain has been left by the suspect or by another person) should be based on the calculation of the likelihood ratio and depends on the population proportions of the DNA profiles that are unknown. We propose a Bayesian nonparametric method that uses a two-parameter Poisson Dirichlet distribution as a prior over the ranked population proportions and discards the information about the names of the different DNA profiles. This model is validated using data coming from European Y-STR DNA profiles, and the calculation of the likelihood ratio becomes quite simple thanks to an Empirical Bayes approach for which we provided a motivation.

Keywords:

forensic statistics; likelihood ratio; Bayesian nonparametric; rare type match problem; Y-STR

1. Introduction

The largely accepted method for evaluating how much some available data

D

(typically forensic evidence) helps discriminate between two hypotheses of interest (the prosecution hypothesis

H_{p}

and the defense hypothesis

H_{d}

) is the calculation of the likelihood ratio (LR), a statistic that expresses the relative plausibility of the data under these hypotheses, defined as

LR = \frac{Pr (D | H_{p})}{Pr (D | H_{d})} .

(1)

Widely considered the most appropriate framework to report a measure of the ‘probative value’ of the evidence regarding the two hypotheses [1,2,3,4], it indicates the extent to which observed data support one hypothesis over the other. The likelihood ratio is supposed to be multiplied by the prior odds, in order to obtain the posterior odds. The latter is the quantity of interest for a judge, but the prior odds do not fall within the statistician’s competence. Even if a judge does not explicitly do Bayesian updating, the likelihood ratio is still considered to be the correct way for the expert to communicate their evaluation of the weight of the evidence to the court. We refer the reader to Taroni et al. [5] for extensive arguments for the use of likelihood ratios in forensic statistics. Forensic literature presents many approaches to calculate the LR, mostly divided into Bayesian and frequentist methods (see Cereda [6,7] for a careful differentiation between these two approaches).

This paper proposes a first application of a Bayesian nonparametric method to assess the likelihood ratio in the rare type match case, the challenging situation in which there is a match between some characteristic of the recovered material and of the control material, but this characteristic has not been observed before in previously collected samples (i.e., in the database of reference). This constitutes a problem because the value of the likelihood ratio depends on the unknown proportion of the matching characteristic in a reference population, and the uncertainty over this proportion, in standard practice for simpler situations, is dealt with using the relative frequency of the characteristic in the available database. In particular, we will focus on Y-STR data, for which the rare type match problem keeps turning up [7]. The problem is so substantial that it has been called “the fundamental problem of forensic mathematics” [8].

The use of our “Bayesian nonparametric” method involves the mathematical assumption that there are infinitely many different Y-STR profiles. Of course, we do not believe this literally to be true. We do suppose that there are so many profiles that we cannot say anything sensible about their exact number, except that it is very large. Hence, we pretend they are infinitely many, so that we can use the chosen Bayesian nonparametric method. The nomenclature is misleading since it gives the impression that there is no parameter. A nonparametric model is not a model without parameters, is a model with at least one infinite-dimensional parameter. Some might call this model “semiparametric”.

The parameter of the model is the infinite-dimensional vector

p

, containing the (unknown) sorted population proportions of all possible Y-STR profiles. As a prior over

p

, we choose the two-parameter Poisson Dirichlet distribution, and we model the uncertainty over its hyperparameters

α

and

θ

through the use of a hyperprior. The information contained in the actual repeat numbers, a list of which form the name of each Y-STR profiles, is discarded thereby reducing the full data

D

to a smaller set D.

If compared to traditional Bayesian methods such as those discussed in Cereda [6], this method has the advantage of having a prior for the parameter p that is more realistic for the population we intend to model. Moreover, despite its technical theoretical background, we empirically derived an approximation that makes the method intuitive and simple to apply for practical use: indeed, simulation experiments show that a (“empirical Bayes”) hybrid approach that plugs in maximum likelihood estimators for the hyperparamenter is justified, at least when using populations having features which we observe in combined European data. The last point in favor of the choice of the two-parameter Poisson Dirichlet prior over p is that it has the following sufficiency property: the probability of observing a new Y-STR profile only depends on the number of already observed Y-STR profiles and on the sample size, while the probability of observing a Y-STR profile that is already in the database only depends on its frequency in the database and on the sample size.

The paper is structured as follows: Section 2 discusses the state of the art regarding the rare type match problem and the evaluation of Y-STR matches. Section 3 presents our model, with the assumptions and the prior distribution chosen for the parameter

p

along with some theory on random partitions and the Chinese restaurant representation, useful to provide a prediction rule and a convenient and compact representation of the reduced data D. In addition, a lemma that facilitates computing the likelihood ratio in a very simple way is presented and proved. In Section 4, the likelihood ratio is derived. Section 5 illustrates the application of this model to a database sampled from an artificial population. We will discuss data-driven choices for the hyperparameters, and the derivation of the likelihood ratio values obtained both with and without reducing the data to partitions, in the ideal situation in which the vector

p

is known. In addition, the distribution of the likelihood ratios for different rare type match cases is studied, along with the analysis of two different errors.

2. State of the Art

Y-STR data have been our main motivation for studying the rare type match problem. Our model will reduce the relationship among various profiles to a binary “match” or “no match” equivalence relation. However, there is a big debate in the scientific community regarding whether it is acceptable to throw away the genetic structure of this kind of data. In this section we discuss the state of the art regarding the rare type match problem as a general issue and also the state of the art regarding methods for assessing evidential values of matching Y-STR profiles. Moreover, it should be realized that the rare type match problem is an interesting problem also outside the Y-STR profile setting and our model can perhaps also be applied to other kinds of data.

2.1. The Rare Type Match Problem

The evaluation of a match between the profile of a particular piece of evidence and a suspect’s profile should depend on the relative frequencies of that profile in the population of potential perpetrators. Indeed, it is intuitive that the rarer the matching profile, the more the suspect is in trouble. A big problem arises when the observed frequency of the profile in a sample (database) from the population of interest is 0 (there is already a problem when the number is small, since it means that we don’t know the population frequency very well). This problem could have been named “the new type match problem”, but we decided to use the name “rare type match problem”, motivated by the fact that a profile that has zero occurrences is likely to be rare, even though it is challenging to quantify how rare it is. The rare type match problem is particularly important for new kinds of forensic evidence, such as results from DIP-STR markers (see, for instance, Cereda et al. [9]) for which the available database size is still limited. The problem also occurs when more established types of evidence, such as Y-chromosome (or mitochondrial) DNA profiles are used, as explained in Section 2.2: they have been our main motivation for the present study.

The rare type match problem has been addressed in well known non-forensic statistics domains, and many solutions have been proposed. The empirical frequency estimator, also called naive estimator that uses the frequency of the characteristic in the database, puts unit probability mass on the set of already observed characteristics, and it is thus unprepared for the observation of a new type. A solution could be the add-constant estimators (in particular the well-known add-one estimator, due to Laplace [10], and the add-half estimator of Krichevsky and Trofimov [11]), which add a constant to the count of each type, included the unseen ones. However, these methods require knowledge of the number of possible unseen types, and they perform badly when this number is probably large (an anyway unknown) compared to the sample size (see Gale and Church [12] for an additional discussion). Alternatively, Good [13], based on an intuition on A.M. Turing, proposed the Good–Turing estimator for the total unobserved probability mass, based on the proportion of singleton observations in the sample. An application of this estimator to the frequentist LR assessment in the rare type match case is proposed in Cereda [7].

More recently, Orlitsky et al. [14] have introduced the high-profile estimator, which extends the tail of the naive estimator to the region of unobserved types. Anevski et al. [15] improved this estimator and provided a consistency proof for their modified estimator (original authors only provided heuristic reasoning that turned out to be rather difficult to make rigorous).

Bayesian nonparametric estimators for the probability of observing a new type have been proposed by Tiwari and Tripathi [16] using Dirichlet processes, by Lijoi et al. [17], De Blasi et al. [18] using a general Gibbs prior, and by Favaro et al. [19] with specific focus on the two-parameter Poisson Dirichlet prior, for which Arbel et al. [20] provide large sample asymptotic and credible bands. In particular, Favaro et al. [21] show the link between the Bayesian nonparametric approach and the Good–Turing estimator. However, the LR assessment requires not only the probability of observing a new species, but also the probability of observing this same species twice (according to the defense, the crime stain profile and the suspect profile are two identical independent observations): to our knowledge, the present paper is the first to address the problem of assessment of the LR in the rare type match case using a Bayesian nonparametric model (i.e., with an infinite dimensional parameter). As a prior for

p

, we will use the two-parameter Poisson Dirichlet distribution, which is proving useful in many discrete domains, in particular language modeling [22]. In addition, it predicts a power-law behavior that describes an incredible variety of phenomena [23] including the observed distribution of Y-STR haplotypes too.

2.2. Evaluation of Matching Probabilities of Y-STR Data

Y-STR profiles are typically used to detect male DNA in male–female DNA mixtures and are made of a number (usually varying from 7 to 23) of integers, that we treat as categorical observations, corresponding to STR polymorphisms belonging to the non-recombining part of the Y-chromosome. There is no biological reason to assume independence among Y-STR loci, and even though the lack of recombination is in principle balanced by recurrent and backward mutations, the existence of such a dependency is studied and confirmed by Caliebe et al. [24]. As far as Y-STR population frequencies are concerned, the dependency between loci implies that no factorization (of the kind used for the autosomal markers) can be used to calculate these frequencies, and that the available databases are too small with respect to the large space of possible profiles (hence, a database will likely contain a high proportion of singletons). Indeed, the rare type match case is very frequent when using Y-STR data, and the use of simplistic methods such as the profile count is too conservative for practical use (it is bounded from below by the inverse of the database size) [24]. In Andersen et al. [25] and Andersen et al. [26], approximations of the joint distribution with second and third order dependencies between loci are explored. However, as admitted by the authors, there is a limitation due to the inadequacy of the sizes of available databases that makes it necessary to use simulations that in turns are oversimplification of real data.

Moreover, as highlighted already in 1994 by Balding and Nichols [27], match probabilities cannot be identified with population frequencies since a match can be due also to a certain degree of relatedness between the two donors of the stain. This is particularly true for Y-STR data, since Y-STR profiles are inherited almost identically from father to son. More recently, Andersen and Balding [28] investigate the influence of relatedness on matches and make a study concluding that 95% of matching profiles are separated by a relatively small number (50–100) of meiosis, hence the degree of relatedness is a very influential factor, according to their study. They thus propose a method to describe the distribution of the number of males with a matching Y-STR profile, extending the approach to mixtures in Andersen and Balding [29]. One limitation of this study is that it is based on extensive simulations which have to be performed anew in each new application, on assumptions about the genetic evolutionary model, and on parameters which are essentially unknown.

There are a huge number of methods developed to assess the evidential values for Y-STR data. Among those that are developed precisely for the rare type match case, there are Egeland and Salas [30], Brenner [8], Cereda [7], and Cereda [6]. These particular (just listed) methods do not take into account genetic information contained in the allelic numbers forming a Y-STR DNA profile. They do not use the fact that, due to relatedness, the observation of a particular Y-STR profile increases the probability of observing the same Y-STR profile again or Y-STR profiles that differ only for few alleles. We refer the reader to Roewer [31], Buckleton et al. [32], Willuweit et al. [33], Wilson et al. [34] for models that use population genetics for coancestry. These models are not designed to be used for the rare type match case, though the Discrete Laplace method presented in Andersen et al. [35] can be successfully applied to that purpose, as shown in Cereda [7].

After a careful study of the available methods for assessing likelihood ratios (or matching probabilities) for Y-STR matches, one can see that they are of different natures (some of them do their best to exploit the genetic structure, others don’t) and based on different assumptions. In our opinion, none of them is fully satisfactory and at the same time useful for the rare type match and for general cases.

In this paper, we study what can be done if we reduce the data, taking into account only the equalities and inequalities among profiles rather than considering the specific Y-STR observed characteristics. We know part of the scientific community will not agree with our approach, preferring an approach such as the one of Andersen and Balding [28], but we believe that our method can also be useful. Indeed, we think it can be very useful for an analyst, in a particular case, to obtain results using several different methods relying on different assumptions: this is actually “sensitivity analysis”. It provides further information which can be used by the court—even if it only shows that the evidential value of the match is almost unquantifiable. Moreover, even though Y-STR data have been the main motivation for this study, this model is actually applicable to different kinds of data (in principle for all forensic data that show power law behavior). When applied to data without genetic structures (such as tire marks or glass fragments), these kinds of criticisms should fade away if, of course, one can also find empirical or theoretical support for power law behavior.

The Y-STR marker system will thus be employed here as an extreme but in practice common and important example in which the problem of assessing the evidential value of rare type match can arise. We believe that the analyst should perform several analyses using different models and different assumptions, and compare the performance of the different methods, in order to try to learn from the differences (or lack of differences) between the conclusions which would follow from each method individually.

The big issues of working with Y-STR data are the unavailability of reliable databases, which are representative of actual population. The YHRD database is in fact a collection of databases coming from police or laboratories. The individual databases are not actually random samples from well defined populations since different institutions and organizations use different selection criteria. For instance, is a prison population representative of the population outside? The sizes of the databases from different countries are not proportional to the sizes of the populations.e are well aware of this limitation.

3. The Model

3.1. Notation and Data

Throughout the paper, the following notation is chosen: random variables and their values are denoted, respectively, with uppercase and lowercase characters: x is a realization of X. Random vectors and their values are denoted, respectively, by uppercase and lowercase bold characters:

p

is a realization of the random vector

P

. Probability is denoted with

Pr (\cdot)

, while the density of a continuous random variable X is denoted alternatively by

p_{X} (x)

or by

p (x)

when the subscript is clear from the context. For a discrete random variable Y, the density notation

p_{Y} (y)

and the discrete one

Pr (Y = y)

will be interchangeably used. Moreover, we will use shorthand notation like

p (y ∣ x)

to stand for the probability density of Y with respect to the conditional distribution of Y given

X = x

.

Notice that, in Formula (1),

D

was regarded as the event corresponding to the observation of the available data. However, later in the paper,

D

will be regarded as a random variable generically representing the data. The particular data at hand will correspond to the value d. In that case, the following notation will thus be preferred:

LR = \frac{Pr (D = d | H = h_{p})}{Pr (D = d | H = h_{d})} or \frac{p (d | h_{p})}{p (d | h_{d})} .

Lastly, notice that “DNA types” are used throughout the paper as a general term to indicate Y-STR profiles.

The data used in the present study were obtained from the Y Chromosome Haplotype Reference Database (YHRD) [36,37]. Here, only seven of the markers included in the PowerPlex1Y23 system (PPY23, Promega Corporation, Madison, WI, USA) were investigated: DYS19, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393. The dependence between these seven “core markers” is studied in Caliebe et al. [24] that concludes that “each of these seven markers contribute indispensable information about each other markers from the same set”.

3.2. Model Assumptions

Our mathematical model is based on the two following mathematical assumptions:

Assumption 1.

There are infinitely many different DNA types in nature.

This assumption, already used by e.g., Kimura [38] in the “infinite alleles model”, allows the use of Bayesian nonparametric methods and is very useful for instance in “species sampling problems” when the total number of possible different species in nature cannot be specified. This assumption is sensible also in case of Y-STR DNA profiles since the state space of possible different haplotypes is so large that it can be considered infinite.

Assumption 2.

The names of the different DNA types do not contain usable information.

Actually, the specific sequence of numbers that forms a DNA profile carries information: if two profiles show few differences, this means that they are separated by few mutation drifts, hence the profiles share a relatively recent common ancestor. However, this information can be very difficult to use and it might be wiser not to try to use it in the LR assessment. This is the reason why we will treat DNA types as “colors”, and only consider their partition into different categories. Stated otherwise, we do not use any topological structure on the space of the DNA types.

Notice that this assumption makes the model a priori suitable for any characteristic which has many different possible types showing power law behavior, as we will see, thus the approach described in this paper might be considered, in principle, after replacing “DNA types” with any other category.

3.3. Prior

In Bayesian statistics, parameters of interest are modeled through random variables. The (prior) distribution over a parameter should represent the prior uncertainty about its value.

The assessment of the LR for the rare type match involves two unknown parameters of interest: one is

h \in {h_{p}, h_{d}}

, representing the unknown true hypothesis, the other is

p

, the vector of the unknown population frequencies of all DNA profiles in the population of potential perpetrators. The dichotomous random variable H is used to model parameter h, and the posterior distribution of this random variable, given the data, is the ultimate aim of the forensic inquiry. Similarly, a random variable

P

is used to model the uncertainty over

p

. Because of Assumption 1,

p

is an infinite-dimensional parameter, hence the need for Bayesian nonparametric methods [39,40]. In particular because of Assumption 2, data can be reduced to partitions, as explained in Section 3.5, and it will turn out that the distribution of these partitions does not depend on the order of the

p_{i}

. Hence, we can define the parameter

p

as having values in

\nabla_{\infty} = {(p_{1}, p_{2}, \dots) ∣ p_{1} \geq p_{2} \geq \dots, \sum p_{i} = 1, p_{i} > 0}

, the ordered infinite-dimensional simplex. The uncertainty about its value will be expressed by the two-parameter Poisson Dirichlet prior [41,42,43,44].

The two-parameter Poisson-Dirichlet distribution can be defined through the following stick-breaking representation [40]:

Definition 1

(two-parameter GEM distribution). Given α and θ satisfying the following conditions,

0 \leq α < 1, a n d θ > - α .

(2)

The vector

W = (W_{1}, W_{2}, \dots)

is said to be distributed according to the GEM(

α, θ

), if

\forall i W_{i} = V_{i} \prod_{j = 1}^{i - 1} (1 - V_{j}),

where

V_{1}

,

V_{2}

,…are independent random variables distributed according to

V_{i} \sim B e t a (1 - α, θ + i α) .

It holds that

W_{i} > 0

, and

\sum_{i} W_{i} = 1 .

The GEM distribution (short for Griffin–Engen–McCloskey distribution) is well-known in the literature as the “stick-breaking prior” since it measures the random sizes in which a stick is broken iteratively.

Definition 2

(Two-parameter Poisson Dirichlet distribution). Given α and θ satisfying condition (2), and a vector

W = (W_{1}, W_{2}, \dots) \sim GEM (α, θ)

, the random vector

P = (P_{1}, P_{2}, \dots)

obtained by ranking

W

, such that

P_{i} \geq P_{i + 1}

, is said to be Poisson Dirichlet distributed PD

(α, θ)

. Parameter α is called the discount parameter, while θ is the concentration parameter.

For our model, we will not allow

α = 0

, hence we will assume

0 < α < 1

, in order to have a prior that shows a power-law behavior as the one observed in the YHRD database (see Section 5.1). We think that it could be interesting to look for about models from population genetics and evolution which seem similar in some respects and which predict power-law behavior.

The main reason that prompted us to choose the two-parameter Poisson Dirichlet distribution among the possible Bayesian nonparametric priors is given by model fitting (see Section 5.1). However, there is a very nice feature of this model. It is the only one that has the following very convenient sufficiency property [45]: the probability of observing a new species only depends on the number of already observed species and on the sample size, and the probability of observing an already seen species only depends on its frequency in the sample and on the sample size.

Lastly, we point out that, in practice, we cannot assume that we know the parameters

α

and

θ

: we will resolve this by using a hyperprior.

3.4. Bayesian Network Representation of the Model

The typical data to evaluate in case of a match is

D = (E, B)

, where

E = (E_{s}, E_{c})

, and

$E_{s}$ = suspect’s DNA type,
$E_{c}$ = crime stain’s DNA type (matching the suspect’s type),
B = a reference database of size n, which is treated here as a random sample of DNA types from the population of possible perpetrators.

The hypotheses of interest for the case are:

$h_{p}$ = The crime stain originated from the suspect,
$h_{d}$ = The crime stain originated from someone else.

In agreement with Assumption 2, the model will ignore information about the names of the DNA types: data

D = (E, B)

will thus be reduced to D accordingly. The Bayesian network of Figure 1 encapsulates the conditional dependencies of the random variables

(H, A, Θ, P, X_{1}, \dots, X_{n + 2}, D)

, whose joint distribution is defined below in terms of a collection of conditional distributions (one for each node).

H is a dichotomous random variable that represents the hypotheses of interest and can take values

h \in {h_{p}, h_{d}}

, according to the prosecution or the defense, respectively. A uniform prior on the hypotheses is chosen for mathematical convenience since it will not affect the likelihood ratio (the variable H being in the conditioning part):

Pr (H = h) \propto 1 for h \in {h_{p}, h_{d}} .

(

A, Θ

) is the random vector that represents the hyperparameters

α

and

θ

, satisfying condition (2). The joint prior density of these two parameters will be generically denoted as

p (α, θ)

:

(A, Θ) \sim p (α, θ) .

For obvious reasons, this will be called the ‘hyperprior’ throughout the text.

The random vector

P

with values in

\nabla_{\infty}

represents the ranked population frequencies of Y-STR profiles.

P = (p_{1}, p_{2}, \dots)

means that

p_{1}

is the frequency of the most common DNA type in the population,

p_{2}

is the frequency of the second most common DNA type, and so on. As a prior for

P

, we use the two-parameter Poisson Dirichlet distribution:

P | A = α, Θ = θ \sim P D (α, θ) .

The database is assumed to be a random sample from the population. Integer-valued random variables

X_{1}

, …,

X_{n}

are here used to represent the (unknown) ranks in the population of the frequencies of the DNA types in the database. For instance,

X_{3} = 5

means that the third individual in the database has the fifth most common DNA type in the population. Given

p

, they are an i.i.d. sample from

p

:

X_{1}, X_{2}, \dots, X_{n} | P = p \sim_{i . i . d .} p .

(3)

X_{n + 1}

represents the rank in the population ordering of the suspect’s DNA type. It is again an independent draw from

p

:

X_{n + 1} | P = p \sim p .

X_{n + 2}

represents the rank in the population ordering, of the crime stain’s DNA type. According to the prosecution, given

X_{n + 1} = x_{n + 1}

, this random variable is deterministic (it is equal to

x_{n + 1}

with probability 1). According to the defence, it is another sample from

p

, independent of the previous ones:

X_{n + 2} | P = p, X_{n + 1} = x_{n + 1}, H = h \sim \{\begin{matrix} δ_{x_{n + 1}} & if h = h_{p} \\ p & if h = h_{d} \end{matrix} .

(4)

In order to observe

X_{1}

, …,

X_{n + 2}

, one would need, by definition, to know the rank, in terms of population proportions, of the frequency of each DNA type in the database. This is not known, hence

X_{1}, \dots, X_{n}

are not observed.

Section 3.5 recalls some notions about random partitions, useful before defining node D, the “reduced” data that we want to evaluate through the likelihood ratio.

3.5. Random Partitions and Database Partitions

A partition of a set S is an unordered collection of nonempty and disjoint subsets of S, the union of which forms S. Particularly interesting for our model are partitions of the set

S = [n] = {1, \dots, n}

, denoted as

π_{[n]}

. The set of all partitions of

[n]

will be denoted as

P_{[n]}

. Random partitions of

[n]

will be denoted as

Π_{[n]}

,

n \in N

. In addition, a partition of n is a finite nonincreasing sequence of positive integers that sum up to n. Partitions of n will be denoted as

π_{n}

, while random partitions as

Π_{n}

.

Given a sequence of integer valued random variables

X_{1}, \dots, X_{n}

, let

Π_{[n]} (X_{1}, X_{2}, \dots, X_{n})

be the random partition defined by the equivalence classes of their indices using the random equivalence relation

i \sim j

if and only if

X_{i} = X_{j}

. This construction allows one to build a “reduction map” from the set of values of

X_{1}, \dots, X_{n}

to the set of the partitions of

[n]

as in the following example (

n = 10

):

\begin{matrix} N^{10} & \to P_{[10]} \end{matrix}

(5)

\begin{matrix} X_{1}, \dots, X_{10} & ⟼ Π_{[10]} (X_{1}, X_{2}, \dots, X_{10}) \end{matrix}

(6)

\begin{matrix} (2, 4, 2, 4, 3, 3, 10, 13, 5, 4) & ⟼ {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}} \end{matrix}

(7)

Similarly, and in agreement with Assumption 2, in our model, we can consider the reduction of data which ignores information about the names of the DNA types: this is achieved, for instance, by retaining from the database only the equivalence classes of the indices of the individuals, according to the equivalence relation “has the same DNA type”. Stated otherwise, the database is reduced to the partition

π_{[n]}^{Db}

, obtained using these equivalence classes. However, the database only supplies part of the data. There are also two new DNA profiles that are equal to one another (and different from the already observed ones in the rare type match case). Considering the suspect’s profile, we obtain the partition

π_{[n + 1]}^{Db +}

, where the first n integers are partitioned as in

π_{[n]}^{Db}

, and

n + 1

constitutes a class by itself. Considering the crime stain profile too, we obtain the partition

π_{[n + 2]}^{Db + +}

where the first n integers are partitioned as in

π_{[n]}^{Db}

, and

n + 1

and

n + 2

belong to the same (new) class. Random variables

Π_{[n]}^{Db}

,

Π_{[n + 1]}^{Db +}

, and

Π_{[n + 2]}^{Db + +}

are used to model

π_{[n]}^{Db}

,

π_{[n + 1]}^{Db +}

, and

π_{[n + 2]}^{Db + +}

, respectively.

Since prosecution and defense agree on the distribution of

X_{1}, \dots, X_{n + 1}

, but not on the distribution of

X_{n + 2} | X_{1}, \dots, X_{n + 1}

, they also agree on the distribution of

Π_{[n + 1]}^{Db +}

but disagree on the distribution of

Π_{[n + 2]}^{Db + +}

(see (4)).

The crucial points of the model are the following:

The random partitions defined through the random variables $X_{1}$ , …, $X_{n + 2}$ and through the database are the same:

$\begin{matrix} Π_{[n]}^{Db} & = Π_{[n]} (X_{1}, \dots, X_{n}), \\ Π_{[n + 1]}^{Db +} & = Π_{[n + 1]} (X_{1}, \dots, X_{n + 1}), \\ Π_{[n + 2]}^{Db + +} & = Π_{[n + 2]} (X_{1}, \dots, X_{n + 2}) . \end{matrix}$
Although $X_{1}$ , …, $X_{n + 2}$ were not observable, the random partitions $Π_{[n]}^{Db}, Π_{[n + 1]}^{Db +}$ , and $Π_{[n + 2]}^{Db + +}$ are observable.

To clarify, consider the following example of a database with

k = 6

different DNA types, from

n = 10

individuals:

B = (b_{1}, b_{2}, b_{1}, b_{2}, b_{3}, b_{3}, b_{4}, b_{5}, b_{6}, b_{2}),

where

b_{i}

is the name of the ith DNA type according to the order chosen for the database. This database can be reduced to the partition of

[10]

:

π_{[10]}^{Db} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}} .

Then, the part of the reduced data whose distribution is agreed on by prosecution and defense is

π_{[11]}^{Db +} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}, {11}},

while the entire reduced data D can be represented as

π_{[12]}^{Db + +} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}, {11, 12}} .

Now, suppose that we knew the rank in the population of each of the DNA types in the database: we knew that

b_{1}

is, for instance, the second most frequent type,

b_{2}

is the fourth most frequent type, and so on. Stated otherwise, we are now assuming that we observe the variables

X_{1}

, …,

X_{n + 2}

: for instance,

X_{1} = 2

,

X_{2} = 4

,

X_{3} = 2

,

X_{4} = 4

,

X_{5} = 3

,

X_{6} = 3

,

X_{7} = 10

,

X_{8} = 13

,

X_{9} = 5

,

X_{10} = 4

,

X_{11} = 9

,

X_{12} = 9

(as in (5)). It is easy to check that

Π_{[10]} (X_{1}, \dots, X_{10}) = π_{[10]}^{Db}

,

Π_{[11]} (X_{1}, \dots, X_{11}) = π_{[11]}^{Db +}

, and

Π_{[12]} (X_{1}, \dots, X_{12}) = π_{[12]}^{Db + +} .

Coming back to our model, data are defined as

D = π_{[n + 2]}^{Db + +}

, obtained partitioning the database enlarged with the two new observations (or partitioning

X_{1}, \dots, X_{n + 2}

). Node D of Figure 1 is defined accordingly.

Notice that, given

X_{1}, \dots, X_{n + 2}

, D is deterministic. An important result is that, according to Proposition 4 in Pitman [46], it is possible to derive directly the distribution of

D ∣ α, θ, H

. In particular, it holds that, if

P ∣ α, θ \sim P D (α, θ),

and

X_{1}, X_{2}, \dots ∣ P = p \sim_{i . i . d} p,

then, for all

n \in N

, the random partition

Π_{[n]} = Π_{[n]} (X_{1}, \dots, X_{n})

has the following distribution:

P_{n}^{α, θ} (π_{[n]}) : = Pr (Π_{[n]} = π_{[n]} | α, θ) = \frac{{[θ + α]}_{k - 1; α}}{{[θ + 1]}_{n - 1; 1}} \prod_{i = 1}^{k} {[1 - α]}_{n_{i} - 1; 1},

(8)

where

n_{i}

is the size of the ith block of

π_{[n]}

(the blocks are here ordered according to their least element), and

\forall x, b \in R, a \in N

,

{[x]}_{a, b} : = \{\begin{matrix} \prod_{i = 0}^{a - 1} (x + i b) & if a \in N \ {0} \\ 1 & if a = 0 \end{matrix} .

This formula, also known as the Pitman sampling formula, is further studied in Pitman [47] and shows that

P_{n}^{α, θ} (π_{[n]})

does not depend on

X_{1}, \dots, X_{n}

, but only on the sizes and the number of classes in the partitions. It follows that we can get rid of the intermediate layer of nodes

X_{1}

, …,

X_{n + 2}

. Moreover, it holds that

Pr (D | α, θ, h_{p}) = P_{n + 1}^{α, θ} (π_{[n + 1]}^{Db +})

, while

Pr (D | α, θ, h_{d}) = P_{n + 2}^{α, θ} (π_{[n + 2]}^{Db + +})

.

The model of Figure 1 can thus be simplified to the one in Figure 2.

3.6. Chinese Restaurant Representation

There is an alternative characterization of this model, called the “Chinese restaurant process”, due to Aldous [48] for the one-parameter case, and studied in detail for the two-parameter version in Pitman [44]. It is defined as follows: consider a restaurant with infinitely many tables, each one infinitely large. Let

Y_{1}, Y_{2}, \dots

be integer-valued random variables that represent the seating plan: tables are ranked in order of occupancy, and

Y_{i} = j

means that the ith customer seats at the jth table to be created. The process is described by the following transition matrix:

Y_{1} = 1,

Pr (Y_{n + 1} = i | Y_{1}, \dots, Y_{n}) = \{\begin{matrix} \frac{θ + k α}{n + θ} & if i = k + 1 \\ \frac{n_{i} - α}{n + θ} & if 1 \leq i \leq k \end{matrix}

(9)

where k is the number of tables occupied by the first n customers, and

n_{i}

is the number of customers that at that time have been seated at table i. The process depends on two parameters

α

and

θ

with the same conditions (2). From (9), one can easily see the sufficientness property mentioned in Section 3.3.

Y_{1}, \dots, Y_{n}

are not i.i.d., nor exchangeable, but it holds that

Π_{[n]} (Y_{1}, \dots, Y_{n})

is distributed as

Π_{[n]} (X_{1}, \dots, X_{n})

, with

X_{1}, \dots, X_{n}

defined as in (3), and they are both distributed according to the Pitman sampling formula (8) [44].

Stated otherwise, we can obtain the same partition

π_{[n]}^{Db}

through the seating plan of n customers or partitioning the

X_{1}, \dots, X_{n}

of the database. Similarly,

π_{[n + 1]}^{Db +}

is obtained when a new customer has chosen an unoccupied table (remember we are in the rare type match case), and

π_{[n + 2]}^{Db + +}

is obtained when the

(n + 2)

nd customer goes to the table already chosen by the

(n + 1)

st customer (suspect and crime stain have the same DNA type). In particular, thanks to (9), we can write:

p (π_{[n + 2]}^{Db + +} ∣ h_{p}, π_{[n + 1]}^{Db +}, α, θ) = 1,

(10)

p (π_{[n + 2]}^{Db + +} ∣ h_{d}, π_{[n + 1]}^{Db +}, α, θ) = \frac{1 - α}{n + 1 + θ},

(11)

since the

(n + 2)

nd customer goes to the same table as the

(n + 1)

st (who was sitting alone).

3.7. A Useful Lemma

The following lemma can be applied to four general random variables Z, X, Y, and H whose conditional dependencies are described by the Bayesian network of Figure 3. The importance of this result is due to the possibility of using it for assessing the likelihood ratio in a very common forensic situation: the prosecution and the defense disagree on the distribution of the entirety of data (Y) but agree on the distribution of a part it (X), and these distributions depend on parameters (Z).

Lemma 1.

Given four random variables Z, H, X, and Y, whose conditional dependencies are represented by the Bayesian network of Figure 3, the likelihood function for h, given

X = x

and

Y = y

satisfies

lik (h ∣ x, y) \propto E (p (y ∣ x, Z, h) ∣ X = x) .

The Bayesian representation of the model, in Figure 3, allow for factoring the joint probability density of Z, H, X, and Y as

p (z, h, x, y) = p (z) p (x ∣ z) p (h) p (y ∣ x, z, h) .

By Bayes’ formula,

p (z) p (x ∣ z) = p (x) p (z ∣ x)

. This rewriting corresponds to reversing the direction of the arrow between Z and X:

The random variable X is now a root node. This means that, when we probabilistically condition on

X = x

, the graphical model changes in a simple way: we can delete the node X, and just insert the value x as a parameter in the conditional probability tables of the variables Z and Y which formerly had an arrow from node X. The next graph represents this model:

This tells us that, conditional on

X = x

, the joint density of Z, Y, and H is equal to

p (z, h, y ∣ x) = p (z ∣ x) p (h) p (y ∣ x, z, h) .

The joint density of H and Y given X is obtained by integrating out the variable Z. It can be expressed as a conditional expectation value since

p (z ∣ x)

is the density of Z given

X = x

. We find:

p (h, y ∣ x) = p (h) E (p (y ∣ x, Z, h) ∣ X = x) .

Recall that this is the joint density of two of our variables, H and Y, after conditioning on the value

X = x

. Let us now also condition on

Y = y

. It follows that the density of H given

X = x

and

Y = y

is proportional (as function of H, for fixed x and y) to the same expression,

p (h) E (p (y ∣ x, Z, h) ∣ X = x)

.

This is a product of the prior for h with some function of x and y. Since posterior odds equals prior odds times likelihood ratio, it follows that the likelihood function for h, given

X = x

and

Y = y

satisfies

lik (h ∣ x, y) \propto E (p (y ∣ x, Z, h) ∣ X = x) .

Corollary 1.

Given four random variables Z, H, X, and Y, whose conditional dependencies are represented by the network of Figure 3, the likelihood ratio for

H = h_{1}

against

H = h_{2}

given

X = x

and

Y = y

satisfies

LR = \frac{E (p (y | x, Z, h_{1}) | X = x)}{E (p (y | x, Z, h_{2}) | X = x)} .

(12)

4. The Likelihood Ratio

From now on, we will omit the superscripts Db, Db+, and Db++ for ease of notation.

Using the hypotheses and the reduction of data D defined in Section 3, the likelihood ratio will be defined as

LR = \frac{p (π_{[n + 2]} | h_{p})}{p (π_{[n + 2]} | h_{d})} = \frac{p (π_{[n + 1]}, π_{[n + 2]} | h_{p})}{p (π_{[n + 1]}, π_{[n + 2]} | h_{d})} .

The last equality holds due to the fact that

Π_{[n + 1]}

is a deterministic function of

Π_{[n + 2]}

.

Corollary 1 can be applied to our model since defense and prosecution agree on the distribution of

π_{[n + 1]}

, but not on the distribution of

π_{[n + 2]}

, and data depends on parameters

α

and

θ

. Thus, if

(A, Θ)

play the role of Z,

X = Π_{[n + 1]}

, and

Y = Π_{[n + 2]}

, by using (10) and (11), we obtain:

\begin{matrix} LR & = \frac{E (p (π_{[n + 2]} ∣ π_{[n + 1]}, A, Θ, h_{p}) ∣ Π_{[n + 1]} = π_{[n + 1]})}{E (p (π_{[n + 2]} ∣ π_{[n + 1]}, A, Θ, h_{d}) ∣ Π_{[n + 1]} = π_{[n + 1]})} \\ = \frac{1}{E (\frac{1 - A}{n + 1 + Θ} ∣ Π_{[n + 1]} = π_{[n + 1]})} . \end{matrix}

The expected value is taken with respect to the posterior distribution of

A, Θ ∣ Π_{[n + 1]} = π_{[n + 1]}

. The solution we propose in this paper is to deal with the uncertainty about

α

and

θ

by using MLE estimators and plug those estimators into the formula. Notice that this is equivalent to a hybrid approach, in which the parameters are estimated in a frequentist way and their values are plugged into the Bayesian LR. In the future, we plan to use MCMC methods to calculate as exactly as possible the exact posterior distribution, given assumed priors on the hyperparameters.

By defining the random variable

Φ = n \frac{1 - A}{n + 1 + Θ}

, we can write the LR as

LR = \frac{n}{E (Φ ∣ Π_{[n + 1]} = π_{[n + 1]})} .

5. Analysis on the YHRD Database

In this section, we present the study we made on a database of 18,925 Y-STR 23-loci profiles from 129 different locations in 51 countries in Europe [37]. Our analyses are performed by considering only 7 Y-STR loci (DYS19, DYS389 I, DYS389 II, DYS3904, DYS3915, DY3926, DY3937), but similar results have been observed with the use of 10 loci.

5.1. Model Fitting

In Figure 4, the ranked frequencies of the 18’925 Y-STR profiles of the YHRD database are compared to the relative frequencies of samples of size n obtained from several realizations of PD(

α_{M L E}, θ_{M L E}

), where

α_{MLE}

and

θ_{MLE}

are the maximum likelihood estimates obtained using the entire database and the likelihood defined by (8).Their values are

α_{MLE} = 0.51

and

θ_{MLE} = 216 .

To do so, we run several times the Chinese Restaurant seating plan (up to

n = 18, 925

customers): each run is used to approximate a new realization

p

from the PD(

α_{M L E}, θ_{M L E}

). As explained in Section 3.6, the partition of the customers into tables is the same as the partition obtained from an i.i.d. sample of size n from

p

. We can see that, for the most common haplotypes (left part of the plot), there is some discrepancy. However, we are interested in rare haplotypes, which typically have a frequency belonging to the right part of the plot. In that region, the two-parameter Poisson Dirichlet follows the distribution of the data quite well. The dotted line in Figure 4 shows the asymptotic behavior on the two-parameter Poisson Dirichlet distribution. Indeed, if

P \sim PD (α, θ)

, then

\frac{P_{i}}{Z i^{- 1 / α}} \to 1, a . s ., when i \to + \infty

for a random variable Z such that

Z^{- α} = Γ (1 - α) / S_{α}

. This power-law behavior describes an incredible variety of phenomena [23].

The thick line in Figure 4 also seems to have a power-law behavior, and, to be honest, we were hoping to get the same asymptotic slope of the prior. This is not what we observe, but, in Figure 5, it can be seen that, for such a big value of

θ

, we would need a bigger database (at least

n = 10^{6}

) to see the correct slope.

5.2. Log-Likelihood

It is also interesting to investigate the shape of the log-likelihood function for

α

and

θ

given

π_{[n + 1]}

. It is defined as

l_{n + 1} (α, θ) : = log p (π_{[n + 1]} | α, θ) .

In Figure 6, the log-likelihood reparametrized using

ϕ = n \frac{1 - α}{n + 1 + θ}

instead of

α

is displayed. A Gaussian distribution centered in the MLE parameters and with covariance matrix the inverse of the Fisher Information is also displayed (in dashed lines). This is not done to show an asymptotic property, but to show the symmetry of the log-likelihood, which validates approximation of

E (Φ ∣ Π_{[n + 1]} = π_{[n + 1]})

with the marginal mode

Φ_{M L E}

, at least when we choose a hyperprior

p (ϕ, θ)

that is flat around

(ϕ_{M L E}, θ_{M L E})

: indeed, it holds that

p (ϕ, θ ∣ π_{[n + 1]}) \propto l_{n + 1} (ϕ, θ) \times p (ϕ, θ)

.

Hence, one could safely make this approximation if one believed that this symmetry would also be true in the real data situation at hand:

LR \approx \frac{n + 1 + θ_{M L E}}{1 - α_{M L E}} .

(13)

Notice that this is equivalent to a hybrid approach, commonly called “Empirical Bayes”, in which the parameters are estimated through the MLE (frequentist) and their values are plugged into the Bayesian LR. We would like to reiterate that we are not using maximum likelihood estimates of the parameters because we consider the likelihood ratio from a frequentist point of view. Our aim is to calculate a Bayesian likelihood ratio, and we have observed empirically that, using the maximum likelihood estimates of the parameters, we can approximate this value.

Hence, in case of a rare type match problem, and using the YHRD database as the reference database, we have

{log}_{10} LR = 4.59

that corresponds to say that it is approximately 40,000 times more likely to observe the reduced data under the prosecution hypothesis than under the defense hypothesis.

5.3. True LR

It is also interesting to study the likelihood ratio values obtained with out method according to Formula (4), and to compare it with the ‘true’ ones, meaning the LR values obtained when the vector

p

is known, over simulated rare type match cases. This corresponds to the desirable (even though completely imaginary) situation of knowing the ranked list of the frequencies of all the DNA types in the population of interest. The model can be represented by the Bayesian network of Figure 7.

The likelihood ratio in this case can be obtained using again Corollary 1, where now

X_{1}

, …,

X_{n + 1}

play the role of Z. Indeed, now that

p

is known, the unobservable part of the model consists of the ranks of the types in the database.

\begin{matrix} {LR}_{∣ p} & = \frac{p (π_{[n + 2]}, π_{[n + 1]} ∣ h_{p}, p)}{p (π_{[n + 2]}, π_{[n + 1]} ∣ h_{d}, p)} \\ = \frac{E (p (π_{[n + 2]} ∣ π_{[n + 1]}, X_{1}, \dots, X_{n + 1}, h_{p}, p) ∣ Π_{[n + 1]} = π_{[n + 1]}, p)}{E (p (π_{[n + 2]} ∣ π_{[n + 1]}, X_{1}, \dots, X_{n + 1}, h_{d}, p) ∣ Π_{[n + 1]} = π_{[n + 1]}, p)} \\ = \frac{1}{E (p_{X_{n + 1}} | Π_{[n + 1]} = π_{[n + 1]}, p)} . \end{matrix}

Notice that, in the rare type case,

X_{n + 1}

is observed only once among the

X_{1}

, …,

X_{n + 1}

. Hence, we call it a singleton, and its distribution given

p, π_{[n + 1]}

is the same as the distribution of each other singleton. Let

s_{1}

denote the number of singletons, and

S

the set of indices of singletons observations in the augmented database. It holds that

s_{1} E (p_{X_{n + 1}} | π_{[n + 1]}, p) = E (\sum_{i \in S} p_{X_{i}} | π_{[n + 1]}, p) .

Notice also that the knowledge of

p

and

π_{[n + 1]}

is not enough to observe

X_{1}, \dots, X_{n + 1}

.

Let us denote as

X_{1}^{*}

, …,

X_{K}^{*}

the K different values taken by

X_{1}

, …,

X_{n + 1}

, ordered decreasingly according to the frequency of their values. Stated otherwise, if

n_{i}

is the frequency of

x_{i}^{*}

among

x_{1}, \dots, x_{n + 1},

then

n_{1} \geq n_{2} \geq \dots \geq n_{K}

. Moreover, in case

X_{i}^{*}

and

X_{j}^{*}

have the same frequency (

n_{i} = n_{j}

), then they are ordered increasingly according to their values. For instance, if

X_{1} = 2

,

X_{2} = 4

,

X_{3} = 2

,

X_{4} = 4

,

X_{5} = 3

,

X_{6} = 3

,

X_{7} = 10

,

X_{8} = 13

,

X_{9} = 5

,

X_{10} = 4

,

X_{11} = 9

, then

X_{1}^{*} = 4, X_{2}^{*} = 2, X_{3}^{*} = 3, X_{4}^{*} = 5, X_{5}^{*} = 9, X_{6}^{*} = 10, X_{7}^{*} = 13

.

By definition, it holds that

E (\sum_{i \in S} p_{X_{i}} | π_{[n + 1]}, p) = E (\sum_{j : n_{j} = 1} p_{X_{j}^{*}} | π_{[n + 1]}, p) .

Notice that

(n_{1}, n_{2}, \dots, n_{K})

is a partition of

n + 1

, which will be denoted as

π_{n + 1}

. In the example,

π_{n + 1} = (3, 2, 2, 1, 1, 1, 1)

. Since the distribution of

\sum_{j : n_{j} = 1} p_{X_{j}^{*}}

only depends on

π_{n + 1}

, the latter can replace

π_{[n + 1]}

. Thus, it holds that

{LR}_{∣ p} = \frac{s_{1}}{E (\sum_{j : n_{j} = 1} p_{X_{j}^{*}} | π_{n + 1}, p)} .

(14)

A more compact representation for

π_{n + 1}

can be obtained by using two vectors

a

and

r

where

a_{j}

are the distinct numbers occurring in the partition, increasingly ordered, and each

r_{j}

is the number of repetitions of

a_{j}

. J is the length of these two vectors, and it holds that

n + 1 = \sum_{j = 1}^{J} a_{j} r_{j} .

In the example above, we have that

π_{n + 1}

can be represented by

(a, r)

with

a = (1, 2, 3)

and

r = (4, 2, 1), J = 3

.

There is an unknown map,

χ

, treated here as latent variable, which assigns the ranks of the DNA types, ordered according to their frequency in nature, to one of the number

{1, 2, \dots, J}

corresponding to the position in

a

of its frequency in the sample, or to 0 if the type if not observed. Stated otherwise,

χ : {1, 2, \dots} ⟶ {0, 1, 2, \dots, J}

χ (i) = \{\begin{matrix} 0 & if the i th most common species in nature is not observed in the sample, \\ j & if the i th most common species in nature is one of the r_{j} observed a_{j} times in the sample . \end{matrix}

Given

π_{n + 1} = (a, r)

,

χ

must satisfy the following set of J conditions:

\sum_{i = 1}^{\infty} 1_{χ (i) = j} = r_{j}, \forall j \in {1, \dots, J} .

(15)

In addition, it should not be allowed that a profile observed

k_{N}

times in the population is observed

k_{n} > k_{N}

times in the sample. Hence, we have to add a further condition:

N p_{i} > a_{χ (i)}, \forall i

(16)

where N is the size of the entire population.

The map

χ

can be represented by a vector

χ = (χ_{1}, χ_{2}, \dots)

such that

χ_{i} = χ (i)

. In the example, we have that

χ = (0, 2, 2, 3, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, \dots) .

Notice that, given

π_{n + 1} = (a, r)

, the knowledge of

χ

implies the knowledge of

X_{1}^{*}

, …,

X_{K}^{*}

: indeed, it is enough to consider the position of the ranked positive values of

χ

, and to solve ties by considering the positions themselves (if

χ_{i} = χ_{j}

, than the order is given by i and j). For instance, in the example, if we sort the positive values of

χ

and we collect their positions, we get

(4, 2, 3, 5, 9, 10, 13)

: the reader can notice that we got back to

X_{1}^{*}, \dots, X_{7}^{*}

.

This means that, to obtain the distribution of

X_{1}^{*}, \dots, X_{K}^{*} | π_{n + 1}, p

, which appears in (14), it is enough to obtain the distribution of

χ | a, r, p

, and since we are only interested in the mean of the sum of singletons in samples of size

n + 1

from the distribution of

X_{1}^{*}, \dots, X_{K}^{*} | a, r, p

, we can just simulate samples from the distribution of

χ | a, r, p

and sum the

p_{i}

such that

χ_{i} = 1 .

It holds that

p (a, r ∣ χ, p) \propto \prod_{1 \leq i \leq m} p_{i}^{a_{χ_{i}}},

(17)

where the proportionality factor is

\frac{(n + 1)!}{\prod_{1 \leq j \leq J} {(a_{j}!)}^{r_{j}}}

.

Details of the Metropolis–Hastings Algorithm

Notice that for the model we assumed

p

to be infinitely long, but for simulations we will use a finite

\bar{p}

, of length m. This is equivalent to assume that only m elements in the infinite

p

are positive, and the remaining infinite tail is made of zeros.

To simulate samples from the distribution of

χ | a, r, p

, we use a Metropolis–Hastings algorithm on the space of the vectors

χ

satisfying the

J + m

conditions (15) and (16). Then, the state space of the Metropolis–Hastings Markov chain is made of all vectors of length m whose elements belong to

{0, 1, \dots, J}

, and satisfy the conditions (15) and (16). If we start with an initial point

χ_{0}

which satisfies (15) and, at each move t of the Metropolis–Hastings, we swap two different values

χ_{i}

and

χ_{j}

inside the vector, condition (15) remains satisfied while conditions (16) must be checked at every iterations. The Metropolis factor is the ratio of the two likelihoods

p (a, r ∣ χ_{t}, p)

and

p (a, r ∣ χ_{t + 1}, p)

, where

χ_{t}

and

χ_{t + 1}

differ only because

χ_{i}

and

χ_{j}

are exchanged. Hence, using (17), the Metropolis factor for every move is

R = \frac{p_{i}^{a_{χ_{i}}} p_{j}^{a_{χ_{j}}}}{p_{j}^{a_{χ_{i}}} p_{i}^{a_{χ_{j}}}} .

Every exchange move is then accepted with probability R. The algorithm is iterated

N = 10^{5}

times, with thinning steps of

10^{3}

and a burn-in period of 20,000 iterations. Since it holds that

E (\sum_{j : n_{j} = 1} p_{X_{j}^{*}} | π_{n + 1}^{Db +}, p) = E (\sum_{i : χ_{i} = 1} p_{i} | a, r, p),

for every accepted

χ

, we calculate the sum of all

p_{i}

s such that

χ_{i} = 1

and we use the average to approximate the denominator of (14). The algorithm is based on a similar one proposed in Anevski et al. [15].

This method allows us to approximate the ‘true’ LR when the vector

p

is known. This is almost never the case, but we can put ourselves in a fictitious world where we know

p

(such as the frequencies in the YHRD database, or as in the following section the frequencies from a smaller population) and compare the true values for the

{LR}_{∣ p}

with the one obtained by applying our Bayesian nonparametric model when

p

is unknown.

5.4. Frequentist-Bayesian Analysis of the Error

A real Bayesian statistician chooses the prior and hyperprior according to his beliefs. Depending on the choice of the hyperprior over

α

and

θ

, she may or may not believe in the approximation (13), but she does not really talk of “error”. However, hardliner Bayesian statisticians are a rare species, and most of the time the Bayesian procedure consists of choosing priors (and hyperpriors) which are a compromise between personal beliefs and mathematical convenience. It is thus interesting to investigate the performance of such priors. This can be done by comparing the Bayesian likelihood ratio with the likelihood ratio that one would obtain if the vector

p

was known, and for the same reduction of data. This is what we call “error”: in other words, at the moment, we are considering the Bayesian nonparametric method proposed in this paper as a way to estimate (notice the frequentist terminology) the true

{LR}_{∣ p}

. If we denote by

p_{x}

the population proportion of the matching profile, another interesting comparison is the one between the Bayesian likelihood ratio and the frequentist likelihood ratio

1 / p_{x}

(here denoted as

{LR}_{f}

) that one would obtain knowing

p

, but not reducing the data to partition. This is a sort of benchmark comparison and tells us how much we lose by using the Bayesian nonparametric methodology, and by reducing data.

In total, there are three quantities of interest (

{log}_{10} LR

,

{log}_{10} {LR}_{∣ p}

, and

{log}_{10} {LR}_{f}

), and two differences of interest, which will be denoted as

${Diff}_{1} = {log}_{10} LR - {log}_{10} {LR}_{∣ p}$ (loss due to choice of the Poisson Dirichlet model and approximation (13)),
${Diff}_{2} = {log}_{10} LR - {log}_{10} {LR}_{f}$ (overall loss).

In order to analyze these five quantities, we can study their distribution over different rare type match cases. However, there is an obstacle. The Metropolis–Hastings algorithm described in Section 5.3 is too slow to be used with the entire European database of Purps et al. [37] of size

n =

18,925.

In order to make the computational effort feasible, we consider the haplotype frequencies for the sole Dutch population (of size

n = 2037

), and we pretend that they are the frequencies from the entire population of possible perpetrators. This population is summarized by the following

a

, and

r

:

a = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 19, 20, 23, 24, 29, 35, 41, 46, 94, 152, 168, 174)

r = (356, 80, 31, 20, 13, 11, 5, 6, 3, 5, 4, 3, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1),

and the maximum likelihood estimators for

α

and

θ

are

α_{MLE} = 0.62

,

θ_{MLE} = 22

.

In this way, we can use the Metropolis–Hastings algorithm to simulate

{LR}_{∣ p}

. The model fitting is still good enough, as shown in Figure 8 (as a side note, notice that the asymptotic behavior is reached faster for this smaller value of

θ_{MLE} = 22

).

However, it is important to stress that the Gaussian shape and consequently the approximation (13) is not empirically supported for small databases of size

n = 100

.

In Table 1 and Figure 9a, we compare the distribution of

{log}_{10} {LR}_{| p}

,

{log}_{10} LR

, and

{log}_{10} {LR}_{f}

obtained by 96 samples of size 100 from the Dutch population. Each sample represents a different rare type match case with a specific database of reference of size

n = 100

.

The distribution of the benchmark likelihood ratio

{log}_{10} {LR}_{f}

has more variation than the distribution of the Bayesian likelihood ratio, while

{log}_{10} {LR}_{| p}

appears to be the most concentrated around its mean.

In Table 2 and Figure 9b, we consider the distribution of the two differences, Diff

_{1}

and Diff

_{2}

. Diff

_{1}

is the smallest and the most concentrated: it ranges between −0.146 and 0.381 and has a small standard deviation. It means that the nonparametric Bayesian likelihood ratio obtained as in (13) can be thought of as a good approximation of the frequentist likelihood ratio for the same reduction of data (

{log}_{10} {LR}_{∣ p}

), even though we have not empirically validated the approximation for small databases of size 100. This difference is due to three things: the approximation (13), the MLE estimation of the hyperparameters, and the choice of a prior distribution (two-parameter Poisson Dirichlet) which is quite realistic, as shown in Figure 8, but not perfectly fitting the actual population.

Notice that the difference increases if the Bayesian nonparametric likelihood is compared to the benchmark likelihood ratio (Diff

_{2}

). Here, the reduction of data comes into play too. However, the difference ranges within one order of magnitude, but most of the time lies between

- 0.676

and 0.115; thus, it is small.

6. Conclusions

This paper discusses the first application of a Bayesian nonparametric method to likelihood ratio assessment in forensic science, in particular to the challenging situation of the rare type match. If compared to traditional Bayesian methods such as those described in Cereda [6], it presents many advantages. First of all, the prior chosen for the parameter

p

seems to be quite realistic for the population whose frequencies we want to model. Moreover, though the theoretical background on which it rests may seem very technical and difficult, the method is extremely simple in practice, thanks to the use of an empirical Bayes approximation. More could be done in the future: in particular regarding approximation (13). The posterior expectation in the denominator could, for instance, be treated using MCMC algorithms or ABC algorithms. Then, we can try to improve the efficiency of the Metropolis–Hastings algorithm defined in Section 5.3 in order to be used with bigger and better populations. The big problem is how to use these methods when relevant populations are poorly defined and accessible databases are of doubtful relevance. We don’t solve those problems.

It is not clear whether other methods are better. This is all very open and controversial. We suggest the analyst to perform several very different analyses and think carefully what the differences between the conclusions tells her. With this aim, we plan to compare this Bayesian nonparametric method to other existing methods for the rare type match problem, investigating calibration and validation through the use of ECE plots [49].

Author Contributions

The original version of this paper was written while the G.C. was PhD student at Leiden University under the supervision of R.D.G. The lead author, G.C., contributed the original conception of this paper, performed most of the literature research, computation, and drafted the original manuscript. Intensive discussions and much rewriting and revision by the two authors together, during which both of them contributed further new ideas, led finally to this version. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Swiss National Science Foundation through Grant No. P2LAP2_178195.

Acknowledgments

We are indebted to Jim Pitman and Alexander Gnedin for their help in understanding their important theoretical results, to Mikkel Meyer Andersen for providing a cleaned version of the database of Purps et al. [37] and to Stephanie Van der Pas, Pierpaolo De Blasi, Giacomo Aletti, and Fabio Corradi for their useful opinions and comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Robertson, B.; Vignaux, G.A. Interpreting Evidence: Evaluating Forensic Science in the Courtroom; John Wiley & Sons: Chichester, UK, 1995. [Google Scholar]
Evett, I.; Weir, B. Interpreting DNA Evidence: Statistical Genetics for Forensic Scientists; Sinauer Associates: Sunderland, UK, 1998. [Google Scholar]
Aitken, C.; Taroni, F. Statistics and the Evaluation of Evidence for Forensics Scientists; John Wiley & Sons: Chichester, UK, 2004. [Google Scholar]
Balding, D. Weight-of-Evidence for Forensic DNA Profiles; John Wiley & Sons: Chichester, UK, 2005. [Google Scholar]
Taroni, F.; Aitken, C.; Garbolino, P.; Biedermann, A. Bayesian Networks and Probabilistic Inference in Forensic Science; John Wiley & Sons: Chichester, UK, 2006. [Google Scholar]
Cereda, G. Bayesian approach to LR in case of rare type match. Stat. Neerl. 2017, 71, 141–164. [Google Scholar] [CrossRef]
Cereda, G. Impact of model choice on LR assessment in case of rare haplotype match (frequentist approach). Scand. J. Stat. 2017, 44, 230–248. [Google Scholar] [CrossRef] [Green Version]
Brenner, C.H. Fundamental problem of forensic mathematics—The evidential value of a rare haplotype. Forensic Sci. Int. Genet. 2010, 4, 281–291. [Google Scholar] [CrossRef]
Cereda, G.; Biedermann, A.; Hall, D.; Taroni, F. An investigation of the potential of DIP-STR markers for DNA mixture analyses. Forensic Sci. Int. Genet. 2014, 11, 229–240. [Google Scholar] [CrossRef] [PubMed]
Laplace, P. Essai Philosophique sur les Probabilites; Mme. Ve Courcier: Paris, France, 1814. [Google Scholar]
Krichevsky, R.; Trofimov, V. The performance of universal coding. IEEE Trans. Inf. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
Gale, W.A.; Church, K.W. What’s wrong with adding one? In Corpus-Based Research into Language; Rodolpi: Amsterdam, The Netherlands, 1994. [Google Scholar]
Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
Orlitsky, A.; Santhanam, N.P.; Viswanathan, K.; Zhang, J. On Modeling Profiles Instead of Values. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI ’04), Banff, AB, Canada, 7–11 July 2004; pp. 426–435. [Google Scholar]
Anevski, D.; Gill, R.D.; Zohren, S. Estimating a probability mass function with unknown labels. Ann. Stat. 2017, 45, 2708–2735. [Google Scholar] [CrossRef] [Green Version]
Tiwari, R.C.; Tripathi, R.C. Nonparametric Bayes estimation of the probability of discovering a new species. Commun. Stat. Theory Methods 1989, A18, 877–895. [Google Scholar] [CrossRef]
Lijoi, A.; Mena, R.H.; Pruenster, I. Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 2007, 94, 769–786. [Google Scholar] [CrossRef]
De Blasi, P.; Favaro, S.; Lijoi, A.; Mena, R.H.; Pruenster, I.; Ruggiero, M. Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process. IEEE Trans. Patterns Anal. Ans Mach. Intell. 2015, 37, 212–229. [Google Scholar] [CrossRef] [Green Version]
Favaro, S.; Lijoi, A.; Mena, R.H.; Pruenster, I. Bayesian nonparametric inference for species variety with a two parameter Poisson-Dirichlet process prior. J. R. Stat. Soc. Ser. (Methodol.) 2009, 71, 993–1008. [Google Scholar] [CrossRef]
Arbel, J.; Favaro, S.; Nipoti, B.; Teh, Y.W. Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics. Stat. Sin. 2017, 27, 839–859. [Google Scholar] [CrossRef]
Favaro, S.; Nipoti, B.; Teh, Y.W. Rediscovery of Good-Turing estimators via Bayesian nonparametrics. Biometrics 2016, 72, 136–145. [Google Scholar] [CrossRef] [PubMed]
Teh, Y.W.; Jordan, M.I.; Beal, M.J.; Blei, D.M. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 2006, 101, 1566–1581. [Google Scholar] [CrossRef]
Newman, M. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 2005, 46, 323–351. [Google Scholar] [CrossRef] [Green Version]
Caliebe, A.; Jochens, A.; Willuweit, S.; Roewer, L.; Krawczak, M. No shortcut solutions to the problem of Y-STR match probability calculation. Forensic Sci. Int. Genet. 2015, 15, 69–75. [Google Scholar] [CrossRef]
Andersen, M.M.; Curran, J.M.; de Zoete, J.; Taylor, D.; Buckleton, J. Modelling the dependence structure of Y-STR haplotypes using graphical models. Forensic Sci. Int. Genet. 2018, 37, 29–36. [Google Scholar] [CrossRef]
Andersen, M.M.; Caliebe, A.; Kirkeby, K.; Knudsen, M.; Vihrsa, N.; Curran, J.M. Estimation of Y haplotype frequencies with lower order dependencies. Forensic Sci. Int. Genet. 2019, 46, 102214. [Google Scholar] [CrossRef]
Balding, D.J.; Nichols, R.A. DNA profile match probability calculation: How to allow for population stratification, relatedness, database selection and single bands. Forensic Sci. Int. 1994, 64, 125–140. [Google Scholar] [CrossRef]
Andersen, M.M.; Balding, D.J. How convincing is a matching Y-chromosome profile? PLoS Genet. 2017, 13, e1007028. [Google Scholar] [CrossRef] [Green Version]
Andersen, M.M.; Balding, D.J. Y-profile evidence: Close paternal relatives and mixtures. Forensic Sci. Int. Genet. 2019, 38, 48–53. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Egeland, T.; Salas, A. Estimating Haplotype Frequency and Coverage of Databases. PLoS ONE 2008, 3, e3988. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Roewer, L. Y chromosome STR typing in crime casework. Forensic Sci. Med. Pathol. 2009, 5, 77–84. [Google Scholar] [CrossRef] [PubMed]
Buckleton, J.; Krawczak, M.; Weir, B. The interpretation of lineage markers in forensic DNA testing. Forensic Sci. Int. Genet. 2011, 5, 78–83. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Willuweit, S.; Caliebe, A.; Andersen, M.M.; Roewer, L. Y-STR Frequency Surveying Method: A critical reappraisal. Forensic Sci. Int. Genet. 2011, 5, 84–90. [Google Scholar] [CrossRef]
Wilson, I.J.; Weale, M.E.; Balding, D.J. Inferences from DNA data: Population histories, evolutionary processes and forensic match probabilities. J. R. Stat. Soc. Ser. (Stat. Soc.) 2003, 166, 155–188. [Google Scholar] [CrossRef]
Andersen, M.M.; Eriksen, P.S.; Morling, N. The discrete Laplace exponential family and estimation of Y-STR haplotype frequencies. J. Theor. Biol. 2013, 329, 39–51. [Google Scholar] [CrossRef] [Green Version]
Willuweit, S.; Roewer, L. Y chromosome haplotype reference database (YHRD): Update. Forensic Sci. Int. Genet. 2007, 1, 83–87. [Google Scholar] [CrossRef]
Purps, J.; Siegert, S.; Willuweit, S.; Nagy, M.; Alves, C.; Salazar, R.; Angustia, S.M.T.; Santos, L.H.; Anslinger, K.; Bayer, B.; et al. A global analysis of Y-chromosomal haplotype diversity for 23 STR loci. Forensic Sci. Int. Genet. 2014, 12, 12–23. [Google Scholar] [CrossRef]
Kimura, M. The number pf alleles that can be maintained in a finite population. Genetics 1964, 49, 725–738. [Google Scholar]
Hjort, N.; Holmes, C.; Müller, P.; Walker, S. Bayesian Nonparametrics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Ghosal, S.; Van der Vaart, A. Fundamentals of Nonparametric Bayesian Inference; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 1997, 25, 855–900. [Google Scholar] [CrossRef]
Feng, S. The Poisson-Dirichlet Distribution and Related Topics: Models and Asymptotic Behaviors; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Buntine, W.; Hutter, M. A Bayesian view of the Poisson-Dirichlet process. arXiv 2012, arXiv:1007.0296. [Google Scholar]
Pitman, J. Combinatorial Stochastic Processes; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Zabell, S.L. The Continuum of Inductive Methods Revisited; Cambridge Studies in Probability, Induction and Decision Theory; Cambridge University Press: Cambridge, UK, 2005; pp. 243–274. [Google Scholar]
Pitman, J. The Two-Parameter Generalization of Ewens’ Random Partition Structure; Technical Report 345; Department of Statistics U.C.: Berkeley, CA, USA, 1992.
Pitman, J. Exchangeable and partially exchangeable random partitions. Probab. Theory Relat. Fields 1995, 102, 145–158. [Google Scholar] [CrossRef]
Aldous, D.J. Exchangeability and Related Topics; Springer: New York, NY, USA, 1985. [Google Scholar]
Ramos, D.; Gonzales-Rodriguez, J.; Zadora, G.; Aitken, C. Information-Theoretical Assessment of the Performance of Likelihood Ratio Computation Methods. J. Forensic Sci. 2013, 58, 1503–1517. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Bayesian network showing the conditional dependencies of the relevant random variables in our model.

Figure 2. Simplified version of the Bayesian network in Figure 1.

Figure 3. Conditional dependencies of the random variables of Lemma 1.

Figure 4. Log scale ranked frequencies from the database (thick line) are compared to the relative frequencies of samples of size

n = 18, 925

obtained from several realizations of PD(

α_{M L E}, θ_{M L E}

) (thin lines). Asymptotic power-law behavior is also displayed (dotted line).

Figure 4. Log scale ranked frequencies from the database (thick line) are compared to the relative frequencies of samples of size

n = 18, 925

obtained from several realizations of PD(

α_{M L E}, θ_{M L E}

) (thin lines). Asymptotic power-law behavior is also displayed (dotted line).

Figure 5. Log scale ranked frequencies from the two-parameter Poisson Dirichlet distribution with

α = 0.51, θ = 216

approximated through a Chinese restaurant seating plan, each with its own number of costumers, corresponding to the different thickness of the lines.

Figure 5. Log scale ranked frequencies from the two-parameter Poisson Dirichlet distribution with

α = 0.51, θ = 216

approximated through a Chinese restaurant seating plan, each with its own number of costumers, corresponding to the different thickness of the lines.

Figure 6. Relative log-likelihood for

ϕ = n \frac{1 - α}{n + 1 + θ}

and

θ

compared to a Gaussian distribution displayed with

95 %

and

99 %

confidence intervals.

Figure 6. Relative log-likelihood for

ϕ = n \frac{1 - α}{n + 1 + θ}

and

θ

compared to a Gaussian distribution displayed with

95 %

and

99 %

confidence intervals.

Figure 7. Bayesian network for the case in which

p

is known.

Figure 7. Bayesian network for the case in which

p

is known.

Figure 8. Log scale ranked frequencies from the Dutch database (thick line) compared to the relative frequencies of samples of size

n = 2037

obtained from several realizations of PD(

α_{M L E}, θ_{M L E}

) (thin lines). Asymptotic power-law behavior is also displayed (dotted line).

Figure 8. Log scale ranked frequencies from the Dutch database (thick line) compared to the relative frequencies of samples of size

n = 2037

obtained from several realizations of PD(

α_{M L E}, θ_{M L E}

) (thin lines). Asymptotic power-law behavior is also displayed (dotted line).

Figure 9. (a) comparison between the distribution of

{log}_{10} LR

,

{log}_{10} {LR}_{| p}

, and

{log}_{10} {LR}_{f}

; (b) the error

{log}_{10} LR - {log}_{10} {LR}_{∣ p}

and

{log}_{10} LR - {log}_{10} {LR}_{f}

.

Figure 9. (a) comparison between the distribution of

{log}_{10} LR

,

{log}_{10} {LR}_{| p}

, and

{log}_{10} {LR}_{f}

; (b) the error

{log}_{10} LR - {log}_{10} {LR}_{∣ p}

and

{log}_{10} LR - {log}_{10} {LR}_{f}

.

Table 1. Summaries of the distribution of

{log}_{10} LR

,

{log}_{10} ({LR}_{| p})

, and

{log}_{10} {LR}_{f}

.

Table 1. Summaries of the distribution of

{log}_{10} LR

,

{log}_{10} ({LR}_{| p})

, and

{log}_{10} {LR}_{f}

.

	Min	1st Qu.	Median	Mean	3rd Qu.	Max	sd
${log}_{10} LR$	2.417	2.512	2.572	2.59	2.658	2.879	0.102
${log}_{10} {LR}_{\| p}$	2.392	2.507	2.542	2.529	2.56	2.602	0.045
${log}_{10} {LR}_{f}$	1.646	2.464	2.832	2.803	3.309	3.309	0.463

Table 2. Summaries of the distribution of

{Diff}_{1}

,

{Diff}_{2}

, and

{Diff}_{3}

.

Table 2. Summaries of the distribution of

{Diff}_{1}

,

{Diff}_{2}

, and

{Diff}_{3}

.

	Min	1st Qu.	Median	Mean	3rd Qu.	Max	sd
Diff $_{1}$	−0.146	−0.036	0.044	0.06	0.144	0.381	0.126
Diff $_{2}$	−0.891	−0.676	−0.221	−0.213	0.115	0.94	0.472

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cereda, G.; Gill, R.D. A Nonparametric Bayesian Approach to the Rare Type Match Problem. Entropy 2020, 22, 439. https://doi.org/10.3390/e22040439

AMA Style

Cereda G, Gill RD. A Nonparametric Bayesian Approach to the Rare Type Match Problem. Entropy. 2020; 22(4):439. https://doi.org/10.3390/e22040439

Chicago/Turabian Style

Cereda, Giulia, and Richard D. Gill. 2020. "A Nonparametric Bayesian Approach to the Rare Type Match Problem" Entropy 22, no. 4: 439. https://doi.org/10.3390/e22040439

APA Style

Cereda, G., & Gill, R. D. (2020). A Nonparametric Bayesian Approach to the Rare Type Match Problem. Entropy, 22(4), 439. https://doi.org/10.3390/e22040439

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Nonparametric Bayesian Approach to the Rare Type Match Problem

Abstract

1. Introduction

2. State of the Art

2.1. The Rare Type Match Problem

2.2. Evaluation of Matching Probabilities of Y-STR Data

3. The Model

3.1. Notation and Data

3.2. Model Assumptions

3.3. Prior

3.4. Bayesian Network Representation of the Model

3.5. Random Partitions and Database Partitions

3.6. Chinese Restaurant Representation

3.7. A Useful Lemma

4. The Likelihood Ratio

5. Analysis on the YHRD Database

5.1. Model Fitting

5.2. Log-Likelihood

5.3. True LR

Details of the Metropolis–Hastings Algorithm

5.4. Frequentist-Bayesian Analysis of the Error

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI