Spatial Fingerprinting: Horizontal Fusion of Multi-Dimensional Bio-Tracers as Solution to Global Food Provenance Problems

Building the capacity of efficiently determining the provenance of food products represents a crucial step towards the sustainability of the global food system. Despite species specific empirical examples of multi-tracer approaches to provenance, the precise benefit and efficacy of multi-tracers remains poorly understood. Here we show why, and when, data fusion of bio-tracers is an extremely powerful technique for geographical provenance discrimination. Specifically, we show using extensive simulations how, and under what conditions, geographical relationships between bio-tracers (e.g., spatial covariance) can act like a spatial fingerprint, in many naturally occurring applications likely allowing rapid identification with limited data. To highlight the theory, we outline several statistic methodologies, including artificial intelligence, and apply these methodologies as a proof of concept to a limited data set of 90 individuals of highly mobile Sockeye salmon that originate from 3 different areas. Using 17 measured bio-tracers, we demonstrate that increasing combined bio-tracers results in stronger discriminatory power. We argue such applications likely even work for such highly mobile and critical fisheries as tuna.


Introduction
Today's global food system is a collection of highly inter-connected trade networks that span a myriad of organizations and geographies [1][2][3][4]. While such a system represents an important and perhaps necessary mechanism for meeting demands for nutritious and affordable food [3,5], it also represents a complex web of activity that carries with it a number of inherent challenges such as sustainability and transparency [2,6]. Most food items travel thousands of kilometers [7] changing form and ownership several times before reaching a consumer's plate [4]. Without proper labelling-such as Country of Origin Labelling (COOL) regulation-the consumer does not have the capability to accurately identify where their food originated and thus cannot make informed decisions about the products they are buying [8]. Unfortunately, tracing food commodities back to their respective origin is a formidable task, which can only be tackled by a robust traceability system integrated along the entire food supply chain [9].
The introduction of new technology (e.g., Wireless Sensor Network and Radio Frequency IDentification, blockchain) and ad hoc recommendations(ISO 9000, Codex Alimentarius, etc.) represent indispensable tools to monitor and secure different food chain stages [9][10][11]. However, there is one common limitation that is stopping us from realizing robust provenance-based value chains-the ability to verify traceability information [12]. Consequently, a vigorous research effort has been geared towards the development of methods to authenticate type and origin of food commodities, such as sensory analysis and chromatographic techniques [12][13][14]. One promising avenue is the use of bio-tracers, i.e., biological features (e.g., DNA, trace-elements, metabolomic compounds, stable isotopes, etc.) that vary with (and thus reflect) the environment an individual is living in, to create fingerprints that can recognize different food products. For instance, DNA barcodes have been used for over a decade to uncover fraudulent labelling practices in the seafood industry [15][16][17][18][19]. Similarly, stable isotopes have shown a lot of promise for authenticating the origins of various food products including olive oils, cheese, honey, meat and fish [20][21][22].
In food product authentication, classes of bio-tracers are often employed independently of each other and "vertical" bio-tracer strategies (i.e., using different markers within a given class of bio-tracers) still prevail to adjust the granularity of information being sought. For example, small DNA sequence fragments of the mitochondrial cytochrome c oxidase I gene (COX-1) are enough to identify a fish fillet to species [19], but the genome coverage required is larger when trying to discriminate sub-populations where genetic variability is much smaller [17]. Similarly, increasing the number of stable isotopes used has repeatedly been shown to be a powerful approach to determine the provenance (i.e., the origin) of numerous food products, even at fine spatial scales [21,[23][24][25][26]. Interestingly, despite the evidence of important gains in discriminatory power brought by vertical data fusion, the general reasons behind such success are rarely discussed. Furthermore, it remains unclear whether this gain extends beyond one class of bio-tracers, i.e., the potential of "horizontal" strategies for food authentication remains to be established [12,27].
In what follows we discuss the efficiency of combining information from different bio-tracers (vertically and horizontally) for food authentication with a specific focus on provenance. Our goal here is threefold. First, we explain how to use multiple bio-tracers to create spatial fingerprints. Second, we show that increasing the number of bio-tracers for authentication increases authentication performance. Third, we provide a relatively simple explanation for why data fusion is always a winning strategy and comment on the potential of horizontal strategies. We support our arguments (which are mainly mathematical, see Appendix A) by comparing how well a set of bio-tracers perform when trying to assign Sockeye Salmon (Oncorhynchus nerka) to three geographically distinct fisheries: British Columbia, Canada; Kamchatka Peninsula, Russia; and Alaska, United States ( Figure 1). The set of bio-tracers included three isotopes (δ 15 N, δ 13 C and δ 34 S) and 14 fatty acids for a total of 17 bio-tracers spanning two different classes. The Sockeye fishery itself presents an interesting model because sustainability practices vary somewhat geographically. The entire Alaskan fishery is certified by the Marine Stewardship Council (MSC)-the most rigorous and widely recognized eco-certification available. The Canadian fishery was also recognized by MSC as sustainable, until 2019 when the Canadian Pacific Sustainable Fisheries Society (CPSFS) decided to self-suspended its MSC certification for many salmon species, including Sockeye [28]. While some fisheries in Russia received certification from the MSC, many remote fisheries in Eastern Russia are under threat due to extractive industries, loss of habitat and large-scale poaching [29]. Much of this is thought to be driven by linkages to organized crime in east Asian markets [30,31]. Therefore, building the capacity to distinguish high-level geographic origins of Sockeye is of particular relevance to the sustainability of Sockeye fishery and food provenance interests in general. Combining bio-tracers to improve the determination of samples' provenance. (a), Sockeye salmon (Oncorhynchus nerka) samples of this study originate from three potential origins, namely Alaska, United States (yellow); British Columbia, Canada (cyan) and Kamchatka Peninsula, Russia (magenta). (b), We examine the efficiency of horizontal strategies that combine several classes of bio-tracers as opposed to vertical strategies that focus on one specific class. (c), While using a single bio-tracer to discriminate the true origin of a sample (distributions on top and right of the chart, dotted line depict bio-tracer values of a sample) may prove difficult, combining bio-tracers (colored areas) enhances the performance of the inference process. (d), This is also shown with confusion matrices obtained using a classifier that uses only the first bio-tracers (top), only the second one (middle) or the combination of the two (bottom).

Data
Muscle tissue trimmings were collected from 90 Sockeye salmon individuals from three different regions (30 individuals per region): British Columbia, Canada; Kamchatka Peninsula, Russia; and Alaska, United States were donated by Albion Farms & Fisheries Ltd. (now Intercity Packers Ltd.), Richmond, BC, Canada. All samples were derived from fillet trimmings to simulate a likely Quality Assurance/Quality Control scenario. Each muscle trimming was processed to obtain 2 muscle tissue samples for analyzing 17 bio-tracers of two classes: 3 stables isotopes (δ 15 N, δ 13 C and δ 34 S) and 14 fatty acids (C16:0, C16:1, C18:0, C18:1, C18:2n-6, C18:2n-6, C18:3n-3, C18:4n-3, C20:1, C20:4n-3, C20:5n-3, C22:1, C22:5n-3, C22:6n-3 and C24:1). One muscle sample from each fish was delivered frozen to the Lipid Analytical Services at the University of Guelph for fatty acid analysis using a combination of Bligh and Dwyer and Morrison and Smith methods [32,33]. Individual FA weights (µg/g) were converted to a % FA composition and fatty acids with >1% presence were retained as bio-tracers. The second muscle samples were dried at 70 • C for 2 days and ground into a fine powder in preparation for stable isotope analysis. Tissue samples were sent to the University of Windsor GLIER Chemical Tracers Lab for isotopic analysis of δ 15 N, δ 13 C and δ 34 S (Windsor, ON, Canada). Importantly, all variables were centered and scaled before any statistical inference.

Statistical Models
We created spatial fingerprints of increasing complexity by combining up to 17 biotracers for our three regions of interest (see Figure 1) and then evaluating their performances (on a different set of samples) to correctly determine the origin of a sample (see the following section). Among the large diversity of supervised-learning methods available, we chose three to reflect current and emerging practices in food authentication: For all three methods, we assessed the probabilities of correctly assigning a sample to its true origin (referred to as *performance*) for every region (this corresponds to the diagonal of the confusion matrix) as well as the probability of assigning a sample to its true origin, irrespective of its true provenance (overall performance). Assuming that we have no prior expectation for the origin of a given sample, the overall performance corresponds to the mean of the diagonal of the confusion matrix.

Simulations Design
For every simulation, we randomly selected 20 samples per regions (60 samples total) as training set and used the remaining samples (10 per region) to evaluate performances of combinations of bio-tracers (thus, the samples used to evaluate performances are different from the one use by the algorithm to create its own knowledge of the data). All simulations were replicated for all three selected classification approaches. We also evaluated the impact of respective size of the two data sets for and we show that gains of performances beyond 20 samples in the training set were marginal for all three methods (see Figure A8).
We evaluated the performances for an increasing number of bio-tracers (from single performances up to the combination of all the 17 bio-tracers available). For every number p of bio-tracers, we used 500 combinations of p bio-tracers. When there were less than 500 existing combinations, we used all of them. For every combination, we randomly chose 200 pairs of training and test sets, leading to up to 100,000 simulations for a given number of bio-tracers. We also assessed the overall performances of the three approaches on the dataset ordered by a Principal Component Analysis (PCA). PCA is a statistical tool commonly used to reduce dimension [38], here PCA was used to transform our data set and obtained uncorrelated variables ordered according to the percentage of variance of the entire data set they capture. To evaluate the robustness to noise, we added an increasing amount of white noise in the of the training set, i.e., for every simulation, we drew 60 values in a centered normal distribution of an increasing standard deviation (from 0.0001 to 10). For every simulation, we used 500 combinations of bio-tracers and 200 pairs of training and test sets (randomly chosen).
Finally, for all bio-tracers and all combinations of 2 and 3 bio-tracers, we computed the inter-regions variance as well as the distance between region centroids (coordinates of region centroids are the means of coordinates of all samples in a given region). We also computed the region data overlap. To do so, for the three regions studied, we computed the convex hull for all pairs and triplets of bio-tracers. Note that, in order to discard potential outliers, we only used 27 data points per region (90%), points included were the closest to their respective region centroid. We then computed the volume (or area) of all intersections between the three convex hulls, summed them and then divided the quantity thereby obtained by the total area (or volume) of the three convex hulls. Last, for all of these sets of bio-tracers we evaluated the performance bio-tracers using 1000 pairs of training and test sets (randomly chosen).

Mathematical Proof
In Appendix A, using Bayes's rule, we demonstrated that increasing the number of bio-tracers combined almost surely increases the discriminatory power (performance) of a Naive Bayesian Classifier (NBC).

Numerical Implementation
For LDA, we used the R implementation lda() available in the package "MASS" [39]. We implemented our own naive Bayesian classifier using R version 3.6.3 [40] and use the function density() for kernel density estimates.
Finally, we used the Julia library Flux.jl [41] for the multi-layer perceptron (two dense layers and cross-entropy loss function). As this approach is data demanding, we used a simple data augmentation procedure: data in the training set were repeated and noise (random variables drew from a centered normal distribution of standard deviation σ) of various levels was added to it (as a centered normal distribution). After evaluating the performances under various augmentation scenarios (see Figure A6), we opted for 1000 repetitions of the data set and a noise level of σ = 0.01.

The More Bio-Tracers the Better
For the three regions considered, increasing the number of bio-tracers always increased the probability of correctly assigning a sample to its true origin ( Figure 2). The three statistical approaches considered show similar behavior, qualitatively, with MLP having the best performance ( Figure 2c). All three methods consistently exceed 90% of correct assignment when 12 or more bio-tracers are combined for Canadian and Russian samples. The three approaches also perform significantly less efficiently for Alaskan samples, which are geographically closer to the two other regions Figure 1. Interestingly, the same order applied in the data space: the distance between Russia and Canada (based on the Euclidean distance between group centroids) is the longest (4.819 vs. 4.145 for Canada-USA and 2.536 for Russia-USA).
The overall performance (i.e, the probability of correctly determining the provenance of a sample irrespective of its true origin), based on a single sample, from 1 to 17 bio-tracers increases from 0.444 to 0.898 for LDA, from 0.465 to 0.817 for NBC and from 0.482 to 0.915 for MLP (Figure 2d-f). Moreover, performances are strongly improved when testing multiple individuals (Figure 2d-f). It is worth noting that even in such case, employing more bio-tracers still provides more accurate predictions (Figure 2d-f). Note that these results align fully with our analytical derivations (see Appendix A and Figure A3). Furthermore, increasing bio-tracers is very robust to noise addition, and this holds true for all three methods ( Figure 3). For instance, the overall performance of LDA with 5 bio-tracers and a very low level of noise added (10 −4 ) is 0.710, but combining 15 bio-tracers with an addition of a noise with a level as high as 1 (a fairly strong noise addition) still yields better discriminatory power (0.758). Therefore, even if the measurement are known to be less accurate for some bio-tracers, they are likely worth being combined with others, assuming that the error is consistent among samples. The probability of assigning one sample to its true origin increases as the number of bio-tracers employed increases for the three regions considered, namely Alaska (yellow), British Columbia (cyan) and Kamchatka Peninsula (magenta). (d-f), The overall performance (i.e., the correct assigning any sample to its true origin) can also be improved by combining samples, assuming samples combined originate from the same region (e.g., individuals of the same lot). Points are colored according to the number of samples combined. These results are qualitatively similar for the three statistical approaches considered, which are Naive Bayesian classifier (NBC; (a,d)), Latent Discriminant Analysis (LDA; (b,e)) and a Multi-Layer Perceptron (MLP; (c,f)). In all panels, points represent performances averaged over up to 100,000 replicates (see Methods for further details).

Figure 3.
Combining bio-tracers is robust to noise addition. The probability of correctly determining the provenance of samples is evaluated for an increasing noise addition to the training data set. The lighter the gray, the more the number of bio-tracers combined. Note that prior to analysis, all bio-tracer values were scaled, thus a noise level of 1 represents a strong noise addition. The three panels correspond to three statistical approaches used: NBC (a), LDA (b), MLP (c).
Using the first axes provided by Principal Component Analysis (PCA) applied on the data set (see Methods) is a strategy that performs relatively well: across all three approaches, using up to the first 6 principal component axes is consistently better than the median of all the bio-tracer combinations we tested ( Figure 4). Furthermore, as expected, the results obtained are similar when most or all bio-tracers are being used, except for NBC for which the PCA slightly negatively impacts the overall performance. Most importantly, for all methods, the axis order provided by a PCA (the first axis being the one that captures the most variance) does not necessary reflect their discriminate power. Hence, the three statistical methods show that the 5th principal component axis provides a more important gain in performance than the 4th one ( Figure 4). In general, combining only a few of the first principal component axes to authenticate food products, as frequently done [42], may be a sub-optimal approach as it can discard axes that carry less variance but more discriminatory power.

An Examination of the Performances
Individually, the 17 bio-tracers have contrasting authentication performances ( Figure 5), this holds true for both classes of bio-tracers : δ 15 N and oleic acid (C18:1) alone perform well (0.547 and 0.548, respectively) whereas δ 13 C and linoleic acid (C18:2n-6) perform poorly (0.343 and 0.333). It is worth noting that the top 3 bio-tracers, based on individual performances, includes 2 fatty acids (oleic acid and docosapentaenoic acid, i.e., C22:5n-3) and one stable isotope (δ 15 N; see Figure 5) and thus cover the two classes of bio-tracers. Note that even though we only show this for LDA ( Figure 5), this holds true for NBC (see Figure A8 in Appendix B) and MLP (see Figure A11).
Interestingly, the overall performance of a pair of bio-tracers systematically outcompetes the best performing of the two bio-tracers included in the pair (see Figure 6a for the results for LDA and Figure A9a for NBC and Figure A12a for MLP). Similarly, the performance of combining the three bio-tracers is better than the best performing pair of bio-tracers that can be drawn from the triplet (see Figures 6b, A9b and A12b). Furthermore, the overall performance of a set of bio-tracers positively correlates with the performances of its subsets. Therefore using the best performing bio-tracers frequently yields a stronger discriminatory power (Figures 6c,d, A9c,d and A12c,d). This explains that the top 3 biotracers, and thus the two classes of bio-tracers, are systematically included in the best pairs and triplets (see Tables A1 and A2).  As expected, the percentage of inter-regions variance captured by a bio-tracer is strongly and positively correlated with its overall performance (Figures 5a and 7a,b). Even in 2 and 3 dimensions, simple non-linear least-squares regressions efficiently captures the variance of these relationship (R 2 = 72.0% and 48.1% for LDA, respectively, see Figures 7a,b, A10a,b and A13a,b for NBC and MLP, respectively). In one dimension, the mean Euclidean distance between region centroids efficiently summarizes one key geomet-rical results of the data space: the further apart the data points of different regions are, the stronger the discriminatory power (Figure 5b). This result could be seen as a simple case of a more general one: the less overlap among regional hypervolumes (i.e., hypervolumes generated by data points of the different regions), the stronger the discriminatory power of a set of bio-tracers gets (Figures 7c, A10c and A13c). Notably, increasing dimensions is often an efficient way to reduce overlap among regions data points (see Figure 7c, Appendix A, Figures A10c and A13c).

Discussion
Working in high dimensions for reliable authentication is already being used in our day-to-day life. For instance, face recognition algorithms use a high number of "abstract features" to recognize faces [43,44]. Similarly, multi-messenger astronomy is experimenting with the fusion of electromagnetic radiation, gravitational waves, neutrinos and cosmic rays to observe and understand the universe through a new lens [45]. Here we acknowledge similar potential for food authentication and clarify why data fusion can enhance the discriminatory power of traceability tools, and thus play a major role in food authentication in the foreseeable future, as other authors have predicted [12]. Our simulations suggest that multi-tracer approaches are increasingly strengthened by spatial tracer covariance and, importantly, allow rapid provenance detection even in the face of noise relative to low dimensional approaches. This is especially relevant for horizontal data fusion of bio-tracers as they are plentiful-some of which have just started to reveal their potential in tracing food products [46]-and reflect various interactions between individuals and their immediate environment. Hence, together with technical advancements that trace movements of food products (such as blockchain), using bio-tracer based fingerprinting strategies to verify the origin of food product can contribute to making the food supply chain more transparent, more robust and eventually more sustainable.
Even though working in high dimension could be a very efficient approach, it also comes with its challenges: even if additional dimensions increase the discriminatory power of statistical classifiers, it comes at a cost as probability density estimates are more difficult and thus less accurate [38]. This is where dimension reduction methods, such as PCA, can be utilized, as they allow for working in a simpler space with mathematically-desirable properties (e.g., uncorrelated axes that concentrate the variance). However, one needs to bear in mind that what matters is to keep as much discriminatory power as possible and thus one should realize that, for instance, working with only the few first principal components may not always provide the best authentication tool as dimensions representing a low amount of total variance may still be of major importance to separate a pair of regions or more. Researchers should rather focus on statistical tools that reduce dimensions while maximizing discriminatory power, such as stepwise LDA [47]. Fortunately, the recent boom in artificial intelligence research is bringing considerable methodological advancements in multivariate density estimation and dimension reduction [48].
Taking advantage of data fusion can only be achieved if relentless efforts are made to acquire reliable data that would be securely archived (e.g., within a blockchain) while being widely accessible. This would require creating and maintaining ad hoc digital infrastructure. In our Sockeye example, we only needed 90 samples and 17 bio-tracers to cleanly differentiate a globally ubiquitous species by geographical region; however, we only covered 1 species across 3 spatially coarse regions-making high diversity and fine spatial scale applications will require more intensive data and probably the integration of additional classes of bio-tracers. Although we did not consider DNA approaches beyond COI barcoding, we did explore DNA barcodes, well known for its species identification abilities [49,50], as a tool for spatial identification. Nonetheless, the genotypic variation at the COI gene was small and showed no spatial signal (see Figure A14). Moreover, here we did not investigate the temporal variations in bio-tracers distribution for the different regions which will be a critical step as this would determine the survey frequency required to maintain reliable spatial fingerprints. Overall, the sampling effort and the data required to extensively cover fishing areas experiencing food security concerns with numerous species of interest (and/or at risk) over long periods of time would certainly be bigger by several orders of magnitude.
Ultimately, standardizing sampling protocols, building large databases and employing powerful computational tools will allow researchers and national authorities to create dynamic maps of probability of origin for any food product to be tested [37,51]. There are various strategies to improve food authentication, employing horizontal data fusion is clearly one of them. Fortunately, we are living in an era where major technical needs have been met, thus horizontal strategies can be employed immediately, but evidently their spread will depend on the balance between the cost of their application and the economical benefits for fishing industry, which vary across seafood products. That said, horizontal data fusion of bio-tracers could certainly be employed beyond the field of food authentication as it is a general principle where bio-tracers can be applied and combined to determine a wide array of biological properties, be it for determining the origin of a species or the structure of an entire food web.  Furthermore, we are grateful to Marie-Hélène Brice for fruitful discussions throughout the writing of this paper. Lastly, we would like to send of warm thank you to Lafille Lambert for her continuous support: from the very early stages of this study, up to the writing of the final version of this article.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Mathematical Proofs
In this first part of the Supplementary Information, we provide a mathematical proof of our main result, i.e., increasing the number of bio-tracers combined increase the probability of assigning a sample to its true origin. To do so, we use several strong simplifying assumptions which are necessary for the rigor of the demonstration. Importantly, numerical results of the main text fully align with the mathematical ones obtained below.

Appendix A.1. Objectives
We consider a sample of n individuals that belong to one of p existing populations and we aim at determining the population the sample originate from. In order to do so, q properties (e.g., fatty acid profile) are measured for every individuals and we assume value distributions are known for every populations. For the sake of clarity, we refer to these measurements as bio-tracers and we assume that inferring the population of origin equates inferring the geographic origins of the sample. Below, we use a Bayesian framework to infer the geographic origin of a sample based on bio-tracers and to discuss how the properties of the value distribution affect the inference [36,52].

Appendix A.2. Notations and Definitions
In what follows: • n, p and q are three natural numbers other than zero; • N n is the set of natural numbers ranging from 0 to n, where n ∈ N * ; • [X] denotes the probability of the event X (X being an event, or a realization of a random variable), and [X|Y] probability of X given Y; • N (µ, σ) denotes a Gaussian distribution of parameter (µ, σ); • E(X) denotes the expected value of the random variable X.
We consider q bio-tracers and p populations. For any population j, f j denotes the probability density distribution (pdf) of bio-tracers values as follows: where D l denotes the support set of the lth bio-tracer (we use the same support set for all areas). Now let S 1 , . . . , S n be n random vectors of q dimensions. Such collection defines the random variables for a sample of size n. Then, let S k = (S k,1 , ..., S k,q ) denotes the set of random variables describing the bio-tracer profile of the kth individual of the profile and s k = (s k,1 , ..., s k,q ) the observed values.

Appendix A.3. Bayesian Approach
Let [A i |s] be the probability that the individuals of the sample originate from the populations i given the bio-tracers values observed. For any population i (i ∈ N p ): Applying Bayes theorem yields: Assuming that A i , . . . , A p is a partition (i.e., those are the only potential population of origins): using the bio-tracers distribution and assuming independence among individuals of the sample: Equation (A3) becomes: In case where bio-tracers are independent (A5) can be written as: which yields: where the probability A j is a prior information. This can be used to complement the information about the sample. For instance, in seafood authentication, it can be used to reflect knowledge on species distribution or fisheries pressure. If no information is available, it is a reasonable assumption to consider populations equipossible in which case the relative size of population can be used as weights and in case population sizes are comparable (or no information is available) for any (i, j), [A i ] = [A j ], (A7) then becomes: Those assumptions are the same as the ones used by a Naive Bayesian Classifier (one of the three approaches we used in the main text)

1.
Improving the a priori knowledge (which we do not consider here); 2.
Considering a larger quantity and/or higher quality of observations; 3.
Using more reliable inference techniques.
Below, we focus on how the data quantity and the properties of the distribution themselves influence the success of authentication. We do so as an attempt to better explain how increasing dimensionality can lead to better discriminatory power. To this end, we use simple cases where mathematical developments are fairly straightforwards.
We consider a simplified situation wherein one sample whose individuals come from one of two existing populations A 1 and A 2 are to be distinguished based on the distribution of one single bio-tracer. Under these assumptions, (A7) becomes: , a non-informative prior) and assuming ∏ n k f 1,1 (s j,1 ) > 0, then (A9) becomes: : Assuming that the sample originates from A 1 , then our goal is to understand how the dissimilarities between the two probability distributions impact the ratio ∏ n k f 2,1 (s j,1 ) ∏ n k f 1,1 (s j,1 ) and thus [A 1 |s]. As the latter probability also describes the success rate of which a sample is ascribed to its true provenance, we sometimes refer to this quantity as performance. Note that because the size of the sample n strongly influences the performance, we also examine n affect [A 1 |s].
We further simplify the problem and posit that f 1,1 is the density function of a Normal distribution N (µ 1 , σ 1 ) and f 1,2 is the density function of Normal distribution N (µ 2 , σ 2 ). (A9) becomes: In the two following sections, we will consider two cases where dissimilarity is straightforwardly defined : where the dissimilarity will be quantified as |µ 1 − µ 2 |; 2. µ 1 = µ 2 = µ where the dissimilarity will be quantified as σ 1 σ 2 .

Appendix A.4.2. Identical Variances
Under the assumption that σ 1 = σ 2 = σ, (A11) becomes: where: Expanding R we obtain: We are now looking for an expression of the expected performance, i.e., E([A 1 |s]) for any n. As we assume that all individuals of the sample originate from A 1 , if X n = ∑ n k=1 s k then we have X n ∼ N (nµ 1 , √ nσ). As E(X n ) = nµ 1 one can notice that On a side note, as for all x > 0, x → x 1+exp(−x) is concave, following Jensen's inequality, we obtain the following upper boundary: For the exact computation of E([A 1 |s]), given that X n ∼ N nµ 1 , . The moments do not have closed forms [53] but are straightforwardly computed numerically see [54]. Importantly enough here µ LN σ LN = 1 2 σ LN , so the ratio increases with σ LN which is a monotonically increasing function of |µ 1 − µ 2 | and n. Following Frederic and Lad [53], we therefore conclude that E([A 1 |s]) increases with |µ 1 − µ 2 | and n and illustrate this in Figure A1.
we have Y ∼ χ 2 (n) and therefore: where f is the probability density function of Y. Using the expression of f , we obtain: where Γ is the gamma function. At first glance, it is hard to determine how E([A 1 |s]) changes with with n and σ 1 σ2 . To examine this, we posit: One can trivially shows that for any n > 1, if x = 1 then h(x, n) = 1 2 . We further conjecture that for any x ∈ R, x → h(exp(x), n) is symmetric and monotonically increasing on x ∈ R + increase. We also conjecture that for any reals n > 1 and m > 1, if m > n then h(x, m) > h(x, n). Simulations presented in Figure A2a It is hard to assert whether it is generally true and below we use simple case to discuss under what conditions this holds true.

.2. Simplification
We first assume that the bio-tracers are independent, in this case we have which is almost equivalent to increasing the number of samples and demonstration are similar to the previous ones. For instance, if for any l, X n,l ∼ N nµ 1,l , √ nσ , then (A13) becomes which yields: Assuming X n,l = ∑ n k=1 s k,l , then we have X n,l ∼ N (µ 1,l , √ nσ), for any l. As all X n,l are independent, then R ∼ N n Just as in the previous section Appendix A.4.2, we have µ LN σ LN = 1 2 σ LN , so the ratio increases with σ LN which is a monotonically increasing function of ∑ q l=1 (µ 2,l − µ 1,l ) 2 and n and thus, following Frederic and Lad [53], combining bio-tracer (i.e., augmenting the number of terms in the sum) increases the performances (see Figure A3). Note that in the simplified situation where (µ 2,l − µ 1,l ) 2 = d for any l, ∑ q l=1 (µ 2,l − µ 1,l ) 2 = qd, then q and n and d plays similar roles. When bio-tracers are not independent, we have the following relationship where f X|Y denotes a conditional probability density function. We can only assert that as long as all ratios of conditional probability are less than 1, adding bio-tracers increase the overall performance, which should be true, on average. That being mentioned, in the rest of the section, we examine the role of correlation among bio-tracers. We do so using a simple case where we assume that bio-tracer values are drawn in multivariate normal distributions, so for any population j, for any individual i, we have S i,j ∼ N (µ j , Σ j ), where Σ j is the variance-covariance matrix.
We start by considering two bio-tracers. In this case, the general form of Σ j is: We assume that for any i and j, σ i,j = 1 and we focus on the role of cov(S 1,j , S 2,j ). Note that under our assumption, cov(S 1,j , S 2,j ) is also the correlation between the two bio-tracers (ρ j ). As we consider only two populations, we have two correlation values, ρ 1 and ρ 2 , and we further simplify the problem by positing ρ 1 = ρ 2 . In other words, we do not explore cases where different populations have different correlation structures among their bio-tracers. Therefore Σ 1 = Σ 2 = Σ that is of the following form: To illustrate the role of ρ, we use (A22) under the new set of assumptions and vary ρ from 0.99 (bio-tracers are extremely correlated) to 0 (bio-tracers are independent) for an increasing sample size ( Figure A4). This shows that there is a continuum: from a situation where the two bio-tracers are very correlated and act effectively as a single bio-tracer, up to the situation where we have two uncorrelated bio-tracers, which performs better. In a second step and examine [A 1 |S] against an increasing dimensionality (i.e., a number of uncorrelated bio-tracers): from 1 to 10, for a sample of 25 individuals ( Figure A5). This shows that E([A 1 |s]) increases with dimensionality. Figure A5. Effect of dimensionality of the set of bio-tracers on the determination of origin. Each point represents the average value from 100,000 simulations. but more tedious. For instance, if we extend the assumptions of Appendix A.4.2 to m populations (normality and independence among the l populations considered, σ l = 0), (A11) becomes: . Solving this analytically is beyond the scope of this appendix.

Appendix B. Additional Figures and Tables
In this part of the Supplementary Information we added supplementary figures and supplementary tables that complements the ones in the main text.      Appendix B.5. DNA Barcodes Figure A14. Unrooted neighbour-joining tree (NJ) based on the p-distance of the 650 bp barcode region of the Cytochrome c Oxidase I gene. The NJ tree was generated using the Barcode of Life Data System V4 see also boldsystems.org [55] using the Kimura 2 Parameter distance model [56] and sequences were generated using the LifeScanner DNA sequencing kit (lifescanner.net).