Questions of identity and primordialism are at the center of scientific and public debate. Until recently, charting the emergence of agriculture, the spread of languages, and the rise and decline of cultures were topics dominated by archeologists. The emergence of aDNA allows paleogeneticists to delve into this debate with a discordant set of assumptions about biology and identity [47
]. This was not unforeseen, as population genetic analyses excel at identifying individual differences, which can inform archeologically contended subjects such as migration and the degree of admixture or population replacements. However, aDNA analyses also require destroying genetic material, sometimes irrevocably, which makes them impossible to replicate. It is therefore crucial to develop a robust genetic methodology that uses population genetic principles to examine the assumptions made by both archeologists and paleogeneticists. It is reasonable to expect that many of the tools employed to study modern-day genomes will need to be adapted to the four-dimensional environment facilitated by aDNA.
Ancestry informative markers are some of the most useful tools in addressing population, biomedical, forensics, and evolutionary questions that remain in use today [9
]. However, it is unclear to what extent known AIMs are applicable to ancient genomic data, which are characterized by high missingness and haploidy [1
In this study, we defined aAIMs (Figure 1
) and sought to identify them using various methods. The number of aAIM candidates detected by each method ranged from 9,000 to 15,000. These numbers are of the same magnitude as large AIMs studies (e.g., [51
]) and reasonable, provided that there is potential relatedness of the ancient Eurasian populations and the near absence of heterozygote markers in the data. To find which of the aAIM candidate sets produced by each method best represent the true population structure, we used the CSS as a benchmark for qualitative and quantitative comparisons.
Identifying the ideal AIM set that would be both small and include redundancies (in the case of sequencing failure), capture the population structure, and allow the identification of admixed individuals is one of the challenges of population genetics. We showed that the aAIMs identified through the PD method outperformed all other methods, in agreement with previous studies that tested PCA-based methods [25
]. In forty percent of the populations, classifications made by the PD method were more accurate than those made using the CSS (Table 1
), which highlights the limitations of using markers indiscriminately. This is not surprising, since not all the markers are equally informative, and less informative markers (e.g., exonic markers) may mask the population structure, resulting in the misclassification of populations. The notion of “more is better” is, hence particularly misguided with aDNA that harbors a multi-layered population structure in a poor set of markers. The application of the PD aAIMs for admixture mapping, combined with tools that can homogenize cases and controls [16
], enables the carrying out of future association studies on aDNA samples (e.g., [3
]). Further investigations with additional data may identify formerly common markers associated with those disease that with time became rare and undetectable.
The use of PCA to infer population structure is controversial [53
], and its use as a clustering method has been criticized [16
]. We note that the PD method employs PCA only to produce and replicate a population structure profile of certain subpopulations based on various sets of markers and does not make claims that the PCA-derived profiles represent the true genetic distances between individuals.
Surprisingly, Infocalc and FST
that are commonly used to identify AIMs [18
] and are reported to perform well [57
], have oftentimes underperformed random SNP selections. Not only was FST
already shown to be particularly small within continental populations [58
], but these methods may be particularly sensitive to aDNA data that are both haploid and have high missingness (Figure S9
). We also found no relationships between the performances of MAF and aAIMs (Figure S5
). Enrichment for high or low MAF SNPs did not guarantee success, although the PD harbored more common SNPs than most of the underperforming methods.
Our study has several limitations. We studied an uneven number of Eurasian populations from various times and locations, causing a skew toward markers that predict central European populations from the Late Neolithic and Bronze Age. A modest attempt to reduce this bias was made by including modern-day African and Asian populations; however, more comprehensive analyses should be made when more global genomes are available from consecutives eras. Second, the aAIMs were calculated independently by each method with individual populations considered independent, although the PCA and ADMIXTURE plots indicate that central European populations may not be independent. Finally, due to the high missingness of the data, it is likely that our study missed informative markers that could improve the classification accuracy in newly sequenced populations. Therefore, our framework and methods must be applied again when more comprehensive aDNA datasets are available.