Open-Access Worldwide Population STR Database Constructed Using High-Coverage Massively Parallel Sequencing Data Obtained from the 1000 Genomes Project

Achieving accurate STR genotyping by using next-generation sequencing data has been challenging. To provide the forensic genetics community with a reliable open-access STR database, we conducted a comprehensive genotyping analysis of a set of STRs of broad forensic interest obtained from 1000 Genome populations. We analyzed 22 STR markers using files of the high-coverage dataset of Phase 3 of the 1000 Genomes Project. We used HipSTR to call genotypes from 2504 samples obtained from 26 populations. We were not able to detect the D21S11 marker. The Hardy-Weinberg equilibrium analysis coupled with a comprehensive analysis of allele frequencies revealed that HipSTR was not able to identify longer alleles, which resulted in heterozygote deficiency. Nevertheless, AMOVA, a clustering analysis that uses STRUCTURE, and a Principal Coordinates Analysis showed a clear-cut separation between the four major ancestries sampled by the 1000 Genomes Consortium. Except for larger Penta D and Penta E alleles, and two very small Penta D alleles (2.2 and 3.2) usually observed in African populations, our analyses revealed that allele frequencies and genotypes offered as an open-access database are consistent and reliable.


Introduction
Next-generation sequencing (NGS), also known as massively parallel or deep sequencing, is a technology that allows millions of DNA fragments to be sequenced in parallel. NGS can deal with several regions or targets simultaneously, enabling variation sites or mutations in the genome to be detected. This technology has allowed worldwide human genetic diversity to be studied for various purposes, including forensic human identification [1][2][3].
Advances in the genomics area have made it possible to use NGS techniques in a more accessible way, mostly because of lower costs. Currently, many researchers are performing whole-exome (WES) and even whole-genome (WGS) sequencing to estimate polygenic risk scores and probabilities of developing multifactorial diseases associated with various genetic regions at once, which would be a more laborious and costly issue if using traditional methodologies [1].
The 1000 Genomes Consortium is a worldwide collaboration that has produced an extensive catalog of human genetic variation. The consortium has sequenced whole genomes of 2504 individuals belonging to multiple populations derived from five population groups: African, East Asian, European, South Asian, and admixed Americans [4]. These data are freely available at the International Genome Sample Resource website (https://www.internationalgenome.org; accessed on 5 July 2021) to generate a variant call format file that uses a set of specific command lines [5]. In 2015, during Phase 3 of the Project, the consortium analyzed the genomes of all the individuals by using a combination of low-coverage whole-genome sequencing (WGS), deep exome sequencing, and dense microarray genotyping. The consortium described worldwide patterns of genomic diversity on the basis of Single Nucleotide Polymorphisms (SNPs), indels, and structural variants (SVs), including deletions, insertions, duplications, inversions, and copy-number variants (CNVs), but it did not analyze or study short tandem repeat (STR) markers in depth [6].
STR markers are crucial in human identification. These markers have high polymorphism levels and are particularly useful for interpreting mixtures of biological samples. However, in addition to the issue of small-sized amplicons, genotyping STR markers by using NGS data is difficult because alignment and stutter errors are frequent [7]. Achieving accurate genotyping by employing NGS data has been challenging because these data have high sequencing error rates [8]. Gymrek et al. (2012) managed to obtain and to analyze STR markers from the dataset of the 1000 Genomes Project using lobSTR [9]. Given that high coverage is mandatory for reliable STR genotype calling to be achieved, a primary concern regarding that study was that the data obtained from the 1000 Genomes Project available for lobSTR were generated by employing shallow sequencing coverage (2x-6x), so the calling was potentially susceptible to errors [10].
To circumvent this coverage issue, the New York Genome Center (NYGC) recently re-sequenced the 2504 samples of the panel of Phase 3 of the 1000 Genomes Project with high (30x) coverage, and aligned the sequence data to GRCh38. These publicly available data could be used to call STR markers reliably [5,11].
NGS technology allows dozens of STR markers to be analyzed together with different classes of markers that provide complementary contributions to population genetics and human identification. For example, including SNPs used as predictors of ancestry and phenotypic characteristics into commercial kits that employ capillary electrophoresis is unfeasible, but they can be combined with STR markers in NGS assays [1]. The problem with the NGS technology is the large amount of data generated and the lack of bioinformatic tools to analyze it [1]. Some tools (e.g., lobSTR [9], STRait Razor [12], toaSTR [13], and HipSTR [10], among others) were developed to analyze STR markers by using NGS data. Each tool employs different algorithms and flanking regions to capture STR reads.
Haplotype inference and phasing for STRs (HipSTR) was developed for calling microsatellites specifically from WGS Illumina FASTq files. HipSTR was designed to deal with genotyping errors and to obtain more robust STR genotypes. HipSTR accomplished this by learning locus-specific PCR stutter models, with the aid of an EM algorithm, by employing a specialized hidden Markov model to align reads to candidate alleles while accounting for STR artifacts, and by using phased SNP haplotypes to genotype and to phase STR markers. These factors turned HipSTR into one of the most reliable tools for genotyping STRs from Illumina sequencing data [10,14].
In contrast to other tools, HipSTR can process hundreds of samples at once. It also allows the user to determine the set of STR markers that must be analyzed and the flanking regions that must be used to capture them. In fact, previous studies showed that HipSTR provides accurate genotype calling. HipSTR accuracy was tested by comparing WGS calls from 118 samples to capillary electrophoresis data, which resulted in 98.8% consistency [10,15]. Recently, we compared HipSTR with Strait Razor and toaSTR, to find that the three tools present high allele calling accuracy (greater than 97%) [14]. Although data processing with HipSTR is more complex and requires bioinformatics knowledge and some nomenclature adjustments, this tool is currently the fastest and most appropriate to deal with larger datasets, including whole genomes [14].
In this investigation we conducted a comprehensive genotyping analysis of a set of STRs of broad forensic interest obtained from the 1000 Genomes populations, aiming to release a reliable open-access STR database that should contribute to future studies in the field of forensic genetics.
To genotype the 22 STR markers based on the human reference genome GRCh38, we ran the HipSTR algorithm for each individual. For this purpose, we used a BED file with the coordinates of each STR region of interest, which was available in the HipSTR repository [10] (https://hipstr-tool.github.io/HipSTR-tutorial/; accessed on 10 July 2021) as described elsewhere [14]. We applied the calling filter (15% stutter model) and a minimum of eight reads to obtain more reliable genotypes. According to a binomial distribution, this minimum number of reads ensures (p > 0.99) that a homozygous genotype is called because of lack of variability at a given locus and not because the second allele has not been sampled.
To perform genotype calling, we used the VCF output file produced by HipSTR and took three parameters into account: the reference allele of each marker, the period (i.e., the length of each STR repeat unit), and the base pair differences (GB) as compared to the reference allele. We adjusted the nomenclature for D19S433, Penta D, Penta E, and vWA by following the recommendations made by Valle-Silva et al. [14]: removal of two repeat units from all D19S433 and vWA alleles called by HipSTR, inclusion of one repeat unit into all Penta D alleles, and removal of two nucleotides from all Penta E alleles. By using IGV software 2.8.2 [16,17] and the HipSTR VizAln function [10], Valle-Silva et al. [14] demonstrated that such adjustments are necessary to prevent some base pairs from shifting in allele calling when compared to the nomenclature established by the ISFG [18].

Statistical Analysis
We calculated allele frequencies, the Hardy-Weinberg equilibrium, and forensic parameters {Match Probability (MP), Power of Discrimination (PD), Power of Exclusion (PE), and Polymorphism Information Content (PIC)} for each population sample or each population group using GenAlEx 6.5 [19] and STRAF 2.5.1 [20] software.
We employed Principal Coordinates Analysis (PCoA) using GenAlEx [19], Analysis of Molecular Variance (AMOVA) using Arlequin [21], and clustering analyses using STRUCTURE 2.3.4 [22] to explore how genetic diversity is distributed across populations of different ethnic backgrounds. We performed STRUCTURE analysis for k ranging from 3 to 6 by applying the correlated allele frequencies model, 100,000 burn-in steps followed by 100,000 Markov Chain Monte Carlo interactions, in 100 independent runs. We selected the results from the runs with the largest "Estimated Ln Probability of Data" {LnP (D)} and depicted them in bar plots created with Distruct 1.1 [23].
We also compared the allele frequencies estimated from the 1000 Genomes Project dataset with STR data retrieved from the same five major population groups (African, European, East Asian, South Asian, and admixed American) that compose the SPSmart STR browser (PopSTR) [24]. For this purpose, we employed Arlequin software to compare the allele frequencies of each STR marker for a given population group between the two datasets by using F ST and an exact test of population differentiation based on genotype frequencies [21]. We made this comparison to verify the reliability of genotype data generated by HipSTR.

Results
The STR genotypes defined for each individual from the newest dataset released by the 1000 Genomes Project are available in Supplementary Table S1 as an open-access database. We excluded the D21S11 marker because we did not succeed in genotyping it (See discussion). Apart from this marker, the mean coverage for calling genotypes ranged from 37.14 (TPOX) to 52.53 (D12S391) ( Table 1). The average successful calling rate was 98.59%; this rate ranged from 84.18% (Penta E) to 100% (CSF1PO, D2S441, D2S1338, D3S1358, D5S818, D8S1179, D22S1045, and TPOX) ( Table 2).  Table 2 lists the allele frequencies and forensic parameters estimated for the whole dataset. The allele frequencies and forensic parameters estimated for each of the 26 populations (Supplementary Table S2) and the five population groups (Supplementary Table S3) are available as Supplementary Data. In general, the most polymorphic loci in all the populations were D1S1656, D2S1338, D12S391, D18S51, and FGA ( Table 2). The analyzed loci were highly informative, with elevated PD values ranging between 86.59% (TPOX) and 97.76% (D1S1656). The combined MP was 5.72 × 10 −27 , and the combined PE was 0.99999997. Analysis of each locus in each population (Supplementary Table S2) showed that D22S1045 in PEL (71.61%) and D1S1656 in GBR (97.52%) presented the lowest and the highest PD value, respectively. The combined MP ranged from 1.98 × 10 −25 in ACB to 2.20 × 10 −21 in PEL.
We estimated the adherences of genotype frequencies to Hardy-Weinberg Equilibrium expectations for each STR marker at a population level (Table 3). Penta E presented heterozygote deficiency in 24 out of the 26 populations, leading to departures from the Hardy-Weinberg equilibrium. This finding indicated that HipSTR incorrectly called many heterozygous genotypes as homozygous. Disregarding Penta E, the number of deviations ranged from one (D13S317 and D16S539) to five (D19S433 and Penta D), and the number of deviations across populations ranged from zero (ASW and CEU) to seven (PUR), with an average of 2.42 departures in each population. When we considered the Bonferroni correction for multiple tests, only 39 departures remained significant, and most of them (61.53%) concerned Penta E.   Principal Coordinates Analysis (PCoA) revealed four different population clusters (Figure 1). The first coordinate separated the cluster of African (AFR) populations on the right side. On the left side, we observed three different population groups: the European (EUR) populations in the upper part, the East Asian (EAS) populations in the lower section, and the South Asian (SAS) populations between them. The CLM, MXL, PEL, and PUR admixed populations clustered with the European populations, while the ACB and ASW populations clustered with the African (AFR) populations, reflecting their ancestry compositions.
Hardy-Weinberg equilibrium. This finding indicated that HipSTR incorrectly called many heterozygous genotypes as homozygous. Disregarding Penta E, the number of deviations ranged from one (D13S317 and D16S539) to five (D19S433 and Penta D), and the number of deviations across populations ranged from zero (ASW and CEU) to seven (PUR), with an average of 2.42 departures in each population. When we considered the Bonferroni correction for multiple tests, only 39 departures remained significant, and most of them (61.53%) concerned Penta E.
Principal Coordinates Analysis (PCoA) revealed four different population clusters (Figure 1). The first coordinate separated the cluster of African (AFR) populations on the right side. On the left side, we observed three different population groups: the European (EUR) populations in the upper part, the East Asian (EAS) populations in the lower section, and the South Asian (SAS) populations between them. The CLM, MXL, PEL, and PUR admixed populations clustered with the European populations, while the ACB and ASW populations clustered with the African (AFR) populations, reflecting their ancestry compositions.   We obtained similar results when we conducted the STRUCTURE analysis. Figure 2 depicts the STRUCTURE results derived from runs obtained with k ranging from three to six. When k = 4, each cluster reflected one of the major ancestries of the 1000 Genomes Project. Moreover, each of the six admixed American populations presented varying levels of ancestries from the four biogeographical groups. To verify the distribution of variance in different levels, we performed AMOVA by assuming a hierarchical structure that gathered the populations in four population groups: AFR, EAS, EUR, and SAS. We did not take the six populations in the AMR population group into account because their admixed compositions would bias the AMOVA results by reducing the proportion of variance between groups. We observed most of the variance within populations (97.12%). Differences between the four population groups accounted for 2.54% of the variance, whereas only 0.34% of the variance occurred due to differences between populations belonging to the same group. gathered the populations in four population groups: AFR, EAS, EUR, and SAS. We did not take the six populations in the AMR population group into account because their admixed compositions would bias the AMOVA results by reducing the proportion of variance between groups. We observed most of the variance within populations (97.12%). Differences between the four population groups accounted for 2.54% of the variance, whereas only 0.34% of the variance occurred due to differences between populations belonging to the same group. By using FST, we also compared the allele frequencies estimated from the dataset of the 1000 Genomes Project to the STR data retrieved for the same five major population groups (African, European, East Asian, South Asian, and admixed American) that composed the SPSmart STR browser (PopSTR) [24] (Table 4). While the AMR (four), EAS (three), EUR (eight), and SAS (four) population groups presented small numbers of markers with significantly different frequencies between the two datasets, AFR presented 17 significant differences. This pattern might reflect the set of populations that compose the compared groups. Penta E was the only marker that showed significantly different FST values in all comparisons. By leaving AFR and Penta E aside, we observed only 15 significant differences out of 80 comparisons: the mean number of statistically significant differences was 0.75 per marker; this number ranged from zero (eight STR markers) to three (D2S441). When we considered the Bonferroni correction for multiple tests, only three of these 15 FST values remained significant, while six out of 16 significant differences observed for AFR (leaving Penta E aside), and all five Penta E differences remained significantly different. By using F ST , we also compared the allele frequencies estimated from the dataset of the 1000 Genomes Project to the STR data retrieved for the same five major population groups (African, European, East Asian, South Asian, and admixed American) that composed the SPSmart STR browser (PopSTR) [24] (Table 4). While the AMR (four), EAS (three), EUR (eight), and SAS (four) population groups presented small numbers of markers with significantly different frequencies between the two datasets, AFR presented 17 significant differences. This pattern might reflect the set of populations that compose the compared groups. Penta E was the only marker that showed significantly different F ST values in all comparisons. By leaving AFR and Penta E aside, we observed only 15 significant differences out of 80 comparisons: the mean number of statistically significant differences was 0.75 per marker; this number ranged from zero (eight STR markers) to three (D2S441). When we considered the Bonferroni correction for multiple tests, only three of these 15 F ST values remained significant, while six out of 16 significant differences observed for AFR (leaving Penta E aside), and all five Penta E differences remained significantly different.

Discussion
The present study provides the most diverse database of forensic autosomal STR markers obtained from global populations. STR markers display high levels of polymorphism, which makes them attractive for forensic purposes and population genetics studies. This is the first time that the 1000 Genomes high-coverage (~30x) dataset has been used for STR genotyping purpose. Although a few previous initiatives [9,25,26] attempted to genotype forensically relevant STRs, they only dealt with previous low-coverage 1000 Genomes releases (~7.4x), which prevented the acquisition of results or resulted in highly unreliable genotypes due to large rates of allele dropout. Moreover, it should be emphasized that even the last paper that presented the high-coverage WGS data did not include STR variants in the results and stated that genotyping STRs from such data remains a considerable challenge [27].
In forensic genetics, STR markers consist in the most widespread and informative tool for human identification. In spite of the limitations addressed below, such as unreliability of Penta D and Penta E genotypes involving specific alleles, this NGS-based STR database presents reliable allele frequencies that could be used in criminal casework to estimate the rarity of a given STR-based profile from a query sample of unknown or uncertain ancestry in various worldwide populations. This could instantly, and without additional costs, trigger a DNA-based intelligence strategy to guide enquiries [28] providing hints and/or assigning biogeographical origin in many situations, such as a missing person investigation [28,29], leaving only the most complex cases for supplementary analysis with a most suitable set of Ancestry Informative Markers.
Short-read next generation sequencing is slowly being introduced in forensic labs worldwide. Although such technology is still restricted and expensive, it has become more sensitive, requiring as little as 25 pg of extracted DNA, and is suitable to solve more complex cases, such as discrimination of twins (using STRs, WGS or mtDNA sequencing approaches) and deconvolution of highly unbalanced mixtures reviewed by [30]. Some criminal [31][32][33], kinship [34] and missing persons [35] casework already benefiting from this have been reported. However, genotyping STR markers by using NGS data, especially WGS assays, may be challenging-accurate genotyping requires high coverage, longer alleles are difficult to detect due to reads of limited sizes, and mutations in flanking regions may lead to null alleles [36]. These and other issues have been addressed by Gaag et al. [37] and Valle-Silva et al. [14].
Notwithstanding the challenges addressed here, several studies have demonstrated that STRs can be genotyped by using dedicated bioinformatics tools. Software such as LobSTR [9], toaSTR [13], STRait Razor [12], and HipSTR [10], among others, have shown consistent and accurate results [14,15]. Moreover, Bornman et al. [8] demonstrated that, by using an NGS approach, CODIS loci could be accurately called even from mixtures.
Particularly for the deconvolution of mixtures, the identification of isometric alleles (i.e., alleles with the same length but containing different repeat sequences) is a necessary task, since it further increases the discriminating power of the currently used STR markers; nevertheless, it is not achieved with traditional PCR and capillary electrophoresis techniques [2,3]. This sequence-based analysis is already feasible with small-scale targeted sequencing assays, particularly those using kits and software solutions tailored for forensic purposes, such as the ForenSeq DNA Signature Prep Kit coupled with the ForenSeq™ Universal Analysis Software (Verogen Inc., San Diego, CA, USA) [38] or the Precision ID GlobalFiler™ NGS STR Panel v2 coupled with the Converge Software NGS Analysis Module (Thermo Fisher Scientific) [39], but it is still a challenge for large-scale WGS assays. In order to achieve this goal concerning big data in the near future, new bioinformatics tools must be developed, or the current ones further improved.
Willems et al. [26] analyzed human STR variation by using lobSTR. These authors employed the data of Phase 1 of the 1000 Genomes Project. The data were generated by using low-sequencing coverage, which is excessively error-prone. In fact, the authors reported difficulties in detecting both alleles in each sample, which resulted in an overall deficit of heterozygotes. As previously addressed, several reasons led us to choose HipSTR to call STR genotypes from this high-coverage dataset of the 1000 Genomes Project. Because HipSTR allows the flanking regions to be customized, almost any STR marker can be evaluated in hundreds of samples at once. At first glance, HipSTR may appear more complex, but it is the most appropriate tool to deal with whole genomes. In addition, a recent evaluation of the performance of this tool revealed high efficiency and accuracy levels [14].
Although HipSTR provides flexibility, the major limitation of this study is the inability to genotype D21S11, which is one of the 20 CODIS loci. Additional limitations are the failure in detecting two very small Penta D alleles and the biased allele frequencies of very large Penta D and Penta E alleles probably because of sequence-specific features, such as the GC content [40][41][42] producing low depth of coverage bias and/or the limited length of the Illumina NGS reads (150 bp paired-end reads). This issue could be immediately circumvented with long-read sequencing technologies, such as those implemented in Pacific Biosciences (PacBio) and Oxford Nanopore platforms. However, one should not expect that long-read sequencing would be suitable for a wide range of forensic samples, which are often degraded and/or available in low amounts [40][41][42][43]. It is noteworthy that, by employing 300 nucleotide-long paired-end reads in a targeted sequencing assay, we successfully genotyped D21S11 with HipSTR, which suggests a sequencing methodology issue rather than a bioinformatics issue [14].
In this study, Penta D and Penta E showed 10.74% and 15.81% of missing data, respectively. By using Illumina sequencing technology, van der Gaag et al. [37] showed that longer alleles of Penta D, Penta E, and FGA presented sequencing errors at the end of the reads, which resulted in null alleles and genotyping errors. As observed for D21S11, this issue was probably related to the impossibility of detecting longer alleles due to read-length constraints. Furthermore, we did not detect two very small Penta D alleles (2.2 and 3.2), which are common in African populations, which was unexpected. Supplementary Table  S4 compares the allele frequencies estimated in the present study with the allele frequencies obtained from the SPSmart STR browser (PopSTR) [24] for the major population groups. Such straightforward comparison showed that we were not able to detect alleles larger than 18 in Penta E. This failure led directly to Hardy-Weinberg equilibrium deviations (Table  S3) due to deficit of heterozygotes in 24 out of the 26 studied populations. Thus, allele frequencies estimated for Penta E were strongly biased toward increased frequencies of shorter alleles and have limited applicability (Supplementary Table S2). The probabilities obtained with the F ST analysis (Table 4) supported this conclusion: Penta E presented significant F ST values in all five comparisons. Although Penta D and FGA also posed this problem, their undetected alleles usually have low frequencies-Except for Penta D alleles 2.2 and 3.2 in African populations (Supplementary Table S4). Therefore, this technical issue did not influence the Hardy-Weinberg equilibrium and F ST analysis as much as Penta E. Although this comparison is valid and helpful, we must emphasize that the compared samples corresponded to distinct population groups. The African population group in popSTR comprised mainly East African Somalian individuals (404 out of 507 samples), while the African populations in the 1000 Genomes Project samples corresponded to West Africa. Similarly, over 50% of the European population group in popSTR was composed mainly of U.S. Europeans (1443 out of 2135) [5,24]. Taken together, these results attest that the bioinformatics analysis performed in the present study is robust, and that the distribution of allele frequencies is reliable for all loci except Penta E.
The most polymorphic loci in the whole dataset of the 1000 Genomes Project were D1S1656, D2S1338, D12S391, D18S51, and FGA. All these markers presented high degrees of polymorphism throughout the world. AMOVA revealed that most of the variance (97.12%) in allele frequencies occurred within populations, corroborating previous studies [44,45]. A study that evaluated human population structure using genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 worldwide populations revealed that the variance within populations accounts for 93 to 95% of genetic variation, while differences among major groups constitute only 3 to 5% [45,46]. Although the number of populations and genetic markers are quite different, the larger amount of variance within populations and lower variance among groups observed in the present study may be either due to chance or to the fact that forensic STRs do show relatively lower F ST than random STRs due to the increased heterozygosity of the former [46]. However, as expected, AMOVA, together with principal component analysis ( Figure 1) and the clustering analysis performed with STRUCTURE (Figure 2), confirmed that the four ancestral populations groups (AFR, EUR, EAS, and SAS) defined by the 1000 Genomes Consortium did differ significantly from each other. Given that the admixed American populations present different ancestry compositions (Figure 2), most of them clustered with Europeans, while ACB and ASW clustered with Africans ( Figure 1).
The results obtained with the STRUCTURE software corroborated the relationship between the different population groups and provided additional support for the reliability of the calculated genotypes. When k = 3, SAS resembled an admixture between EAS and EUR. A specific cluster for SAS emerged when k = 4. When k = 5, a minor Eurasian (shared between EUR and EAS) component arose. When k = 6, the SAS-shared ancestry with EUR and EAS became more evident. Regarding the admixed American populations, irrespective of the number of clusters considered, ACB and ASW revealed their preeminent African origin, CLM and PUR revealed more extensive European ancestry, and MXL and PEL revealed almost equal amounts of European and Amerindian (i.e., EAS) ancestries. These results fully corroborated the distribution of the populations into the PCoA (Figure 1). Additional clusters did not provide increased resolution with straightforward meaning.
The outcome of this population genetics evaluation further corroborates the robustness and reliability of this STR dataset. Despite all the applications already addressed in the beginning of this section, the most important contribution of this open access genotype dataset probably lies in the fact that it may be used to estimate and establish additional population genetics parameters that may be taken as direct references in many studies that are using the 1000 Genomes Project dataset to retrieve new sets of SNPs, indels and microhaplotypes in various efforts to maximize intelligence from DNA evidence [27,[47][48][49][50].

Conclusions
We were able to offer a reliable open-access STR database based on the high-coverage (30x) WGS data of Phase 3 of the 1000 Genomes Project generated by the NYGC. However, the limited length of sequencing reads introduces noticeable bias in allele frequencies estimated for Penta D and Penta E. The reliability of this dataset is supported by (a) previous studies attesting that HipSTR is efficient, (b) the Hardy-Weinberg equilibrium analysis, (c) the set of analyses employed to evaluate the interpopulation genetic diversity, and (d) the comparison between the allele frequencies obtained here and the frequencies obtained by other initiatives that used capillary electrophoresis. Although we expect that this openaccess database will be of great interest for future forensic studies on population genetics, the current 1000 Genomes Project dataset does not describe human genetic diversity worldwide. In fact, many biogeographical regions, mainly in Oceania and the Americas, have not been sampled, indicating that additional large-scale initiatives may provide further insight into STR diversity in populations worldwide.