Genetic diversity of germplasm is assessed by collecting key information, especially: (i) Allele number per locus; (ii) genotype number per locus; (iii) gene diversity; (iv) PIC (polymorphism information content) values; (v) observed and expected heterozygosity; (vi) partition of the diversity into its components within and between populations; and (vii) the genetic distance among the analyzed populations. The analyses are usually performed using a variety of molecular markers grouped into two categories: Co-dominant markers, such as SSR (single sequence repeat) and SNP (single nucleotide polymorphism), which are able to identify the allelic situation at each locus, and dominant markers, such as ISSR (inter simple sequence repeats), RAPD (random amplified polymorphic DNA), and AFLP (amplified fragment length polymorphism), which usually have a multi-band pattern and are unable to recognize allelic variants [

1]. The latter produce a series of bands with unknown relationships (i.e., could be allelic variants of the same genes or mark different genome regions). Hence, without knowing the allelic situation, each band is recorded as a locus with two possible alleles’ band presence (scored as 1) or band absence (scored as 0) and the relative 0/1 matrix is used in statistical analyses. The papers reviewed here comprise data based on co-dominant markers that were often wrongly recorded as the presence/absence of possible bands, leading to a loss of information on allelic variance and the presence of heterozygosity (observed heterozygosity, Ho).

The present paper offers a short and simple guide to the principles that form the base of the most common analyses. It focuses on some of the most widely-used computer programs in population genetics, run under Windows, to highlight the advantages and disadvantages of the various software packages, thus facilitating appropriate selection and use.

#### 1.1. Hardy–Weinberg Principle

Most of the statistical computations use parameters based on the Hardy–Weinberg principle [

2,

3]. Here, the basis of the principle and its applications are highlighted. As it is widely known, the Hardy–Weinberg principle considers the genetic and genotype frequency for a single locus in a population and states: “

allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences”. These potential evolutionary forces include: (i) Migration, (ii) mutation, (iii) selection, (iv) population size sufficient to avoid drift, and (v) random mating. Unfortunately, this definition of the Hardy–Weinberg does not sufficiently focus on other important consequences of the principle such as: “

if a population is in equilibrium it is possible to compute the allele frequencies knowing the genotype frequencies and vice-versa by the formula of binomial square development i.e., (

p +

q)

^{2} =

p^{2} +

q^{2} + 2

pq = 1”, where

p^{2} is the frequency of the AA genotype,

q^{2} indicates the aa genotype frequency, 2

pq the Aa genotype frequency,

p the A allele frequency, and

q the a allele frequency. This equation is true only for a population in the Hardy–Weinberg equilibrium where it is possible to compute allele frequencies from knowing the genotype frequencies and vice versa. The above is if only two alleles, A and a, are possible for that locus. If, instead, three alleles may occur at a locus, the formula would be a trinomial square development ((

p +

q +

r)

^{2} =

p^{2} +

q^{2} +

r^{2} + 2

pq + 2

pr + 2

qr = 1) and so on for higher numbers of alleles. It should be noted that the square terms (i.e.,

p^{2} +

q^{2} +

r^{2}, etc.) are homozygote frequencies while the others (i.e., 2

pq + 2

pr + 2

qr, etc.) are heterozygotes. Considering several alleles,

I, with a frequency,

p_{i}, the homozygote frequency is Ʃ

p_{i}^{2} and heterozygote frequency can be calculated as the complementary difference from the homozygote frequency (i.e., 2

pq = 1 − (

p^{2} +

q^{2}) or 1 − Ʃ

p_{i}^{2}).

#### 1.2. Genetic Diversity

The gene diversity index is calculated for each locus and population according to Nei [

4], utilizing the Hardy–Weinberg formula,

$He=1-{{\displaystyle \sum}}_{i=1}^{n}{p}_{i}^{2}$, hereafter simplified as

He = 1 − Ʃ

p_{i}^{2}, which is the heterozygosity expected if the population is in Hardy–Weinberg equilibrium. In analogy, the genetic identity (

J) is Ʃ

p_{i}^{2} (homozygotes). However, since

He could be computed for all populations, including non-random mating systems (e.g., autogamus, which, by definition, will not in Hardy–Weinberg equilibrium being a pure line with homozygosity for all loci), the terminology for

He is thus

gene diversity, rather than

expected heterozygosity.

In a small population, the alleles per locus can be skewed, especially when compared to large populations [

5]. Unbiased heterozygosity is as for the above-mentioned heterozygosity multiplied by the factor, 2

n/(2

n − 1) [

6]. As a result, the larger the population, the lower are the differences between the biased and unbiased expected heterozygosity. This detail is often not sufficiently elaborated upon in the literature, as many papers do not mention whether unbiased or biased

He is used.

The variability between and within populations can be calculated according to Nei [

4] by taking into account different allele frequencies in whole populations or only in subpopulations. The nomenclature used is:

H_{T} for total observed diversity;

H_{S} for within-population diversity; and

D_{ST} for the between-population diversity, with

H_{T} =

H_{S} +

D_{ST}.

Similarly, the Wright’s fixation indices,

F_{IS},

F_{ST}, and

F_{IT} [

7], are often used, also the F-statistics are based on the expected level of heterozygosity. The measures describe the different levels of population structures, such as variance of allele frequencies within populations (

F_{IS}), variance of allele frequencies between populations (

F_{ST}), and an inbreeding coefficient of an individual relative to the total population (

F_{IT}), all of which are related to heterozygosity at various levels of population structure. The terms mentioned above are represented by the formula, 1 −

F_{IT} = 1 −

F_{IS} + 1 −

F_{ST}, where

I is the individual,

S the subpopulation, and

T the total population.

F_{IT} thus refers to the individual in comparison with the total,

F_{IS} is the individual in comparison with the subpopulation, and

F_{ST} is the subpopulation in comparison with the total. As shown in

Figure 1, total

F, indicated by

F_{IT}, can be partitioned into

F_{IS} (or

f) and

F_{ST} (or

θ).

F_{ST} can be calculated using the formula: F_{ST} = (H_{T} − H_{S})/H_{T}, where H_{T} is the proportion of the heterozygotes in the total population and H_{S} the average proportion of heterozygotes in subpopulations.

In a series of loci, l, in n populations and using the complementary sum of allele frequency (1 − Ʃp_{i}^{2}), different figures can be obtained. In particular:

For each locus and each population, He = (1 − Ʃp_{i}_{(lg)}^{2}), where p_{i}_{(lg)} is the ith allele frequency of the lth locus in the gth population.

The average of the above He over populations gives the genetic diversity within a population for each locus, while the average of all the loci within a population diversity gives H_{S}. The formula can thus be written as: H_{S} = (Ʃ_{l}(Ʃ_{g}(1 − Ʃ_{pi}_{(lg)}^{2})/_{g})/_{l}), where (1 − Ʃp_{i}_{(lg)}^{2}) indicates the expected heterozygosity for each locus in each population, g indicates the number of populations, and l the loci number.

The total genetic diversity, H_{T}, is calculated using the allele frequency, p_{i}_{(l)}, for each locus over all populations and calculating the mean over loci: H_{T} = Ʃ(1 − Ʃ_{pi}_{(l)}^{2})/_{l}).

The between population component of diversity is calculated using the formula: D_{ST} = H_{T} − H_{S}.

The between population component may also be expressed in relation to the total genetic diversity (for each locus and overall loci) as

G_{ST} =

H_{T}/

D_{ST} [

4].

Table 1 shows an example extracted from Turpeinen et al. [

8], where different parameters for three populations were analyzed using two markers. The

H_{T} for each locus corresponds to the polymorphic information content (PIC) of that locus, which in other words, consists in the capacity of that locus (or better a marker) to assess polymorphism and diversity. Botstein et al. [

9] proposed an adjustment of this value as:

where

p_{i} and

p_{j} are the population frequency of the

ith and

jth alleles. The PIC proposed by Botstein and colleagues [

9] subtracts from the

He value an additional probability (ƩƩ2

p_{i}^{2}p_{j}^{2}) due to the fact that linked individuals do not add information to the overall variation.

#### 1.3. Genetic Distance

Genetic diversity (

He) and genetic identity (

J or

Ho) are also used to estimate the genetic distance within and between populations, since two populations with high identity in their genes are closer than two with high diversity. If

J_{x} = Ʃ

p_{xi}^{2} is the probability of identity in population

x with

p_{xi} the frequency of the

i-th allele and

J_{y} = Ʃ

p_{yi}^{2} is the probability of identity in population

y, the probability of identity in both populations is

J_{xy} = Ʃ

p_{xi}p_{yi} as described by Nei [

10,

11]. The probability of identity in population

x for all normalized loci is

I =

J_{xy}/√(

J_{x}J_{y}) and, in turn, the genetic distance is

D = −

LnI = −

Ln (

J_{xy}/√(

J_{x}J_{y})). In a small sample set with many loci, any biases can be corrected using

Ď = −

Ln G_{xy}/√(

G_{x}G_{y}), where

G_{x} and

G_{y} are (2

n_{x}J_{x} − 1)/(2

n_{x} − 1) and (2

n_{y}J_{y} − 1)/(2

n_{y} − 1) over the l loci studied, respectively, and

G_{xy} =

J_{xy} [

12]. In this case,

Ď could be negative, due to sampling errors, and hence considered as zero.

Various software packages can be used to calculate the above-mentioned parameters; they often use different parameters and have their own advantages and disadvantages. In general, for the analyses of genetic diversity, characteristics required in statistical software are: (i) Precision (no bugs), accuracy, and reproducibility; (ii) user friendliness (e.g., do not need command line scripts); (iii) clear output in terms of graphical options; and (iv) that it is open access. This paper compares some software packages that run using Microsoft Windows, which are generally used to calculate population genetic analyses. The software packages assessed are:

Software description and comparison is carried out using examples of data obtained with SSR markers (hence, co-dominant) on nine durum wheat populations from three Ethiopian regions as described by Mondini et al. [

20]. For the purpose of this assessment, the analyses of 10 genotypes per population are reported.