MolMarker: A Simple Tool for DNA Fingerprinting Studies and Polymorphic Information Content Calculation Jahnke, Gizella

Molecular markers and mapping are used to analyze an organism’s genes. They allow the selection of target genetic areas based on marker genotype (and not trait phenotype), facilitate the study of genetic variability and diversity, create linkage maps, and follow individuals or lines carrying certain genes. They may be used to select parental genotypes, remove linkage drag in back-crossing, and choose difficult-to-measure characteristics. Due to a lack of genetic variety in crops, the gene pools of wild crop relatives for future agricultural production have been examined. The invention of RFLP (Restriction Fragment Length Polymorphism) for linkage mapping allowed for the creation of other traditional approaches such as RAPD (Random Amplified Polymorphic DNA) and AFLP (Amplified Fragment Length Polymorphism). Accordingly, the need to describe the polymorphic information content (PIC) of the ideal marker has been raised. Marker selection reliability depends on the marker’s relationship to the genomic area of interest. Although informativeness must be estimated for genetic study design, there are no readily available tools. Earlier, PICcalc was developed to calculate heterozygosity (H) and PIC to simplify molecular investigations. These two values were corrected for dominant and co-dominant markers (binary and allelic data) to determine polymorphism quality. Due to the popularity of PICcalc web, we developed a downloadable version called MolMarker with extra functionality to reduce server maintenance.


Introduction
The primary means to study the genetic features of an organism rely on genetic markers and mapping. Molecular markers are the major tools to identify genomic regions involved in the control of traits of interest. They also facilitate selection for the target genomic regions on the basis of marker genotype rather than the phenotype of the concerned trait [1]. For example, these markers play a key role in studies on genetic variability and diversity, construction of linkage maps, and tracking individuals or lines carrying particular genes. They can be used to select and pair parental genotypes or eliminate linkage drag in back-crossing and select traits that are difficult to measure using phenotypic assays [2]. Molecular markers have many other applications, including in phylogenetics and systematics, conservation biology, molecular ecology, developmental biology, forensics, disease testing, and paternity assessment [3].
The pivotal role of molecular markers can be seen in plant breeding, where developing improved varieties is crucial for food security on a global scale. Given the continuously increasing human population, declining agricultural resources, and the stresses generated by climate change, plant breeding is expected to make greater contributions in increasingly shorter time frames [1]. In some cases, due to the lack of genetic diversity in crops, efforts have been made to explore the gene pools of wild species for potential utilization in meeting the future challenges of crop production. Thus, the main aim of breeding programs nowadays is to trace diversity and to find new traits, particularly genes conferring resistance to diseases and pests present in wild genetic resources. This is done to maintain current levels of agricultural productivity, and molecular markers are essential tools in this process.
In recent years, many promising new alternative molecular marker techniques have been developed. This was largely due to rapid growth in genomic research, which initiated a trend away from random DNA markers toward gene-targeted functional markers. Due to the rapid expanse of several public genomic databases and next-generation sequencing technologies, the development of such functional markers located in or near candidate genes of interest has become relatively simple. With the advent of genome sequencing projects, high throughput genotyping-by-sequencing (GBS) methods eliminated the need to create individual genetic markers [4]. However, numerous species lack sufficient genome data for GBS methods, and in these cases, the use of PCR amplification remains an important tool for marker development.
The development of restriction fragment length polymorphism (RFLP) for linkage mapping in humans by Botstein et al. [5] not only created the possibility for the development of other classical methods, such as random amplified polymorphic DNA (RAPD) and amplified fragment length polymorphism (AFLP), but also pinpointed the measures of an ideal marker by describing polymorphic information content (PIC). The reliability of marker selection depends mainly on the strength of linkage between the marker and the genomic region of interest. For the accurate design of genetic studies, such estimates must be calculated to describe the informativeness of the markers. However, there are currently no easily accessible calculators for that purpose. To simplify the work of molecular studies, we previously developed a useful online tool PICcalc [6] for the calculation of heterozygosity (H) [7] and PIC. These two values were adjusted for both dominant and co-dominant markers (both binary and allelic data) to measure the quality or informativeness of the polymorphism of the genetic marker. Currently, PICcalc is the only accessible program that can easily calculate these values for genetic studies in various organisms [8][9][10]. Due to the popularity and high demand for PICcalc web, we sought to develop a downloadable version with additional features that could operate independently of continuous server maintenance procedures. In addition, MolMarker has an easy-to-learn user-friendly graphical user interface (GUI). Java was used as a programing language, which provides platform independence. The software consists of a core application and joint plugins, which makes the software suitable for built-in new algorithms. The core application is responsible for the service and display of the GUI, the projects, and the data, as well as for some simple computations. The plugins carry out the following operations: PIC and H calculation, database editing, construction of dendrograms, calculation of parent-offspring relations, and null allele estimation.
Here, we present our software MolMarker v1.0 (Jahnke G. and Smidla J.; Veszprém, Hungary) ( Figure 1) which integrates the key features of PICcalc and also provides various novel functions for genetic marker analyses based on DNA fingerprinting techniques. The user-friendly software has a graphical user interface (GUI) and is platform-independent (Java application).

Programming Language and IDE
Java is a general-purpose, object-oriented programming language. Object-oriented means that the basic units of the software developed are the so-called objects which allow the modular structure of the program and its subsequent further development. Anothe major advantage of this programming language is its platform independence, whic means that the software developed can be run on any operating system simply by in stalling the appropriate "Java Virtual Machine" (JVM) on the computer (operating system [11,12]. The JVM is available for the vast majority of operating systems in use today.
To develop the software on the Windows Vista operating system, the NetBeans 8. integrated development environment was used. This IDE allows programmers to write compile, test, and debug applications, and then profile and deploy the programs [13]. Net Beans supports not only Java but other programming languages. NetBeans IDE can b extended with other modules [14], is free to use, has no restrictions on its use, and effec tively supports the creation of GUI applications, allowing the development of user friendly software [15].

UPGMA Algorithm
The Unweighted Pair Group Method with Arithmetic Mean algorithm (UPGMA [16,17] is used to reconstruct phylogenic trees (dendrograms) using a similarity matrix a the input, which is a simple hierarchical clustering procedure.
This method is the simplest for constructing phylogenic trees. Its main drawback i that it assumes the same evolutionary rate for all lineages, i.e., the mutation rate is constan over time (molecular clock theory) [18]. This means that the final apices (leaves) are equi distant from the tree root. As it is highly unlikely that each branch will have the sam mutation rate, UPGMA often generates a tree with faulty topology. The algorithm gener ates a rooted, ultrametric tree and has a run time of O(n 2 ) [19].

Neighbor-Joining Algorithm
The neighbor-joining algorithm [20,21] is also used to reconstruct phylogenic trees but also determines the length of the different branches. In each cycle, the "nearest verti ces" of the tree are selected, called neighbors. This is performed recursively in each cycl

Programming Language and IDE
Java is a general-purpose, object-oriented programming language. Object-oriented means that the basic units of the software developed are the so-called objects which allow the modular structure of the program and its subsequent further development. Another major advantage of this programming language is its platform independence, which means that the software developed can be run on any operating system simply by installing the appropriate "Java Virtual Machine" (JVM) on the computer (operating system) [11,12]. The JVM is available for the vast majority of operating systems in use today.
To develop the software on the Windows Vista operating system, the NetBeans 8.0 integrated development environment was used. This IDE allows programmers to write, compile, test, and debug applications, and then profile and deploy the programs [13]. NetBeans supports not only Java but other programming languages. NetBeans IDE can be extended with other modules [14], is free to use, has no restrictions on its use, and effectively supports the creation of GUI applications, allowing the development of userfriendly software [15].

UPGMA Algorithm
The Unweighted Pair Group Method with Arithmetic Mean algorithm (UPGMA) [16,17] is used to reconstruct phylogenic trees (dendrograms) using a similarity matrix as the input, which is a simple hierarchical clustering procedure.
This method is the simplest for constructing phylogenic trees. Its main drawback is that it assumes the same evolutionary rate for all lineages, i.e., the mutation rate is constant over time (molecular clock theory) [18]. This means that the final apices (leaves) are equidistant from the tree root. As it is highly unlikely that each branch will have the same mutation rate, UPGMA often generates a tree with faulty topology. The algorithm generates a rooted, ultrametric tree and has a run time of O(n 2 ) [19].

Neighbor-Joining Algorithm
The neighbor-joining algorithm [20,21] is also used to reconstruct phylogenic trees, but also determines the length of the different branches. In each cycle, the "nearest vertices" of the tree are selected, called neighbors. This is performed recursively in each cycle until all vertices are paired [22].
The algorithm takes a distance matrix as input and sequentially modifies the original star topology tree while minimizing the sum of branch lengths, thus approximating the so-called minimum-evolution method [23][24][25]. The algorithm has a run time of O(n 3 ).

The Expectation-Maximization Algorithm
The Expectation-Maximization (EM) algorithm was first formulated by Dempster and colleagues [26]. This algorithm is an iterative method designed to provide maximum likelihood estimates of the parameters of statistical models where the model itself depends on missing or hidden data. The EM iteration consists of the following two steps: Step 1, E (Expectation): in this step, the missing data are calculated by training a conditional expected value based on the estimated values of the parameters.
Step 2, M (Maximization): Based on the data calculated in the previous step and the existing data, a new estimate of the model parameters is made by maximizing the likelihood function.
The iterations are continued until the difference between the previous and the current value of the likelihood function is less than a predefined, sufficiently small value.
The EM algorithm can be used to estimate the frequency of null alleles in PCR-based genetic markers. In this case, heterozygotes carrying the null allele are indistinguishable from homozygotes carrying the detectable allele, so in this case, the null allele can be considered hidden data. The other problem is that if no product (missing data) is obtained in the PCR reaction, there are two possible reasons for this. It is possible that the tested individual is homozygous for the null allele at the locus or the genotyping failed due to some other error.
In the MolMarker software, the EM algorithm developed by Kalinowski and Taper [27] was implemented to estimate null alleles.

Description of the Software and Its Functionalities
The menu structure of MolMarker is provided in Table 1. After installation, new projects can be created or input files can be read by the software. MolMarker employs semicolon-delimited files as input, described as 'molecular', for isozymes or other types of biochemical markers, or 'genetic' type input files, coded in binary (presence/absence) format. Input files are further described in the manual and example files are provided in the software package. During data management, it is possible to upload the data entered into an online database (Figure 2). The MolMarker.sql file, which is available online (also attached to this article as Supplementary Material), is used to create the web SQL database.   Summary statistics, including allele frequencies, H, and PIC can be displayed or also saved under 'Display/Summary Statistics' or 'Save/Summary Statistics'. Before the allele frequencies are displayed, it is necessary to indicate in which loci a null allele is possible (Figure 3). For example, a screenshot of the summary statistics display is shown in Figure 4. Summary statistics, including allele frequencies, H, and PIC can be displayed or saved under 'Display/Summary Statistics' or 'Save/Summary Statistics'. Before the frequencies are displayed, it is necessary to indicate in which loci a null allele is pos (Figure 3). For example, a screenshot of the summary statistics display is shown in F 4.   Similarity matrices can also be obtained based on Jaccard similarity, simple matching (SM) [17,28] and the Czekanowski-Dice [29][30][31] and Ochiai [32] coefficients ( Figure 5). Similarity matrices can also be obtained based on Jaccard similarity, simple matching (SM) [17,28] and the Czekanowski-Dice [29][30][31] and Ochiai [32] coefficients ( Figure 5). The first snapshot of relationships among samples is displayed by the UPGMA and Neighbor-Joining methods ( Figure 6). As these methods are preceded by other methods, we recommend subjecting the data set to more rigorous analysis with other programs and using MolMarker for data exploration. The parentage analyses option provides a list of possible parent-offspring and likelihood ratio statistics corresponding to the detected combinations.
Using the intuitive graphical user interface, basic marker statistics for genetic studies can be obtained with MolMarker. The software is open source and can be downloaded for free [33]. The software has been downloaded 351 times since its registration (Figure 7).  The parentage analyses option provides a list of possible parent-offspring and likelihood ratio statistics corresponding to the detected combinations.
Using the intuitive graphical user interface, basic marker statistics for genetic studies can be obtained with MolMarker. The software is open source and can be downloaded for free [33]. The software has been downloaded 351 times since its registration (Figure 7). The parentage analyses option provides a list of possible parent-offspring and likelihood ratio statistics corresponding to the detected combinations.
Using the intuitive graphical user interface, basic marker statistics for genetic studies can be obtained with MolMarker. The software is open source and can be downloaded for free [33]. The software has been downloaded 351 times since its registration (Figure 7).

Discussion
Research studies based on molecular markers frequently use a large number of samples, or if the sample size is small, multiple alleles of a single molecular marker are implemented to increase the reliability of the study making it almost inconceivable to evaluate the results without computational support. Although the currently available software is able to process specific data (sets), it is often required to compare and evaluate research data belonging to various different types of markers from several perspectives. Currently, there is no such software available, researchers use numerous (usually 5-10) different programs-many of which are general-purpose spreadsheets or statistical programs-and there is a strong demand for an "all-in-one" downloadable software.
For example, a general-purpose spreadsheet (e.g., MS Excel) is most commonly used to calculate marker summary statistics, while the calculation of the similarity matrix and dendrograms is carried out in the statistical software package SPSS [34]. For parentage analyses, Identity 1.0 [35] is often employed, which calculates a wide range of statistics in a non-user-friendly way allowing for high error rates to accumulate during data entry, which is especially cumbersome for large sample sizes. The highly popular web-based application PICcalc [6] was previously used to calculate PIC and H values. For maximum likelihood-based null allele estimation [36], ML-NULL [27] is often used also suffering from data entry difficulties.

Conclusions
The primary aim of this study was to develop an open-access software with a userfriendly graphical interface, which is suitable for the multi-objective evaluation of molecular marker datasets. The goals were achieved using Java programming language, while further development can be achieved by the integration of new plugins.
Author Contributions: Conceptualization, G.J. and P.P.; methodology, G.J. and J.S.; software, G.J. and J.S.; validation, G.J. and P.P.; writing-original draft preparation, P.P. and G.J.; writing-review and editing, all authors. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.