Entropy-Based Approach in Selection Exact String-Matching Algorithms

The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns’ entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.


Introduction
String-matching processes are included in applications in many areas, like applications for information retrieval, information analysis, computational biology, multiple variations of practical software implementations in all operating systems, etc. String-matching forms the basis for other computer science fields, and it is one of the most researched areas in theory as well as in practice. An increasing amount and availability of textual data require the development of new approaches and tools to search useful information more effectively from such a large amount of data. Different string-matching algorithms perform better or of size , commonly referred to as the sliding window mechanism (or search window). In the process of comparing the main text T [1…n] and a pattern P [1…m], where m ≤ n, the aim is to find all occurrences, if any, of the exact pattern in the text (Figure 1.). The result of comparing patterns with text is information that they match if they are equal or they mismatch. The length of both windows must be of equal in length, during the comparison phase. First, one must align the window and the text's left end and then compare the characters from the window with the pattern's characters. After an exact matching (or mismatch) of pattern with the text, the window is moved to the right. The same procedure repeats until the right end of the window has reached the right end of the text [11][12][13][14][15].

Methodology Description
A state of the art survey shows a lack of platform-independent methodology, which will help choose an algorithm for searching a specific string pattern. The proposed approach for evaluating exact string pattern matching algorithms is formalized in a methodology consisting of six steps, shown in Figure 2, to build a model applicable to data sets and algorithms in a given domain. The first step of the proposed methodology, shown in Figure 2, is selecting representative texts for domain model building. In the second step, the algorithms are selected. Selected algorithms are limited only to the ones that wanted to be considered. After selecting representative texts for domain model building and algorithms, the searching phase for representative patterns starts in the third step. Representative patterns can be text substrings or the can be randomly created from the domain alphabet. The searching phase means that all representative patterns are searched with algorithms selected in the second step. Search results are collected and expressed in specific metrics. In the fourth step, patterns entropy is calculated. In the fifth step, entropy discretization is applied. Entropy results are discretized and divided into groups by frequency distribution [16,17].

Methodology Description
A state of the art survey shows a lack of platform-independent methodology, which will help choose an algorithm for searching a specific string pattern. The proposed approach for evaluating exact string pattern matching algorithms is formalized in a methodology consisting of six steps, shown in Figure 2, to build a model applicable to data sets and algorithms in a given domain. of size , commonly referred to as the sliding window mechanism (or search window).
In the process of comparing the main text T [1…n] and a pattern P [1…m], where m ≤ n, the aim is to find all occurrences, if any, of the exact pattern in the text (Figure 1.). The result of comparing patterns with text is information that they match if they are equal or they mismatch. The length of both windows must be of equal in length, during the comparison phase. First, one must align the window and the text's left end and then compare the characters from the window with the pattern's characters. After an exact matching (or mismatch) of pattern with the text, the window is moved to the right. The same procedure repeats until the right end of the window has reached the right end of the text [11][12][13][14][15].

Methodology Description
A state of the art survey shows a lack of platform-independent methodology, which will help choose an algorithm for searching a specific string pattern. The proposed approach for evaluating exact string pattern matching algorithms is formalized in a methodology consisting of six steps, shown in Figure 2, to build a model applicable to data sets and algorithms in a given domain. The first step of the proposed methodology, shown in Figure 2, is selecting representative texts for domain model building. In the second step, the algorithms are selected. Selected algorithms are limited only to the ones that wanted to be considered. After selecting representative texts for domain model building and algorithms, the searching phase for representative patterns starts in the third step. Representative patterns can be text substrings or the can be randomly created from the domain alphabet. The searching phase means that all representative patterns are searched with algorithms selected in the second step. Search results are collected and expressed in specific metrics. In the fourth step, patterns entropy is calculated. In the fifth step, entropy discretization is applied. Entropy results are discretized and divided into groups by frequency distribution [16,17].
Text Pattern Figure 2. Methodology for building model based on the entropy approach for string search algorithms selection.
The first step of the proposed methodology, shown in Figure 2, is selecting representative texts for domain model building. In the second step, the algorithms are selected. Selected algorithms are limited only to the ones that wanted to be considered. After selecting representative texts for domain model building and algorithms, the searching phase for representative patterns starts in the third step. Representative patterns can be text substrings or the can be randomly created from the domain alphabet. The searching phase means that all representative patterns are searched with algorithms selected in the second step. Search results are collected and expressed in specific metrics. In the fourth step, patterns entropy is calculated. In the fifth step, entropy discretization is applied. Entropy results are discretized and divided into groups by frequency distribution [16,17]. The last step is to classify algorithms in the built model and present the obtained algorithms' ranking according to the proposed approach.

Representative Patterns Sample Size
The sample size of representative patterns is determined by Equation (1) for finite population [18][19][20]: where n is the sample size, z is the z-score, ε is the margin of error, N is the population size, and p is the population proportion. The commonly used confidence levels are 90%, 95%, and 99%. Each has its corresponding z-scores provided by tables based on the chosen confidence level (confidence level 0.95 used in the experiment with z-score 1.65). The margin of error means the maximum distance for the sample estimate to deviate from the real value. A population proportion describes a percentage of the value associated with a survey.
In theory, we are dealing with an unlimited population since patterns and texts can have an unlimited number of characters. However, in practice, we have to limit populations to a finite number [14,16]. The maximum number of classes in the discretization phase is determined by Equation (2) (n is the total number of observations in the data) [16,17]: Also, the range of the data should be calculated by finding minimum and maximum values. The range will be used to determine the class interval or class width. The following Equation (3) is used [16,17]:

Entropy
Shannon entropy is a widely used concept in information theory to convey the amount of information contained in a message. Entropy is a standard measure for the state of order, or better, a disorder of symbols in a sequence. The entropy of a sequence of characters describes the complexity, compressibility, amount of information [21][22][23][24][25].
Suppose that events A 1 , A 2 , . . . , A n is defined, and they make a complete set. The following expression is valid Finite system α holds all events A i , i = 1, 2, . . . , n with probability p i 's corresponding values. The following form will denote system α (Equation (4)) [22]: The states of a system α will denote events A i , i = 1, 2, . . . , n. System α is a discrete system with a finite set of states. Every finite system describes some state of uncertainty because it is impossible to know which state is the system in a specific time. The goal is to express quantitatively such uncertainty in some way. It means that a particular function, which will join a specific number to system α, should be defined. In that way, the system will have a measure for its uncertainty [22].
The function which quantitatively measures an uncertainty of a system is called Entropy of system, and it is defined with the following Equation (5) [22,26]: Entropy 2021, 23, 31

of 19
The entropy of a system is denoted with H(α). If p i = 0, it follows that p i log p i = 0. The information theory logarithm base is usually 2, and an entropy unit is called a bit (binary digit). The entropy is zero only if one of the probabilities p i = 1, . . . , n is equal 1, and others are 0. In that case, there is no uncertainty since it is possible to predict the system's state precisely. In any other case, entropy is a positive number [22,26].
If the system α contains test results, a degree of uncertainty before a test is executed is equal to the entropy of the system α. When the test is executed, the degree of uncertainty is zero. The amount of information after test execution is larger if the uncertainty was bigger before the test. The information given after the test, denoted with ϑ α, is equal to the entropy of the system α (Equation (6)) [22,27]: Another measure from information theory is Kolmogorov complexity. Although Kolmogorov complexity looks similar to Shannon entropy, they are conceptually different measures. Shannon entropy interprets the smallest number of bits required for the optimal string encoding. Kolmogorov complexity is the minimum number of bits (or the minimum length) from which a particular string can effectively be reconstructed [28][29][30]:

Formal Metric Description
The metrics and the quality attributes that are used for string searching algorithms analysis imply several issues, like the quality of framework (occurs when a quality model does not define a metric), lack of an ontology (occurs when the architectural concepts need quantification), lack of an adequate formalism (when metrics are defined with a formalism that requires a strong mathematical background what causes less metric usability), lack of computational support (occurs when metrics do not produce tools for metrics collection), lack of flexibility (occurs when metrics collection tools are not available in open-source format what causes less ability to modify them) and lack of validation (occurs when crossvalidation is not performed). All these issues complicate determining which properties and measures would be useful in selecting metrics and results presentation [31].
Two main approaches exist for expressing the speed of an algorithm. The first approach is formal, analyzing algorithm complexity through algorithm time efficiency (or time complexity, the time required). The second approach is empirical, analyzing particular computer resources usage through space and time efficiency (or space complexity, the memory required). Objective and informative metrics should accompany each approach [2,[31][32][33][34][35].
Algorithmic efficiency analysis shows the amount of work that an algorithm will need for execution, and algorithm performance is a feature of hardware that shows how fast the algorithm execution will be done. Formal metrics are usually used for efficiency analysis. A commonly used metric from the formal approach is Big O notation or Landau's symbol, representing the complexity of an algorithm shown as a function of the input size describing the upper bound for the search time in the worst case. Empirical metrics, like algorithm execution run time usually presented in milliseconds, processor and memory usage, temporary disk usage, long term disk usage, power consumption, etc., are usually used for algorithm performance analysis. The runtime execution metric is difficult to describe analytically, so empirical evaluation is needed through the experiments using execution runtime metrics [4,9,11,14,[36][37][38][39][40][41][42].
The proposed methodology focuses on evaluating the speed of algorithms using the character comparisons (CC) metric. CC metric is the number of compared characters of the pattern with the characters of text. Character comparison metric is a measure that is independent of programming language, computational resources, and operating systems, which means that it is platform-independent like formal approaches. However, besides the time complexity, the CC metric covers space complexity in some segments, like the execution run time, and can be programmatically implemented and used like empirical approaches. Thus, this metric is a formal and empirical approach combined [9].

Methodology Implementation
The application of the proposed methodology is presented in the paper for the two domains. For the genome (DNA) domain proposed methodology is implemented and depicted in Figure 3 and for the natural language domain is shown in Figure 4. 2021, 23, x 6 of 18 execution run time, and can be programmatically implemented and used like empirical approaches. Thus, this metric is a formal and empirical approach combined [9].

Methodology Implementation
The application of the proposed methodology is presented in the paper for the two domains. For the genome (DNA) domain proposed methodology is implemented and depicted in Figure 3 and for the natural language domain is shown in Figure 4.    execution run time, and can be programmatically implemented and used like empirical approaches. Thus, this metric is a formal and empirical approach combined [9].

Methodology Implementation
The application of the proposed methodology is presented in the paper for the two domains. For the genome (DNA) domain proposed methodology is implemented and depicted in Figure 3 and for the natural language domain is shown in Figure 4.   That is the entropy-based model for selecting the most efficient algorithm for any pattern searched in a particular domain. In the following sections, each step is described in detail.

Selection of Representative Texts for Domain Model Building
For the DNA domain, selected representative texts are the genome data of four different species. For the natural language domain, selected representative texts are English texts from the Canterbury Corpus [43]. The length of a DNA sequence expressed in base pairs (bp) varies from a few thousand to several million and even billion bp. The DNA character strings are formed with the 4-letter alphabet {A, C, G, T}. The length of texts from the Canterbury Corpus is expressed in bytes and formed of the English alphabet [a-z|A-Z|0-9|!|]. We used the bible subset as the text to be searched because it is more representative of natural English text than the other convenient word lists, and it is publicly released [2,21,44,45].
In detail, the following publicly available representative texts are used for model building: • DNA sequences of nucleotides for the DNA domain That is the entropy-based model for selecting the most efficient algorithm for any pattern searched in a particular domain. In the following sections, each step is described in detail.

Selection of Representative Texts for Domain Model Building
For the DNA domain, selected representative texts are the genome data of four different species. For the natural language domain, selected representative texts are English texts from the Canterbury Corpus [43]. The length of a DNA sequence expressed in base pairs (bp) varies from a few thousand to several million and even billion bp. The DNA character strings are formed with the 4-letter alphabet {A, C, G, T}. The length of texts from the Canterbury Corpus is expressed in bytes and formed of the English alphabet [a-z|A-Z|0-9|!|]. We used the bible subset as the text to be searched because it is more representative of natural English text than the other convenient word lists, and it is publicly released [2,21,44,45].
In detail, the following publicly available representative texts are used for model building:

Selection of Algorithms
Seven commonly used string matching algorithms have been chosen to be ranked with the proposed model: brute force, nӓive (BF), Boyer-Moore (BM), Knuth Morris Pratt (KMP), Apostolico-Crochemore (AC), quick search (QS), Morris Pratt (MP) and Horspool (HOR) [12,39,[51][52][53][54][55][56]. The selected algorithms belong to the group of software-based algorithms that use exact string-matching techniques with a character comparison approach (classical approach) [11]. All selected algorithms used in this experiment match their published version [3,12,39], which might represent the better implementation of the original algorithm [57]. Seven string search algorithms are selected as our baseline for model construction. However, any exact string-matching algorithm that can be evaluated with character comparison metrics can be ranked with the proposed model.

Searching Results for Representative Patterns
For model development, design, and construction in step 3 of the model building ( Figure 2), we used 9.725 different patterns. For the DNA domain model, 7.682 patterns are used, and 2.043 patterns of English text from the Canterbury Corpus are used for the natural language domain. The length of patterns ranges from 2 characters to 32 characters. 4.269 patterns for the DNA domain and 1.685 patterns for the natural language domain (or more) are needed to accomplish a confidence level of 95%, that the real value is within ±1% of the surveyed value (Equation (1)). With this confidence level, it can be concluded that a model objectively reflects the modeled domain since a model is constructed with an adequate sample size. That is the entropy-based model for selecting the most efficient algorithm for any pattern searched in a particular domain. In the following sections, each step is described in detail.

Selection of Representative Texts for Domain Model Building
For the DNA domain, selected representative texts are the genome data of four different species. For the natural language domain, selected representative texts are English texts from the Canterbury Corpus [43]. The length of a DNA sequence expressed in base pairs (bp) varies from a few thousand to several million and even billion bp. The DNA character strings are formed with the 4-letter alphabet {A, C, G, T}. The length of texts from the Canterbury Corpus is expressed in bytes and formed of the English alphabet [a-z|A-Z|0-9|!|]. We used the bible subset as the text to be searched because it is more representative of natural English text than the other convenient word lists, and it is publicly released [2,21,44,45].
In detail, the following publicly available representative texts are used for model building: •  [12,39,[51][52][53][54][55][56]. The selected algorithms belong to the group of software-based algorithms that use exact string-matching techniques with a character comparison approach (classical approach) [11]. All selected algorithms used in this experiment match their published version [3,12,39], which might represent the better implementation of the original algorithm [57]. Seven string search algorithms are selected as our baseline for model construction. However, any exact string-matching algorithm that can be evaluated with character comparison metrics can be ranked with the proposed model.

Searching Results for Representative Patterns
For model development, design, and construction in step 3 of the model building ( Figure 2), we used 9.725 different patterns. For the DNA domain model, 7.682 patterns are used, and 2.043 patterns of English text from the Canterbury Corpus are used for the natural language domain. The length of patterns ranges from 2 characters to 32 characters. 4.269 patterns for the DNA domain and 1.685 patterns for the natural language domain (or more) are needed to accomplish a confidence level of 95%, that the real value is within ±1% of the surveyed value (Equation (1)). With this confidence level, it can be concluded that a model objectively reflects the modeled domain since a model is constructed with an adequate sample size.
Chelonia mydas (green sea turtle; NW_006571126.1 Chelonia mydas unplaced genomic scaffold, CheMyd_1.0 scaffold1, whole genome shotgun sequence, 7.392.783 bp, 7.1 Mb) [47] Entropy 2021, 23, x 7 of 18 That is the entropy-based model for selecting the most efficient algorithm for any pattern searched in a particular domain. In the following sections, each step is described in detail.

Selection of Representative Texts for Domain Model Building
For the DNA domain, selected representative texts are the genome data of four different species. For the natural language domain, selected representative texts are English texts from the Canterbury Corpus [43]. The length of a DNA sequence expressed in base pairs (bp) varies from a few thousand to several million and even billion bp. The DNA character strings are formed with the 4-letter alphabet {A, C, G, T}. The length of texts from the Canterbury Corpus is expressed in bytes and formed of the English alphabet [a-z|A-Z|0-9|!|]. We used the bible subset as the text to be searched because it is more representative of natural English text than the other convenient word lists, and it is publicly released [2,21,44,45].
In detail, the following publicly available representative texts are used for model building: •  [12,39,[51][52][53][54][55][56]. The selected algorithms belong to the group of software-based algorithms that use exact string-matching techniques with a character comparison approach (classical approach) [11]. All selected algorithms used in this experiment match their published version [3,12,39], which might represent the better implementation of the original algorithm [57]. Seven string search algorithms are selected as our baseline for model construction. However, any exact string-matching algorithm that can be evaluated with character comparison metrics can be ranked with the proposed model.

Searching Results for Representative Patterns
For model development, design, and construction in step 3 of the model building (Figure 2) That is the entropy-based model for selecting the most efficient algorithm for any pattern searched in a particular domain. In the following sections, each step is described in detail.

Selection of Representative Texts for Domain Model Building
For the DNA domain, selected representative texts are the genome data of four different species. For the natural language domain, selected representative texts are English texts from the Canterbury Corpus [43]. The length of a DNA sequence expressed in base pairs (bp) varies from a few thousand to several million and even billion bp. The DNA character strings are formed with the 4-letter alphabet {A, C, G, T}. The length of texts from the Canterbury Corpus is expressed in bytes and formed of the English alphabet [a-z|A-Z|0-9|!|]. We used the bible subset as the text to be searched because it is more representative of natural English text than the other convenient word lists, and it is publicly released [2,21,44,45].
In detail, the following publicly available representative texts are used for model building: •

Selection of Algorithms
Seven commonly used string matching algorithms have been chosen to be ranked with the proposed model: brute force, nӓive (BF), Boyer-Moore (BM), Knuth Morris Pratt (KMP), Apostolico-Crochemore (AC), quick search (QS), Morris Pratt (MP) and Horspool (HOR) [12,39,[51][52][53][54][55][56]. The selected algorithms belong to the group of software-based algorithms that use exact string-matching techniques with a character comparison approach (classical approach) [11]. All selected algorithms used in this experiment match their published version [3,12,39], which might represent the better implementation of the original algorithm [57]. Seven string search algorithms are selected as our baseline for model construction. However, any exact string-matching algorithm that can be evaluated with character comparison metrics can be ranked with the proposed model.

Searching Results for Representative Patterns
For model development, design, and construction in step 3 of the model building (Figure 2) That is the entropy-based model for selecting the most efficient algorithm for any pattern searched in a particular domain. In the following sections, each step is described in detail.

Selection of Representative Texts for Domain Model Building
For the DNA domain, selected representative texts are the genome data of four different species. For the natural language domain, selected representative texts are English texts from the Canterbury Corpus [43]. The length of a DNA sequence expressed in base pairs (bp) varies from a few thousand to several million and even billion bp. The DNA character strings are formed with the 4-letter alphabet {A, C, G, T}. The length of texts from the Canterbury Corpus is expressed in bytes and formed of the English alphabet [a-z|A-Z|0-9|!|]. We used the bible subset as the text to be searched because it is more representative of natural English text than the other convenient word lists, and it is publicly released [2,21,44,45].
In detail, the following publicly available representative texts are used for model building: •

Selection of Algorithms
Seven commonly used string matching algorithms have been chosen to be ranked with the proposed model: brute force, nӓive (BF), Boyer-Moore (BM), Knuth Morris Pratt (KMP), Apostolico-Crochemore (AC), quick search (QS), Morris Pratt (MP) and Horspool (HOR) [12,39,[51][52][53][54][55][56]. The selected algorithms belong to the group of software-based algorithms that use exact string-matching techniques with a character comparison approach (classical approach) [11]. All selected algorithms used in this experiment match their published version [3,12,39], which might represent the better implementation of the original algorithm [57]. Seven string search algorithms are selected as our baseline for model construction. However, any exact string-matching algorithm that can be evaluated with character comparison metrics can be ranked with the proposed model.  [12,39,[51][52][53][54][55][56]. The selected algorithms belong to the group of software-based algorithms that use exact string-matching techniques with a character comparison approach (classical approach) [11]. All selected algorithms used in this experiment match their published version [3,12,39], which might represent the better implementation of the original algorithm [57]. Seven string search algorithms are selected as our baseline for model construction. However, any exact string-matching algorithm that can be evaluated with character comparison metrics can be ranked with the proposed model.

Searching Results for Representative Patterns
For model development, design, and construction in step 3 of the model building (Figure 2), we used 9.725 different patterns. For the DNA domain model, 7.682 patterns are used, and 2.043 patterns of English text from the Canterbury Corpus are used for the natural language domain. The length of patterns ranges from 2 characters to 32 characters. 4.269 patterns for the DNA domain and 1.685 patterns for the natural language domain (or more) are needed to accomplish a confidence level of 95%, that the real value is within ±1% of the surveyed value (Equation (1)). With this confidence level, it can be concluded that a model objectively reflects the modeled domain since a model is constructed with an adequate sample size.

Patterns Entropy Calculation
Searched patterns are grouped into classes according to their entropy. Entropy is calculated using Equations (5) and (6). For example, for P = TCGTAACT, after counting the number of characters in a pattern, A = 2, C = 2, G = 1, T = 3, the probabilities respectively are: So for a pattern TCGTAACT calculated entropy is 1.90563906222957. Entropy for the given pattern from the English text P = "e name of the LORD. And the LORD" is accordingly 3.698391111. Entropies values are rounded to the two decimals (i.e., entropy for pattern TCGTAACT is 1.91 and entropy for English text pattern in the above example is 3.7).

Entropies Discretization
The next phase is grouping data into classes or making frequency distribution. Calculated entropies are discretized in classes created by frequency distribution, displaying the number of observations or results in a sample or a given interval. Classes do not need to be represented with the same number of patterns. Table 1 is just a section of the overall patterns entropy classification for the DNA domain.  Table 2 shows entropy classes after discretization with the number of patterns in each of them.  Table 3 is just a section of the overall patterns entropy classification for the natural language domain.  (2) is 9, the width of classes after applying Equation (3) is 0.46. The Table 4 shows entropy classes for the natural language domain after discretization with the number of patterns in each of them.
Entropy classes containing a small number of patterns affect the model the least since such patterns are rare and occur in less than 0.5% of cases. Examples of such patterns are TTTTTTTTCTTTTTTT, AAAGAAA, and LL. When a pattern does not belong to any entropy class, the relevant class is the first closest entropy class.

Classification of Algorithms in the Built Model
The algorithm analysis results integrated into a model, provide a ranking list of algorithms by their efficiency, measured with character comparison metrics, correlated with the searched pattern entropy. More efficient algorithms perform fewer characters comparison when finding a pattern in a text. The model proposes a more efficient algorithm for string matching for a given pattern based on the entropy class to which the observed pattern belongs.
The results presented in Tables 5 and 6 give a ranking list of the selected algorithms grouped by the entropy class. The percentages shown in the result tables represent a proportion of pattern searching results for a particular algorithm, which might be smaller or greater than the remaining algorithms inside the quartile. For example, in Table 5, if the searched pattern belongs to the entropy class 1 (number of representative patterns is 9), 55.88% of the searching results for a given entropy class with the QS algorithm are in the first quartile, 14.71% are in the second quartile, 29.41% are in the third quartile ( Figure 5). When patterns are searched with the BM algorithm, 47.92% of the searching results expressed as CC count is in the first quartile, 23.53% are in the second quartile, and 25% are in the third, and 8.33% are in the fourth quartile. In this case, for a given pattern, the built model suggests using the QS algorithm as the most efficient algorithm. The selected algorithm is considered an optimal algorithm that will make fewer character comparisons (CC) than others for most patterns being searched belonging to the entropy class 1.
Entropy classes in Table 5 are defined in Table 2.
In Table 5, for the entropy class 8 (number of representative patterns searched is 1451), the model shows that the BM algorithm is the most efficient. In 61.95% of cases for patterns in the entropy class 8, the BM algorithm made the least characters comparison versus the other six algorithms evaluated with the model. In 24.38% cases, BM was second best; in 13.68% cases was the third and never was the worse.
Entropy classes in Table 6 are defined in Table 4. In Table 6, for example, for the entropy class 6 (number of representative patterns searched is 393), the model shows that the QS algorithm is the most efficient. In 70.13% of cases for patterns in the entropy class 6, the QS algorithm made the least characters comparison versus the other six algorithms evaluated with the model. In 29.87% of cases, QS was second best and never was the worse. For the entropy class 7 (number of representative patterns searched is 283), the model shows that the most efficient is the BM algorithm. In 65.02% of cases for patterns in the entropy class 7, the BM algorithm made the least characters comparison versus the other six algorithms evaluated with the model. In 34.98% of cases, BM was second best and never was the worse. When patterns are searched with the BM algorithm, 47.92% of the searching results expressed as CC count is in the first quartile, 23.53% are in the second quartile, and 25% are in the third, and 8.33% are in the fourth quartile. In this case, for a given pattern, the built model suggests using the QS algorithm as the most efficient algorithm. The selected algorithm is considered an optimal algorithm that will make fewer character comparisons (CC) than others for most patterns being searched belonging to the entropy class 1. Entropy classes in Table 5 are defined in Table 2.

Methodology Validation and Discussion
For model validation, the seventh and ninth entropy classes (961 and 4692 patterns) were selected for the DNA domain, and the sixth (393 patterns) and ninth classes (221 patterns) were selected for the natural language domain. The model classes chosen for validation have the highest number of representative patterns and are characteristic for the specific domains.
The selected patterns for validation are not part of the patterns set with which the model was created. For the DNA domain model, also a different text is chosen for validation. The DNA domain model is validated with the DNA sequence Homo sapiens isolate HG00514 chromosome 9 genomic scaffold HS_NIOH_CHR9_SCAFFOLD_1, whole genome shotgun sequence, 43.213.237 bp, 39 Mb as the text [58]. The natural language domain is validated with the natural language text set from the Canterbury Corpus. [43] Before the model validation process, a check was made to see if the selected patterns were sufficiently representative for model validation. The check was done with the central limit theorem. The set of patterns used in the validation phase has a normal distribution ( Figure 6, Mean = 1.900, and Std. Dev = 0.064) as a set of patterns used in model building (Figure 7, Mean = 1.901, and Std. Dev = 0.059), which means that patterns used to validate the model represent a domain.
Other entropy classes of patterns discretized character comparisons also follow the normal distribution. The basis in the model validation phase is to verify if the test results differ from the developed model results presented in Tables 5 and 6.
For comparing the two data sets (model results and test results), double-scaled Euclid distance, and Pearson correlation coefficient were used.
Double-scaled Euclidian distance normalizes raw Euclidian distance into a range of 0-1, where 1 represents the maximum discrepancy between the two variables. The first step in comparing two datasets with double-scaled Euclidian methods is to compute the maximum possible squared discrepancy (md) per variable i of v variables, where v is the number of observed variables in the data set. The md i = (Maximum for variable i-Minimum for variable i) 2 , where 0 (0%) is used for minimum and 1 (100%) for maximum values for double-scaled Euclid distance. The second step's goal is to produce the scaled variable Euclidean distance, where the sum of squared discrepancies per variable is divided by the maximum possible discrepancy for that variable, Equation (7):  Other entropy classes of patterns discretized character comparisons also follow the normal distribution. The basis in the model validation phase is to verify if the test results differ from the developed model results presented in Tables 5 and 6.
For comparing the two data sets (model results and test results), double-scaled Euclid distance, and Pearson correlation coefficient were used.
Double-scaled Euclidian distance normalizes raw Euclidian distance into a range of 0-1, where 1 represents the maximum discrepancy between the two variables. The first step in comparing two datasets with double-scaled Euclidian methods is to compute the maximum possible squared discrepancy (md) per variable i of v variables, where v is the  The final step is dividing scaled Euclidian distance with the root of v, where v is the number of observed variables, Equation (8). Double-scaled Euclid distance easily turns into a measure of similarity by subtracting it from 1.0. [16,[59][60][61][62]:  Applying Equation (8) on Table 7, column "Scaled Euclidean (d 1 )", gives a doublescaled Euclidian distance of 0.227. Subtracting double-scaled Euclidian Distance from 1 gives a similarity coefficient of 0.773 or 77%. Table 8 shows the results of the calculated double-scaled Euclid distance and corresponding similarity coefficient. Converting double-scaled Euclidian distance to a context of similarity, it is possible to conclude that the built model is similar to the validation results with a high degree of similarity. The seventh and ninth classes from the built model for the DNA domain have a similarity coefficient with their validation results of 77%. The high percentage of similarity also has the sixth and ninth classes from the built model for the natural language domain with their validation results of 80% and 86%. The results for validated classes obtained in the validation process are extremely similar to the results from the built model. A proportion of searched pattern character comparisons for a particular algorithm inside the quartile is similar to the built model.
Pearson's correlation coefficient is used to check the correlation between data from the model and data from the validation phase. Pearson correlation coefficients per classes are shown in Table 9. Table 9. Pearson correlation coefficient for DNA and natural language classes.  The seventh and ninth classes from the built model for the DNA domain have a linear Pearson's correlation coefficient with their validation results. The sixth and ninth classes from the natural language domain's built model have a linear Pearson's correlation coefficient with their validation results. Pearson's correlation coefficient shown in Figure 8 indicate that the values from the built model (x-axis, Model) and their corresponding validation result (y-axis, validation) follow each other with a strong positive relationship. Using the double-scaled Euclidean distance in the validation process shows a strong similarity between the built model and validation results. In addition to the similarity, a strong positive relationship exists between classes selected from the built model and validation results proven by Pearson's correlation coefficient. Presented results show that it is possible to use the proposed methodology to build a domain model for selecting an optimal algorithm for the exact string matching. Except for optimal algorithm selection for a specific domain, this methodology can be used to improve the efficiency of stringmatching algorithms in the context of performance, which is in correlation with empirical measurements. Using the double-scaled Euclidean distance in the validation process shows a strong similarity between the built model and validation results. In addition to the similarity, a strong positive relationship exists between classes selected from the built model and validation results proven by Pearson's correlation coefficient. Presented results show that it is possible to use the proposed methodology to build a domain model for selecting an optimal algorithm for the exact string matching. Except for optimal algorithm selection for a specific domain, this methodology can be used to improve the efficiency of string-matching algorithms in the context of performance, which is in correlation with empirical measurements.
The data used to build and validate the model can be downloaded from the website [63].

Conclusions
Proposed methodology for ranking algorithms is based on properties of the searched string and properties of the texts being searched. Searched strings are classified according to the pattern entropy. This methodology is expressing algorithms efficiency using platform independent metrics thus not depending on algorithm implementation, computer architecture or programming languages characteristics. This work focuses on classical software-based algorithms that use exact string-matching techniques with a character comparison approach. For any other type of algorithms, this methodology cannot be used. The used character comparisons metrics is platform-independent in the context of formal approaches, but the number of comparisons directly affects the time needed for algorithm execution and usage of computational resource. Studying the methodology, complexity, and limitations of all available algorithms is a complicated and long-term task. The paper discusses, in detail, available metrics for string searching algorithms properties evaluation and proposing a methodology for building a domain model for selecting an optimal string searching algorithm. The methodology is based on presenting exact string-matching results to express algorithm efficiency regardless of query pattern length and dataset size. We considered the number of compared characters of each algorithm expressed by the searched string entropy for our baseline analysis. High degrees of similarity and a strong correlation between the validation results and the built model data have been proven, making this methodology a useful tool that can help researchers choose an efficient string-matching algorithm according to the needs and choose a suitable programming environment for developing new algorithms. Everything that is needed is a pattern from a specific domain by which the model is built, and the model will suggest using the most optimal algorithm for usage. The defined model finally selects the algorithm that will most likely run up the least character comparison count in pattern matching. This research does not intend to evaluate the algorithm logic and programming environment in any way; the main reason for comparing the results of algorithms is the construction of the algorithm selection model. The built model is straightforwardly extendable with other algorithms; all required is adequate training data sets. Further research is directed to find additional string characteristics, besides pattern entropy, that can enhance developed methodology precision for selecting more efficient string search algorithms.