Empirical Convergence Theory of Harmony Search Algorithm for Box-Constrained Discrete Optimization of Convex Function

: The harmony search (HS) algorithm is an evolutionary computation technique, which was inspired by music improvisation. So far, it has been applied to various scientiﬁc and engineering optimization problems including project scheduling, structural design, energy system operation, car lane detection, ecological conservation, model parameter calibration, portfolio management, banking fraud detection, law enforcement, disease spread modeling, cancer detection, astronomical observation, music composition, ﬁne art appreciation, and sudoku puzzle solving. While there are many application-oriented papers, only few papers exist on how HS performs for ﬁnding optimal solutions. Thus, this preliminary study proposes a new approach to show how HS converges on an optimal solution under speciﬁc conditions. Here, we introduce a distance concept and prove the convergence based on the empirical probability. Moreover, a numerical example is provided to easily explain the theorem.


Introduction
In recent years, many researchers have utilized various nature-inspired metaheuristic algorithms for solving scientific and engineering optimization problems. One of the popular algorithms is harmony search (HS), which was inspired by jazz improvisation [1]. As a musician plays a musical note from its memory or randomly, HS generates a value from its memory or randomly. As a new harmony, which is composed of musical notes, is evaluated in each practice and memorized if it is good, a new solution vector, which is composed of values, is evaluated in each iteration and memorized if it performs well. This HS optimization process, which utilizes three basic operations (memory consideration, pitch adjustment, and random selection), continues until it finds an optimal solution vector [2].
When compared with other algorithms, there is a similarity between HS and them [1]. HS is similar to Tabu Search with respect to keeping the past vectors in a memory called harmony memory (HM). In addition, HS can use the adaptative parameters of HMCR (Harmony Memory Consideration Rate) and PAR (Pitch Adjustment Rate), which are similar to Simulated Annealing in varying temperature. Furthermore, both HS and genetic algorithm (GA) manage multiple vectors simultaneously. However, there is also a difference between HS and GA. While HS generates a new vector by considering all the existing vectors, GA generates the new vector by considering only two of the existing vectors (the parents). In addition, HS considers each variable in a vector independently while GA cannot consider in that way because its major operation is crossover, which keeps the gene sequence (multiple variables together).
While there are many applications proposed so far, only a few studies have been dedicated to the theoretical background of the HS algorithm. Beyer [26] dealt with the expected population variance of several evolutionary algorithms, and Das et al. [27] proposed an approximated variance of the expectation of solution candidates and discussed the exploratory power of the HS algorithm. However, there has been no study discussing the convergence of the HS algorithm while the convergence of other optimization algorithms has been discussed in some studies [28][29][30][31].
As an evolutionary computation algorithm, HS considers an optimization problem: where If all the bounds are infinite, the above problem becomes an unconstrained optimization. However, most practical optimization problems adopt variables having certain prefixed range of values, and, therefore, become box-constrained problems [32].
In this paper, we propose the convergence theory based on the empirical probability for the HS, which is one of box-constrained optimization algorithms. In particular, we consider the convex optimization functions with single and multiple variable cases. For this, we define a discrete sequence to prove convergence using distance. This new approach can be applied to the algorithms, such as HS, which have candidate sets that store the improved solutions in the storage and iteratively update them.

Basic Structure of Harmony Search
The HS algorithm is an optimization method inspired by musical phenomena. This algorithm mimics the musical performing process that occurs when a musician searches for a better harmonic sound, such as jazz improvisation. Jazz improvisation finds a musically pleasing harmony (perfect state) determined by aesthetic standards, just as the optimization process finds a global solution (perfect state) determined by an objective function. While the pitch of each instrument determines its aesthetic quality, the value of each decision/design variable determines its solution quality of the objective function.
In order to optimize a problem using HS, we first define all possible candidate value set called the candidate set (universal set) Λ. Let us assume that Λ includes K candidate values.
Here, the memory storage HM of the HS algorithm can be expressed as in Equation (2). To initialize HM, we randomly choose values from the universal set, and generate vectors as many as HMS (harmony memory size, that is, the number of vectors stored in HM). The value of the objective function is also kept next to each solution vector.
Once HM is prepared, the HM is refreshed with better solution vectors iteration by iteration. If a newly generated vector x New is better than the worst vector x Worst stored in the HM in terms of an objective function value, the new vector is swapped with the worst one. The value of variable x i (i = 1, 2, · · · , n) can be randomly selected from the set of all candidate discrete values Λ i ={x i (1), x i (2), . . . , x i (K i )} with a probability of P R (random selection rate). Otherwise, the value of x i can be selected from the set of stored values, named , which is the i th column of HM, with a probability of P M (the probability with which the value is selected solely from HM), or, once x i k i p is selected from HM i , it can be slightly adjusted by moving it to neighboring values x i k i p + m with a probability of P P (the probability with which the value is selected solely using pitch adjustment) as follows: where = 1, . . . , n; k i R , k i P ∈ {1, . . . , K i }; h ∈ {1, . . . , HMS}, P R + P M + P P = 1 (100%), 0 ≤ P R , P M , P P ≤ 1, and m is some non-zero integer such that x i k i P + m ∈ Λ i . Here, m is a predetermined parameter for adjustment. That is, m = ±1 or ±2 or ±3 . . . . . ., etc. For example, if m = ±1, then we take For P M and P P , we first define HMCR by HMCR = 1 − P R . Then, after we define PAR, P M and P P are defined as P M = HMCR(1 − PAR) and P P = HMCR·PAR.
Here, for the pitch adjustment, we first select x h i randomly from HM i using a uniform distribution between 0 and 1, i.e., h = Integer(Rand(0, 1) × HMS + 0.5). Then, we identify k i P , which satisfies x h i = x i k i P , where k i P ∈ {1, . . . , K i }. Next, we further tweak x i k i P into x i k i P + m , where x i k i P + m ∈ Λ i . This is the basic structure of the HS algorithm [33]. Although there are many structural variants of the HS algorithm [34], most variants basically have the above-mentioned three operations: memory consideration, pitch adjustment, and random selection.

Solution Formula for Harmony Search without Pitch Adjustment
Let us formulate the solution for the HS algorithm without a pitch adjustment case. At the first generation, x 1 i (i = 1, . . . , n) will be chosen with given probabilities as follows: where i = 1, . . . , n; k = k P ∈ {1, . . . , K i }; h ∈ {1, . . . , HMS}, P R + P M = 1 and HM 0 i is the initial harmony memory for x i . And let x g i be the solution with given probabilities at the g th generation stage, then where HM g−1 i is the newly updated i th column of HM after the (g − 1) th generation. At the first generation, x 1 = x 1 1 , x 1 2 , . . . , x 1 n is obtained by two operations (random selection or memory consideration). If the newly generated vector x 1 is better than x 0 worst which is the worst vector in HM 0 in terms of an objective function value, then x 0 worst will be replaced with x 1 . Otherwise, the worst vector x 0 worst will stay in the memory. The element that we get from this comparison after the first generation will be represented by x 1 new . That is, Following a similar procedure, at the g th generation, if x g is better than x g−1 worst , which is the worst solution vector in HM g−1 , then x g−1 worst will be replaced with x g . Otherwise, the worst element x g−1 worst will stay in the memory. The element that we get from this comparison after the g th generation will be represented by x g new . That is, Using an indicator function, the solution formula for x g new in Equation (7) can be equivalently represented by: where

Solution Formula for Harmony Search with Pitch Adjustment
Next, let us formulate the solution for the HS algorithm with a pitch adjustment case. At the first generation, x 1 i (i = 1, . . . , n) will be chosen with given probabilities as follows: where i = 1, . . . , n; k i R , k i P ∈ {1, . . . , K}; h ∈ {1, . . . , HMS}, P R + P M + P P = 1 and HM 0 i is the initial harmony memory. Then, let x g i be the solution with given probabilities at the g th generation stage, in which: where HM g−1 i is the newly updated i th column of HM after the (g − 1) th generation. At the first generation, . , x 1 n is obtained by three operations (random selection, memory consideration or pitch adjustment). If x 1 is better than x 0 worst which is the worst one in HM 0 in terms of an objective function value, then Therefore, after the first generation, HM 1 will be updated by substituting x 0 worst with x 1 new . Then, after the g th generation, x g new ← x g i f x g per f orms better than x is the solution vector that performs the worst in the HM g−1 . Therefore, after the g th generation, HM g will be updated by substituting x 0 worst with x 1 new . Let x g new =x g . Then, using an indicator function, the solution formula forx g in Equation (13) can be represented by: where

One Variable Case
In this section, we discuss the behavior of the solutions in HM. Let us first consider f (x) is a function of one discrete variable x. Without loss of generality, we can assume after rearrangement, and we have K − 1 subintervals that are separated by x(1), x(2), · · · , x(K) (See Figure 1). In addition, x h ≤ · · · ≤ x HMS after rearrangement. The h th element of HM at the g th generation will be written as x h,g .
Mathematics 2021, 9, x FOR PEER REVIEW 5 of 12 Let x new g = x ĝ . Then, using an indicator function, the solution formula for x ĝ in Equation (13) can be represented by: where

One Variable Case
In this section, we discuss the behavior of the solutions in HM. Let us first consider ( ) is a function of one discrete variable . Without loss of generality, we can assume that = { ( ) ∈ 1 | = 1, ⋯ , } satisfies (1) < (2) < ⋯ < ( ) < ⋯ < ( ) after rearrangement, and we have − 1 subintervals that are separated by (1), (2), ⋯ , ( ) (See Figure 1). In addition, after rearrangement. The ℎ ℎ element of HM at the ℎ generation will be written as ℎ, . Here, let us define the total ranges of the candidate set and HM. In addition, we define the distances between neighboring endpoints, which are the lengths of subintervals as follows: and we define the minimum and maximum value of as follows: Here, note that we have an important relation as follows: As we mentioned, we deal with the convex function with discrete variables for its minimum optimal solution. However, note that it can be also applied to the concave function for the maximum optimal solution without loss of generality. If ( ) has the minimum value at * , then * can be located in four different ways. Figure 2 shows these four different cases. (a) and (b) show the convex objective functions. In this paper, we focus on the convex cases where * is located in the middle of the range of HM, which is also Here, let us define the total ranges of the candidate set Λ and HM. In addition, we define the distances between neighboring endpoints, which are the lengths of subintervals as follows: and we define the minimum and maximum value of δ k as follows: Here, note that we have an important relation as follows: As we mentioned, we deal with the convex function with discrete variables for its minimum optimal solution. However, note that it can be also applied to the concave function for the maximum optimal solution without loss of generality. If f (x) has the minimum value at x h * , then x h * can be located in four different ways. Figure 2 shows these Mathematics 2021, 9, 545 6 of 13 four different cases. (a) and (b) show the convex objective functions. In this paper, we focus on the convex cases where x h * is located in the middle of the range of HM, which is also applied to the concave cases. However, this theory can be also applied to the monotone increasing/decreasing objective functions such as (c) and (d) as trivial cases, where x h * is located at the right or left end point.
Mathematics 2021, 9, x FOR PEER REVIEW 6 of 12 applied to the concave cases. However, this theory can be also applied to the monotone increasing/decreasing objective functions such as (c) and (d) as trivial cases, where * is located at the right or left end point. By Equations (5) and (7), the worst element of HM is replaced with a better element as is increased. Even though not every generation replaces the worst element of HM with a new one, it is true that the worst element of HM is replaced as is increased. Now, let us consider the case when HM is updated at the ℎ generation. From Figure 2, we can consider the following properties. , the left-end smallest-value point 1 will be replaced by some value except in the case of ′ = 1 . Here, note that 1 ≤ ′ ≤ . The smallest-value element 1 can be newly updated by ′ or 2 . Consider that we are at the ℎ generation. Then, the previous 1 is denoted as 1, −1 , and the newly updated smallest one from HM will be denoted as 1, . Here, It is clear that (b) & (d): Because = , the right end point will be replaced by some value except in the case of ′ = . ′ , which shows better performance than . Here, note that 1 ≤ ′ ≤ . The largest-value element can be updated by ′ or −1 . Consider that we are at the ℎ generation. Then, the previous is denoted as , −1 , and the newly updated largest one from HM will be denoted as , . Here, By Equations (5) and (7), the worst element of HM is replaced with a better element as g is increased. Even though not every generation replaces the worst element of HM with a new one, it is true that the worst element of HM is replaced as g is increased. Now, let us consider the case when HM is updated at the g th generation. From Figure 2, we can consider the following properties.
(a) & (c): Because x 1 = x worst , the left-end smallest-value point x 1 will be replaced by some value except in the case of x = x 1 . Here, note that x 1 ≤ x ≤ x HMS . The smallest-value element x 1 can be newly updated by x or x 2 . Consider that we are at the g th generation. Then, the previous x 1 is denoted as x 1,g−1 , and the newly updated smallest one from HM will be denoted as x 1,g . Here, It is clear that (b) & (d): Because x HMS = x worst , the right end point x HMS will be replaced by some value except in the case of x = x HMS . x , which shows better performance than x HMS . Here, note that x 1 ≤ x ≤ x HMS . The largest-value element x HMS can be updated by x or x HMS−1 . Consider that we are at the g th generation. Then, the previous x HMS is denoted as x HMS,g−1 , and the newly updated largest one from HM will be denoted as x HMS,g . Here, x HMS,g = x (HMS−1),g−1 i f x (HMS−1),g−1 > x , It is clear that Now we prove the following theorem.
Theorem 1. Assume that there exists exactly one solution in the candidate set Λ for the objective function f (x). If D g = |HM g | is the length of HM at the g th generation, then (i) {D g } is a monotone decreasing sequence as g is increased.

Proof. (i) As we have seen from Equations
which means the sequence {D g } is a monotone decreasing sequence. Note that not every generation makes HM updated. Therefore, we need to define a new notation for the generation that updates HM. If the worst element of HM is replaced by a new element x at g th 1 , g th 2 , . . . generation such that then {g l |l = 1, 2, 3, . . .} is the set of generation numbers that update HM when each update reduces x HMS − x 1 in HM. It is clear that i.e., {g l } a subsequence of {g}. Note that, if there are more than one worst solution in HM, then even if HM is updated, it does not satisfy Equation (25). Therefore, the generation number at that time, it is not an element of {g l }. Now, we consider that either D g l > 0 or D g l = 0 is true. Equivalently, if either D g l ≥ δ m or D g l = 0 is true, then, in the case of D g l > 0, we have D g l ≥ δ m Furthermore, for the smallest length of subinterval δ m , it is easily seen that: Here, we need to check the existence of {g l } because we are not sure if HM is steadily updated as g is increased. That is, we are not sure if we can guarantee x * (= argmin f (x)) is selected at certain generation. In addition, it is finally guaranteed based on the relation between empirical probability and theoretical probability. That is, if we repeat the generation, it is guaranteed that x * is surely selected at certain generation. Therefore, there exist each generation number g l (l = 1, 2, · · · ), which updates HM.
If we repeat this procedure l times, we get: where D 0 = HM 0 is the initial range of HM. Here, {D g l } is a subsequence of {D g }. Here, D g l = D 0 − lδ m , is unreasonable form practical point of view. Therefore, we assume that D g l < D 0 − lδ m . Since {D g } is decreasing via monotone, {D g l } is decreasing via monotone as well. Furthermore, {D g } is bounded below by 0. For {D g }, note that we have: Now, let us define l as follows: Since l > D g l ≥ 0, as l increases, l can be negative, but we consider only the case of l > 0. Therefore, let us have l > 0 for l = 1, 2, 3, · · · , L. Since l (l = 1, 2, 3, · · · , L) have discrete values, we see that { l } is a positive decreasing sequence (See Figure 3a). As g is increased, g 1, g 2 , . . . exist empirically. Therefore, every time HM is updated, the sequence l is decreased. That is, as l increases, 1 > 2 > · · · > l > · · · > L > 0. (32) atics 2021, 9, x FOR PEER REVIEW Now, let us define as follows: Since > ≥ 0, as increases, can be negative, b of > 0. Therefore, let us have > 0 for = 1,2,3, ⋯ , . S discrete values, we see that { } is a positive decreasing seque increased, 1, 2 , … exist empirically. Therefore, every time H is decreased. That is, as increases, And it is true that: is not ve because is the minimum length of subintervals. Therefore even after the solutions in HM converge until is very sma generation more times after the ℎ generation until we s + < 0 − ( + ) = + , and + +1 < 0 − ( + + 1) = + +1 , bu Assume that the solutions in HM converge after the g th t generation. Then, And it is true that: because δ m is the minimum length of subintervals. Therefore, we continue the procedure even after the solutions in HM converge until l is very small but positive. We repeat the generation s more times after the g th t generation until we satisfy: D g t+s < D 0 − (t + S)δ m = t+s , and t+s > 0, Mathematics 2021, 9, 545 9 of 13 for some positive integer s. However, Equation (36) is not possible because we assume that l > 0, so we stop the procedure after s more generations. Therefore, t+s = L . Here, L is still a positive constant as in Figure 3b. But, Since we have either D g l ≥ δ m or D g l = 0, Equations (35)-(37) guarantee that D g t+s = D g L is now 0. It means that, at the g th L generation, it is guaranteed that: In fact, Equation (38) has been already satisfied from the g th t generation, but now the convergence is finally guaranteed by the inequality (34) mathematically. That is, the values in HM are convergent before the g th L generation. Therefore, we can conclude that there exists some t < L for some positive integers t and L such that HM g t is convergent.
Therefore, the values in HM converge empirically, which proves the theorem.

Remark.
In the general probability theory, for each event E of the sample space S, we define n(E) to be the number of times in the first n repetitions of the experiment that the event E occurs. Then P(E), which is the probability of the event E, is defined as: Here, P(E) is defined as the (limiting) proportion of time that E occurs. It is, thus, the limiting frequency of E. P(E) = P E is called the "theoretical probability" and n(E) n is called the "empirical probability." The empirical probability approaches the theoretical probability as the number of experiments is increased. It means when the repetition number n is small, the event E may not occur. However, if we repeat the experiment many times, i.e., if n is large enough, the ratio between n(E) and n converges P E . It guarantees that the E should occur (even n(E) times) if we repeat the experiment many times. Therefore, even when HM does not include the solution, that minimizes f (x), if Λ includes the unique solution, say x * . Then it is absolutely selected as we repeat the iteration. That is: where the event E is the event of selection of x * . In addition, Equation (25) guarantees that x * should be selected as g → ∞ . Then, as g increases, we get g 1 , g 2 , · · · . As we see from Figure 4, we have D 0 = 11, and Then, as increases, we get 1 , 2 , ⋯. As we see from Figure 4, ℎ 1 = 2, ℎ 2 = 1, ℎ 3 = 0.7, ℎ 4 = 2.3, ℎ 5 = 3, ℎ 6 = 2, We have either ≥ or = 0 by the definition of (45) guarantees that 21 = 0. Figure 4 shows the convergence o fact, we already have 6 = 0, but it is not proven mathematic finally shown after 15 more generations as follows.  Furthermore, we get: Therefore, t = 6 from Equation (43). However, from δ m = 0.7, 0 = 11, 1 = 10.3, 2 = 9.6, 3 = 8.9, 4 = 8.2, 5 = 7.5, 6 = 6.8.
we get 6 = 6.8 > 0 and it is not a very small number yet. Therefore, we repeat the procedure s more times until we get Equations (35) and (36). It can be easily calculated that s = 15. Hence, We have either D g l ≥ δ m or D g l = 0 by the definition of D g l . Therefore, Equation (45) guarantees that D g 21 = 0. Figure 4 shows the convergence of HM in Example 1. In fact, we already have D g 6 = 0, but it is not proven mathematically yet. Therefore, it is finally shown after 15 more generations as follows.
Therefore, there exists a positive integer t = 6 that is less than L = 21 such that the solution in HM g 6 is convergent. Example 1 shows Theorem 1 is confirmed.
Here, let us define the total ranges of the candidate set Λ and HM. Similarly, without loss of generality, we focus on the minimum case for this optimal solution problem in this section because we can easily apply this theory to the maximum case. In addition, we define the distances between neighboring solution candidate vectors, which are the lengths of subintervals as follows.
Then we define the minimum and maximum value of δ k as follows: Here, note that we have an important relation as follows: HM = x h ∈ R n x 1 ≤ · · · ≤ x h ≤ · · · x HMS (i) If x 1 = x worst , then x 1 will be replaced by some x , which shows better performance than x 1 . Here, note that x 1 < ||x || < x HMS . Therefore, x 1 , which has the smallest norm, will be newly replaced by x . Consider that we are at the g th generation, then the previous x 1 is denoted as x 1,g−1 . At that point, the newly updated smallest vector from HM will be denoted as x 1,g . Here, x 1,g = x 2,g−1 i f x 2,g−1 < ||x ||, x o.w.
(53) (ii) If x HMS = x worst , then x HMS will be replaced by another vector x , which shows better performance than x HMS . Here, note that x 1 < ||x || < x HMS . Therefore, x HMS , which has the largest norm, will be newly replaced by x . Consider that we are at the g th generation. Then the previous x HMS is denoted as x HMS,g−1 . Then, the newly updated largest vector from HM will be denoted as x HMS,g . Here, x HMS,g = x (HMS−1),g−1 i f x (HMS−1),g−1 < ||x ||, It is clear that: x HMS,g − x 1,g ≤ x HMS,g−1 − x 1,g−1 .
Corollary 2. Assume that there exists exactly one solution vector in the candidate set Λ for the objective function f (x) = f (x 1 , x 2 , · · · , x n ). In addition, if D g = |HM g | = x HMS,g − x 1,g is the length of HM at the g th generation, then (i) {D g } is a monotone decreasing sequence as g is increased.
(ii) Furthermore, the values in HM converge.
Proof. Based on the notation that are defined in Equations (47)-(51), the proof is clearly done by Theorem 1.

Conclusions
In this communication, we employed the distance concept and proved the convergence of the HS algorithm based on the empirical probability. The solution behavior of HS for one or more discrete variables was discussed and the given theorem was demonstrated with a numerical example.
For the future study, we will expand the theorem to include non-discrete variables, multi-modal functions, and adaptive parameters [35].