Modeling Local Search Metaheuristics Using Markov Decision Processes

Ruiz-Torrubiano, Rubén; Dhungana, Deepak; Paudel, Sarita; Buckchash, Himanshu

doi:10.3390/a18080512

Open AccessArticle

Modeling Local Search Metaheuristics Using Markov Decision Processes

IMC University of Applied Sciences Krems, 3500 Krems, Austria

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2025, 18(8), 512; https://doi.org/10.3390/a18080512

Submission received: 10 June 2025 / Revised: 6 August 2025 / Accepted: 13 August 2025 / Published: 14 August 2025

(This article belongs to the Special Issue Algorithmic Innovations: Bridging Theoretical Foundations and Practical Applications)

Download

Browse Figures

Versions Notes

Abstract

Local search metaheuristics like tabu search or simulated annealing are popular heuristic optimization algorithms for finding near-optimal solutions for combinatorial optimization problems. However, it is still challenging for researchers and practitioners to analyze their behavior and systematically choose one over a vast set of possible metaheuristics for the particular problem at hand. In this paper, we introduce a theoretical framework based on Markov Decision Processes (MDPs) for analyzing local search metaheuristics. This framework not only helps in providing convergence results for individual algorithms but also provides an explicit characterization of the exploration–exploitation tradeoff and a theory-grounded guidance for practitioners for choosing an appropriate metaheuristic for the problem at hand. We present this framework in detail and show how to apply it in the case of hill climbing and the simulated annealing algorithm, including computational experiments.

Keywords:

metaheuristics; combinatorial optimization; Markov Decision Process; stochastic control problems

1. Introduction

Metaheuristics are high-level algorithmic frameworks that provide guidance or strategies for developing heuristic optimization algorithms [1]. Frequently, these strategies use concepts from other disciplines as their underlying inspiration, like the evolution process in biology, the process of cooling a metallic alloy, the cooperative behavior of flocks of birds, and many more [2]. Despite being heuristic in nature, so that in general there is no general theoretical guarantee that the algorithm will find the global optimum, metaheuristics have demonstrated a tremendous success in practice, being effectively used in applications like logistics (e.g., vehicle routing problem [3]), network design [4], scheduling [5,6,7,8], finance (e.g., portfolio optimization [9]) and many others.

In general, it can be distinguished between local search (or single-solution) metaheuristics like simulated annealing (SA) [10], tabu search (TS) [11], variable neighborhood search (VNS) [12] and large neighborhood search (LNS) [13] on the one hand, and population-based metaheuristics on the other, like genetic algorithms (GAs) [14] particle swarm optimization (PSO) [15] or estimation of distribution algorithms (EDAs) [16]. Local search metaheuristics typically start with a candidate solution

x_{0}

(not necessarily feasible) and apply stochastic operators to select the next candidate solution from a suitable chosen neighborhood or pool of individuals. By iteratively repeating this process, local search metaheuristics are able to find near-optimal solutions to the problem at hand efficiently, even if the search space is very large. By contrast, population-based metaheuristics start with a group of candidate solutions (often referred to as a population)

P = {x_{1}, x_{2}, \dots, x_{n}}

and apply different types of operators to generate new candidate solutions, updating the population in the process.

It is well acknowledged that there is a lack of strong theoretical frameworks for understanding and analyzing the behavior of metaheuristic algorithms [17,18]. While theoretical results and convergence analysis do exist for individual algorithms like genetic algorithms and simulated annealing [19,20], existing frameworks are either focused on one type of analysis (like convergence analysis [17]), or they provide only weak results [18]. Moreover, they provide no theory-grounded guidance for practitioners on which algorithm might be most appropriate for a particular task.

In this paper, we fill this gap by proposing a novel framework based on Markov Decision Processes (MDPs) as the underlying model for metaheuristic algorithms. Specifically, in this study we focus on local search metaheuristics, leaving population-based metaheuristics for future work. We show that particular instances of local search metaheuristics can be modeled as policies of a suitably defined MDP, and based on that, we derive different coefficients to model their convergence and exploration–exploitation behavior. By introducing different parameters into these bounds and assigning them a practical meaning, we show which properties have to be taken into account for a given metaheuristic (policy) to perform well on a particular class of problems.

This paper is structured as follows. In Section 2, we review previous relevant work. Section 3 presents the MDP framework for local search metaheuristics and applies it to the hill climbing algorithm as a first example. In Section 4, we apply this framework to simulated annealing, showing how the relevant measures of convergence and exploration–exploitation can be explicitly calculated. We present some computational experiments in Section 5 that provide empirical evidence for our theoretical results. Section 6 concludes the paper and outlines future work.

2. Previous Work

Mathematical analysis of optimization metaheuristics has been done previously in the literature in a variety of ways. The work on the schema theorem by Holland [21] can be considered one of the first attempts at mathematical formalization of metaheuristic algorithms. However, this result applied only to simple GAs with one-point crossover and binary encoding. One of the most popular variants of mathematical analysis was spearheaded by the Markov chain analysis of genetic algorithms (GAs), which was mostly done in the 1990s. For instance, in early work, the authors of [22] proved that GAs with an elitist selection scheme converge to the optimum with probability 1. Suzuki [19,20] proved theoretical results based on Markov chain analysis for GAs, focusing on mutation probabilities, and developed an early comparison with simulated annealing (SA). In [23], the authors applied a non-stationary Markov model for the convergence analysis of genetic algorithms. One of the main drawbacks of using Markov chain theory for metaheuristic analysis is that Markov chains are not expressive enough for handling important aspects of the search process in metaheuristics. Critically, the reward structure and the different operators used in the algorithm can only be expressed implicitly via the transition matrix.

More recent approaches in the literature use other models, like the evolutionary Turing machine [24] and the nested partition algorithm [18]. However, most of these results focus on convergence analysis and do not handle the exploration–exploitation dilemma explicitly. This question was addressed in [25] by experimental means, but no theoretical analysis was provided. Similarly, a case study based on elitist evolutionary algorithms (concretely, random univariate search and evolutionary programming) was presented in [26] using a specially-tailored probabilistic framework. An extensive survey on the exploration–exploitation tradeoff in metaheuristics is given in [27]. In [28], a mathematical characterization of the exploration–exploitation tradeoff is given in the form of the optimal contraction theorem. However, their definition of exploration leaves out the possibility that information collected during the search can be used for exploring other areas of the search space.

In contrast to previous approaches, our framework applies Markov decision processes (MDPs) to the problem of modeling the convergence behavior and the exploration–exploitation tradeoff of local search metaheuristics. We consider MDP to be a more suitable framework than the previously presented approaches since the reward structure of the general search problem can be represented explicitly. Moreover, transformations and sequences of candidate solutions can be modeled explicitly by means of actions, which in MDP are naturally linked to rewards. We can therefore model a particular metaheuristic algorithm as a stochastic agent that makes decisions about which candidate solution to visit next based on the information obtained so far and the reward obtained. We explore the implications of this modeling choice and we show in the following sections that the convergence and the exploration–exploitation analysis of metaheuristic algorithms can be calculated explicitly.

3. Markov Decision Processes for Local Search Metaheuristics

In this section, we develop the mathematical machinery needed to define an MDP model for local search metaheuristics. Our main insight is that we can model any local search metaheuristic as a stochastic agent that executes some policy on a suitably defined MDP. Critically, we note that even if the goal of a specific metaheuristic is to find the optimum of an objective function, the algorithm itself proceeds by maximizing the total reward over a (possibly) infinite time horizon. Clearly, the rewards are directly connected with the best values found so far of the objective function, but the stochastic agent needs to proceed by gathering information about promising directions in the search space (exploration–exploitation tradeoff). Therefore, the exploration of the search space is guided mainly by the maximization of the total reward.

3.1. Definitions

We start by introducing some preliminaries in the form of notation and definitions. Then we specify the model in detail and describe how to define metaheuristic policies using this framework. In the following, we will mainly follow the conventions and definitions used in [29].

Definition 1

(Markov Decision Process). A Markov Decision Process (MDP) is a tuple

〈 S, A, P, R, α 〉

, where:

$S$ is a set of states.
$A (i)$ is a set of actions available in state $i \in S$ .
$P (a) = {p_{i j} (a)}_{i j}$ is the state transition matrix, given action a, where $p_{i j} (a) = P {S_{t + 1} = j | S_{t} = i, A_{t} = a}$ .
$R (a)$ is a reward function, $R_{s} (a) = E [r_{t + 1} | S_{i} = s, A_{t} = a]$ .
$α \in [0, 1]$ is a discount factor.

In words, an MDP is a stochastic decision model where an agent makes decisions on each time step about which action a to take from a predefined action set

A (i)

, which might be different for each state

i \in A (i)

. Most importantly, the transition probabilities to a new state

S_{t + 1}

depend only on the last state

S_{t}

and not on the previous history (i.e., the previously visited states). This is known as the Markov property.

Definition 2.

A policy R is a decision rule

R = (π^{1}, π^{2}, \dots, π^{t}, \dots)

that can be expressed as a probability distribution of the actions, given states:

π_{i a}^{t} = P {A_{t} = a | S_{t} = s}

If the policy R is such that there exists an action a such that

π_{i a}^{t} = 1

for every state

i \in S

, we say the policy is deterministic; otherwise, we call the policy stochastic. If the policy does not depend on time, we call this policy stationary and simply write

π_{i a}

.

The natural problem that now arises is how to find policies that maximize the rewards obtained in some precise sense. This is captured by means of a utility function, which might be, e.g., maximize the total expected or average reward. An MDP can be classified according to its time horizon in finite or infinite time MDP. Likewise, we can also distinguish between discrete and continuous time MDPs. For local search metaheuristics, we focus on the MDP formulation where:

The time is discrete and the time horizon is infinite.
The utility function is the total expected reward.

Regarding condition (1), we note that clearly the execution time of a metaheuristic is always finite, but this finite time cannot be fixed beforehand (except for the class of metaheuristics that are run for a fixed number of iterations). This is modeled by having terminal states with no outgoing transitions (i.e., the probability of transitioning to any other state is 0, and

A (i) = \emptyset

for all terminal states

i \in S

). We argue for condition (2) that, in general, the main strategy used by all metaheuristics is to search for the optimum indirectly by exploring the search space

X

and exploiting promising regions where an optimum

x^{*} \in X

is supposed to be. However, this is done by sampling from the search space in some specific sense and evaluating the objective function f on that sample, thus collecting information about the target function value. If we interpret this sampling and evaluating procedure as collecting rewards, the agent (i.e., the metaheuristic) can be interpreted as trying to maximize the total reward obtained.

3.2. Optimal Policies

In general, we model local search metaheuristic algorithms by means of policies R as defined before. We define the reward of such a policy as the total (discounted) expected reward:

v_{i}^{α} (R) = \sum_{t = 1}^{\infty} E_{i, R} {α^{t - 1} r_{t}}

(1)

where

r_{t}

denotes the collected actual reward at time step t, and the expectation is over policy R, given initial state

i \in S

, and

α

is a discount factor that regulates the value of future rewards. In the following, we will assume

α = 1

so we can write:

v_{i} (R) = \sum_{t = 1}^{\infty} \sum_{j, a} P {S_{t + 1} = j | S_{t} = i, A_{t} = a} r_{j} (a)

(2)

where

r_{j} (a)

denotes the reward obtained from state j after action a. Let

{P (π^{t})}_{i j}

denote the transition matrix of policy R, and

{r (π^{t})}_{i}

its reward vector. We have

\begin{matrix} {P (π^{t})}_{i j} & = \sum_{a \in A} p_{i j}^{t} (a) π_{i a}^{t} for all i, j \in S \end{matrix}

(3)

\begin{matrix} {r (π^{r})}_{i} & = \sum_{a \in A} r_{i}^{t} (a) π_{i a}^{t} for all i \in S \end{matrix}

(4)

The reward of policy R can be now expressed based on the transition probabilities as follows:

v^{α} (R) = \sum_{t = 1}^{\infty} α^{t - 1} P (π^{1}) P (π^{2}) \dots P (π^{t - 1}) r (π^{t})

(5)

Definition 3.

We call a policy

R_{*}

optimal if it is the best achievable policy considering the reward function, i.e.,

v^{α} (R_{*}) = {sup}_{R} v^{α} (R)

3.3. Local Search Metaheuristics as Policies

We now formally represent an arbitrary local search metaheuristic as a policy in an MDP as follows. In this study, we focus on binary strings of length n as candidate solutions, therefore

X \subseteq {0, 1}^{n}

. From now on, we denote by

x \in {0, 1}^{n}

a binary vector that is a candidate solution for the problem at hand, which might be feasible or not (i.e., it can be that, during the search process, there is an

x_{t}

such that

x_{t} \notin X

). In the following, we will assume (without loss of generality) a maximization problem, i.e., find

x^{*}

such that

x^{*} = {argmax}_{x \in X} f (x)

.

Definition 4.

A criterium H is a function

H : X \times X \to {0, 1}

such that

H (x^{'}; x) = 1

if and only if

x^{'}

is a neighbour of

x

according to a metaheuristic-specific strategy, and 0 otherwise.

Typically, instantiations of different local search metaheuristics provide one or several criteria H to determine which candidate solutions

x^{'}

can be considered neighbors of the current solution

x

. For instance, in hill-climbing there is typically only one criterium H that defines the possible candidate solutions starting from the current one (for instance, all solutions that can be obtained from the current one by flipping a random bit). However, other metaheuristics like VNS typically require a set of criteria

H = {H_{1}, \dots, H_{n}}

to be specified, and the algorithm iterates in some form over

H

.

Definition 5.

We define a neighborhood

N_{H} (x)

as the subset of

X

induced by H, i.e., all candidate solutions

x^{'} \in X

that are neighbors of

x

according to criterium H; therefore,

N_{H} (x) = {x^{'} \in X | H (x^{'}; x) = 1}

.

Let

x \in X

and

x^{'} \in N_{H} (x)

. We denote by

x \mapsto x^{'}

the operator that transforms

x

into

x^{'}

. For example, the neighborhood of all binary strings located at Hamming distance one from

x

would be defined by

N_{H_{D}} (x) = {x^{'} \in X | H_{D} (x^{'}; x) = 1}

. In this case,

x \mapsto x^{'}

denotes the operator that flips the one bit necessary to turn

x

into

x^{'}

.

Definition 6

(Local search MDP). Let our local search MDP be defined by:

$S = {0, 1, \dots, 2^{n - 1}}$ . Any state $i \in S$ can trivially be converted into a candidate solution as the binary representation of the integer i.
$A (i) = {i \mapsto j | j \in ⋃_{k} N_{k}}$ , the set of all possible transformations in all neighborhoods applicable to the current state $i \in S$ .
The transition probabilities, for all $i, j \in S$ , $a \in A (i)$ :

$p_{i j} (a) = \{\begin{matrix} 1 / | A (i) |, if a = i \mapsto j \\ 0, otherwise \end{matrix}$

(6)
The reward for state j given action $a = i \mapsto j$ is defined by $r_{j} (a) = f (j) - f (i)$ , where f is the objective function (note the slight abuse of notation here since f is defined on $X$ rather than $S$ , but the conversion is trivial).

An instantiation of this MDP therefore starts at an arbitrary state

i \in S

and visits a number of states according to the transition probabilities before halting at time T. The algorithm terminates as soon as it reaches a terminal or absorbing state, i.e., a state

i \in S

where the transition probability of reaching any other state is 0 (

p_{i j} (a) = 0

for all

i \neq j

, and

p_{i i} (a) = 1

otherwise). Note that the rewards are chosen in such a way that, in general, they incentivize making progress towards any optimum

x^{*}

, setting the stage for the optimization process to advance towards an optimal solution.

In this setting, we identify a policy R with a particular local search metaheuristic.

Example 1

(Hill climbing). We define a policy

R_{H C}

for the standard hill climbing algorithm as follows:

$A (i) = {i \mapsto j | j \in N_{H_{D}}}$ as defined before (we flip one bit), and
Let $M (i) = {j \in S | j = argmax f (j^{'}), a = i \mapsto j, a \in A (i)}$ , then

$π_{i a} = \{\begin{matrix} 1 / | M (i) |, if j \in M (i) \\ 0, otherwise \end{matrix}$

(7)

Note that

R_{H C}

is a deterministic policy except when

| M (i) | > 1

, in which case ties are broken randomly.

In general, the MDP formulation is defined in such a way that the balance between exploration and exploitation becomes explicit in the definition of the policy. Informally, the agent can either maximize the reward in the short run (thus going into an exploitation phase), or temporarily accept negative rewards and therefore going into an exploration phase, which will depend on the particular trajectory taken from an initial solution

x_{0}

. Using the previous notions, we define exploration and exploitation explicitly as follows:

Definition 7

(Exploration and exploitation). Let

a \in A (i)

be an action such that

a = i \mapsto j

. We say that a is an exploration action if and only if

r_{j} (a) \leq 0

. Otherwise, we call a an exploitation action.

Definition 8

(Exploration–exploitation function). Let

σ : A (i) \to {0, 1}

be a function such that

σ (a) = 1

if and only if a is an exploration action, and

σ (a) = 0

otherwise. We call σ the exploration–exploitation function.

We now state our main result, which shows that any local search metaheuristic can be represented by a policy where the balance between exploration and exploitation becomes explicit.

Theorem 1

(Local search exploration-exploitation theorem). Let M be a local search MDP. For any local search metaheuristic A, there exists a policy

R_{A}

such that

π_{i a}^{t} = α_{i}^{A} (t) P {σ (a) = 1 | S_{t} = i} + β_{i}^{A} (t) P {σ (a) = 0 | S_{t} = i}

(8)

Theorem 1 shows that for any metaheuristic A, we can explicitly construct the corresponding policy as a mixture of exploration and exploitation actions.

Proof.

Let

M_{t}^{+} (i) = {j \in S | f (j) > f (i), a = i \mapsto j, a \in A (i), S_{t} = i}

and

M_{t}^{-} (i) = {j \in S | f (j) \leq f (i), a = i \mapsto j, a \in A (i), S_{t} = i}

be the sets of improving and non-improving states given action a, respectively. We have:

\begin{matrix} π_{i a}^{t} & = P {A_{t} = a | S_{t} = i} \\ = P {A_{t} = a, j \in M_{t}^{+} (i) | S_{t} = i} + P {A_{t} = a, j \in M_{t}^{-} (i) | S_{t} = i} \\ = \frac{| M_{t} {(i)}^{+} |}{| M_{t} {(i)}^{+} | + | M_{t} {(i)}^{-} |} P {σ (a) = 0 | S_{t} = i} + \\ + \frac{| M_{t} {(i)}^{-} |}{| M_{t} {(i)}^{+} | + | M_{t} {(i)}^{-} |} P {σ (a) = 1 | S_{t} = i} \end{matrix}

(9)

Therefore,

α_{i}^{A} (t) = \frac{| M_{t} {(i)}^{-} |}{| M_{t} {(i)}^{+} | + | M_{t} {(i)}^{-} |}, and β_{i}^{A} (t) = \frac{| M_{t} {(i)}^{+} |}{| M_{t} {(i)}^{+} | + | M_{t} {(i)}^{-} |}

□

Theorem 1 and its proof provide hints to several measures of interest. The first one is a measure that can be used for convergence analysis and defined as follows:

γ_{i}^{A} (t) : = \frac{β_{i}^{A} (t)}{α_{i}^{A} (t)} = \frac{| M_{t} {(i)}^{+} |}{| M_{t} {(i)}^{-} |}

(10)

We call

γ_{i}^{A} (t)

the convergence coefficient of local search metaheuristic A (conditioned on initial state

i \in S

) at time t, and denote

γ^{A} (t)

as the vector of all

{γ_{i}^{A} (t)}_{i \in S}

(that we call convergence vector). We can now study the convergence behavior of local search metaheuristics according to the limit of the components

γ_{i}^{A} (t)

as

t \to \infty

. For instance, if

{lim}_{t \to \infty} γ_{i}^{A} (t) = 0

(for some state i), this means that algorithm A has converged, since the improving set becomes zero faster than the non-improving set. Note that

γ_{i}^{A} (t) = 0

only implies convergence and not optimality (i.e., it can be a local optimum).

The second relevant measure represents an explicit way of evaluating the balance between exploration and exploitation that a specific local search metaheuristic provides. We define

δ_{i a}^{A} (t) : = \frac{P {σ (a) = 1 | S_{t} = i}}{P {σ (a) = 0 | S_{t} = i}}

(11)

Averaging over all actions, we get

δ_{i}^{A} = \sum_{t = 0}^{\infty} δ_{i}^{A} (t) = \sum_{t = 0}^{\infty} \sum_{j \in A (i)} p_{i j} (a) δ_{j a}^{A} (t) = \frac{1}{| A (i) |} \sum_{j \in A (i)} \sum_{t = 0}^{\infty} δ_{j a}^{A} (t)

(12)

We call

δ_{i}^{A}

the exploration–exploitation coefficient of local search metaheuristic A conditioned on initial state

i \in S

. In vector notation, we write

δ^{A}

and call it the exploration–exploitation coefficient vector. In general, there can be components of

δ^{A}

that converge and others that diverge. Therefore, one possibility would be to assert the general exploration–exploitation behavior of A based on

δ_{*}^{A} = {max}_{i} δ_{i}^{A}

and investigate the individual components of

δ^{A}

in case a more detailed analysis is needed.

We can now define for each metaheuristic a measure of its exploration–exploitation balance as follows:

Definition 9.

We say that A is a balanced local search metaheuristic if

δ_{*}^{A} = C

for a constant

C > 0, C \in R

. Otherwise, if

δ_{*}^{A}

diverges to ∞, we say that A is exploration-oriented. Finally, if

δ_{*}^{A} = 0

, we say that A is exploitation-oriented.

Using our local search MDP, we have transformed the problem of convergence analysis and determining the exploration–exploitation balance of any metaheuristic into the quantitative problem of calculating

δ_{i}^{A}

,

δ_{*}^{A}

and the limits of

γ_{i}^{A} (t)

as

t \to \infty

for

i \in S

. We now apply these notions to the hill climbing heuristic already mentioned before.

Example 2.

In the case of hill climbing, for every

i \in S

we have:

\begin{matrix} α_{i}^{A} (t) & = 0 \\ β_{i}^{A} (t) & = 1 / | M (i) | \\ γ_{i}^{A} (t) & = 0 \end{matrix}

(13)

Since

{lim}_{t \to \infty} γ_{i}^{A} (t) = 0, \forall i \in S

, we have that hill climbing always converges to a (possibly local) optimum. Regarding exploration–exploitation behavior, we have

δ_{i a} (t) = 0, \forall a \in A (i), \forall i \in S

. We therefore obtain that

δ_{*}^{A} = 0

and hill climbing is exploitation-oriented.

3.4. Summary

For the sake of clarity, we provide a graphical summary of our approach in Figure 1. In this figure, we depict the analysis described in the text as a sequence of five steps and break down each step according to the measures defined.

In the next section, we apply our MDP model to the simulated annealing (SA) metaheuristic and derive some of its convergence and exploration–exploitation properties.

4. Simulated Annealing

Simulated annealing (SA) [10] is a local search metaheuristic inspired by the procedure of cooling a molten solid according to a predefined schedule. The main idea is to start with a high temperature T and decrease this temperature in subsequent iterations with the goal of reaching the minimum value of an energy function E. Transitions from the current candidate solution that improve the objective value are accepted with probability 1. Otherwise, the solution is accepted with a probability that depends on the magnitude of the difference between the current best objective value and the value of the new candidate solution. In the initial, high-temperature phase, the probability of accepting worse solutions is high (emphasizes exploration), whereas in later stages lower temperatures also decrease this probability (emphasizes exploitation). More formally, this probability can be written as:

P {x_{t + 1} = x^{'} | x_{t} = x} = \{\begin{matrix} 1, if E (x^{'}) < E (x) \\ exp (\frac{- (E (x) - E (x^{'}))}{T}), otherwise \end{matrix}

(14)

In our case, since we are solving by convention maximization problems, without loss of generality we set

E = - f

. Usually, a geometric cooling scheme is used:

T_{n e w} = α T_{c u r r e n t}

, where

α \in [0, 1)

. Let

T_{0}

denote the initial temperature (which is a free parameter of the algorithm). Then the temperature at time t is given by

T = α^{t} T_{0}

.

We now define a policy

R_{S A}

to model the behavior of SA as a stochastic decision agent as follows:

Let $M_{t}^{+} (i)$ and $M_{t}^{-} (i)$ be defined as in Theorem 1.
Let us consider action $a = i \mapsto j$ and state $i \in S$ . Using Theorem 1 we obtain

$\begin{matrix} π_{i a}^{t} & = P {A_{t} = a | S_{t} = i} \\ = \frac{| M_{t} {(i)}^{+} |}{| M_{t} {(i)}^{+} | + | M_{t} {(i)}^{-} |} + \frac{| M_{t} {(i)}^{-} |}{| M_{t} {(i)}^{+} | + | M_{t} {(i)}^{-} |} exp (- \frac{f (j) - f (i)}{α^{t} T_{0}}) \end{matrix}$

(15)

Now, the convergence coefficients can be calculated as follows:

lim_{t \to \infty} γ_{i}^{A} (t) = lim_{t \to \infty} \frac{| M_{t} {(i)}^{+} |}{| M_{t} {(i)}^{-} |} \underset{t \to \infty}{\to} 0, \forall i \in S

(16)

Since when time approaches infinity, assuming a geometric cooling scheme, the algorithm will only accept improving actions and in the limit there will be more non-improving than improving actions (note that

α < 1

by definition).

Let us now calculate the exploration–exploitation coefficient:

P {σ (a) = 1 | S_{t} = i} = \{\begin{matrix} 0, & if f (j) > f (i) \\ exp (- \frac{f (j) - f (i)}{α^{t} T_{0}}), & otherwise \end{matrix}

(17)

Using Equations (11) and (12) we get:

δ_{i}^{A} = \frac{1}{| A (i) |} \sum_{j \in A (i)} \sum_{t = 0}^{\infty} δ_{j a}^{A} (t) = \frac{1}{| A (i) |} \sum_{\begin{matrix} j \in A (i) \\ f (j) \geq f (i) \end{matrix}} \sum_{t = 0}^{\infty} exp (- \frac{f (j) - f (i)}{α^{t} T_{0}})

(18)

which for all

i \in S

and

α \in [0, 1)

converges to a positive constant

C > 0

, since:

\frac{exp (- \frac{f (j) - f (i)}{α^{t + 1} T_{0}})}{exp (- \frac{f (j) - f (i)}{α^{t} T_{0}})} = exp (- \frac{(1 - α) (f (j) - f (i))}{α^{t + 1} T_{0}}) \underset{t \to \infty}{\to} 0

(19)

Therefore, simulated annealing is a balanced local search metaheuristic according to our definition.

5. Computational Experiments

In this section, we present some preliminary computational experiments (the code for the experiments is publicly available under https://github.com/IMC-UAS-Krems/mdp_local_search_experiments, accessed on 12 August 2025) that validate the theoretical results obtained in the previous section. We calculate the convergence and exploration–exploitation coefficients

γ^{A} (t)

and

δ^{A} (t)

explicitly for hill climbing and simulated annealing and visualize their temporal evolution using a random initial state across sample runs with the OneMax problem. In OneMax, the objective is to maximize the number of 1’s of an n bit string, starting from a randomly generated solution. It represents a widely used and simple benchmark for preliminary tests applied to search heuristics.

5.1. Hill Climbing

Figure 2 shows a visualization of the temporal evolution of the convergence coefficient across a sample run of the algorithm using

n = 30

as string size. As can be seen, the convergence coefficient decreases monotonically until it reaches 0 on the last iteration (red curve) as theoretically predicted. Since there is always an increment of 1 in the objective function after each move, the objective function also increases monotonically and linearly until reaching the optimal objective value (blue curve).

In Figure 3, we plot the evolution of both the convergence and the exploration–exploitation coefficient in another run of hill climbing (starting from a new random string). As can be seen, the exploration–exploitation coefficient stays (by definition) at 0, indicating that only exploitation-oriented moves, and no exploration-oriented, are performed by the algorithm.

5.2. Simulated Annealing

In the case of SA, we provide the evolution of the convergence coefficient in Figure 4. In this case, the evolution of both

γ^{A}

and the objective value differ substantially from hill climbing. In the initial stages of the algorithm (where T is still high),

γ^{A}

fluctuates as expected, alternating between improving and non-improving moves. However, as the temperature decreases and the algorithm focuses on promising regions of the search space, the fluctuations are also reduced and

γ^{A}

reaches its final value of 0, indicating that the algorithm has converged.

Figure 5 provides a visualization of the evolution of the exploration–exploitation coefficient for SA. As expected from the theoretical discussion, the evolution of the coefficient closely matches the current temperature: in the initial stages (high temperature),

δ^{A}

is also high, while in later stages with decreasing temperature, the coefficient also decreases, signaling the transition from a highly exploration-oriented to exploitation-oriented behavior. In sum, we can conclude that SA has a balanced exploration–exploitation behavior.

6. Conclusions

In this paper, we have presented a new framework for modeling the convergence and exploration–exploitation behavior of local search metaheuristics based on Markov decision processes. This framework provides a novel and intuitive way of characterizing essential properties of local search metaheuristics in a principled way. We proved that any local search metaheuristic can be analyzed using the tools provided and showed how to apply the framework in two cases: the hill climbing and the simulated annealing metaheuristics. In the case of hill climbing, we showed that the framework correctly classifies it as exploitation-oriented, whereas simulated annealing was classified as balanced, which is consistent with the general consensus in the literature. Our study goes beyond the state-of-the-art in metaheuristic analysis, since the tools we develop provide a more expressive methodology than methods based on Markov chains and other probabilistic approaches. As a consequence, the behavior of local search metaheuristics can be modeled in a more realistic way and the results obtained are consistent with practice.

As future work, we plan to extend our results to other local search metaheuristics like variable neighborhood search (VNS) [12] and large neighborhood search (LNS) [30]. Additionally, the framework will be extended to be able to handle population-based metaheuristics (PBMHs) like genetic algorithms (GA) [14], particle swarm optimization (PSO) [15] and estimation of distribution algorithms (EDAs) [16]. Additionally, more empirical studies will be conducted with the goal of explicitly calculating the measures described in this paper for different stages in the optimization process and different instances, potentially leading to new tools for instance space analysis.

Author Contributions

Conceptualization, R.R.-T., D.D., S.P. and H.B.; methodology, R.R.-T., D.D., S.P. and H.B.; validation, R.R.-T., D.D., S.P. and H.B.; formal analysis, R.R.-T.; writing—original draft preparation, R.R.-T.; writing—review and editing, R.R.-T., D.D., S.P. and H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Data Availability Statement. This Change does not affect the scientific content of the article.

References

Sörensen, K.; Glover, F. Metaheuristics. In Encyclopedia of Operations Research and Management Science; Springer: Boston, MA, USA, 2013; Volume 62, pp. 960–970. [Google Scholar]
Silberholz, J.; Golden, B. Comparison of Metaheuristics. In Handbook of Metaheuristics; Gendreau, M., Potvin, J.Y., Eds.; Springer: Boston, MA, USA, 2010; pp. 625–640. [Google Scholar] [CrossRef]
Lin, S.W.; Lee, Z.J.; Ying, K.C.; Lee, C.Y. Applying hybrid meta-heuristics for capacitated vehicle routing problem. Expert Syst. Appl. 2009, 36, 1505–1512. [Google Scholar] [CrossRef]
Fernandez, S.A.; Juan, A.A.; de Armas Adrián, J.; Silva, D.G.e.; Terrén, D.R. Metaheuristics in Telecommunication Systems: Network Design, Routing, and Allocation Problems. IEEE Syst. J. 2018, 12, 3948–3957. [Google Scholar] [CrossRef]
Pillay, N. A survey of school timetabling research. Ann. Oper. Res. 2014, 218, 261–293. [Google Scholar] [CrossRef]
Kaur, M.; Saini, S. A Review of Metaheuristic Techniques for Solving University Course Timetabling Problem. In Advances in Information Communication Technology and Computing; Goar, V., Kuri, M., Kumar, R., Senjyu, T., Eds.; Lecture Notes in Networks and Systems; Springer: Singapore, 2021; pp. 19–25. [Google Scholar] [CrossRef]
Ecoretti, A.; Ceschia, S.; Schaerf, A. Local search for integrated predictive maintenance and scheduling in flow-shop. In Proceedings of the 14th Metaheuristics International Conference, Syracuse, Italy, 11–14 July 2022. [Google Scholar]
Ceschia, S.; Guido, R.; Schaerf, A. Solving the static INRC-II nurse rostering problem by simulated annealing based on large neighborhoods. Ann. Oper. Res. 2020, 288, 95–113. [Google Scholar] [CrossRef]
Doering, J.; Kizys, R.; Juan, A.A.; Fitó, À.; Polat, O. Metaheuristics for rich portfolio optimisation and risk management: Current state and future trends. Oper. Res. Perspect. 2019, 6, 100121. [Google Scholar] [CrossRef]
Kirkpatrick, S.; Gelatt, C.D., Jr.; Vecchi, M.P. Optimization by simulated annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef] [PubMed]
Glover, F.; Laguna, M. Tabu Search; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Mladenović, N.; Hansen, P. Variable neighborhood search. Comput. Oper. Res. 1997, 24, 1097–1100. [Google Scholar] [CrossRef]
Shaw, P. Using constraint programming and local search methods to solve vehicle routing problems. In Proceedings of the International Conference on Principles and Practice of Constraint Programming, Pisa, Italy, 26–30 October 1998; pp. 417–431. [Google Scholar]
Holland, J.H. Genetic Algorithms. Sci. Am. 1992, 267, 66–73. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
Larrañaga, P.; Lozano, J.A. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation; Springer Science & Business Media: Berlin, Germany, 2001; Volume 2. [Google Scholar]
Liu, B.; Wang, L.; Liu, Y.; Wang, S. A unified framework for population-based metaheuristics. Ann. Oper. Res. 2011, 186, 231–262. [Google Scholar] [CrossRef]
Chauhdry, M.H.M. A framework using nested partitions algorithm for convergence analysis of population distribution-based methods. EURO J. Comput. Optim. 2023, 11, 100067. [Google Scholar] [CrossRef]
Suzuki, J. A Markov chain analysis on simple genetic algorithms. IEEE Trans. Syst. Man Cybern. 1995, 25, 655–659. [Google Scholar] [CrossRef]
Suzuki, J. A further result on the Markov chain model of genetic algorithms and its application to a simulated annealing-like strategy. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 1998, 28, 95–102. [Google Scholar] [CrossRef] [PubMed]
Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; U Michigan Press: Ann Arbor, MI, USA, 1975. [Google Scholar]
Eiben, A.E.; Aarts, E.H.L.; Van Hee, K.M. Global convergence of genetic algorithms: A markov chain analysis. In Proceedings of the Parallel Problem Solving from Nature, Dortmund, Germany, 1–3 October 1991; Schwefel, H.P., Männer, R., Eds.; Springer: Berlin/Heidelberg, Germany, 1991; pp. 3–12. [Google Scholar] [CrossRef]
Cao, Y.; Wu, Q. Convergence analysis of adaptive genetic algorithms. In Proceedings of the Second International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, Glasgow, UK, 2–4 September 1997; pp. 85–89, ISSN 0537-9989. [Google Scholar] [CrossRef]
Eberbach, E. Toward a theory of evolutionary computation. Biosystems 2005, 82, 1–19. [Google Scholar] [CrossRef] [PubMed]
Cuevas, E.; Diaz, P.; Camarena, O. Experimental Analysis Between Exploration and Exploitation. In Metaheuristic Computation: A Performance Perspective; Cuevas, E., Diaz, P., Camarena, O., Eds.; Intelligent Systems Reference Library, Springer International Publishing: Cham, Switzerland, 2021; pp. 249–269. [Google Scholar] [CrossRef]
Chen, Y.; He, J. Exploitation and Exploration Analysis of Elitist Evolutionary Algorithms: A Case Study. arXiv 2020, arXiv:2001.10932. [Google Scholar]
Xu, J.; Zhang, J. Exploration-exploitation tradeoffs in metaheuristics: Survey and analysis. In Proceedings of the 33rd Chinese Control Conference, Nanjing, China, 28–30 July 2014; pp. 8633–8638, ISSN 1934-1768. [Google Scholar] [CrossRef]
Chen, J.; Xin, B.; Peng, Z.; Dou, L.; Zhang, J. Optimal Contraction Theorem for Exploration–Exploitation Tradeoff in Search and Optimization. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2009, 39, 680–691. [Google Scholar] [CrossRef]
Kallenberg, L.C. Lecture Notes on Markov Decision Processes; University of Leiden: Leiden, The Netherlands, 2022. [Google Scholar]
Pisinger, D.; Ropke, S. Large Neighborhood Search. In Handbook of Metaheuristics; Gendreau, M., Potvin, J.Y., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 99–127. [Google Scholar] [CrossRef]

Figure 1. Graphical summary of the MDP approach outlined in this paper.

Figure 2. Convergence Coefficient (

γ^{A}

) for hill climbing on OneMax (

n = 30

).

Figure 2. Convergence Coefficient (

γ^{A}

) for hill climbing on OneMax (

n = 30

).

Figure 3. Exploration–exploitation (

δ^{A}

) and convergence coefficient (

γ^{A}

) for hill climbing on OneMax (

n = 30

).

Figure 3. Exploration–exploitation (

δ^{A}

) and convergence coefficient (

γ^{A}

) for hill climbing on OneMax (

n = 30

).

Figure 4. Convergence coefficient (

γ^{A}

) for SA on OneMax (

n = 30

), using

α = 0.995

and an initial temperature of

T_{0} = 10

.

Figure 4. Convergence coefficient (

γ^{A}

) for SA on OneMax (

n = 30

), using

α = 0.995

and an initial temperature of

T_{0} = 10

.

Figure 5. Exploration–exploitation (

δ^{A}

) for SA OneMax (

n = 30

) along with temperature in a sample run with maximum 1000 iterations. Convergence is achieved around iteration 800.

Figure 5. Exploration–exploitation (

δ^{A}

) for SA OneMax (

n = 30

) along with temperature in a sample run with maximum 1000 iterations. Convergence is achieved around iteration 800.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ruiz-Torrubiano, R.; Dhungana, D.; Paudel, S.; Buckchash, H. Modeling Local Search Metaheuristics Using Markov Decision Processes. Algorithms 2025, 18, 512. https://doi.org/10.3390/a18080512

AMA Style

Ruiz-Torrubiano R, Dhungana D, Paudel S, Buckchash H. Modeling Local Search Metaheuristics Using Markov Decision Processes. Algorithms. 2025; 18(8):512. https://doi.org/10.3390/a18080512

Chicago/Turabian Style

Ruiz-Torrubiano, Rubén, Deepak Dhungana, Sarita Paudel, and Himanshu Buckchash. 2025. "Modeling Local Search Metaheuristics Using Markov Decision Processes" Algorithms 18, no. 8: 512. https://doi.org/10.3390/a18080512

APA Style

Ruiz-Torrubiano, R., Dhungana, D., Paudel, S., & Buckchash, H. (2025). Modeling Local Search Metaheuristics Using Markov Decision Processes. Algorithms, 18(8), 512. https://doi.org/10.3390/a18080512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modeling Local Search Metaheuristics Using Markov Decision Processes

Abstract

1. Introduction

2. Previous Work

3. Markov Decision Processes for Local Search Metaheuristics

3.1. Definitions

3.2. Optimal Policies

3.3. Local Search Metaheuristics as Policies

3.4. Summary

4. Simulated Annealing

5. Computational Experiments

5.1. Hill Climbing

5.2. Simulated Annealing

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI