Reinforcement Learning-Guided Hybrid Metaheuristic for Energy-Aware Load Balancing in Cloud Environments

Yousef Sanjalawe; Salam Al-E’mari; Budoor Allehyani; Sharif Naser Makhadmeh

doi:10.3390/a18110715

,

and

¹

Department of Information Technology, King Abdullah II School for Information Technology, University of Jordan (UJ), Amman 11942, Jordan

²

Department of Information Security, Faculty of Information Technology, University of Petra (UoP), Amman 11196, Jordan

³

Department of Software Engineering, College of Computing, Umm Al-Qura University (UQU), Makkah 24381, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Algorithms2025, 18(11), 715;https://doi.org/10.3390/a18110715

This article belongs to the Special Issue AI Algorithms for 6G Mobile Edge Computing and Network Security

Version Notes

Order Reprints

Abstract

Cloud computing has transformed modern IT infrastructure by enabling scalable, on-demand access to virtualized resources. However, the rapid growth of cloud services has intensified energy consumption across data centres, increasing operational costs and carbon footprints. Traditional load-balancing methods, such as Round Robin and First-Fit, often fail to adapt dynamically to fluctuating workloads and heterogeneous resources. To address these limitations, this study introduces a Reinforcement Learning-guided hybrid optimization framework that integrates the Black Eagle Optimizer (BEO) for global exploration with the Pelican Optimization Algorithm (POA) for local refinement. A lightweight RL controller dynamically tunes algorithmic parameters in response to real-time workload and utilization metrics, ensuring adaptive and energy-aware scheduling. The proposed method was implemented in CloudSim 3.0.3 and evaluated under multiple workload scenarios (ranging from 500 to 2000 cloudlets and up to 32 VMs). Compared with state-of-the-art baselines, including PSO-ACO, MS-BWO, and BSO-PSO, the RL-enhanced hybrid BEO–POA achieved up to 30.2% lower energy consumption, 45.6% shorter average response time, 28.4% higher throughput, and 12.7% better resource utilization. These results confirm that combining metaheuristic exploration with RL-based adaptation can significantly improve the energy efficiency, responsiveness, and scalability of cloud scheduling systems, offering a promising pathway toward sustainable, performance-optimized data-centre management.

Keywords:

Black Eagle Optimizer (BEO); cloud computing; energy efficiency; hybrid optimization; Pelican Optimization Algorithm (POA); reinforcement learning; resource utilization

1. Introduction

Cloud Computing (CC) has fundamentally changed the landscape of Information Technology (IT) by offering on-demand access to scalable pools of computing resources, including servers, storage, and software applications [1,2]. These resources can be quickly provisioned and released with minimal human intervention, enabling widespread adoption of cloud services across sectors—from data analytics and social media to e-commerce and scientific research. As a result, data centres have grown significantly in both size and complexity to meet the increasing demand. However, this rapid expansion has also led to a sharp rise in energy consumption [3]. Global estimates suggest that data centres now account for a substantial—and steadily growing—portion of electricity usage, raising serious concerns about operational costs and environmental impact, particularly concerning carbon emissions.

Figure 1 provides an illustrative view of a typical cloud computing environment, showcasing various cloud deployment models (public, private, community, and hybrid) as well as key elements such as the load balancer, cloud controller, and cloud scheduler. User requests traverse a firewall before being directed to the load balancer, which intelligently allocates tasks among available resources by established policies [4]. Such orchestration optimises resource utilisation and addresses the critical need to minimise energy consumption, underscoring the growing emphasis on sustainability in modern cloud infrastructure.

Figure 1. Architecture of CC.

Several interrelated factors drive the rising energy demands in modern data centers. High server density requires substantial power for both computation and cooling, while fluctuating user workloads often necessitate dynamic resource allocation [5,6]. Additionally, hardware diversity and virtualization overheads can exacerbate inefficiencies when tasks are not optimally mapped to servers. As depicted in Figure 2, user requests move through a web service and load balancer, distributing incoming jobs across multiple Virtual Machines (VMs). Within this framework, devising strategies that harmonize performance and energy consumption is paramount—ensuring data centres expand capacity and operate sustainably and cost-effectively.

Figure 2. Load Balancing in CC.

One of the most effective strategies for addressing the rising energy demands of cloud data centres is intelligent load balancing—the process of distributing tasks to make the best use of available computing resources [7,8]. Although load balancing has long been studied in distributed systems, it takes on new importance in cloud environments, where the number of interconnected servers is vast and workloads constantly change. Often, some servers operate at low capacity or sit idle while still consuming power. In contrast, others become overloaded, leading to slower performance and potential breaches of Service-Level Agreements (SLAs) [9]. By dynamically reallocating workloads to better match resource availability, data centres can reduce the number of active physical machines required at any given time. This approach reduces overall power consumption while maintaining a consistent user experience and high Quality of Service (QoS).

Researchers and practitioners have recently explored various load-balancing methods, ranging from simple heuristic approaches—like Round Robin and First-Fit [10,11]—to more sophisticated techniques that leverage advanced algorithms and optimization models [12,13,14]. Traditional heuristics offer speed and simplicity but may struggle to adapt to rapidly changing workload patterns. Conversely, modern optimization techniques, including evolutionary and swarm intelligence algorithms, can dynamically navigate high-dimensional search spaces to find near-optimal solutions [14]. Nevertheless, challenges persist, including the speed of algorithmic convergence, solution accuracy, and the computational overhead of large-scale implementations. These constraints highlight the need for continuous innovation in algorithm design, particularly in blending the strengths of different approaches into hybrid models.

With these considerations, energy-efficient load balancing stands at the forefront of academic inquiry and industrial practice. As data centres expand to accommodate big data analytics, machine learning workloads, and global-scale web services, the importance of energy awareness continues to grow. Sustainable practices are not merely an environmental imperative but also a cost-driven necessity for businesses. By investigating novel algorithms that adapt to variable workloads and leverage the unique advantages of nature-inspired heuristics, researchers aim to develop solutions that minimize the energy footprint of large-scale cloud environments. The hybrid optimization approach proposed in this work directly addresses these concerns by balancing performance objectives with the pressing need for green, economical cloud computing infrastructure.

1.1. Problem Statement

Despite significant strides in developing scheduling algorithms and resource allocation policies, achieving an optimal distribution of computational loads in large-scale cloud environments remains challenging. This difficulty arises from multiple factors, including dynamic and heterogeneous workload demands, the ever-increasing size of data centre infrastructures, and the need to minimize operational costs and environmental impact [15,16]. Traditionally, heuristic-based approaches (e.g., Round Robin, First-Fit, and variants) have been used to mitigate load imbalances, yet these methods often lack scalability and adaptability. Specifically, as the number of user requests or VMs escalates, classical heuristics can result in suboptimal resource utilization and, consequently, higher power consumption. This inefficiency not only inflates operational costs but also hampers sustainable growth.

To formalize the problem, let us consider a cloud data centre with M Physical Machines (PMs), each capable of hosting several VMs. We assume that there are N tasks (or jobs) to be scheduled, where each task has a specific computational demand, often represented by its required CPU time or Millions of Instructions (MI). Let

x_{i j}

be a binary decision variable such that

x_{i j} = \{\begin{matrix} 1 & if task i is assigned to machine j, \\ 0 & otherwise . \end{matrix}

(1)

Each machine

j \in {1, 2, \dots, M}

has a maximum capacity

C_{j}

representing the total computing resources it can offer (e.g., CPU cycles, memory, etc.). The capacity constraint can be expressed as

\sum_{i = 1}^{N} Demand (i) x_{i j} \leq C_{j}, \forall j \in {1, 2, \dots, M},

(2)

where

Demand (i)

denotes the resource requirement (or computational load) of task i. Ensuring this constraint is respected helps avoid overloading any physical or virtual machine, thereby maintaining Service-Level-Agreements (SLAs) and preventing undue performance degradation.

In many modern data centres, power consumption can be modelled as a function of CPU utilization, which is typically the dominant factor in determining energy usage. Let

P_{j} (U_{j})

represent the power consumption of machine j as a function of its utilization level

U_{j}

. A simplified model might consider a linear relationship:

P_{j} (U_{j}) = P_{j, idle} + (P_{j, \max} - P_{j, idle}) U_{j},

(3)

where

P_{j, idle}

is the idle power consumption (the minimum power a machine uses when it is turned on but not actively performing tasks), and

P_{j, \max}

is the maximum power usage when the machine is fully utilized. If we let the total assigned computational load determine

U_{j}

, then the objective is to minimize the aggregated power consumption across all M machines:

min \sum_{j = 1}^{M} P_{j} (U_{j} ({x_{i j}})) .

(4)

The challenge is that

U_{j}

depends directly on how tasks are distributed (captured by the decision variables

x_{i j}

), making the problem combinatorial. Solving this optimization effectively requires a search strategy to navigate a high-dimensional solution space and adapt to varying workload patterns and machine heterogeneity.

While various metaheuristic algorithms have been applied to scheduling and load balancing in cloud computing, a key limitation persists: many rely on static parameter settings, limiting their adaptability to dynamic, unpredictable workloads. Traditional optimization approaches that emphasize either global exploration or local exploitation often struggle to achieve both fast convergence and high-quality solutions under fluctuating conditions. In general, metaheuristic algorithms operate through two fundamental mechanisms—exploration and exploitation. Exploration enables a broad search of the solution space to discover diverse and potentially optimal regions, thereby preventing premature convergence to local optima. Exploitation, in contrast, intensifies the search for promising solutions, refining them and ensuring convergence toward an optimal outcome. Achieving an effective balance between these two phases is crucial for maintaining both diversity and precision during optimization.

To overcome the limitations of existing methods, we propose a hybrid optimization strategy that combines the complementary strengths of the Black Eagle Optimization (BEO) algorithm and the Pelican Optimization Algorithm (POA). Drawing inspiration from the soaring and predatory behavior of black eagles, BEO offers a strong balance between exploration and exploitation, ensuring efficient task allocation across heterogeneous cloud resources. Meanwhile, POA models pelicans’ cooperative hunting strategies to enhance local refinement through turbulence-inspired movements. To further enhance adaptability, a reinforcement learning mechanism is integrated into the hybrid framework to tune algorithmic parameters based on system feedback dynamically. This adaptive learning process enables the algorithm to respond intelligently to workload variations, resulting in a flexible, energy-efficient, and high-performance load-balancing solution for modern cloud environments.

1.2. Contributions

The main contributions of this paper are summarized as follows:

A novel hybrid algorithm combining BEO and POA for energy-aware load balancing, considering the dynamic and heterogeneous nature of large-scale cloud environments.
A comprehensive mathematical formulation of the load-balancing problem, explicitly capturing capacity constraints and a power consumption model. This formulation is a basis for designing and implementing the proposed hybrid method.
Integration of an RL controller into the hybrid BEO–POA framework to dynamically adapt exploration and exploitation strategies based on workload feedback.
A self-adaptive mechanism that learns optimal parameter settings over time, improving responsiveness to dynamic cloud environments.
Evaluation and comparison of the proposed method against State-Of-The-Art (SOTA) load balancers.

1.3. Paper Organization

The remainder of this paper is organized as follows. Section 2 reviews related work on load balancing and energy efficiency in cloud computing, highlighting the limitations of existing approaches. Section 3 presents the fundamental concepts and mathematical models underlying load balancing, including the BEO and POA. Section 4 introduces the hybrid BEO-POA load balancer, detailing its problem formulation, algorithm design, and theoretical justifications. Section 5 describes the implementation of the hybrid BEO-POA approach in the CloudSim framework, covering the system architecture and task scheduling strategies. Section 6 discusses key implementation considerations, including parameter tuning, population size selection, resource heterogeneity, and the evaluation setup. Section 7 presents the results and discussion, analyzing the performance of the proposed method relative to state-of-the-art techniques using metrics such as energy consumption, response time, and resource utilization. Finally, Section 9 concludes the paper by summarizing the key findings.

3. Preliminaries

To effectively tackle the challenges of energy-efficient load balancing in cloud computing, it is essential to understand the core optimization techniques used in this study. This section introduces the BEO and the POA, which serve as the foundation of our proposed approach to task scheduling and resource allocation in large-scale cloud environments. BEO is inspired by the hunting strategies of black eagles and is designed to strike a balance between global exploration and local exploitation, promoting reliable convergence toward optimal solutions. On the other hand, POA draws from pelicans’ cooperative hunting behavior, using collective intelligence to refine the search process and improve computational efficiency. By combining these two nature-inspired algorithms, the hybrid approach aims to reduce energy consumption without compromising system performance. The following subsections present the mathematical models underlying BEO and POA, detailing their operational phases, governing equations, and their implementation within the optimization framework.

3.1. BEO Algorithm

The BEO is a metaheuristic algorithm inspired by the hunting and social behaviors of black eagles [27]. It models various stages of the eagle’s behavior, such as stalking, hovering, catching, snatching, warning, migrating, courting, and hatching, to guide the search process effectively. These eight core behaviors are designed to maintain a balance between global exploration and local exploitation, which is essential for efficiently navigating complex search spaces. Each behavior is mathematically defined to contribute to the algorithm’s ability to converge toward optimal solutions in a structured and adaptive manner.

Figure 3 illustrates the decision-making flow of the BEO algorithm. The process begins with an initialization phase, where key parameters are defined, including the number of iterations (T), the population size (N) representing the number of black eagles, and the threshold for stalled updates (H). Once initialized, the fitness of each black eagle is evaluated to identify the best candidate solution. The optimization proceeds through a hierarchical structure incorporating stalking, hovering, and catching strategies to explore the search space. These strategies enable adequate diversification and intensification during the search. Depending on the proximity of the best solution to the search space boundaries, the algorithm dynamically decides whether to invoke a warning mechanism or proceed with the snatching strategy to update the population and guide the search toward more promising regions.

Figure 3. Flowchart of BEO.

The migration mechanism is activated when the number of iterations or the stagnation threshold exceeds predefined limits. This allows the algorithm to escape local optima and explore new regions of the search space. In the final stages, the BEO incorporates courtship and hatching strategies to refine candidate solutions and enhance convergence. This structured, adaptive decision-making process ensures a balanced trade-off between exploration and exploitation, thereby improving the algorithm’s overall search efficiency and robustness.

3.1.1. Initialization

The population of black eagles is initialized in an n-dimensional search space:

X = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 n} \\ x_{21} & x_{22} & \dots & x_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{d 1} & x_{d 2} & \dots & x_{d n} \end{matrix}]

(5)

where

X represents the position matrix of black eagles.
d is the problem dimension.
n is the number of black eagles (population size).
$x_{i j}$ is the position of the i-th eagle in the j-th dimension.

The position of each black eagle is initialized as

x_{i j} = l b_{j} + rand \cdot (u b_{j} - l b_{j})

(6)

where

$l b_{j}$ and $u b_{j}$ are the lower and upper boundaries of the search space in dimension j.
rand is a random number in the range $[0, 1]$ .

3.1.2. Stalking (Global Search)

Black eagles stalk their prey from high ground, scanning the environment. The mathematical model for this phase is

x_{i, j}^{t + 1} = x_{i, j}^{t} + r_{1} \cdot (x_{best, j} - x_{i, j}^{t}) - r_{2} \cdot (x_{k, j} - x_{i, j}^{t})

(7)

where

$x_{i, j}^{t + 1}$ is the updated position of the eagle.
$x_{best, j}$ is the current best solution.
$x_{k, j}$ is the position of a randomly selected eagle.
$r_{1}, r_{2}$ are random coefficients in the range $[0, 1]$ .

3.1.3. Hovering (Rotational Search)

Black eagles hover to maintain visual contact with their prey, modelled by

x_{i, j}^{t + 1} = x_{best, j} + m (x_{i, j}^{t} - x_{best, j})

(8)

where

m is the hovering transformation matrix given by

$m = [\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}]$

(9)
$θ$ is a random rotation angle.

3.1.4. Catching (Local Refinement)

When an eagle catches prey, it refines its position:

x_{i, j}^{t + 1} = 2 x_{i, j}^{t} - x_{best, j} + s_{0} \cdot d_{1}

(10)

where

$s_{0}$ is a scaling factor.
$d_{1}$ is a distance factor.

3.1.5. Snatching (Jump Search)

Black eagles engage in snatching behaviour, modelled as

x_{i, j}^{t + 1} = x_{best, j} + e^{r_{3}} \cdot (x_{best, j} - x_{i, j}^{t})

(11)

where

$e^{r_{3}}$ is a random exponential jump.

3.1.6. Migration (Adaptive Escape)

Eagles migrate when food is scarce:

x_{i, j}^{t + 1} = x_{best, j} + z \cdot s_{1} \cdot (x_{i, j}^{t} - t_{2} \cdot x_{best, j})

(12)

where

z is an adaptation coefficient.
$t_{2}$ is a random migration factor.

The BEO algorithm follows a structured process, outlined in Algorithm 1.

Algorithm 1 BEO

Require:: Population size N, Maximum iterations T, Search space boundaries, Objective function F
Ensure:: Best solution $X^{*}$
1:: Initialize the population of N black eagles randomly
2:: Evaluate the objective function F for each eagle
3:: Identify the best solution (prey position)
4:: for $t = 1$ to T do ▹Iterate through generations
5:: for $i = 1$ to N do ▹Iterate through all eagles
6:: Perform Stalking Phase: Update position using Equation (7)
7:: Apply Hovering Phase: Fine-tune search using Equation (8)
8:: Apply Catching Phase: Refine position using Equation (10)
9:: Apply Snatching Phase: Jump search using Equation (11)
10:: Apply Migration Phase: Adaptive movement using Equation (12)
11:: end for
12:: Update the best solution found
13:: end for
14:: return $X^{*}$ ▹Best solution found

The BEO integrates multiple intelligent search strategies inspired by the predatory behaviours of black eagles. By balancing global exploration and local exploitation, BEO achieves robust convergence and adaptability in solving optimization problems.

3.1.7. Justification for Selecting the BEO

The decision to employ the BEO as the global search component in the proposed hybrid metaheuristic framework stems from its demonstrated balance between exploration and exploitation, low parameter dependency, and superior convergence behavior compared to classical algorithms such as the Genetic Algorithm (GA) and Differential Evolution (DE). While GA and DE have historically served as benchmarks in evolutionary computation, their performance in dynamic and large-scale cloud scheduling tasks is often hindered by parameter sensitivity and slower convergence under high-dimensional constraints [28,29].

Theoretically, BEO draws inspiration from black eagles’ cooperative hunting and migratory behaviors, encapsulating distinct phases such as stalking, hovering, snatching, and migration. These adaptive mechanisms enable dynamic regulation of search intensities and prevent premature convergence. Unlike GA, which relies on crossover and mutation rates, or DE, which depends on scaling and recombination factors, BEO’s operators self-adjust based on the population’s fitness variance [27]. This self-adaptive behavior reduces the need for extensive manual tuning, a key advantage in energy-aware scheduling, where workload distribution patterns can change unpredictably.

Empirical evidence further validates BEO’s selection. In their study, Zhang et al. [27] evaluated BEO over 30 CEC2017 and 12 CEC2022 benchmark functions, reporting that the algorithm achieved optimal convergence accuracy in all unimodal functions and outperformed comparative metaheuristics, including GA, DE, and PSO, in 78.95% of multimodal functions. Moreover, the standard deviation of fitness values ranked among the top three in 90.48% of the test cases, demonstrating superior stability and robustness in stochastic environments. Subsequent comparative research confirms that BEO achieves faster convergence and higher accuracy than traditional algorithms on constrained and dynamic optimization problems [14,30].

In cloud computing, resource scheduling is a highly dynamic, multimodal optimization problem characterized by heterogeneity, unpredictable workloads, and conflicting objectives, such as minimizing energy consumption while maximizing resource utilization and throughput. Classical metaheuristics such as GA and DE require frequent parameter recalibration as task loads or infrastructure heterogeneity evolve [15]. Conversely, BEO’s stochastic migration and snatching phases allow adaptive balancing between exploration and exploitation without external control, leading to stable convergence and improved scheduling quality across diverse scenarios.

It is worth emphasizing that using BEO does not imply universal superiority over GA or DE across all domains. Instead, the algorithm was chosen as a strategic fit for dynamic cloud scheduling tasks that demand rapid adaptability, energy awareness, and low parameter overhead. Nonetheless, future work will include a systematic comparative study incorporating GA, DE, and the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) within identical simulation environments (CloudSim, EdgeCloudSim, and iFogSim) to substantiate the empirical advantages of BEO further.

BEO provides a theoretically grounded and empirically validated foundation for large-scale energy-aware load balancing. Its efficient global search mechanism, minimal tuning requirements, and proven benchmark superiority make it a robust choice for the worldwide exploration phase of the proposed hybrid RL-guided metaheuristic framework.

3.2. POA

The POA is a bio-inspired metaheuristic that mimics pelicans’ cooperative hunting behaviour. The algorithm is structured around two main phases: an exploration phase, where pelicans search for prey by moving towards optimal regions, and an exploitation phase, where they refine their search using winging and turbulence strategies to capture prey efficiently. These behaviours are mathematically modelled to effectively balance global search (exploration) and local refinement (exploitation) [31].

The POA employs two fundamental equations that govern pelican movement during optimization.

3.2.1. Exploration Phase—Movement Towards Prey

During exploration, pelicans adjust their positions based on the location of prey, ensuring a diversified search of the solution space. The movement of each pelican is formulated as follows:

x_{i, j}^{P 1} = \{\begin{matrix} x_{i, j} + rand \cdot (p_{j} - I \cdot x_{i, j}), & if F_{p} < F_{i} \\ x_{i, j} + rand \cdot (x_{i, j} - p_{j}), & otherwise \end{matrix}

(13)

where

$x_{i, j}^{P 1}$ is the updated position of the i-th pelican in the j-th dimension.
$x_{i, j}$ represents the current position of the i-th pelican in the j-th dimension.
$p_{j}$ is the prey’s position in the j-th dimension.
$F_{p}$ is the fitness value of the prey’s position.
$F_{i}$ is the fitness value at the pelican’s current position.
I is a randomly chosen integer (1 or 2) that controls movement intensity.
rand is a uniformly distributed random number in the range $[0, 1]$ .

3.2.2. Exploitation Phase—Winging on the Water Surface

Once pelicans reach the water surface, they use their wings to create turbulence, forcing fish into shallower waters for easier capture. This fine-tuning step is modelled as

x_{i, j}^{P 2} = x_{i, j} + R \cdot (1 - \frac{t}{T}) \cdot (2 \cdot rand - 1) \cdot x_{i, j}

(14)

where

$x_{i, j}^{P 2}$ is the refined position of the i-th pelican in the j-th dimension.
R is a predefined constant (typically set to 0.2) that controls the intensity of local search.
t is the current iteration number.
T is the maximum number of iterations.
rand is a uniformly distributed random number in the range $[0, 1]$ .
$(1 - t / T)$ ensures that the search area contracts for precise convergence as iterations progress.

The acceptance of a new position follows an adaptive updating mechanism, ensuring that only solutions yielding an improvement in the objective function are retained:

X_{i} = \{\begin{matrix} X_{i}^{P 1}, & if F_{i}^{P 1} < F_{i} \\ X_{i}, & otherwise \end{matrix}

(15)

X_{i} = \{\begin{matrix} X_{i}^{P 2}, & if F_{i}^{P 2} < F_{i} \\ X_{i}, & otherwise \end{matrix}

(16)

where

$X_{i}$ is the updated position of the i-th pelican.
$F_{i}$ is the fitness value at the pelican’s current position.
$F_{i}^{P 1}$ and $F_{i}^{P 2}$ are the fitness values at the updated positions obtained from the exploration and exploitation phases, respectively.

The POA follows a structured process, as outlined in Algorithm 2.

Algorithm 2 POA

Require:: Population size N, Maximum iterations T, Search space boundaries, Objective function F
Ensure:: Best solution $X^{*}$
1:: Initialize the population of N pelicans randomly within the search space
2:: Evaluate the objective function F for each pelican
3:: Identify the best current solution (prey position)
4:: for $t = 1$ to T do ▹Iterate through generations
5:: for $i = 1$ to N do ▹Iterate through all pelicans
6:: Perform Exploration Phase: Update pelican position using Equation (13)
7:: Apply the adaptive update mechanism
8:: Perform Exploitation Phase: Fine-tune search using Equation (14)
9:: Apply the adaptive update mechanism
10:: end for
11:: Update the best solution found so far
12:: end for
13:: return $X^{*}$ ▹Best solution found

The POA effectively balances exploration and exploitation by simulating pelicans’ strategic hunting behaviour. It ensures efficient convergence towards optimal solutions by dynamically moving towards prey and locally refining through turbulence. The structured adaptation mechanism further enhances performance, making POA a competitive approach for solving complex optimization problems in cloud computing, engineering, and beyond.

3.3. Rationale for Using POA as the Local Refinement Component

The POA was chosen as the local refinement module in the proposed hybrid BEO–POA framework. Its adaptive turbulence mechanism balances exploitation intensity and population diversity more effectively than classical single-point refiners such as Hill Climbing (HC) or Tabu Search (TS). Although these simpler heuristics are computationally efficient, they typically operate greedily, progressively improving a single candidate solution based on deterministic neighbourhood transitions. This makes them prone to stagnation in local optima, particularly in multimodal and high-dimensional landscapes such as energy-aware cloud scheduling. POA, in contrast, maintains population diversity through stochastic turbulence and adaptive contraction of the exploration radius, enabling it to refine multiple promising regions simultaneously [31,32].

From an algorithmic perspective, POA models pelicans’ cooperative hunting behaviour. The exploitation phase, often referred to as the “winging turbulence,” can be mathematically expressed as

x_{i, j}^{t + 1} = x_{i, j} + R \cdot (1 - \frac{t}{T}) \cdot (2 \cdot r a n d - 1) \cdot x_{i, j},

(17)

where R controls turbulence intensity, t and T represent the current and maximum iteration counts, and

(1 - t / T)

progressively shrinks the search radius as convergence approaches. This mechanism yields a dynamic local search analogous to a variable neighbourhood strategy without the overhead of explicitly enumerating or evaluating neighbouring states, as done in HC or TS. The result is a more flexible refinement process that adapts to the landscape curvature in real time.

4. Proposed BEO-POA with RL Load Balancer

This section presents the overall workflow of the proposed RL-enhanced hybrid BEO–POA framework for energy-aware load balancing in cloud environments. The workflow operates in four main stages. First, the initialization stage generates an initial population of task–VM mappings and system parameters. Second, the global exploration stage, driven by the BEO, performs a broad search across the solution space to identify promising task allocations. Third, the local refinement stage, guided by the POA, fine-tunes elite solutions to enhance local convergence. Finally, the RL controller continuously monitors system metrics such as energy consumption, utilization, and load imbalance, dynamically adjusting algorithmic parameters (

s_{BEO}

,

R_{POA}

, and

η

) to maintain optimal balance between exploration and exploitation.

We propose a hybrid optimization approach that combines an enhanced version of the BEO with the POA to improve energy-efficient load balancing in cloud computing environments. This hybridization is motivated by the complementary strengths of the two metaheuristics. BEO provides a strong balance between global exploration and local exploitation through its structured strategies—stalking, hovering, catching, and migration—allowing it to navigate diverse solution spaces effectively. Meanwhile, POA enables rapid convergence and fine-tuned local searches through cooperative hunting behaviors and turbulence-based refinements.

We introduce an RL controller into the optimization loop to make the hybrid system more adaptive in real time. The RL component monitors key performance indicators, including workload variability, VM utilization, and energy usage. Based on this feedback, it dynamically adjusts parameters within BEO and POA—such as switching probabilities, step sizes, and refinement intensities—to maintain an optimal balance between exploration and exploitation as conditions change.

The hybrid BEO-POA becomes a more intelligent and context-aware load balancer by embedding this learning-based adaptation layer. It can respond proactively to system dynamics, reduce unnecessary energy consumption, and improve overall resource utilization. This makes the proposed method well-suited to large-scale, heterogeneous cloud environments where unpredictable workload patterns and SLAs must be upheld.

To tailor the BEO and POA for load balancing in cloud environments, several key modifications have been introduced:

RL-based adaptation: We added a lightweight RL controller that monitors key system metrics—such as energy usage, workload variation, and resource utilization—and uses this feedback to fine-tune the optimizer in real time. Based on the system’s current state, the RL agent adjusts parameters such as the switching rate between BEO and POA, the step size, and the refinement intensity. This helps the algorithm adapt on the fly without relying on manual tuning.
Dynamic role switching between BEO and POA: Instead of relying on a fixed strategy, the algorithm dynamically switches between BEO and POA depending on convergence trends and workload behavior. For example, under high workload variability, the system might favor POA’s local refinement to quickly stabilize the load.
Adaptive balance between exploration and exploitation: The algorithm adjusts its focus between exploring new solutions and refining existing ones based on real-time feedback. If tasks are frequently migrated or performance is unstable, it shifts toward more local search to fine-tune assignments.
Energy-aware migration in BEO: The migration behavior of the BEO component has been modified to consider energy usage. Now, the algorithm prefers migrating tasks to virtual machines with better energy profiles, helping reduce overall power consumption.
Energy-conscious task allocation in POA: POA’s movement rules were updated to include energy metrics, making tasks more likely to be assigned to VMs that consume less power when idle. This subtle change improves energy efficiency without sacrificing performance.

4.1. Mathematical Model of Hybrid BEO-POA with RL

A three-stage optimization process governs the proposed hybrid algorithm:

Stage 1: Global Exploration using BEO
Task allocations are adjusted based on black eagle movement patterns at the global search phase. The updated position of each eagle (task assignment) according to Equation (7).
Here,
- $x_{i, j}^{t + 1}$ is the updated position of the i-th task allocation in dimension j.
- $x_{i, j}^{t}$ is the previous position of the i-th task allocation in dimension j.
- $x_{best, j}$ is the current best load balancing solution in dimension j.
- $x_{k, j}$ is the position of a randomly selected alternative task allocation in dimension j.
- $r_{1}$ and $r_{2}$ are random coefficients in the range $[0, 1]$ that control movement intensity.
The migration mechanism in BEO is modified to incorporate energy constraints, ensuring that tasks are moved to VMs with lower power consumption, as defined by Equation (12).

Stage 2: Local Refinement using POA

Once the global exploration stage stabilizes, the fine-tuning stage utilizes POA to optimize local assignments based on Equation (17).

This ensures fine-grained task assignment optimization, reducing energy consumption and improving load distribution.

The structured hybridization process is outlined in Algorithm 3.

Algorithm 3 RL-Enhanced Hybrid BEO–POA Load Balancing

Require:: Population size M, Maximum iterations I, Energy threshold $E_{t h}$ , Load imbalance threshold $L_{t h}$
Ensure:: Optimal load-balanced task allocation
1:: Initialize population of M eagles (BEO solutions) and M pelicans (POA solutions)
2:: Initialize RL agent with policy $π (s_{t}) = a_{t}$
3:: Evaluate initial fitness values based on energy consumption, VM utilization, and LIF
4:: Identify the global best solution $g$
5:: for $t = 1$ to I do ▹Main optimization loop
6:: Observe system state $s_{t} \leftarrow$ {current energy usage, VM utilization, LIF, convergence rate}
7:: Select action $a_{t} \leftarrow π (s_{t})$ using current RL policy
8:: Adjust optimizer parameters (e.g., BEO step size, POA turbulence rate, switching ratio) based on $a_{t}$
9:: Global Search: Update black eagle solutions using BEO
10:: Evaluate energy-aware migration behavior
11:: if energy consumption $> E_{t h}$ or load imbalance $> L_{t h}$ then
12:: Increase agent ratio allocated to POA for local refinement
13:: else
14:: Maintain or strengthen BEO-driven exploration
15:: end if
16:: Local Search: Apply POA fine-tuning to optimize VM-task mappings)
17:: Evaluate updated fitness values; update global best $g$ if improved
18:: Update RL agent using observed reward $r_{t}$ to improve policy $π$
19:: end for
20:: return $g$ ▹Final optimized task allocation

BEO was designed primarily for structured global exploration. The algorithm adapts its migration step size to population diversity, promoting rapid coverage of unexplored regions and preventing early saturation. In contrast, POA’s main strength lies in localized exploitation: its turbulence operator performs micro-adjustments within a dynamically contracting radius, allowing precise fine-tuning once promising basins are discovered [31,32]. Conceptually, BEO acts as a coarse-grained navigator, whereas POA functions as a fine-grained refiner.

To empirically examine whether the hybridization produces duplication or synergy, we performed an auxiliary ablation study in which three configurations were tested on the same CloudSim workload (Scenario III with 1000 cloudlets, 32 VMs, 8 hosts):

BEO Only: global exploration and refinement handled solely by BEO.
POA Only: exploration and exploitation handled solely by POA.
Hybrid BEO–POA: BEO performs exploration for 60% of iterations, after which POA refines the best 40% of candidate mappings.

Each configuration was executed ten times. The averaged results are reported in Table 1.

Table 1. Ablation study illustrating complementarity between BEO and POA under identical workload.

The results show that the hybrid configuration consistently outperformed either component alone in energy consumption, makespan, and load balance, despite incurring a modest 6–8% increase in computation time. This overhead is acceptable given the 10–15% improvement in performance metrics. The synergy arises because BEO’s global search rapidly identifies diverse promising regions, while POA subsequently intensifies exploitation within those regions using turbulence-driven refinements. Without this division of labour, BEO alone exhibits slower convergence in the final iterations, and POA alone lacks sufficient initial diversity to escape local optima.

Algorithmically, the integration is implemented via sequential orchestration rather than simultaneous execution, thereby mitigating redundancy. The RL controller governs the switching ratio between BEO and POA based on convergence indicators such as fitness variance and the LIF. When population diversity drops below a threshold, control shifts from BEO to POA; when diversity increases again, BEO resumes exploration. This adaptive scheduling ensures the two optimizers operate in complementary temporal phases rather than duplicating effort within the same iteration.

It is also essential to consider the computational complexity. The time complexity of BEO is approximately

O (N \times d \times T_{B E O})

, where N is the population size, d is the problem dimension, and

T_{B E O}

is the iteration count. POA’s complexity is

O (M \times d \times T_{P O A})

. In the hybrid design,

T_{B E O}

and

T_{P O A}

are reduced proportionally (e.g., 60% and 40% of the total T), keeping the overall complexity close to the single-algorithm baseline. Thus, hybridization adds minimal overhead relative to the gain in solution quality.

These empirical and analytical observations align with other reports of successful two-phase metaheuristics. For instance, Dehghani et al. [32] and Singh et al. [33] noted that pairing exploration-dominant and exploitation-dominant metaheuristics improves both convergence speed and solution precision, provided that their control loops are sequentially synchronized. The proposed BEO–POA follows this paradigm by exploiting complementary behavioural properties rather than duplicating similar functions.

We recognize that hybridization introduces additional design complexity and marginal computational cost. To further optimize efficiency, future work will explore two enhancements: (1) employing dynamic population resizing so that POA operates on a reduced subset of elite solutions during refinement, and (2) using reinforcement-learning-based adaptive iteration allocation to minimize idle computation during the switching phase. These extensions aim to preserve the hybrid’s accuracy benefits while reducing overhead.

Although BEO and POA possess intrinsic exploration–exploitation mechanisms, their behavioural emphases differ sufficiently to warrant hybridization. BEO provides structured, large-scale exploration, while POA contributes adaptive, fine-grained exploitation. Empirical results confirm that their sequential combination yields complementary synergy rather than redundancy, improving energy efficiency and load balance with minimal additional computational cost.

Stage 3: RL Controller
To enhance the adaptability of the hybrid BEO–POA algorithm, we introduce an RL controller that continuously adjusts the optimizer’s behavior in response to the system’s current state. This addition allows the load balancer to make smarter decisions over time without requiring manual parameter tuning.

The RL component is a high-level control layer that observes the system, selects appropriate actions, and learns from outcomes. It is designed around a standard agent-environment framework, defined as follows:

State ( $s_{t}$ ): At each decision step t, the agent observes a state vector that includes metrics such as the current task load, average VM utilization, the algorithm’s operating phase (exploration vs. exploitation), and the LIF. This snapshot reflects the system’s current condition and helps guide adaptive behavior.
Action ( $a_{t}$ ): Based on the observed state, the agent selects an action from a predefined set. These actions include adjusting the switching probability between BEO and POA, modifying the BEO step size, or tuning the POA turbulence factor. The goal is to find the right balance between exploration and refinement to respond effectively to system dynamics.
Reward ( $r_{t}$ ): The agent receives a reward signal that reflects the quality of the chosen action. The reward is calculated to encourage low energy consumption, high resource utilization, and balanced task distribution. A simple yet effective reward function is defined as

$r_{t} = w_{1} \cdot (1 - Normalized Energy) + w_{2} \cdot Utilization - w_{3} \cdot LIF$

(18)

where $w_{1}$ , $w_{2}$ , and $w_{3}$ are weight parameters that determine the importance of each term, normalized energy is scaled to the range $[0, 1]$ , ensuring comparability across metrics.

The chosen reward components reflect the three most significant objectives of cloud resource management: minimizing energy consumption, maximizing utilization, and maintaining a balanced load distribution. However, these objectives are not equally critical. Excessive energy use directly impacts a data center’s operational cost and sustainability, whereas a moderate imbalance can be tolerated if overall utilization remains high. Accordingly, the reward prioritizes energy reduction by assigning

w_{1} = 0.5

, followed by

w_{2} = 0.3

to encourage efficient resource use, and

w_{3} = 0.2

to penalize unbalanced workloads.

This weighting scheme was derived from empirical observations from preliminary CloudSim experiments. When all weights were equal, the RL controller oscillated between over-aggressive energy saving and under-utilization, leading to suboptimal throughput. Increasing

w_{1}

relative to

w_{2}

stabilized the policy and reduced total energy consumption by approximately 12% while maintaining acceptable utilization. Therefore, the final weights

(0.5, 0.3, 0.2)

were selected to reflect the practical trade-offs between sustainability, performance, and stability observed across multiple trials.

4.1.1. Sensitivity Analysis of Reward Weights

A sensitivity analysis was conducted using five distinct reward configurations to quantify the impact of weight selection on learning performance (Table 2). Each configuration modifies one or more weight ratios while keeping the others constant. The RL agent was retrained under identical workload conditions (Scenario III: 1000 cloudlets, 32 VMs, 8 hosts) for 200 episodes per configuration.

Table 2. Sensitivity analysis of reward weight combinations on convergence performance.

The results indicate that the RL agent is moderately sensitive to reward composition. Configurations prioritizing utilization

(0.3, 0.5, 0.2)

achieved higher CPU usage but exhibited unstable rewards and higher load imbalance. Conversely, overemphasizing energy

(0.6, 0.3, 0.1)

reduced consumption marginally but led to convergence oscillations due to excessive exploration. The proposed weights

(0.5, 0.3, 0.2)

yielded the most balanced outcomes, achieving the lowest combined energy–LIF cost and the fastest convergence rate (140 episodes). The reward variance of 0.018 further indicates stable learning across multiple runs.

These findings suggest that the RL controller’s behaviour is robust within a reasonable range of weight variations, but extreme prioritization of a single metric degrades stability. The balanced reward scheme allows the agent to learn policies that simultaneously reduce energy use, maintain high utilization, and avoid severe imbalance. The analysis also reveals an implicit interaction between energy and utilization: minor reductions in energy are often accompanied by a proportional drop in utilization when the reward weights are skewed, confirming the multi-objective trade-off inherent in cloud scheduling. The reward sensitivity analysis provides both interpretability and reproducibility for future researchers. Although the current weight configuration was tuned empirically, the modular RL framework can accommodate adaptive or self-tuning reward mechanisms.

The agent uses an RL algorithm, such as Q-learning or a Deep Q-Network (DQN), to learn optimal actions over time. The goal is to discover a policy

π (s_{t}) = a_{t}

, which maps system states to optimal actions that maximize long-term rewards. Over multiple iterations, the policy improves as the agent gathers more experience, allowing the system to self-tune its behavior in dynamic cloud environments.

To clearly demonstrate the interaction between the RL controller and the hybrid BEO–POA optimization process, the following Algorithm 4 outlines the high-level operational flow without delving into low-level implementation details.

Algorithm 4 Stage 3: RL Controller—High-Level Interaction with BEO–POA

Require:: Max episodes K, horizon I; initial policy $π$ ; initial parameters $s_{BEO}, R_{POA}, η$
Ensure:: Adapted policy $π$ and final mapping $g$
1:: for $k = 1$ to K do ▹Episode
2:: Reset simulator; initialize populations; set best solution $g$
3:: for $t = 1$ to I do ▹Decision step
4:: Observe state $s_{t} \leftarrow$ {energy, utilization, LIF, diversity, convergence}
5:: Select action $a_{t} \leftarrow π (s_{t})$ ▹e.g., adjust $s_{BEO}, R_{POA}, η$
6:: Apply parameter updates (clamped to bounds)
7:: if rand() $< η$ then
8:: Local refinement (POA) on elite subset
9:: else
10:: Global exploration (BEO) on population
11:: end if
12:: Evaluate fitness; update $g$ if improved
13:: Compute reward $r_{t}$ using (18)
14:: Update policy $π \leftarrow RL_Learn (π, s_{t}, a_{t}, r_{t}, s_{t + 1})$
15:: if converged or budget reached then break
16:: end if
17:: end for
18:: end for
19:: return $g$

4.1.2. Definition and Reproducibility of the Reinforcement Learning Controller

The RL controller serves as an adaptive supervisory layer that dynamically regulates the behaviour of the hybrid BEO–POA optimizer. Its purpose is to maintain an optimal balance between exploration and exploitation throughout the optimization process by continuously adjusting several key parameters in response to observed system performance.

The hybrid BEO–POA optimizer exhibits two complementary behaviours: large-scale exploration through BEO’s migration strategy and local exploitation via POA’s turbulence mechanism. However, the optimal balance between these modes varies depending on workload heterogeneity and convergence progress. A static configuration of algorithmic parameters, such as migration step size or turbulence intensity, can lead to premature convergence or excessive wandering. To mitigate this, a lightweight RL controller was embedded as a high-level policy learner. Its task is to monitor the current optimization state and adjust three behavioural parameters in real time: (1) the exploration step size of BEO (

s_{B E O}

), (2) the turbulence coefficient of POA (

R_{P O A}

), and (3) the switching probability (

η

) that determines the transition between the two optimizers.

Markov Decision Process Formulation

The RL controller is modelled as a discrete-state Markov Decision Process (MDP) defined by the tuple

(S, A, R, P, γ)

, where S denotes the set of states representing the current status of the optimization process, A is the action space comprising possible parameter adjustments, R is the scalar reward returned after each update, P represents the transition probabilities between states, and

γ

is the discount factor that balances immediate and long-term rewards.

State Space. Each state

s_{t} \in S

is a four-dimensional vector that captures key aspects of system performance at iteration t. The variables, normalized to the range [0, 1], are as follows:

$E_{t}$ —normalized energy consumption of the data centre at iteration t;
$U_{t}$ —average CPU utilization rate of active hosts;
$D_{t}$ —population diversity index, computed as the normalized variance of fitness values across candidate solutions;
$C_{t}$ —convergence indicator, representing the normalized rate of change of the best fitness value across the last k iterations.

This combination provides a compact yet sufficient representation of the optimizer’s progress, enabling the agent to infer when to intensify exploration or exploitation.

Action Space. The action set

A = {a_{1}, \dots, a_{5}}

defines five possible control interventions:

$a_{1}$ : Increase $s_{B E O}$ to promote stronger global exploration;
$a_{2}$ : Decrease $s_{B E O}$ to stabilize convergence;
$a_{3}$ : Increase $R_{P O A}$ to expand local search turbulence;
$a_{4}$ : Decrease $R_{P O A}$ for finer local exploitation;
$a_{5}$ : Adjust the switching probability $η$ between BEO and POA according to diversity level.

These discrete actions provide sufficient granularity for adaptive control while keeping the learning process computationally tractable.

Reward Function. The reward function

r_{t} = f (E_{t}, U_{t}, L I F_{t})

quantifies the benefit of each action using three normalized metrics: energy consumption (

E_{t}

), resource utilization (

U_{t}

), and load imbalance factor (

L I F_{t}

). It is computed as follows:

r_{t} = w_{1} (1 - E_{t}) + w_{2} U_{t} - w_{3} L I F_{t},

(19)

where the weight coefficients

(w_{1}, w_{2}, w_{3})

were empirically set to

(0.5, 0.3, 0.2)

. This formulation rewards actions that reduce energy use and imbalance while maintaining high utilization.

Learning and Implementation Details

The agent adopts the classical Q-learning algorithm [34,35] to approximate the optimal state–action value function

Q^{*} (s, a)

. The update rule is expressed as follows:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})],

(20)

where

α

is the learning rate and

γ

the discount factor. The

ϵ

-greedy exploration strategy balances random exploration (with probability

ϵ

) with exploitation of the best-known policy.

Training was performed over 200 episodes, each corresponding to one complete optimization run. Key hyperparameters were tuned empirically as follows: learning rate

α = 0.1

, discount factor

γ = 0.9

, and initial exploration rate

ϵ = 0.2

(decayed linearly to 0.05). The state space was discretized into ten bins per dimension, yielding

10^{4}

possible states. The reward signal was smoothed using a 5-iteration rolling average to reduce stochastic noise. All random seeds were fixed (random.seed(42)) to guarantee repeatability. These explicit definitions ensure that other researchers can re-implement the controller independently.

Integration with the Hybrid Optimizer

The RL controller operates asynchronously at the meta-level, updating its decisions every 10 optimization iterations rather than continuously, thereby minimizing computational overhead. At each update, the optimizer reports its current metrics, energy, utilization, diversity, and imbalance to the RL agent, which computes the corresponding state and reward. The agent then updates its Q-table, selects the following action

a_{t}

, and modifies the relevant optimizer parameters (

s_{B E O}

,

R_{P O A}

,

η

) for the next cycle. This feedback loop, illustrated in Figure 4, allows the optimizer to self-adapt dynamically to workload variability and convergence trends.

Figure 4. RL Flow.

Reproducibility and Future Enhancement

The explicit specification of states, actions, rewards, learning algorithm, and hyperparameters makes the RL integration reproducible. Rather than deep reinforcement learning, Tabular Q-learning was chosen to maintain transparency and interpretability while keeping computational requirements modest. Nevertheless, the framework’s modular design allows straightforward substitution with more advanced agents such as DQN or Proximal Policy Optimization (PPO) for future experiments on larger or more volatile datasets.

The RL controller is an adaptive, reproducible, and interpretable learning layer that fine-tunes the hybrid BEO–POA optimizer in real time. Its formal MDP definition and controlled implementation parameters ensure that the learning dynamics can be independently verified, thereby addressing reproducibility and clarity concerns during integration.

While tabular Q-learning was adopted in this study for its simplicity, interpretability, and ease of integration within the CloudSim-based simulation framework, it presents certain limitations when applied to higher-dimensional or continuous state spaces. The discrete representation, though effective for the four-dimensional state vector used in this work, may lead to scalability issues as the number of state features increases or when finer granularity is required to capture complex system dynamics. This can result in slower convergence or reduced generalization capability in highly dynamic cloud environments.

Future extensions of this framework could therefore incorporate function approximators, such as DQN or actor–critic methods, which are better suited for modelling continuous, large-scale state spaces. These approaches would enable the RL controller to generalize across unseen system conditions while maintaining adaptability, thereby enhancing robustness and decision quality in real-world cloud data center deployments.

By incorporating this learning-based controller, the hybrid BEO–POA becomes a more intelligent and flexible load balancer, capable of adjusting its optimization strategy in real time. This leads to better performance across various workloads and system configurations.

4.2. Modification of the BEO Migration Step for Energy-Aware Task Placement

The migration step in the original BEO is primarily designed for global exploration, in which each agent (eagle) updates its position by following the global best solution via a stochastic migration vector. While this design ensures adequate coverage of the search space, it does not inherently account for energy efficiency when applied to task placement or VM scheduling. To address this limitation, we introduce an energy-aware adaptation of the BEO migration rule that explicitly integrates VM energy consumption and utilization metrics into the position update process.

4.2.1. Original BEO Migration Principle

In the canonical BEO algorithm [27], each eagle updates its position according to

x_{i}^{t + 1} = x_{i}^{t} + s_{B E O} \cdot r_{1} \cdot (x_{b e s t}^{t} - r_{2} \cdot x_{i}^{t}),

(21)

where

s_{B E O}

is the step size controlling migration intensity,

r_{1}

and

r_{2}

are random coefficients uniformly distributed in [0, 1], and

x_{b e s t}^{t}

is the best-known position at iteration t. This rule ensures exploration by moving each agent toward the current global optimum while maintaining diversity via stochastic perturbations. However, it treats all dimensions equally and does not distinguish between energy-efficient or overloaded VMs.

4.2.2. Energy-Aware Migration Adaptation

To adapt the migration behaviour for cloud environments, we reformulate Equation (21) to favour migration toward VMs with lower predicted energy cost and higher resource efficiency. The modified migration rule is defined as

x_{i}^{t + 1} = x_{i}^{t} + s_{B E O} \cdot r_{1} \cdot [(1 - λ) \cdot (x_{b e s t}^{t} - x_{i}^{t}) + λ \cdot Δ E_{i}^{t}],

(22)

where

λ \in [0, 1]

controls the trade-off between performance convergence and energy minimization, and

Δ E_{i}^{t}

represents the normalized energy-efficiency gradient defined as

Δ E_{i}^{t} = \frac{1}{n_{V M}} \sum_{j = 1}^{n_{V M}} ω_{j} \cdot \frac{1 - U_{j}^{t}}{P_{j}^{t}},

(23)

with

U_{j}^{t}

denoting the CPU utilization of VM j and

P_{j}^{t}

its instantaneous power draw. The weighting factor

ω_{j}

prioritizes underutilized yet energy-efficient VMs, encouraging the migration of new tasks to hosts with low load and a high performance-per-watt ratio. This formulation ensures that the optimizer does not merely seek the shortest makespan but balances it with minimal incremental energy consumption.

Intuitively, Equation (22) modifies the migration vector to include an “energy-awareness bias.” When the system load is uneven, the term

Δ E_{i}^{t}

acts as a corrective vector that steers the search toward lower-power VMs without sacrificing exploration. When

λ = 0

, the algorithm behaves as standard BEO; when

λ = 1

, migration is entirely driven by energy minimization. In our implementation,

λ

is adaptively adjusted by the RL controller based on reward signals combining energy, utilization, and load imbalance.

To validate the effectiveness of this adaptation, we compared the modified migration step against two commonly used deterministic heuristics—Best-Fit and Min-Min—in the same CloudSim environment. Both heuristics were configured to assign 1000 tasks across 32 VMs, with energy models following the standard power–utilization relationship

P (U) = P_{i d l e} + (P_{m a x} - P_{i d l e}) \times U

.

4.2.3. Theoretical and Empirical Justification

From a theoretical standpoint, the modified migration rule implicitly defines a multi-objective search direction that minimizes the convex combination:

F (x) = (1 - λ) f_{m a k e s p a n} (x) + λ f_{e n e r g y} (x),

(24)

where

f_{m a k e s p a n}

and

f_{e n e r g y}

are differentiable surrogate objectives. Under standard convexity and boundedness assumptions, the descent property of Equation (22) ensures that the expected improvement in

F (x)

remains nonnegative, i.e.,

E [F (x^{t + 1})] \leq E [F (x^{t})] .

(25)

This provides theoretical backing for its convergence stability. Empirically, convergence curves (Figure 5) demonstrate a smooth monotonic decrease in total energy with no oscillatory behaviour, confirming that the migration adaptation remains stable under RL-controlled parameter tuning.

Figure 5. Energy Convergence.

This adaptation transforms BEO from a purely performance-driven optimizer into an energy-aware metaheuristic suitable for modern sustainable cloud systems. The reinforcement learning controller further enhances its adaptability by dynamically adjusting

λ

based on system feedback. As a result, the hybrid BEO–POA algorithm consistently achieves lower energy consumption and balanced resource utilization compared with heuristic baselines, without compromising convergence speed or computational efficiency.

5. Implementation of Hybrid BEO–POA in CloudSim

This section outlines the implementation of the proposed Hybrid BEO–POA load balancing approach within the CloudSim framework. The prototype was developed entirely in Java 11, consistent with CloudSim’s native design and architecture. It is assumed that the reader is familiar with core CloudSim components such as Datacenter, DatacenterBroker, Cloudlet, and VM. The primary objective of this implementation is to integrate the BEO and POA algorithms into the task scheduling and resource allocation policies, thereby enabling dynamic workload distribution across virtual machines (VMs). Through this integration, the system aims to minimize energy consumption while maintaining high levels of resource utilization and overall performance efficiency.

5.1. CloudSim Architecture Overview

CloudSim consists of the following principal components:

Datacenter: Models the physical infrastructure, including hosts, networking, and storage. The PowerDatacenter class is used for energy-aware simulations.
Host: Represents a PM, typically configured with CPU cores, RAM, storage, and a PowerModel for energy consumption.
VM: Encapsulates allocated CPU cores, memory, and bandwidth. Tasks (Cloudlets) run on these VMs.
Cloudlet: Represents a user job or task characterized by computational demand (e.g., in millions of instructions), input and output sizes, and execution time.
DatacenterBroker: Mediates between users and the Datacenter, coordinating the submission of Cloudlets to appropriate VMs.

The Hybrid BEO-POA approach is implemented primarily at the broker level (or as part of a custom scheduler) to distribute tasks energy-efficiently.

5.2. Setting up an Energy-Aware Datacenter

Since the goal is to reduce energy usage, we utilize the PowerDatacenter class, which computes power consumption based on host utilization.

Define a power model: Extend the PowerModel class to implement a custom function for power consumption. A linear power model is given by

$P (utilization) = P_{idle} + (P_{\max} - P_{idle}) \times utilization,$

(26)

where
- $P_{idle}$ is the power consumed when a host is idle.
- $P_{\max}$ is the power consumed when a host is fully utilized.
Create PowerHost objects: Instantiate PowerHost instances, each configured with a PowerModel, CPU cores, RAM, and bandwidth.
Assemble the PowerDatacenter: Provide a list of PowerHost objects to the PowerDatacenter along with a VmAllocationPolicy.

5.3. Custom Load Balancing Policy

The Hybrid BEO–POA algorithm decides how to allocate Cloudlets to VMs. This can be integrated into CloudSim by

Extending a DatacenterBroker subclass.
Implementing a custom CloudletScheduler inside each VM.

To manage high-level task scheduling across multiple VMs, we develop a custom DatacenterBroker.

5.4. Extending the DatacenterBroker for Hybrid BEO–POA

A new class, BEOPOA_Broker, extends DatacenterBroker and implements the metaheuristic load balancing strategy:

Initialize a population: Each agent in the BEO and POA population represents a mapping of tasks to VMs.
Evaluate Fitness (Energy Cost): Compute total energy consumption for each mapping:

$E = \sum_{h = 1}^{H} [P_{idle}^{(h)} + (P_{\max}^{(h)} - P_{idle}^{(h)}) \times U_{avg}^{(h)}] \times t^{(h)},$

(27)

where
- H is the number of hosts.
- $U_{avg}^{(h)}$ is the average CPU utilization on host h.
- $t^{(h)}$ is the active time of host h during the simulation.
Apply BEO and POA updates: Execute BEO exploration and refinement iteratively.
Convergence and Assignment: Once the metaheuristic converges, bind each Cloudlet to its assigned VM:

$bindCloudletToVm (cloudletId, vmId)$

The proposed Hybrid BEO–POA load-balancing algorithm (Algorithm 5) optimizes task allocation in cloud environments by leveraging the BEO’s global exploration capabilities and the POA’s local refinement mechanisms. The algorithm begins by initializing a population of candidate solutions, each representing a mapping of tasks to VMs. The initial energy cost is computed for all individuals, and the best solution is selected. During each iteration, BEO performs global search updates using stalking, hovering, and catching strategies, while POA applies cooperative movement and turbulence mechanisms for fine-tuning. The approach dynamically adapts the balance between BEO and POA based on the convergence rate, increasing POA’s share for intensified local search when necessary. Once convergence is reached, cloudlets are bound to their assigned VMs according to the best-obtained mapping.

Algorithm 5 CloudSim Setup for Hybrid BEO–POA Scheduling

Require:: Number of hosts H, number of VMs V, number of cloudlets C, VM allocation policy $Π$
Ensure:: Configured CloudSim environment and performance metrics
1:: Initialize CloudSim ▹Create simulation instance, calendars, logger
2:: Instantiate H PowerHost objects with (PEs, MIPS, RAM, BW, Storage, PowerModel)
3:: Create DatacenterCharacteristics and PowerDatacenter with policy $Π$
4:: Create a DatacenterBroker B
5:: Generate V Vm objects with (MIPS, PEs, RAM, BW, Size, Vmm, Scheduler)
6:: Submit VMs to B
7:: Generate C Cloudlet objects with (length, PEs, file size, output size, UtilizationModel)
8:: Bind cloudlets to VMs using the Hybrid BEO–POA scheduler
9:: Submit cloudlets to B
10:: Start CloudSim simulation
11:: Stop simulation when all cloudlets finish
12:: Collect results from B ▹statuses, start/finish times, VM mappings
13:: Compute performance metrics: energy consumption, load balancing index, makespan, throughput, SLA violations
14:: return performance metrics

To implement this method within CloudSim, the End-to-End CloudSim execution algorithm (Algorithm 6) is followed. The CloudSim environment is set up by instantiating PowerHost objects with predefined power models, configuring a PowerDatacenter, and creating VMs and cloudlets with their respective computational properties. The Hybrid BEO-POA Broker (BEOPOA_Broker) is then instantiated to manage task allocation. The simulation is executed using CloudSim’s event-driven model, and final results—including energy usage and execution time—are collected for performance evaluation. This integration ensures an energy-efficient, adaptive load-balancing mechanism that handles dynamic workloads in cloud computing environments.

Algorithm 6 End-to-End CloudSim Execution

Require:: CloudSim Environment Setup Parameters
Ensure:: Final simulation results, including energy usage and execution time
1:: Setup CloudSim Environment ▹Initialize the simulation framework
2:: Instantiate PowerHost objects with defined PowerModel
3:: Create PowerDatacenter with VM allocation policy
4:: Generate VMs and Cloudlets with required properties
5:: Instantiate BEOPOA_Broker to manage task allocation
6:: Run the Hybrid BEO–POA algorithm for load balancing
7:: Execute CloudSim simulation using CloudSim.startSimulation()
8:: Retrieve and analyze the final results (energy usage, execution time)
9:: return Simulation results ▹Optimized task scheduling metrics

6. Implementation Considerations and Experiments Setups

This section discusses the essential considerations for implementing the proposed BEO-POA load-balancing technique in the CloudSim environment. It highlights critical factors influencing the algorithm’s effectiveness, including population size, parameter tuning, and resource heterogeneity, all of which affect computational efficiency and energy consumption. Additionally, the section outlines the experimental setup and describes the workload scenarios used to evaluate the proposed approach across varying cloud infrastructure configurations. The experiments use varying numbers of cloudlets, hosts, virtual machines, and data centres to assess the algorithm’s scalability and adaptability. By systematically analyzing these factors, the study ensures a comprehensive evaluation of the proposed method under realistic cloud computing conditions.

6.1. Key Considerations of Implementations

This section presents the key considerations of the integrated approach for energy-efficient, adaptive load balancing using BEO and POA. It highlights three key considerations for effective algorithm deployment:

Population Size
Choosing an appropriate population size is critical for balancing computational cost against solution quality. While larger populations generally allow for a more thorough exploration of the search space and higher accuracy, they also increase execution time. Smaller populations reduce computation time but may lead to incomplete exploration, potentially resulting in lower-quality solutions.
Parameter Tuning
Both BEO and POA employ algorithm-specific parameters that directly affect performance.
- BEO parameters control the exploration-exploitation balance, helping the algorithm avoid premature convergence and thoroughly evaluate candidate solutions.
- POA adaptation rate: Setting this rate correctly is key to ensuring the system can quickly respond to workload fluctuations without causing instability or excessive oscillations.
Striking the right balance in tuning these parameters is essential for maintaining high resource utilization and energy efficiency.
Table 3 summarizes the parameters of BEO and POA and their respective optimal values, as determined through experimentation.

Table 3. Parameter Values for BEO and POA.
Heterogeneous Resources
Modern cloud environments typically include VMs with diverse CPU speeds, memory capacities, and power consumption models. Consequently, an adaptive assignment strategy must match incoming tasks to VMs according to their processing capabilities and power profiles. This ensures that energy consumption is minimized while still meeting performance objectives.
By systematically integrating BEO and POA within the CloudSim simulator, we enable an energy-aware, adaptive load-balancing framework for cloud computing infrastructures. This method accounts for population size, parameter tuning, and hardware heterogeneity, achieving robust performance and reduced energy usage.

RL Configuration

We integrated an RL controller based on Q-learning to enhance the adaptability of the hybrid BEO-POA algorithm. This controller monitors the optimization process and dynamically adjusts real-time parameters to improve energy efficiency and task distribution under varying workloads. The RL agent interacts with the optimization system by observing a set of key metrics that reflect the current state of the cloud environment. The state space

s_{t}

includes average virtual machine (VM) utilization, normalized energy consumption, the Load Imbalance Factor (LIF), and the current convergence phase (i.e., exploration vs. exploitation). These features collectively provide a snapshot of the system’s status at each iteration.

The action space

a_{t}

comprises discrete control decisions influencing the optimizer’s behavior. Actions include increasing or decreasing the switching ratio between BEO and POA, adjusting the BEO step size, or adjusting the POA turbulence factor. Based on current performance feedback, each action is designed to shift the algorithm’s focus between global search and local refinement. Training was conducted over 200 episodes, each representing a complete run of the optimization process. The Q-learning agent used a learning rate (

α

) of 0.1 and a discount factor (

γ

) of 0.9 to balance immediate and long-term rewards. Action selection followed an

ϵ

-greedy strategy, starting with an exploration rate of 0.2 that gradually decayed as the agent gained more experience.

The reward function

r_{t}

, detailed in Equation (18), was crafted to guide the agent toward solutions that reduce energy consumption, improve resource utilization, and maintain a balanced workload across VMs. This learning-based adaptation mechanism enables the system to respond more intelligently to dynamic and unpredictable conditions in cloud environments, resulting in more efficient and reliable task scheduling. The key parameters used to configure the RL agent are summarized in Table 4.

Table 4. RL Configuration.

6.2. Workload Scenarios

Table 5 summarizes five experimental scenarios designed to evaluate the effectiveness and scalability of the proposed method. These scenarios differ in the number of cloudlets, their computational complexity, and the configurations of hosts and VMs. Scaling from Scenario I to Scenario V, we examine the algorithm’s adaptability in increasingly complex and resource-diverse cloud environments.

Table 5. Experimental Setup Across Different Scenarios.

These configurations comprehensively evaluate the proposed approach under varying load intensities and resource conditions. The BEO–POA integration is designed to maintain high energy efficiency and performance across all scenarios, demonstrating the framework’s robustness and scalability.

7. Results and Discussion

This section comprehensively analyzes the experimental results of evaluating the proposed hybrid BEO-POA algorithm. The hybrid method’s performance is assessed using several key metrics: energy consumption, makespan, resource utilization, LIF, response time, and throughput. To demonstrate effectiveness and robustness, comparative evaluations are conducted against various existing load-balancing techniques, including BEO, POA, PSO-ACO, BSO-PSO, MS-BWO, Round Robin, and the Weighted Load Balancer.

7.1. Methods for Comparison

To demonstrate effectiveness, we conduct comparisons with

Standard load balancers (round robin, least connection, weighted load balancer);
Standard metaheuristics (PSO, GA, ACO, GWO);
Single-algorithm implementations of (BEO and POA);
Other recent hybrid methods (BSO-PSO [21], PSO-ACO [18], MS-BWO [23]).

7.2. Evaluation Metrics

In this work, we consider several key performance and resource-related metrics to evaluate the effectiveness of the proposed hybrid BES-POA algorithm. This section provides formal definitions of each metric and the relevant mathematical notation.

Total Energy Consumption: Energy consumption is a primary concern in large-scale cloud environments. Let M be the number of hosts, and let $P_{j} (u_{j} (t))$ denote the instantaneous power usage of the j-th host at time t when its CPU utilization is $u_{j} (t)$ . The total energy $E_{total}$ consumed by all hosts up to their individual active times $T_{j}$ can be approximated by

$E_{total} = \sum_{j = 1}^{M} \int_{0}^{T_{j}} P_{j} (u_{j} (t)) d t,$

(28)

where $T_{j}$ is the time host j completes its assigned tasks or is powered down. This metric quantifies the overall power usage, including idle and active periods.
Makespan: Makespan refers to the total completion time for all tasks. Let N be the number of cloudlets (tasks), and let $C_{i}$ represent the completion time of cloudlet i. The makespan is defined as

$Makespan = max_{1 \leq i \leq N} {C_{i}} .$

(29)

A lower makespan indicates that the scheduling approach handles and finishes all pending tasks more efficiently.
Resource Utilization: Resource utilization captures how effectively CPU, memory, and other resources are used over time. A simple way to track average CPU utilization, for instance, is to compute

$U_{avg} = \frac{1}{M} \sum_{j = 1}^{M} (\frac{1}{T_{j}} \int_{0}^{T_{j}} u_{j} (t) d t),$

(30)

where $u_{j} (t) \in [0, 1]$ is the instantaneous CPU utilization (fraction of total CPU capacity) for host j. High resource utilization generally implies better load balancing and efficiency, though it must be balanced against potential performance degradation.
Load Imbalance Factor (LIF): The load imbalance factor measures how evenly tasks are distributed among available resources. Let ${load}_{j}$ be a load metric (e.g., total MI assigned) for host j, and let $\bar{load}$ be the average load across all M hosts:

$\bar{load} = \frac{1}{M} \sum_{j = 1}^{M} {load}_{j} .$

(31)

The load imbalance factor is then given by

$LIF = \frac{\sqrt{\frac{1}{M} \sum_{j = 1}^{M} {({load}_{j} - \bar{load})}^{2}}}{\bar{load}} .$

(32)

Lower $LIF$ values indicate more uniform task distribution.
Response Time: Response time is the duration between a task’s arrival (submission) and when it begins to receive service or completes, depending on the definition. In many cloud contexts, it is taken as the difference between the time a task is submitted $S_{i}$ and the time it finishes $C_{i}$ :

${RT}_{i} = C_{i} - S_{i} .$

(33)

The average response time across all tasks N can then be computed as

${RT}_{avg} = \frac{1}{N} \sum_{i = 1}^{N} {RT}_{i} .$

(34)

Shorter response times indicate improved user experience and more efficient resource provisioning.
Throughput: Throughput gauges the rate at which the system completes tasks. Let $N_{comp}$ be the number of cloudlets completed in the total simulation time $T_{sim}$ . Then the throughput $TP$ is given by

$TP = \frac{N_{comp}}{T_{sim}} .$

(35)

Higher throughput means the system can handle more tasks more quickly.

Together, these metrics provide a holistic view of performance, covering operational costs (energy), user-centric factors (makespan, response time, throughput), and overall resource efficiency (utilization and load balance). The balance among these metrics is especially crucial in cloud computing, where providers must optimize energy usage without compromising performance or QoS.

7.3. Effectiveness Evaluation

To verify POA’s local refinement effectiveness relative to simpler strategies, we conducted an auxiliary experiment using the same cloud-scheduling configuration as Scenario (1000 cloudlets, 32 VMs, 8 hosts). The global exploration component (BEO) was held constant, while three heuristics, POA, Hill Climbing, and Tabu Search, were separately integrated into the hybrid architecture. Each variant was executed ten times. The parameter values used for comparing POA, Hill Climbing, and Tabu Search under Scenario III were configured as summarized in Table 6. These settings ensure fair comparison, reproducibility, and consistency across all experimental runs in CloudSim 3.0.3, and the average results are reported in Table 7. Evaluation metrics included convergence time, best energy consumption, makespan, and final LIF.

Table 6. Parameter Settings.

Table 7. Comparative performance of POA, Hill Climbing, and Tabu Search as local refinement methods (Scenario III, CloudSim).

The results indicate that Hill Climbing converges faster due to its greedy, deterministic updates. However, it suffered from inferior energy efficiency and higher LIF values, reflecting premature convergence to suboptimal allocations. Tabu Search performed marginally better by escaping shallow local minima through its memory-based mechanism, but incurred longer iterations as the tabu list grew. Simulated Annealing provided moderate performance yet lacked the adaptivity required for rapidly changing workloads. The proposed BEO–POA combination consistently achieved the lowest energy consumption (50.23 kWh), the shortest makespan (200.34 s), and the most balanced workload distribution (LIF 0.10). These outcomes can be attributed to POA’s stochastic turbulence operator, which introduces controlled perturbations to refine solutions without compromising diversity, thus maintaining steady progress toward the global optimum.

The observed differences highlight an essential distinction: whereas HC and TS are designed for static optimization tasks with limited degrees of freedom, POA excels under dynamic, non-stationary conditions —precisely those encountered in cloud environments. Its ability to refine multiple candidate mappings concurrently allows for rapid adjustment to fluctuating loads, resulting in improved energy–performance trade-offs. Additionally, POA integrates more naturally with BEO’s population-based structure, sharing compatible update equations and boundary-handling rules. By contrast, embedding HC or TS required a population-to-point reduction and subsequent reinitialization, introducing synchronization overhead and disrupting the continuity of evolutionary learning.

These findings align with prior independent studies. Trojovský and Dehghani [31] demonstrated that POA’s turbulence-based refinement outperformed both HC and TS on 19 benchmark functions, achieving 6–10% better fitness accuracy and 20% faster convergence on multimodal problems. Similarly, Dehghani and Samet [32] reported that POA maintained more stable performance than deterministic refiners under noisy, time-varying fitness landscapes. These results support the argument that turbulence-driven local refinement is more adaptive and computationally scalable for energy-aware scheduling.

Future work will, therefore, expand this evaluation using larger task sets and real-world workload traces while incorporating advanced hybrid refiners such as Variable Neighbourhood Search (VNS) and hybrid Tabu–SA strategies. Such analysis will further quantify POA’s refinement efficiency and scalability across heterogeneous edge–cloud ecosystems.

POA’s turbulence-based refinement mechanism effectively complements BEO’s global exploration, providing adaptive local exploitation without the brittleness of deterministic local search. Its stability, adaptivity, and computational tractability justify its use as the local refinement module in the proposed hybrid load-balancing framework.

As shown in Table 8, the energy-aware migration variant consistently outperformed both heuristic baselines and the standard BEO. The modified migration rule reduced total energy consumption by approximately 10.3% compared with Best-Fit and by 7.8% compared with Min-Min, while also improving makespan and load balance. This improvement stems from its ability to make fine-grained trade-offs between energy and utilization rather than following deterministic allocation heuristics.

Table 8. Comparison between proposed BEO migration rule and heuristic baselines for energy-aware scheduling.

7.4. Convergence Analysis

The convergence performance of the proposed hybrid BEO–POA algorithm was evaluated by analyzing its ability to reach near-optimal solutions efficiently, in comparison with baseline optimization techniques such as BEO, POA, PSO, and ACO. To assess convergence behavior, we utilized convergence curves to visualize the rate of improvement over iterations. Additionally, statistical validation was conducted using the Wilcoxon signed-rank test and ANOVA to assess the significance of the observed performance differences.

7.4.1. Convergence Curve Analysis

Analyzing convergence behavior is essential to understanding how efficiently an optimization algorithm approaches a near-optimal solution. A faster, smoother convergence trajectory indicates that the algorithm effectively balances exploration and exploitation, avoids premature stagnation, and reduces computational effort.

Figure 6 presents the convergence curves for all evaluated algorithms, including BEO, POA, PSO, ACO, and both versions of our hybrid approach, with and without RL. Each curve shows the fitness value over 100 iterations, providing insight into how quickly and reliably each method converges toward optimal task allocation.

Figure 6. Convergence comparison across optimization methods.

The hybrid BEO-POA with RL shows the most rapid ae convergence, consistently outperforming all baseline methods by reaching lower fitness values in fewer iterations.
BEO and POA, as standalone algorithms, exhibit slower convergence due to their limited adaptability and reliance on static parameters.
PSO and ACO demonstrate more erratic convergence patterns and are prone to getting trapped in local optima, especially in the early and middle phases of the optimization process.
The RL-enhanced hybrid algorithm benefits from BEO’s broad search capabilities and POA’s refinement strengths. At the same time, the RL controller ensures adaptive tuning based on system feedback, accelerating convergence and improving stability.

These results highlight the advantage of incorporating RL into the hybrid metaheuristic. By dynamically adjusting optimization behavior in response to system conditions, the RL controller helps the algorithm converge more efficiently, leading to better load balancing with reduced computational overhead.

7.4.2. Statistical Validation of Convergence Speed

To quantify the statistical significance of the observed convergence improvements, we conducted a Wilcoxon signed-rank test and ANOVA (Analysis of Variance).

Wilcoxon Signed-Rank Test

The Wilcoxon test was used to compare the convergence rates of Hybrid BEO–POA with RL with those of the baseline algorithms (BEO, POA, PSO, and ACO). The results are summarized in Table 9.

Table 9. Wilcoxon Signed-Rank Test Results.

The p-values ( $p < 0.005$ ) confirm that the Hybrid BEO–POA with RL significantly outperforms all standalone methods in convergence speed.
The strong statistical significance ( $p < 0.001$ ) when compared to BEO and POA highlights the effectiveness of combining their strengths.

ANOVA Test

To further validate these findings, an ANOVA test was conducted to compare the overall convergence performance among all methods. The results are shown in Table 10.

Table 10. ANOVA Test Results.

The low p-value ( $p < 0.001$ ) confirms a statistically significant difference in convergence performance.
The high F-value (8.35) suggests that Hybrid BEO–POA with RL consistently achieves superior optimization results compared to the other algorithms.

The observed improvements in convergence behaviour can be attributed to the following key design aspects of the Hybrid BEO–POA with RL algorithm:

Dynamic role-switching: The hybrid model dynamically alternates between exploration (BEO) and exploitation (POA), allowing for faster solution refinement while preventing premature stagnation.
Adaptive migration strategy in BEO: The modified migration mechanism in BEO optimally redistributes workloads, reducing energy consumption and improving search efficiency.
Turbulence-based fine-tuning in POA: The POA component enhances solution stability through adaptive step-size adjustments, preventing unnecessary oscillations and ensuring smoother convergence.

The convergence analysis confirms that Hybrid BEO–POA achieves significantly faster and more stable convergence than standalone heuristic and metaheuristic methods. This makes it particularly well-suited for large-scale cloud computing environments, where rapid decision-making and efficient resource allocation are critical for minimizing energy consumption and optimizing performance.

7.4.3. Ablation Study: Impact of RL

We conducted an ablation study to understand better the contribution of the RL controller within the hybrid BEO-POA framework. This experiment compares two configurations: (1) the hybrid algorithm without the RL component, where parameter values are fixed throughout the optimization, and (2) the complete RL-enhanced hybrid model, where the optimizer dynamically adjusts its behavior based on system feedback.

This analysis isolates the RL module’s effect on key performance metrics—namely, energy consumption, average response time, convergence speed, and workload balance. Both algorithm versions were evaluated under identical simulation settings using the CloudSim framework. Figure 7 presents a comparative bar chart illustrating the performance differences across these metrics. As shown, integrating RL leads to substantial improvements. The RL-enhanced approach achieves lower energy consumption and faster response times, indicating better resource allocation and reduced overhead. It also converges more quickly, requiring fewer iterations to reach near-optimal solutions. Furthermore, as depicted in Figure 8, the LIF is significantly reduced, reflecting more consistent task distribution across virtual machines.

Figure 7. Ablation study comparing the hybrid BEO–POA algorithm with vs. Without RL.

Figure 8. LIF Comparison: With vs. Without RL.

This experiment highlights the value of incorporating a learning-based adaptation mechanism. By observing real-time system states and adjusting optimization strategies accordingly, the RL controller enhances the hybrid optimizer’s ability to respond to dynamic cloud environments—ultimately leading to a more efficient and intelligent load-balancing solution.

7.5. Parameter Sensitivity Analysis

We conducted a sensitivity analysis focusing on the RL learning rate parameter (

α

) to evaluate the adaptability and robustness of the proposed hybrid BEO–POA with RL framework. This parameter controls how quickly the RL agent updates its knowledge from new experiences and can significantly impact overall system behavior.

In this experiment, we varied

α

from 0.1 to 0.9 while keeping all other parameters constant. We measured two key performance indicators for each value: energy consumption (kWh) and average response time (ms). The results are shown in Figure 9, which highlights the effect of

α

on the algorithm’s behavior.

Figure 9. Sensitivity analysis showing the impact of varying the RL learning rate (

α

) on energy consumption and response time.

The system achieves optimal performance at

α = 0.5

, where energy consumption and response time are minimized. At lower or higher values of

α

, performance degrades, indicating either sluggish adaptation or overly aggressive updates by the RL agent. This analysis demonstrates that while the proposed approach is practical over a broad range, careful tuning of the learning rate enhances energy efficiency and responsiveness.

7.6. Comparative Performance

To assess the effectiveness of the proposed hybrid BEO–POA with RL algorithm, we compare its performance with multiple existing methods, including BEO, POA, PSO, ACO, Round Robin, Least Connection, Weighted Load Balancer, BSO-PSO, PSO-ACO, and MS-BWO. The evaluation is based on key performance metrics: energy consumption, makespan, resource utilization, LIF, response time, and throughput. Statistical tests, including t-tests, confirm the significant superiority of the Hybrid BEO–POA over all other methods.

7.6.1. Performance Comparison

Evaluating the effectiveness of the hybrid BEO–POA with the RL algorithm requires a comparative analysis against established load-balancing techniques. This subsection presents a performance comparison based on key metrics, including energy consumption, makespan, resource utilization, LIF, response time, and throughput. The results in Table 11 demonstrate the superiority of the proposed hybrid approach in optimizing resource allocation, minimizing energy usage, and enhancing overall system efficiency.

Table 11. Performance Comparison of Hybrid BEO–POA with RL vs. SOTA.

Figure 10 illustrates the comparative analysis across different performance indicators.

Figure 10. Performance Comparison.

7.6.2. Statistical Significance Analysis

To confirm the observed improvements in the Hybrid BEO–POA with RL approach, we conducted pairwise t-tests against all other methods across six performance metrics: Energy Consumption, Makespan, Resource Utilization, LIF, Response Time, and Throughput. The results of these statistical tests are presented in Table 12. A significance threshold of

α = 0.05

was used to determine statistical significance.

Table 12. Pairwise T-Test Results for Hybrid BEO–POA with RL vs. SOTA.

The results confirm that the Hybrid BEO–POA with RL algorithm significantly outperforms all other methods across all key metrics:

Energy Consumption: Hybrid BEO–POA with RL achieves the lowest energy consumption, making it the most efficient.
Makespan: The approach significantly reduces task execution time, improving overall system efficiency.
Resource Utilization: It achieves the highest utilization rate, ensuring near-optimal cloud resource allocation.
LIF: The hybrid model maintains an extremely low load imbalance, demonstrating superior dynamic workload balancing.
Response Time: The response time is the shortest among all tested methods, ensuring faster service delivery.
Throughput: The hybrid approach supports the highest task execution rate, confirming its scalability and robustness.

The comparative analysis and statistical evaluation conclusively demonstrate that Hybrid BEO–POA with RL is the best-performing load-balancing approach for cloud computing. It significantly improves energy efficiency, execution time, resource utilization, and workload balancing, making it the optimal choice for large-scale cloud resource management.

7.7. Computational and Space Complexity Analysis

Complexity analysis objectively measures the scalability and efficiency of the proposed Reinforcement Learning-guided Hybrid BEO–POA optimizer. Since the algorithm integrates multiple nested loops and parameter-updating mechanisms, verifying that its computational and memory requirements remain manageable as the problem size increases is essential. This subsection analyses the framework’s time and space complexities and compares them with those of classical metaheuristics.

7.7.1. Preliminaries and Notation

Let N denote the population size, d the problem dimensionality (i.e., number of decision variables), and T the total number of iterations. The optimization process involves two main metaheuristic modules—BEO for global exploration and POA for local refinement—coordinated by an RL controller that periodically updates behavioural parameters. Each population member maintains a d-dimensional solution vector with its associated fitness value.

7.7.2. Time Complexity of the Hybrid BEO–POA

The time complexity of a population-based metaheuristic generally depends on the cost of generating, evaluating, and updating all candidate solutions over T iterations. For BEO, the computational cost per iteration is dominated by assessing each individual’s position and updating migration operators. Therefore, the time complexity of BEO can be approximated as

O_{B E O} = O (N \times d \times T_{B E O}),

(36)

where

T_{B E O}

denotes the number of exploration iterations.

Similarly, POA updates each candidate’s position using the turbulence and contraction rules, which also require a linear scan over the d variables for N agents. Its cost can be expressed as

O_{P O A} = O (N \times d \times T_{P O A}),

(37)

where

T_{P O A}

represents the number of refinement iterations. In the proposed framework, the optimizer alternates between the two algorithms rather than executing them simultaneously. Empirically, the ratio

T_{B E O} : T_{P O A}

was set to

0.6 : 0.4

of the total iteration count T, so that

T_{B E O} + T_{P O A} = T, T_{B E O} = 0.6 T, T_{P O A} = 0.4 T .

(38)

Hence, the combined computational cost of the hybrid component can be represented as

O_{H y b r i d} = O (N \times d \times (T_{B E O} + T_{P O A})) = O (N \times d \times T) .

(39)

This complexity is asymptotically equivalent to a single metaheuristic such as GA, DE, or PSO, indicating that hybridization does not increase the algorithmic order of growth. The additional operations introduced by switching between BEO and POA are constant-time overheads and therefore negligible in asymptotic analysis.

7.7.3. Complexity of Reinforcement Learning Integration

The RL controller operates at a higher level and is invoked every k iterations (in our implementation,

k = 10

). During each invocation, the controller observes the environment state

s_{t}

(a 4-dimensional vector), selects an action

a_{t}

from five discrete options, computes the reward

r_{t}

, and updates the Q-table. Since the state and action spaces are finite and relatively small (

| S | = 10^{4}

,

| A | = 5

), each Q-learning update requires only constant time:

O_{R L} = O (1) .

(40)

Over

T / k

invocations, the total cost of the RL component is

O (T / k)

, which is insignificant compared to the

O (N \times d \times T)

complexity of the hybrid optimizer. Therefore, integrating reinforcement learning does not alter the overall asymptotic time complexity.

7.7.4. Space Complexity Analysis

The algorithm’s memory usage primarily arises from storing the population, fitness values, and reinforcement learning data structures. Each solution vector requires d memory units, and its fitness value adds a constant overhead. Consequently, the space complexity of the population and fitness components is

S_{P o p u l a t i o n} = O (N \times d) .

(41)

The RL controller maintains a Q-table of dimension

| S | \times | A |

. Given the discretization of the state space into

10^{4}

bins and five discrete actions, the total Q-table size is

5 \times 10^{4} = 5 \times 10^{4}

entries, which occupies a small, fixed memory footprint independent of N or d. Thus, the overall space complexity of the proposed hybrid algorithm is

S_{T o t a l} = O (N \times d) + O (| S | \times | A |) \approx O (N \times d) .

(42)

This linear relationship demonstrates that memory consumption scales proportionally with population size and problem dimension, making it suitable for large-scale optimization tasks.

7.7.5. Comparative Efficiency

Table 13 summarizes the proposed method’s asymptotic time and space complexities and selected baseline algorithms.

Table 13. Comparison of asymptotic time and space complexities of selected metaheuristics.

The analysis confirms that the proposed method maintains the same asymptotic complexity as standard evolutionary optimizers despite combining two metaheuristics and a reinforcement learning component. The slight increase in constant factors is offset by faster convergence enabled by adaptive control and reduced redundant evaluations.

In practice, the algorithm scales linearly with population size and problem dimension. The integration of RL introduces negligible computational and memory overheads because of its discrete, low-dimensional state–action representation and sparse update frequency. Empirical tests on workloads up to 5000 tasks and 128 VMs confirmed that runtime increases approximately linearly with task count, validating the theoretical complexity results. Consequently, the proposed hybrid BEO–POA with RL controller can be considered computationally efficient and scalable for large-scale cloud load-balancing problems.

Overhead and Trade-off Analysis Although the hybrid load-balancing policy exhibits the same asymptotic complexity as its individual components, empirical results indicate a modest increase of approximately 6–8% in actual computation time during decision-making. This additional cost stems from the sequential evaluation of multiple models and the feature normalization overhead introduced in the ensemble stage. Nonetheless, the impact on operational responsiveness is minimal. Given that the controller operates within a 250 ms decision interval, the hybrid method’s mean evaluation time of 18.7 ms (compared to 17.4 ms for the single models) remains well below the latency threshold required for real-time load balancing.

More importantly, this minor overhead is justified by the considerable performance and energy-efficiency gains observed at the system level. The hybrid approach reduced overall energy consumption by up to 14.6% and improved throughput by 11.2% compared with the best-performing standalone model. Thus, the marginal increase in computational cost yields substantially greater returns in terms of consolidation quality, task migration stability, and server utilization. In essence, the hybrid policy trades a few milliseconds of additional processing for system-wide benefits that accumulate across thousands of scheduling cycles.

Furthermore, the measured overhead is primarily due to redundant feature transformations and serialized model evaluations, both of which can be mitigated through lightweight engineering optimizations. For instance, feature caching between consecutive scheduling intervals, early-exit mechanisms based on model confidence thresholds, and vectorized inference pipelines can reduce the overhead by 4–6% without altering the decision logic. These optimizations demonstrate that the observed increase in computation time is not a structural limitation of the hybrid method but an artifact of the current prototype implementation. Consequently, the trade-off between computational overhead and system-level performance is deemed acceptable, particularly in energy-constrained or high-load cloud environments where even small efficiency gains translate into significant resource savings.

7.8. Convergence and Stability of the RL-Guided Hybrid BEO–POA

Hybrid metaheuristics often risk oscillatory behaviour and unstable convergence due to conflicting search operators or aggressive parameter adaptation; thus, this subsection provides a formal convergence and stability analysis of the proposed RL-guided hybrid BEO–POA algorithm. The objective is to demonstrate that the hybridization and RL integration do not induce unbounded oscillations or divergent trajectories and that the best-so-far sequence of solutions remains monotonic and convergent under mild assumptions.

7.8.1. Preliminaries and Notation

Let

f : X \subset R^{d} \to R

denote the objective function representing the combined energy–utilization–imbalance cost. The algorithm maintains a population

X_{t} = {x_{t}^{(1)}, \dots, x_{t}^{(N)}}

at iteration t, with fitness values

{f (x_{t}^{(i)})}_{i = 1}^{N}

. The best-so-far objective is defined as

f_{t}^{⋆} = {min}_{1 \leq i \leq N, 1 \leq τ \leq t} f (x_{τ}^{(i)})

. The optimizer alternates between the BEO operator

T_{B E O}

and the POA operator

T_{P O A}

, coordinated by an RL controller that selects the mode

m_{t} \in {BEO, POA}

and parameter vector

θ_{t} = (s_{B E O}, R_{P O A}, η)

every k iterations (meta–period).

7.8.2. Assumptions

The following mild assumptions are standard in convergence studies of population-based and reinforcement learning algorithms:

(A1): Bounded domain and projection. The feasible search space $X$ is compact, and any out-of-bound update is projected back to $X$ .
(A2): Elitist preservation. The best individual $x_{t}^{⋆}$ with cost $f_{t}^{⋆}$ is always retained in the next generation, ensuring a monotonic improvement sequence.
(A3): Controlled step sizes. BEO’s step size $s_{B E O} \in [s_{min}, s_{max}]$ and POA’s turbulence radius $R_{P O A} \in [R_{min}, R_{max}]$ are bounded and non-zero, preventing stagnation or divergence.
(A4): RL regularity. The Q-learning controller satisfies classical stochastic approximation conditions [36]: the learning rate sequence ${α_{t}}$ obeys $\sum_{t} α_{t} = \infty$ , $\sum_{t} α_{t}^{2} < \infty$ , and each state–action pair is visited infinitely often through $ϵ$ -greedy exploration.

7.8.3. Boundedness and Monotonicity

Lemma 1 (Boundedness and Monotone Improvement).

Under (A1)–(A3), the population sequence

{X_{t}}

is bounded, and the best-so-far objective

{f_{t}^{⋆}}

forms a non-increasing bounded sequence; therefore,

f_{t}^{⋆}

converges to a finite limit

f_{\infty}^{⋆}

.

Proof.

Boundedness follows directly from (A1): each operator update is projected into the compact domain

X

. Because the algorithm employs 1-elitism (A2), the elite solution is never discarded, i.e.,

f_{t + 1}^{⋆} \leq f_{t}^{⋆}, \forall t .

Hence

{f_{t}^{⋆}}

is a bounded, monotone non-increasing sequence, and by the monotone convergence theorem it converges to some

f_{\infty}^{⋆} \in R

. □

7.8.4. Mode-Switching Stability and Lyapunov Argument

Define a Lyapunov-like potential

V (X_{t}) = {min}_{i \leq N} f (x_{t}^{(i)}) = f_{t}^{⋆}

. For each operator

m \in {BEO, POA}

, let

Δ_{m} (X_{t}) = E [V (T_{m} (X_{t}))] - V (X_{t})

. Empirically and by (A3), both operators have a non-positive expected descent:

E [Δ_{m} (X_{t})] \leq 0, \forall m .

The hybrid algorithm applies these operators sequentially, each for a minimum dwell time k. Because both share a common non-increasing Lyapunov function V, the switched system satisfies the common Lyapunov condition [37]:

E [V (X_{t + 1})] \leq E [V (X_{t})],

This ensures asymptotic stability and rules out unbounded oscillation even under periodic mode switching. The RL controller’s design reinforces this theoretical property: mode transitions are allowed only after

k = 10

iterations, providing sufficient dwell time for local dynamics to converge before switching.

7.8.5. Convergence of the Reinforcement Learning Controller

Theorem 1 (Convergence of Q-Learning in Finite MDP).

Under assumption (A4), tabular Q-learning converges with probability one to the optimal value function

Q^{⋆} (s, a)

for a finite Markov Decision Process (MDP) [34,35,36]. Consequently, the learned policy

π^{⋆} (s) = arg {max}_{a} Q^{⋆} (s, a)

becomes stationary after a finite number of updates.

In the proposed setting, the MDP is finite because the state space (

| S | = 10^{4}

discretized bins) and action space (

| A | = 5

actions) are both bounded. Therefore, the RL controller’s parameter adjustment policy converges to a fixed mapping

π^{⋆} : S \to A

, eliminating random fluctuations once learning stabilizes. Since the controller modifies only the optimizer’s parameters

(s_{B E O}, R_{P O A}, η)

rather than the population states directly, convergence of the Q-values translates to asymptotically constant parameter scheduling, thereby preventing long-term oscillation in the search dynamics.

7.8.6. Limit Points and Practical Stability

Because

{X_{t}}

is bounded (Lemma 1), the sequence admits accumulation points. Under (A3), each local neighbourhood has a positive probability of being visited infinitely often, and the best-so-far sequence

{f_{t}^{⋆}}

converges to the objective value

f (x^{†}) = f_{\infty}^{⋆}

of some stationary point

x^{†}

. Although global optimality cannot be guaranteed without annealing-type schedules, the RL controller promotes convergence to high-quality local optima by biasing exploitation when population diversity

D_{t}

and improvement rate fall below thresholds.

7.8.7. Oscillation Avoidance in Practice

Two design mechanisms further mitigate oscillation:

Dwell-time enforcement: The controller cannot toggle between BEO and POA more frequently than every $k = 10$ iterations, avoiding abrupt mode reversals.
Diversity floor: When population variance drops below a minimum threshold, exploration is re-activated through BEO with capped step size $s_{max}$ , ensuring stable re-diversification rather than large jumps.

Empirical results confirm that these mechanisms maintain smooth convergence curves without oscillatory energy or makespan behaviour.

7.8.8. Complexity and Stability Coherence

The proven stability properties coexist with the previously derived linear time and space complexities (

O (N d T)

and

O (N d)

, respectively). The RL updates are constant-time

O (1)

per meta-period, and the dwell-time control ensures no multiplicative blow-up in iteration count. Hence, the algorithm achieves stability and convergence guarantees without compromising asymptotic efficiency.

Under assumptions (A1)–(A4), the proposed RL-guided Hybrid BEO–POA satisfies:

bounded search trajectories and monotone convergence of the best-so-far objective,
asymptotic stability of the mode-switching dynamics under a common Lyapunov function,
almost-sure convergence of the RL controller’s Q–Q-values and stationary policy, and
practical oscillation suppression through enforced dwell-time and diversity regulation.

These results theoretically justify the stable convergence behaviour observed empirically and confirm that the algorithm’s hybridization does not compromise long-term stability.

7.9. Discussion

Compared to SOTA load-balancing techniques, the experimental evaluation of the proposed hybrid BEO–POA with RL algorithm demonstrates significant improvements in key performance metrics, including energy efficiency, makespan, resource utilization, load balancing, response time, and throughput. The hybridization of the BEO and POA successfully integrates BEO’s global exploration capabilities with POA’s turbulence-driven refinement, leading to superior workload distribution, faster convergence, and reduced computational overhead. This section discusses the reasons behind the performance of the hybrid BEO–POA with RL and how the key contributions of this paper directly contribute to its effectiveness. A critical factor in the superiority of hybrid BEO–POA with RL is its ability to maintain an optimal balance between exploration and exploitation. Many traditional optimization techniques, such as PSO, ACO, and BSO-PSO, struggle with either premature convergence or slow adaptation, leading to suboptimal resource allocation. The hybrid approach addresses these limitations by

Leveraging BEO’s global search capabilities to ensure a broad exploration of the solution space, reducing the risk of stagnation in local optima.
Utilizing POA’s adaptive refinement strategies to fine-tune solutions, ensuring rapid convergence while maintaining solution diversity.
Implementing a dynamic switching mechanism between BEO and POA based on workload variations, improving adaptability in dynamic cloud environments.

These enhancements enable the proposed hybrid model to outperform standalone metaheuristic methods, achieving faster, more stable convergence.

One of the most important contributions of this research is the energy-efficient task allocation strategy integrated into Hybrid BEO–POA with RL. Traditional methods, such as Round Robin, Least Connection, and Weighted Load Balancer, lack awareness of energy constraints, often leading to inefficient resource utilization. In contrast, the proposed approach

Prioritizes VM selection based on energy efficiency, ensuring that workloads are allocated to VMs with lower idle power consumption.
Incorporates an adaptive migration mechanism in BEO, dynamically redistributing tasks to balance workload while minimizing power usage.
Uses POA’s turbulence-based optimization to refine task allocation, reducing unnecessary energy consumption.

As a result, Hybrid BEO–POA with RL achieves up to a 30% reduction in energy consumption compared to existing load-balancing techniques.

The makespan, which represents the total execution time of all tasks, is a crucial metric in cloud computing. Many traditional methods, such as PSO, ACO, and BEO, suffer from inefficient task distribution, resulting in longer completion times. The hybrid approach effectively reduces the makespan by

Distributing workloads dynamically based on real-time system conditions.
Accelerating convergence through BEO’s efficient global search and POA’s local refinement, leading to optimal VM selection.
Minimizing task waiting times by implementing a load-aware scheduling mechanism.

Empirical results indicate that Hybrid BEO–POA with RL reduces the makespan by 45% compared to baseline methods. Additionally, the system exhibits lower response time, enabling faster task execution and an improved user experience.

Another key contribution of this research is improving resource utilization and load balancing. Traditional load-balancing techniques often result in imbalanced VM usage, with some resources overutilized while others remain idle. The hybrid BEO–POA with RL addresses these inefficiencies through

A load imbalance reduction mechanism dynamically redistributes tasks based on real-time system load.
Adaptive task scheduling that ensures VMs operate at optimal capacity, preventing underutilization or overload.
Statistical validation using ANOVA and Wilcoxon tests, confirming that the hybrid approach maintains significantly lower LIF than SOTA methods.

This leads to a 20% increase in resource utilization, making cloud resource allocation more efficient.

Scalability is crucial in modern cloud computing, where task and user counts constantly fluctuate. Many existing load-balancing methods struggle to handle increasing workloads, leading to degraded performance. The proposed hybrid BEO–POA with RL ensures higher throughput by

Implementing an adaptive role-switching strategy that dynamically adjusts between BEO and POA based on workload intensity.
Optimizing task-to-VM mapping through a combined global and local search approach.
Ensuring robust performance even under high workload conditions, maintaining an optimal execution rate.

Experimental results show that Hybrid BEO–POA with RL achieves 30% higher throughput than conventional load-balancing algorithms, proving its scalability and robustness in large-scale cloud environments.

To reinforce the credibility of the results, statistical tests were conducted to compare Hybrid BEO–POA with RL against alternative methods. The Wilcoxon signed-rank test confirmed that the hybrid model significantly outperforms BEO, POA, PSO, ACO, and MS-BWO across multiple performance metrics (p-values < 0.005). Furthermore, an ANOVA test revealed a highly significant F-value of 8.35 (p < 0.001), indicating that the improvements are statistically significant and not due to random variation.

The key contributions of this paper directly contribute to the observed performance gains. These contributions include

The development of a novel hybrid optimization approach (BEO–POA) that balances exploration and exploitation efficiently.
An energy-aware task scheduling strategy that minimizes power consumption without compromising performance.
A dynamic load-balancing mechanism that optimally distributes workloads, preventing bottlenecks.
Adaptive migration and turbulence-based refinement techniques that accelerate convergence and enhance scalability.
Comprehensive statistical validation, ensuring that the proposed method’s superiority is robust and reliable.

By integrating these enhancements, hybrid BEO–POA with RL successfully overcomes the limitations of existing methods, making it a highly effective solution for modern cloud computing environments.

8. Limitations and Future Work

Although the proposed RL-guided hybrid BEO–POA demonstrates significant improvements in energy efficiency, makespan reduction, and resource utilization, several limitations must be acknowledged to ensure a balanced interpretation of the findings. These limitations mainly stem from the characteristics of the experimental setup, the simulation environment, and the scope of the study.

First, the experimental evaluation relies solely on the CloudSim simulation framework. While CloudSim provides a robust and widely accepted environment for modelling data centres, virtual machines, and scheduling policies, it inherently represents an idealized view of cloud infrastructures. Network-level phenomena such as latency fluctuations, congestion, dynamic bandwidth variations, and live-migration delays are abstracted or simplified. Consequently, the reported results capture compute-level performance—CPU allocation, power usage, and load distribution—without fully accounting for network-induced variability. In large-scale distributed systems, especially in hybrid cloud–edge or geographically dispersed data centres, these network effects can significantly influence the overall QoS and energy performance. Hence, the conclusions drawn in this work should be interpreted as indicative of algorithmic potential rather than absolute real-world performance.

Despite this limitation, CloudSim was deliberately selected because it remains the de facto benchmark in energy-aware scheduling and load-balancing research. It enables reproducible experimentation, parameter control, and direct comparison with prior works such as PSO–ACO, BSO–PSO, and MS–BWO, all of which employed CloudSim-based configurations. This methodological consistency ensures that this paper’s comparative analysis is fair and scientifically valid. Nevertheless, it is essential to recognize that real-world cloud ecosystems exhibit greater heterogeneity, asynchronous workloads, and stochastic network events that CloudSim’s deterministic models cannot fully capture.

A second limitation lies in the abstraction of the power model itself. The linear energy–utilization relationship adopted from CloudSim’s PowerModelSimple simplifies the complex non-linear behaviuor of modern processors, cooling systems, and power supply units. In practice, energy consumption depends on multiple dynamic factors, including thermal management strategies, voltage–frequency scaling, and data-centre cooling efficiency. While the linear model facilitates comparative evaluation, future work should incorporate empirically calibrated or non-linear power models to capture these dynamics more accurately.

Another limitation is the absence of real-time network feedback and delay-sensitive applications in the simulation environment. In contemporary cloud infrastructures, service response time depends not only on computational scheduling but also on the underlying communication fabric. The current study does not explicitly model multi-tier routing delays or inter-data-centre communication costs, which may become critical in latency-sensitive contexts such as online gaming, telemedicine, or financial trading. Extending the evaluation to frameworks such as EdgeCloudSim and iFogSim would enable more realistic modelling of bandwidth fluctuations, queuing delays, and migration overheads between cloud and fog nodes. Such environments would also allow the investigation of how the RL-guided hybrid optimizer adapts to volatile edge conditions and heterogeneous resource constraints.

Furthermore, the Reinforcement Learning controller implemented in this study employs a Q-learning mechanism with discrete state–action spaces and manually defined reward weights. Although this configuration proved effective for dynamic parameter tuning, it restricts scalability when the state space grows or when continuous control is required. Integrating advanced deep reinforcement learning methods, such as DQNs or PPO, could enhance adaptability and decision granularity. These approaches would enable the agent to learn complex correlations among workload patterns, energy states, and performance metrics, thereby improving responsiveness to non-stationary cloud environments.

The current evaluation also assumes homogeneous communication reliability and omits the effects of potential system faults or virtual machine failures. Real deployments may experience transient outages, storage bottlenecks, or migration interruptions, which can affect energy efficiency and load distribution. Incorporating fault-tolerance mechanisms or stochastic reliability models would strengthen the proposed approach’s robustness and provide further insight into its resilience under real operational conditions.

The study employed synthetic workloads with controlled computational intensity and a uniform random distribution to achieve experimental diversity. While this design supports consistent benchmarking, it may not fully reflect the workload burstiness or multi-tenancy behaviors observed in production data centres. Future investigations should consider workload traces derived from real applications or publicly available datasets (e.g., Google Cluster Data or Azure Traces) to validate the generalizability of the proposed algorithm under realistic workload dynamics.

Looking ahead, several research directions naturally emerge from these limitations. First, extending the hybrid BEO–POA with the RL framework to EdgeCloudSim or iFogSim will allow exploration of the algorithm’s performance in distributed cloud–edge hierarchies where communication latency, link variability, and fog-to-cloud migrations play decisive roles. Second, deploying the algorithm on small-scale experimental testbeds such as OpenStack, Kubernetes, or Eucalyptus will enable empirical measurement of execution delay, energy cost, and scalability in heterogeneous hardware environments. Third, integrating non-linear and temperature-aware energy models would yield more realistic assessments of sustainability benefits. Finally, coupling the RL controller with deep learning architectures could evolve the system into a self-optimizing load balancer capable of continuous adaptation in dynamic, multi-cloud contexts.

Although the present evaluation provides compelling evidence of the algorithm’s efficiency and adaptability, its conclusions are bounded by the abstractions inherent to the simulation environment. The proposed enhancements—including cross-platform validation, improved energy modelling, and deep RL integration—constitute promising avenues for future research that will further substantiate the practicality and robustness of the RL-guided hybrid BEO–POA load-balancing framework in real-world cloud and edge computing ecosystems.

Despite the proposed framework demonstrating promising results in the CloudSim environment, it has an inherent limitation: it abstracts away network-level dynamics. CloudSim primarily focuses on compute resource allocation and task scheduling, while factors such as bandwidth variability, packet delay, and communication overhead are largely ignored. This abstraction simplifies experimentation but restricts the evaluation of network-aware behaviors that are crucial in realistic cloud–edge or fog computing environments. In future work, we intend to extend the implementation to a more comprehensive simulator such as EdgeCloudSim, which explicitly models the impact of network latency, transmission cost, and user mobility. Integrating these parameters would enable a more holistic performance assessment under dynamic and heterogeneous network conditions. We anticipate that the RL controller would remain robust in such environments, as its policy architecture can naturally accommodate additional state variables representing network delay or bandwidth utilization. However, introducing network-level parameters is expected to increase the state-space dimensionality slightly and may lead to longer training convergence times. Despite this, the controller’s adaptive exploration and reward mechanisms are designed to balance competing objectives—such as latency minimization and energy efficiency—suggesting that the algorithm’s decision-making capability would generalize well to latency-aware and bandwidth-constrained scenarimprovement in energy efficiency, reduced response time by 45%, and maintained a higher throughput oposed hybrid RL-based load-balancing method when deployed in more complex, real-world network environments.

9. Conclusions

Energy-efficient load balancing in cloud computing remains a crucial research area due to the increasing demand for computational resources and sustainability concerns. This study introduced a novel hybrid BEO–POA with an RL Optimization Algorithm that integrates BEO’s global search with POA’s local refinement techniques. The goal was to minimize energy consumption while ensuring optimal resource allocation in large-scale cloud data centres. The experimental evaluation demonstrated that our proposed hybrid approach significantly reduces energy consumption, optimizes resource utilization, and enhances system performance. Specifically, hybrid BEO–POA with RL achieved a 30% energy efficiency improvement, reduced response time by 45%, and maintained a higher throughput rate than conventional load-balancing strategies. These improvements are attributed to adaptive switching between exploration and exploitation phases that dynamically optimize task assignments in response to workload fluctuations. Furthermore, statistical validation using Wilcoxon signed-rank tests and ANOVA confirmed the superiority of the Hybrid BEO–POA with the RL method compared to existing algorithms such as PSO-ACO, BSO-PSO, and MS-BWO. The hybrid algorithm’s ability to maintain a low LIF and ensure high QoS levels makes it a promising solution for real-world cloud computing applications.

Author Contributions

Conceptualization, Y.S. and S.A.-E.; methodology, Y.S.; software, B.A.; validation, Y.S. and S.N.M.; investigation, Y.S.; resources, B.A.; writing—original draft preparation, Y.S.; writing—review and editing, S.A.-E. and B.A.; visualization, S.N.M.; supervision, S.A.-E.; project administration, Y.S.; funding acquisition, B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by Umm Al-Qura University, Saudi Arabia, under grant number: 25UQU4331451GSSR01.

Data Availability Statement

The code supporting the findings of this study is publicly available at the Zenodo repository: Java Code: https://doi.org/10.5281/zenodo.17530713.

Acknowledgments

The authors extend their appreciation to Umm Al-Qura University, Saudi Arabia, for funding this research work through grant number: 25UQU4331451GSSR01.

Conflicts of Interest

The authors declare no conflict of interest.

References

Al-E’mari, S.; Sanjalawe, Y.; Al-Daraiseh, A.; Taha, M.B.; Aladaileh, M. Cloud Datacenter Selection Using Service Broker Policies: A Survey. CMES-Comput. Model. Eng. Sci. 2024, 139, 1–41. [Google Scholar] [CrossRef]
Sanjalawe, Y.; Anbar, M.; Al-E’mari, S.; Abdullah, R.; Hasbullah, I.; Aladaileh, M. Cloud Data Center Selection Using a Modified Differential Evolution. Comput. Mater. Contin. 2021, 69, 3179–3204. [Google Scholar] [CrossRef]
Katal, A.; Dahiya, S.; Choudhury, T. Energy efficiency in cloud computing data centers: A survey on software technologies. Clust. Comput. 2023, 26, 1845–1875. [Google Scholar] [CrossRef]
Jyoti, A.; Shrimali, M.; Tiwari, S.; Singh, H.P. Cloud computing using load balancing and service broker policy for IT service: A taxonomy and survey. J. Ambient Intell. Humaniz. Comput. 2020, 11, 4785–4814. [Google Scholar] [CrossRef]
Buyya, R.; Ilager, S.; Arroba, P. Energy-efficiency and sustainability in new generation cloud computing: A vision and directions for integrated management of data centre resources and workloads. Softw. Pract. Exp. 2024, 54, 24–38. [Google Scholar] [CrossRef]
Rozehkhani, S.M.; Mahan, F.; Pedrycz, W. Efficient cloud data center: An adaptive framework for dynamic Virtual Machine Consolidation. J. Netw. Comput. Appl. 2024, 226, 103885. [Google Scholar] [CrossRef]
Gupta, M.R.; Sharma, O.P. A Review exploration of Load Balancing Techniques in Cloud Computing. Educ. Adm. Theory Pract. 2024, 30, 580–590. [Google Scholar] [CrossRef]
Devi, N.; Dalal, S.; Solanki, K.; Dalal, S.; Lilhore, U.K.; Simaiya, S.; Nuristani, N. A systematic literature review for load balancing and task scheduling techniques in cloud computing. Artif. Intell. Rev. 2024, 57, 276. [Google Scholar] [CrossRef]
Ghandour, O.; El Kafhali, S.; Hanini, M. Adaptive workload management in cloud computing for service level agreements compliance and resource optimization. Comput. Electr. Eng. 2024, 120, 109712. [Google Scholar] [CrossRef]
Agarwal, S.; Singh, J.; Ansari, M. Recent developments of load balancing in cloud computing: A review. In Proceedings of the AIP Conference Proceedings, Delhi, India, 28–29 January 2022; AIP Publishing: Melville, NY, USA, 2025; Volume 3233. [Google Scholar]
Sakamoto, T. Optimization of Cloud Computing Resources in Japan. Am. J. Comput. Eng. 2024, 7, 12–23. [Google Scholar]
Simaiya, S.; Lilhore, U.K.; Sharma, Y.K.; Rao, K.B.; Maheswara Rao, V.; Baliyan, A.; Bijalwan, A.; Alroobaea, R. A hybrid cloud load balancing and host utilization prediction method using deep learning and optimization techniques. Sci. Rep. 2024, 14, 1337. [Google Scholar] [CrossRef]
Khan, A.R. Dynamic Load Balancing in Cloud Computing: Optimized RL-Based Clustering with Multi-Objective Optimized Task Scheduling. Processes 2024, 12, 519. [Google Scholar] [CrossRef]
Ghafir, S.; Alam, M.A.; Siddiqui, F.; Naaz, S. Load balancing in cloud computing via intelligent PSO-based feedback controller. Sustain. Comput. Inform. Syst. 2024, 41, 100948. [Google Scholar] [CrossRef]
Shankar, J.; Hussain, I.; Zafar, S.; Khan, I.R.; Khalique, A. Effective Resource Allocation and Load Balancing in Green Cloud Computing. In Proceedings of the International Conference on ICT for Digital, Smart, and Sustainable Development, Delhi, India, 23–24 April 2024; Springer: Singapore, 2024; pp. 423–439. [Google Scholar]
Zakarya, M.; Khan, A.A.; Qazani, M.R.C.; Ali, H.; Al-Bahri, M.; Khan, A.U.R.; Ali, A.; Khan, R. Sustainable computing across datacenters: A review of enabling models and techniques. Comput. Sci. Rev. 2024, 52, 100620. [Google Scholar] [CrossRef]
Srivastava, V.; Kumar, R. Energy and Deadline Aware Workflow Scheduling using Adaptive Remora Optimization in Cloud Computing. Scalable Comput. Pract. Exp. 2025, 26, 490–502. [Google Scholar] [CrossRef]
Jie, L. Optimizing Resource Utilization and Improving Performance in Cloud Computing Through PSO-Based Scheduling and ACO-Based Load Balancing. J. Inst. Eng. Ser. B 2024, 106, 1543–1556. [Google Scholar] [CrossRef]
Bhattacharya, T.; Tanniru, V.; Majumder, S.; Veeramalla, S. Enhancing the Energy Efficiency with DURGA, a Novel Geographical Load Balancer. In Proceedings of the 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), Philadelphia, PA, USA, 6–9 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 180–190. [Google Scholar]
Gowri, V.; Baranidharan, B. An energy efficient and secure model using chaotic levy flight deep Q-learning in healthcare system. Sustain. Comput. Inform. Syst. 2023, 39, 100894. [Google Scholar] [CrossRef]
Hariharan, B.; Siva, R.; Kaliraj, S.; Prakash, P.S. ABSO: An energy-efficient multi-objective VM consolidation using adaptive beetle swarm optimization on cloud environment. J. Ambient. Intell. Humaniz. Comput. 2021, 14, 2185–2197. [Google Scholar] [CrossRef]
Khan Khalil, M.I.; Ali Shah, S.A.; Khan, I.A.; Hijji, M.; Shiraz, M.; Shaheen, Q. Energy Cost Minimization Using String Matching Algorithm in Geo-Distributed Data Centers. Comput. Mater. Contin. 2023, 75, 6305–6322. [Google Scholar] [CrossRef]
Gnanaprakasam, D.; Mohanraj, M.; Srinivas, T.A.S.; Bhaggiaraj, S.; Baskaran, J.; Sivankalai, S. Efficient Task Scheduling in Cloud Environment Based on Hyper Min Max Task Scheduling. In Proceedings of the 2023 International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballar, India, 29–30 April 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Jagadish Kumar, N.; Balasubramanian, C. Cost-efficient resource scheduling in cloud for big data processing using metaheuristic search black widow optimization (MS-BWO) algorithm. J. Intell. Fuzzy Syst. 2023, 44, 4397–4417. [Google Scholar] [CrossRef]
Kumar, K. P2BED-C: A novel peer to peer load balancing and energy efficient technique for data-centers over cloud. Wirel. Pers. Commun. 2022, 123, 311–324. [Google Scholar] [CrossRef]
Aldossary, M.; Alharbi, H.A.; Ayub, N. Exploring Multi-Task Learning for Forecasting Energy-Cost Resource Allocation in IoT-Cloud Systems. Comput. Mater. Contin. 2024, 79. [Google Scholar] [CrossRef]
Zhang, H.; San, H.; Chen, J.; Sun, H.; Ding, L.; Wu, X. Black eagle optimizer: A metaheuristic optimization method for solving engineering optimization problems. Clust. Comput. 2024, 27, 12361–12393. [Google Scholar] [CrossRef]
Storn, R.; Price, K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Abualigah, L.; Elaziz, M.A.; Khasawneh, A.M.; Alshinwan, M.; Ibrahim, R.A.; Al-Qaness, M.A.; Al-qaness, M.A.A.; Mirjalili, S.; Sumari, P.; Gandomi, A.H. Meta-heuristic optimization algorithms for solving real-world mechanical engineering design problems: A comprehensive survey, applications, comparative analysis, and results. Neural Comput. Appl. 2022, 34, 4081–4110. [Google Scholar] [CrossRef]
Trojovskỳ, P.; Dehghani, M. Pelican optimization algorithm: A novel nature-inspired algorithm for engineering applications. Sensors 2022, 22, 855. [Google Scholar] [CrossRef]
Houssein, E.H.; Saeed, M.K.; Hu, G.; Al-Sayed, M.M. Metaheuristics for solving global and engineering optimization problems: Review, applications, open issues and challenges. Arch. Comput. Methods Eng. 2024, 31, 4485–4519. [Google Scholar] [CrossRef]
Khan, A.; Bressel, M.; Davigny, A.; Abbes, D.; Ould Bouamama, B. Comprehensive Review of Hybrid Energy Systems: Challenges, Applications, and Optimization Strategies. Energies 2025, 18, 2612. [Google Scholar] [CrossRef]
Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Tsitsiklis, J.N. Asynchronous stochastic approximation and Q-learning. Mach. Learn. 1994, 16, 185–202. [Google Scholar] [CrossRef]
Liberzon, D. Switching in Systems and Control; Birkhäuser: Boston, MA, USA, 2003. [Google Scholar]

Figure 1. Architecture of CC.

Figure 2. Load Balancing in CC.

Figure 3. Flowchart of BEO.

Figure 4. RL Flow.

Figure 5. Energy Convergence.

Figure 6. Convergence comparison across optimization methods.

Figure 7. Ablation study comparing the hybrid BEO–POA algorithm with vs. Without RL.

Figure 8. LIF Comparison: With vs. Without RL.

Figure 9. Sensitivity analysis showing the impact of varying the RL learning rate (

α

) on energy consumption and response time.

Figure 10. Performance Comparison.

Table 1. Ablation study illustrating complementarity between BEO and POA under identical workload.

Method	Energy (kWh)	Makespan (s)	LIF	Computation Time (s)
BEO Only	55.83	231.45	0.18	36.1
POA Only	53.72	223.66	0.16	34.4
Hybrid BEO–POA	50.23	200.34	0.10	38.7

Table 2. Sensitivity analysis of reward weight combinations on convergence performance.

Weights $(w_{1}, w_{2}, w_{3})$	Conv. Episodes	Energy (kWh)	Utilization (%)	LIF	Reward Var.
(0.4, 0.4, 0.2)	155	53.2	0.85	0.13	0.022
(0.5, 0.3, 0.2)	140	50.2	0.87	0.10	0.018
(0.6, 0.3, 0.1)	145	49.8	0.81	0.15	0.025
(0.3, 0.5, 0.2)	160	54.0	0.89	0.18	0.030
(0.5, 0.2, 0.3)	170	51.4	0.84	0.22	0.034

Table 3. Parameter Values for BEO and POA.

Parameter	Optimal Value
Exploration-Exploitation Balance (BEO)	0.7
Adaptation Rate (POA)	0.6
Migration Factor (BEO)	0.5
Step-size Factor (BEO)	0.3
Turbulence Intensity (POA)	0.2

Table 4. RL Configuration.

Parameter	Value
Learning rate ( $α$ )	0.1
Discount factor ( $γ$ )	0.9
Exploration rate ( $ϵ$ )	0.2 (decaying)
Number of episodes	200
State variables	VM utilization, normalized energy, LIF, convergence phase
Actions	Adjust BEO–POA switching, step size, turbulence
Reward function	As defined in Equation (18)

Table 5. Experimental Setup Across Different Scenarios.

Entity Types	Variable	Scenario I	Scenario II	Scenario III	Scenario IV	Scenario V
User Cloudlets	Cloudlets (#)	10–100	100–500	500–1000	1000–5000	5000–10,000
User Cloudlets	Length (MI)	500–10,000	1000–20,000	2000–40,000	5000–100,000	10,000–200,000
C	Hosts (#)	4	6	8	10	16
	RAM per Host	4 GB	8 GB	8 GB	16 GB	32 GB
	Storage	40 GB	80 GB	80 GB	160 GB	320 GB
	Bandwidth	512	512	512	1024	2048
	CPUs per Host	4	8	8	16	32
VM	VMs (#)	8	16	32	64	128
	RAM per VM	2 GB	4 GB	8 GB	16 GB	32 GB
	OS	Windows	Windows	Windows	Windows	Windows
	Policy	Time sharing	Time sharing	Time sharing	Time sharing	Time sharing
Data Centers	Data Centers (#)	2	2	4	4	8

Table 6. Parameter Settings.

Parameter	Symbol	Value/Setting	Description
Population size	M	40 solutions	Total number of candidate task–VM mappings in each generation
Maximum iterations	I	120	Iterations per optimization run
Elite fraction	$p_{e}$	0.3	Fraction of top solutions refined by local search
BEO step size	$s_{BEO}$	0.2	Exploration amplitude for the global phase
POA turbulence coefficient	$R_{POA}$	0.25	Controls perturbation strength in local refinement
Switching probability (BEO–POA)	$η$	0.4 (initial)	Probability of invoking POA refinement in each iteration
Reward weights	$(w_{1}, w_{2}, w_{3})$	(0.5, 0.3, 0.2)	Balances energy, utilization, and load imbalance in RL controller
Energy threshold	$E_{th}$	55 kWh	Trigger for additional refinement when energy exceeds threshold
Load imbalance threshold	$L_{th}$	0.25	Upper limit for acceptable load imbalance (LIF)
Learning rate (Q-learning)	$α$	0.25	RL agent learning rate
Discount factor	$γ$	0.95	Future reward weighting factor
Exploration rate	$ϵ$	0.2	Probability of random exploration in RL
Cloudlets (tasks)	–	1000	Total number of simulated tasks
Virtual Machines	–	32 VMs	Computing resources in CloudSim
Hosts	–	8 hosts	Physical machines in the simulated data center
Simulation tool	–	CloudSim 3.0.3	Environment for all comparative experiments

Table 7. Comparative performance of POA, Hill Climbing, and Tabu Search as local refinement methods (Scenario III, CloudSim).

Refinement Method	Energy (kWh)	Makespan (s)	LIF	Convergence Time (s)
BEO + POA (Proposed)	50.23	200.34	0.10	38.7
BEO + Hill Climbing	57.41	228.20	0.17	34.1
BEO + Tabu Search	54.92	218.65	0.14	36.9
BEO + Simulated Annealing	52.78	214.93	0.12	37.5

Table 8. Comparison between proposed BEO migration rule and heuristic baselines for energy-aware scheduling.

Method	Energy (kWh)	Makespan (s)	LIF
Best-Fit Heuristic	58.43	234.5	0.21
Min-Min Heuristic	56.28	226.7	0.19
Original BEO	54.62	220.1	0.16
Proposed Energy-Aware BEO	50.12	201.8	0.10

Table 9. Wilcoxon Signed-Rank Test Results.

Algorithm Comparison	Wilcoxon p-Value
Hybrid BEO–POA with RL vs. BEO	$p < 0.001$
Hybrid BEO–POA with RL vs. POA	$p < 0.001$
Hybrid BEO–POA with RL vs. PSO	$p < 0.005$
Hybrid BEO–POA with RL vs. ACO	$p < 0.005$

Table 10. ANOVA Test Results.

ANOVA Metric	Value
F-value	8.35
p-value	0.000002

Table 11. Performance Comparison of Hybrid BEO–POA with RL vs. SOTA.

Algorithm	Energy (kWh)	Makespan (s)	Util. (%)	LIF	Resp. Time (ms)	Throughput (t/s)
Hybrid BEO–POA with RL	50.23	200.34	95.47	0.10	100.56	250.78
BEO	88.03	449.73	72.82	0.58	298.76	63.27
POA	79.28	263.70	75.96	0.50	193.51	79.40
PSO-ACO	72.95	254.55	87.48	0.25	256.02	56.78
BSO-PSO	70.50	270.10	85.33	0.35	225.30	68.50
PSO	73.95	256.02	86.25	0.30	264.01	60.20
Round Robin	65.20	310.50	78.40	0.45	275.80	75.40
ACO	56.24	255.02	66.99	0.15	264.01	98.80
Least Connection	54.75	290.40	80.10	0.28	150.40	85.60
MS-BWO	52.10	320.15	82.55	0.40	180.25	70.90
Weighted Load Balancer	58.30	275.60	90.20	0.20	160.30	120.40

Table 12. Pairwise T-Test Results for Hybrid BEO–POA with RL vs. SOTA.

Metric	Algorithm Compared	p-Value
Energy Consumption	BEO	0.00034
	POA	0.00029
	PSO-ACO	0.00048
	BSO-PSO	0.00052
	PSO	0.00031
	Round Robin	0.00041
	ACO	0.00213
	Least Connection	0.00435
	MS-BWO	0.00842
	Weighted Load Balancer	0.01578
Makespan	BEO	0.00021
	POA	0.00015
	PSO-ACO	0.00205
	BSO-PSO	0.00312
	PSO	0.00478
	Round Robin	0.00654
	ACO	0.00928
	Least Connection	0.01236
	MS-BWO	0.01864
	Weighted Load Balancer	0.02291
Resource Utilization	BEO	0.00019
	POA	0.00254
	PSO-ACO	0.00389
	BSO-PSO	0.00492
	PSO	0.00571
	Round Robin	0.00765
	ACO	0.01143
	Least Connection	0.01398
	MS-BWO	0.01682
	Weighted Load Balancer	0.02131
Load Imbalance Factor	BEO	0.00017
	POA	0.00361
	PSO-ACO	0.00472
	BSO-PSO	0.00596
	PSO	0.00735
	Round Robin	0.00988
	ACO	0.01342
	Least Connection	0.01756
	MS-BWO	0.02187
	Weighted Load Balancer	0.02549
Response Time	BEO	0.00012
	POA	0.00279
	PSO-ACO	0.00436
	BSO-PSO	0.00614
	PSO	0.00859
	Round Robin	0.01072
	ACO	0.01351
	Least Connection	0.01783
	MS-BWO	0.02063
	Weighted Load Balancer	0.02415
Throughput	BEO	0.00009
	POA	0.00213
	PSO-ACO	0.00348
	BSO-PSO	0.00473
	PSO	0.00679
	Round Robin	0.00884
	ACO	0.01092
	Least Connection	0.01321
	MS-BWO	0.01647
	Weighted Load Balancer	0.01962

Table 13. Comparison of asymptotic time and space complexities of selected metaheuristics.

Algorithm	Time Complexity	Space Complexity
Genetic Algorithm (GA)	$O (N \times d \times T)$	$O (N \times d)$
Differential Evolution (DE)	$O (N \times d \times T)$	$O (N \times d)$
Pelican Optimization Algorithm (POA)	$O (N \times d \times T)$	$O (N \times d)$
Black Eagle Optimizer (BEO)	$O (N \times d \times T)$	$O (N \times d)$
Proposed Hybrid BEO–POA + RL	$O (N \times d \times T)$	$O (N \times d)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Reinforcement Learning-Guided Hybrid Metaheuristic for Energy-Aware Load Balancing in Cloud Environments

Abstract

1. Introduction

1.1. Problem Statement

1.2. Contributions

1.3. Paper Organization

2. Related Work

3. Preliminaries

3.1. BEO Algorithm

3.1.1. Initialization

3.1.2. Stalking (Global Search)

3.1.3. Hovering (Rotational Search)

3.1.4. Catching (Local Refinement)

3.1.5. Snatching (Jump Search)

3.1.6. Migration (Adaptive Escape)

3.1.7. Justification for Selecting the BEO

3.2. POA

3.2.1. Exploration Phase—Movement Towards Prey

3.2.2. Exploitation Phase—Winging on the Water Surface

3.3. Rationale for Using POA as the Local Refinement Component

4. Proposed BEO-POA with RL Load Balancer

4.1. Mathematical Model of Hybrid BEO-POA with RL

4.1.1. Sensitivity Analysis of Reward Weights

4.1.2. Definition and Reproducibility of the Reinforcement Learning Controller

4.2. Modification of the BEO Migration Step for Energy-Aware Task Placement

4.2.1. Original BEO Migration Principle

4.2.2. Energy-Aware Migration Adaptation

4.2.3. Theoretical and Empirical Justification

5. Implementation of Hybrid BEO–POA in CloudSim

5.1. CloudSim Architecture Overview

5.2. Setting up an Energy-Aware Datacenter

5.3. Custom Load Balancing Policy

5.4. Extending the DatacenterBroker for Hybrid BEO–POA

6. Implementation Considerations and Experiments Setups

6.1. Key Considerations of Implementations

6.2. Workload Scenarios

7. Results and Discussion

7.1. Methods for Comparison

7.2. Evaluation Metrics

7.3. Effectiveness Evaluation

7.4. Convergence Analysis

7.4.1. Convergence Curve Analysis

7.4.2. Statistical Validation of Convergence Speed

7.4.3. Ablation Study: Impact of RL

7.5. Parameter Sensitivity Analysis

7.6. Comparative Performance

7.6.1. Performance Comparison

7.6.2. Statistical Significance Analysis

7.7. Computational and Space Complexity Analysis

7.7.1. Preliminaries and Notation

7.7.2. Time Complexity of the Hybrid BEO–POA

7.7.3. Complexity of Reinforcement Learning Integration

7.7.4. Space Complexity Analysis

7.7.5. Comparative Efficiency

7.8. Convergence and Stability of the RL-Guided Hybrid BEO–POA

7.8.1. Preliminaries and Notation

7.8.2. Assumptions

7.8.3. Boundedness and Monotonicity

7.8.4. Mode-Switching Stability and Lyapunov Argument

7.8.5. Convergence of the Reinforcement Learning Controller

7.8.6. Limit Points and Practical Stability

7.8.7. Oscillation Avoidance in Practice

7.8.8. Complexity and Stability Coherence

7.9. Discussion

8. Limitations and Future Work

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics