Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN

Safiullin, Ildar A.; Ashaev, Ivan P.; Korobkov, Alexey A.; Gaysin, Artur K.; Nadeev, Adel F.

doi:10.3390/inventions11010008

Open AccessArticle

Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN

by

Ildar A. Safiullin

,

Ivan P. Ashaev

,

Alexey A. Korobkov

,

Artur K. Gaysin

and

Adel F. Nadeev

^*

Radioelectronics and Telecommunication Systems Department, Kazan National Research Technical University Named After A.N. Tupolev-KAI, Karl Marx Str. 10, 420111 Kazan, Russia

^*

Author to whom correspondence should be addressed.

Inventions 2026, 11(1), 8; https://doi.org/10.3390/inventions11010008

Submission received: 21 November 2025 / Revised: 30 December 2025 / Accepted: 9 January 2026 / Published: 10 January 2026

(This article belongs to the Section Inventions and Innovation in Electrical Engineering/Energy/Communications)

Download

Browse Figures

Versions Notes

Abstract

The increasing number of Base Stations (BSs) and connected devices, coupled with their mobility, poses significant challenges and makes mobility management even more pressing. Therefore, advanced handover (HO) management technologies are required to address this issue. This paper focuses on the ping-pong HO problem. To address this issue, we propose an algorithm using Reinforcement Learning (RL) based on the Double Deep Q-Network (DDQN). The novelty of our approach is to assign specialized RL agents to users based on their mobility patterns. The use of specialized RL agents simplifies the learning process. The effectiveness of the proposed algorithm is demonstrated in tests on the ns-3 platform due to its ability to replicate real-world scenarios. To compare the results of the proposed approach, the baseline handover algorithm based on Events A2 and A4 is used. The results show that the proposed approach reduces the number of HO by more than four times on average, resulting in a more stable data rate and increasing it up to two times in the best case.

Keywords:

AI-RAN; O-RAN; RAN Intelligent Controller; RIC; handover; reinforcement learning; mobility management; ping-pong effect

1. Introduction

Promising fifth-generation wireless mobile networks, such as 5G New Radio (NR), represent a significant advancement over previous systems in data rate, latency, connectivity, reliability, and energy efficiency [1,2]. Compared to 4G Long-Term Evolution (LTE), 5G NR aims for data speeds up to 10 Gbps and latency reduced to 1 ms [1,2,3]. Moreover, the network should support up to 10⁶ connections per km² and be 100 times more energy efficient [3]. A robust collective dynamic routing method may be necessary [4]. At the same time, channel bandwidth should increase by 5–20 times, reaching 100 MHz in the bands up to 6 GHz, and up to 400 MHz in millimeter-wave bands [1,3]. Deploying 5G NR in millimeter waves enables wider channels but reduces coverage area due to physical channel properties at these frequencies. All these features result in a massive increase in data volume transmitted and processed within 5G networks.

Along with the development of 5G networks, the concept of Open Radio Access Networks (O-RANs) [5] is actively developing. This approach rejects the traditional closed architecture of mobile networks built on single-vendor equipment and instead promotes a distributed structure of base stations (BSs) with standardized open interfaces between its elements. This allows the use of diverse hardware and software from different manufacturers within one network. Furthermore, it is proposed to implement a RAN Intelligent Controller (RIC), which will perform intelligent management functions in O-RAN networks [5,6,7]. RIC provides logical functions for real-time network and resource optimization and plays a crucial role in enabling disaggregation, virtualization, interoperability, flexibility, and programmability [5,6,7].

RIC can be used to collect large volumes of data in 5G networks, which can be analyzed using Artificial Intelligence (AI) and Machine Learning (ML) algorithms. For example, AI/ML algorithms can be used to solve the following problems: network resource allocation, network fault detection and management, network load prediction and optimization, providing protection for both individual devices and the overall network, ensuring Quality of Service (QoS), and user grouping and mobility prediction [8].

The open nature of O-RAN enables the utilization of advanced data processing techniques, particularly ML methods like Reinforcement Learning (RL). In RL training stage, an agent interacts with the environment to learn actions that maximize reward, leading to optimized outcomes. As discussed in [9], RL has been successfully applied to mobility management in 5G Ultra-Dense Small Cell (UDSC) networks.

In addition, active research is currently underway in the field of 6G. For example, in [10], the authors discuss possible 6G O-RAN architecture options, propose use cases, and highlight challenges. In [11], the authors propose a Next Generation AI on RAN testbed to address the design challenges of 6G networks. This testbed enables the Graphical Processing Unit (GPU) to accelerate Layer 1 and Layer 2 processing. Moreover, the GPU is simultaneously used to process L1/L2 data in real time and train an AI model to predict traffic.

This work focuses on the problem of user mobility management with the goal of improving handover (HO) efficiency. To solve this problem, the use of an RL agent with offline learning is proposed. The proposed solution demonstrates the possibility of optimizing HO process and reduces the ping-pong effect compared to baseline algorithms.

The key contributions of this paper can be summarized as follows:

For RL agent training, we propose to use data that has undergone primary multilinear processing: tensor decomposition and clustering of factor matrices;
We propose to train the RL agent based on identified trends after multilinear data processing;
We develop a method in which the RL agent manages the HO process based on user classification by mobility type;
Simulation results show that in a ping-pong HO scenario, the proposed algorithm avoids a large number of HOs while maintaining a high data rate.

The rest of the paper is organized as follows. The remainder of Section 1 presents the HO problem and the importance of solving it in mobile networks and introduces relevant research. Section 2 describes the proposed solution based on the RL algorithm. Section 3 provides a description of the simulation scenarios, collects the obtained results, and analyzes them. Section 4 discusses the features of the proposed algorithm and its advantages and disadvantages, and suggests steps for future research. Section 5 summarizes the main results and conclusions of this paper.

1.1. Formulation of HO Problem

HO is the process of handing over a user’s services from one BS to another. The HO process typically involves user movement. A classic HO scenario is a user moving from the coverage area of one BS (serving BS) to the coverage area of another BS (neighbor BS).

The HO process is described by HO parameters: in the baseline version, these are the threshold, the Hysteresis Margin (HM), and the Time To Trigger (TTT). These parameters define the logic of the HO process over time. This process is controlled by measurements taken at BS, which are evaluated and sent by the user or User Equipment (UE). Possible measurements according to the 3rd Generation Partnership Project (3GPP) include Reference Signal Received Power (RSRP), Reference Signal Received Quality (RSRQ), and Signal-to-Noise and Interference Ratio (SINR).

3GPP describes the HO procedure in terms of message exchange in TS 38.300 [12]. Furthermore, in TS 38.331 Release 15 [13], 3GPP defines six types of HO events. These traditional (baseline) events were already introduced in LTE standard. They are triggered based on thresholds, and there is a risk that, when the network operates on a single frequency, interference from neighbor BSs may become so significant that the BS service quality at the time the HO decision is made is too low. On the user side, this event may lead to a Radio Link Failure (RLF) [14].

As it is shown in Figure 1 for baseline Event A3, when the SINR value of the serving BS is less than the threshold value and the SINR value of the neighbor BS becomes greater than the serving BS on the HM value, the serving BS starts the HO process after the TTT time expires. This paper does not address the problem of mathematical modeling of the handover process. However, such modeling is described in sufficient detail in [15].

In Release 16 and subsequently in Release 17, 3GPP added Conditional HO (CHO) to ensure reliable and seamless user mobility [14,16]. Unlike baseline HO, which is executed immediately upon receiving the HO command, CHO is executed only when a corresponding specified condition is met [16]. Additionally, Releases 16 and 17 introduced interference-, distance-, and time-based events, as well as events for Unmanned Aerial Vehicles (UAVs).

However, it is worth clarifying that in reality, ideal BS coverage areas that look like a circle or a cell do not exist. When deploying a network in an urban area, it is always necessary to take into account built-up areas that interfere with signal propagation. Furthermore, it is practically impossible to account for multipath signal propagation in space, which leads to channel frequency selectivity.

These signal propagation difficulties lead to one of the critical types of HO, which is known in the literature as the ping-pong effect [17]. Some examples of ping-pong HO are shown in Figure 2. Ping-pong HO is defined as HO from BS 1 to BS 2 and then back to BS 1 if the Time-of-Stay (TS) at BS 2 is less than a pre-defined Minimum TS (MTS) value [18].

1.2. Review of Existing Articles

Currently, there are quite a few good surveys that examine the HO problem from various angles. For example, in [19], the authors examine the evolution of the HO process from LTE to NR, highlighting the differences, advantages, and future challenges.

In [20], the authors highlight new aspects of the HO process added to NR systems, such as CHO and the use of uplink channel state. They conduct an in-depth analysis of existing methods for solving the HO problem. The category of algorithmic solutions includes works using AI/ML models, fuzzy logic, game theory, and big data analysis.

In [21], the authors focus on the problems of optimizing the HO process in Beyond 5G (B5G) systems. The authors present and classify several HO optimization schemes and algorithms implemented in previous, current, and future mobile communication systems. Furthermore, they provide the advantages and disadvantages of the classified optimization algorithms.

In [22,23], the authors explore possible algorithmic development options for solving the HO problem in B5G and 6G systems. In [22], they focus on innovative approaches that leverage camera and lidar technologies, along with the use of UAVs functioning as base stations. Meanwhile, ref. [23] outlines envisioned developments for 6G systems based on contemporary concepts, with a strong emphasis on AI and ML models as potential solutions for the future.

In [24], the authors focus on the challenges associated with HO and mobility management in 5G/6G networks. They highlight the importance of aligning mobility and HO strategies with sustainable development goals to reduce energy consumption and optimize resource utilization. The study emphasizes the integration of AI/ML to enhance the sustainability and efficiency of mobility and HO management.

Summarizing [20,21,22,23,24], it can be concluded that hopes are placed on solutions based on AI/ML models as the solutions of the future.

1.2.1. Application of AI/ML Algorithms in Mobile Networks

A detailed analysis of the application of AI/ML models in 5G and B5G systems is provided in [8]. For example, AI/ML models are used to predict user mobility and assign users to groups, to ensure security and safety for both the user device and the entire network, to provide appropriate QoS and Quality of Experience (QoE), and to utilize cloud computing and UAVs as BS. Additionally, the idea is voiced that the network should have an orchestrator that can manage various AI/ML models.

The authors in [25] study the application of AI/ML solutions in LTE systems with obstacles in the signal path. Several ML methods are used, followed by an analysis of the results. Using an Artificial Neural Network (ANN), it is possible to achieve acceptable performance results in terms of throughput and the probability of successful completion of the download.

In [26], the authors investigate the feasibility and accuracy of real-time mobile network throughput prediction and HO initialization prediction in 4G and 5G systems. Information-rich network state records from public transportation systems are used as input. A Deep Neural Network (DNN) model is proposed to find temporal patterns of throughput changes in fixed-mobility scenarios.

In [27], the authors propose using a Fuzzy Logic Control (FLC) scheme to achieve seamless HO when users move in multi-radio access networks. Simulation results show that the proposed scheme enables controlled mobility optimization in terms of the ping-pong effect, RLF, and delay in various mobile station speed scenarios.

The authors in [28] propose a context-aware HO admission system that relies on passenger mobility data, train trajectories, trip times and frequencies, network load, and SINR values. The use of Supervised Learning (SL) is proposed for mobility prediction and the avoidance of unnecessary HO during passenger movements.

In [29], the authors present a technique for self-optimization of the HO process using SL based on a regression tree for B5G systems. The RSRP and RSRQ values, as well as the user speed, are used as model features. Based on these parameters, the proposed solution optimizes HO through adaptive TTT prediction.

In [30], the authors propose a mobility management mechanism for 5G based on a SARSA-based RL algorithm, with intelligent adaptation of TTT and HM values. Furthermore, a Kalman filter is used to predict the future signal quality of serving and neighboring cells. TTT and HM parameters are adapted using an epsilon-greedy policy.

In [31], the authors consider the use of Long Short-Term Memory (LSTM) to optimize the HO process through QoE prediction. Environmental scenarios with coverage holes are studied. The proposed solution is implemented as an xApp, which receives RSRP/RSRQ measurements from UE of surrounding BS and makes a decision on the feasibility of successful data download for each BS.

The authors in [32] propose an LSTM-based xApp for HO event prediction to optimize network operation cost/performance. Furthermore, the authors evaluated the proposed algorithm based on real-world measurements obtained in a real O-RAN network.

In [33], the authors focus on the tasks of resource allocation and HO in 5G O-RAN systems and propose a two-step solution. The first stage is forecasting next timeslots network metrics based on LSTM, the second stage is making a decision about HO based on the predicted metrics.

1.2.2. Application of RL Algorithm in Mobility Management

The paper [9] provides a detailed analysis of the application of the RL algorithm to mobility management in 5G. Furthermore, the paper discusses how ML algorithms can help in various HO scenarios and provides future directions and challenges for 5G networks.

The authors in [34,35] propose an autonomous scheme based on Double Deep Reinforcement Learning (DDRL) to minimize the HO rate in 5G systems. The authors use SNR instead of SINR as an input because they assume that millimeter-wave antennas form directional beams, in which case the contribution of inter-cell interference is negligible.

In [36], the authors focus on improving the HO process and formulate an optimization problem whose goal is to achieve fairness in the data transfer rate between UE and reduce the HO number. This problem is NP-hard, so two DRL-based algorithms are proposed: a centralized one and a multi-agent one. The proposed algorithms make it possible to almost completely avoid ping-pong handoffs.

The authors in [37] examine in detail a method for optimizing HO in 5G networks using the RL algorithm and show that the HO problem can be formulated as a subclass of RL problems called Contextual Multi-Arm Bandit (CMAB). The proposed RL agent processes RSRP reports and selects appropriate handoff actions according to the RL framework to maximize long-term benefit.

In [38], the authors focus on improving overall network throughput. To this end, they propose xApp, which is implemented based on DRL using the Proximal Policy Optimization (PPO) approach. RSRQ measurements are used as input.

2. Materials and Methods

2.1. Description of the Overall Concept for HO Management

The main idea of the proposed algorithm is to divide users into groups and then manage HO separately for each user group. The target audience of the proposed solution is users with high mobility, as they are more susceptible to inefficient HO. It is also proposed to take into account the fact that mobile users can only move along fixed trajectories (roads), the configuration of which is dictated by urban development. These facts are taken into account in this paper to optimize the HO process.

The proposed method (Figure 3) consists of two parts: the first part operates in non-real time (non-RT), and the second part processes data in near-real time (near-RT). In the O-RAN paradigm, these parts are located in non-RT RIC and near-RT RIC, respectively. The authors refer to [39], which describes a model for interaction between O-RAN elements for mobility management using existing O-RAN interfaces.

The received measurement reports are used in both branches of the proposed algorithm. Non-RT data processing chain implemented in rApp is described below. Data in the form of measurement reports is entered into a shared database. This database collects measurement results for each user within a single cluster of BSs. The next step, after a sufficient number of reports has been collected, is to process the stored data.

The accumulated reports are represented as a three-dimensional tensor of the form N_BS × N_snapshot × N_UE, where N_BS is the number of BSs, N_snapshot is the number of time samples, and N_UE is the number of users.

The collected tensor is then decomposed into factor matrices using Canonical Polyadic Decomposition (CPD) [40]. The calculated factor matrices are used to identify key groups of highly mobile users. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm [41] is used to identify the groups. Clustering allows one to obtain grouped values of the measured parameter for user groups with similar mobility patterns. The mobility pattern of a specific user group is determined by the speed and trajectory of the users’ movements.

These grouped measurements are then used to train a classification model. The classification problem is currently a relevant one, and we are exploring possible solutions. Naive and full Bayesian classifiers [42] are considered as possible options. Time series classification using the shapelet transform [43] can also be considered.

Furthermore, the results of the DBSCAN algorithm are used to pre-train RL agents. Each user group is assigned a separate RL agent to manage mobility. As a supplement to the process of training an RL agent based on clustering results, pre-training based on data from baseline HO algorithms can be considered.

For a more detailed description of the tensor-based method for determining various patterns of users with high mobility, the authors refer to [44].

2.2. Proposed HO Algorithm Based on RL

Next, the overview of the second data processing branch is given, which occurs in near-RT and can be implemented as an xApp. In this case, received measurement reports are fed into the data processing chain in real-time, without creating a database, as in the case with the non-RT chain. Depending on the selected classification model, the received data can be preprocessed, for example, using a Kalman filter if the main trend in parameter changes needs to be identified. A classifier is then applied, which determines the user’s group membership in near-RT. During the classification process, two possible cases arise: the user is successfully classified into one of the high-mobility groups, or the user is not included in either group.

If the users are successfully classified, they are assigned to the corresponding RL agent. An RL agent’s responsibilities include monitoring metrics such as RSRP, RSRQ, current BS load, previous measurement results, etc. Moreover, based on the observed metrics, the RL agent will initialize the HO process for the user in the corresponding group.

If the user does not belong to a high-mobility group, then baseline methods will be used, for example, based on Events A3 and A5. Furthermore, this classification option can serve as a safety net in the event of incorrect operation of the RL agent, as well as during its training.

It is worth clarifying that the training process for RL agents in non-RT must be continued even if already-trained RL agents are available. This approach will allow agents to adapt to the slowly changing characteristics of the BS cluster as a whole.

The general diagram of the RL agent’s interaction with the external environment is shown in Figure 4. In this paper, the state space is represented by RSRQ values from each user and BSs load values. Therefore, the state for the i-th user can be written as follows:

s_i = {RSRQ_BS₁, RSRQ_BS₂, …, RSRQ_BSN, BSs_i, L_BS₁, L_BS₂, …, L_BSN},

(1)

where RSRQ_BSj this is the RSRQ value of the user from j-th BS, BSs_i is the index of the serving of BS, L_BSj is the current load of j-th BS, and N is the number of BSs.

The action space used is switching to all possible BS in the scenario under consideration. In this case, the action for i-th user is the index of the next serving BS, a_i = [BSj], where BSj is the index of BS that will serve the i-th user at the next point in time.

The reward is an integral function that takes into account RSRQ from both the serving BS and neighbor BSs. The function also takes into account the load of the serving and neighbor BSs. Furthermore, to minimize the ping-pong effect, the number of HO in a given time interval is taken into account. Specifically, the reward function is formulated as follows:

R(V_RSRQ, V_L_BS, N_HO) = α₁(V_RSRQ(t) − V_RSRQ(t − 1)) + α₂(V_L_BS(t) − V_L_BS(t − 1)) + α₃N_HO(t),

(2)

where V_RSRQ(t) is a vector with RSRQ values from all users for each BS at time t, V_L_BS(t) is a vector with load values for each BS at a given time t, N_HO(t) is the number of HO at a given time t, α_i are weighting factors that allow one to adjust for the impact of RSRQ values, network load, and the number of HO, i = 1,2,3.

The Double Deep Q-Network (DDQN) algorithm was chosen as an RL agent because it has a short convergence time, which is very important for training an RL agent in a running network. DDQN consists of two networks with the same architecture: (i) is the online network, which is updated when interacting with the environment, and (ii) is the target network, which is a stable version of the online network. Its weights are periodically copied from the first network. The architecture of each network is shown in Figure 5. Each network consists of three fully connected layers (FC1, FC2, FC3) and a Rectified Linear Unit (ReLU) activation function between them.

Possible actions to be taken include the following:

Initializing HO to any neighbor BS;
Deciding not to initiate any HO.

In this paper, we trained an RL agent from scratch and also explored the possibility of further training a pre-trained RL agent. Training was conducted on 60 episodes. The results were averaged over 10 runs with randomly generated fading and user locations in the initial route zone.

To pre-train the model, data obtained during a baseline HO could be used, which significantly reduced training time. In this case, the action chosen by the RL model is replaced by the action chosen by the baseline HO.

3. Results

3.1. Simulation Setup

Network Simulator 3 (ns-3) is chosen as the simulation environment. All scenarios consisted of non-standalone 5G NR networks. Management LTE BS has a very low radiated power to simulate a homogeneous 5G network scenario. All 5G BSs have the same radiated power and the same channel bandwidth. The simulation is conducted with two 5G BSs; the main simulation scenario parameters are presented in Table 1.

A total of three simulation scenarios are considered, which are shown in Figure 6. In the first scenario, users move from left to right from the service area of BS 1 toward BS 2. The main goal of this scenario is to test the operation of the proposed algorithm under normal network conditions.

In the second and third scenarios, users move from top to bottom, moving within the boundary of two cells. In this case, the baseline approach would expect a ping-pong effect, characterized by a large number of handoffs between BSs in a short period of time. The proposed algorithm is expected to reduce the ping-pong effect while maintaining the overall system data rate.

The difference between scenarios 2 and 3 is that in the second scenario, only moving users need to be served. In the third scenario, an additional load is added to the second BS. This is performed using static users, the number of which is M = 10. In this case, the proposed algorithm’s ability to account for BS load is tested.

It is assumed that the proposed algorithm will connect moving users to the first BS, which has no additional load. Even if the signal strength of the second BS exceeds the signal strength of the serving BS, the proposed algorithm will maintain service and will not hand off users.

3.2. Results

Comparing the proposed algorithm with a classic (baseline) handover algorithm allows for a more relevant performance assessment. Therefore, this article presents the results of a comparison with the baseline HO algorithm based on Event A2 and Event A4, which will be referred to as the A2A4 algorithm hereinafter. For the A2A4 algorithm, the threshold value is set to 3 dB based on [45].

In the first scenario, users move from left to right, passing through the service areas of two cells. The simulation results for the baseline A2A4 HO algorithm and the proposed RL HO algorithm are shown in Figure 7. As we can see, at the boundary of the service areas of two cells, some variance occurs in HO for the A2A4 HO case. This is due to the fast fading that occurs at the cell boundary. As a result, the received signal exceeded the threshold value several times, causing the user to HO from one BS to another.

RL agent shows results similar to baseline HO—on average, about 5 HO against 1 HO. It can be seen that no additional HOs occur at the cell boundary. At the same time, the data transfer rate became smoother due to the absence of pauses in data transfer when initiating multiple HOs.

In [18], the ping-pong HO rate is defined as the ratio of the number of ping-pong HOs to the total number of successful HOs, excluding HO failures. In these scenarios, unsuccessful HOs are not modeled. Therefore, for the baseline HO algorithm, the number of ping-pong HOs is equal to 5, which is 100% of the ping-pong HO rate. These values are calculated based on the MTS value equal to 1 s, according to [18]. For the proposed algorithm, the number of ping-pong HOs is equal to zero.

Figure 8 shows the simulation results for a scenario of user movement along the service boundary of two cells, from top to bottom. As in the first scenario, the RL agent reduces the number of HO while responding to some short-term changes between the power levels of the two BS within a period of 6 to 7.5 s. In this case, the number of HO decreased by an average of six times from 24 HO to 3–4 HO with a comparable average data rate.

For the baseline HO algorithm, the number of ping-pong HOs is equal to 19 out of 22, which corresponds to the ping-pong HO rate equal to 86%. However, for the proposed algorithm, the ping-pong HO rate is 33%, since there is only one ping-pong HO over the total three HOs.

In the overloaded BS scenario, the RL agent initiates the HO procedure only at the very beginning of the simulation (Figure 9). Even with a significant difference in signal strength, HO is not initiated. However, the data transfer rate remains stable and high.

Baseline HO yields significantly inferior results. The number of HO in this case is significantly higher, resulting in significant system signaling overhead and a lower average data rate. To compensate for the low data rate, BS must allocate additional resources between 8 and 10 s. This is characterized by a data rate peak, which places an additional load on the scheduler and may impact QoS for other users, if any.

As in the previous scenario, the number of HO decreased, while the average data transfer rate with the RL algorithm increased from approximately 7.5 Mbps to approximately 18 Mbps. The number of HO for the default case is about 40 per iteration against 1–2 HO for the RL case. The reason for the increasing data rate is fewer reconnections and pauses in transmissions. Peaks in the data rate are triggered by the data stored in the buffer for transmission. When we have no retransmissions, the base station provides additional resources to transmit data.

For the baseline HO algorithm, the number of ping-pong HOs is equal to 26 out of the total 30 HOs; the ping-pong HO rate is equal to 86%. For the proposed algorithm, there are no ping-pong HOs at all.

4. Discussion

The proposed approach of separating users into groups based on their mobility patterns leads to a simplified RL agent and makes it more specialized. Trained and specialized agents are not suitable for other scenarios due to a very limited observation scope. Approaches presented on the state of the art have a much wider scope, so it improves their generalization but also requires much more time for RL convergence. Further generalization makes the RL model more suitable for a wide spectrum of different scenarios, but it makes them less specialized and leads to lower performance compared to specialized models. Also, small pre-trained specialized RL agents can be easily scaled to high performance system due to their simplicity and require a short time for convergence.

For future work, we plan to use a pre-trained RL agent based on prepared data to reduce the number of iterations required for convergence. Also, as our main goal, we are planning to conduct an investigation with multiple groups and a classification model which can be used to choose a specialized RL agent. Furthermore, the mechanism for activating the traditional handover algorithm should be considered. This would provide more robustness in unexpected cases or when the performance of the RL agent is low, to guarantee a good user experience. Moreover, an investigation into constraining a more complex reward function for optimizing more KPI’s will be provided in the future.

5. Conclusions

This paper presented a possible application of RL-driven HO based on O-RAN architecture and provided research on the effectiveness of specialized RL agents based on the popular problem cases for mobility management. This concept is based on dividing users into separate groups in accordance with their mobility patterns. A specialized RL agent is assigned to these groups, which reduces the possible state space, as users have similar patterns of environmental parameter changes. Based on this, RL agents must solve more specialized cases. Then, the function approximating the action utility becomes significantly simpler, which reduces the computational complexity and convergence time of the RL model. As a result, the RL agent learned to ignore short-term dips in signal strength, thereby avoiding the ping-pong effect. According to the results, we have a higher gain in performance (data rate) and a significantly reduced number of handovers. In the case of an overloaded cell, during the training process, the algorithm developed an “understanding” that switching to BS with a high RSRQ does not lead to an increase in data transmission rate.

Author Contributions

Conceptualization, A.F.N.; methodology, I.A.S., I.P.A. and A.F.N.; software, I.P.A.; validation, I.P.A.; formal analysis, I.A.S., I.P.A., A.A.K., A.K.G. and A.F.N.; investigation, I.A.S. and I.P.A.; data curation, I.P.A.; writing—original draft preparation, I.A.S. and I.P.A., writing—review and editing, I.A.S., I.P.A., A.A.K., A.K.G. and A.F.N.; visualization, I.A.S. and I.P.A., supervision, A.A.K. and A.F.N.; project administration, A.K.G.; funding acquisition, A.F.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Russian Science Foundation, grant number 23-69-10084, https://rscf.ru/project/23-69-10084/, accessed on 7 November 2025.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

3GPP	Third Generation Partnership Project
4G	Fourth Generation
5G	Fifth Generation
6G	Sixth Generation
AI	Artificial Intelligence
ANN	Artificial Neural Network
B5G	Beyond 5G
BS	Base Station
CPD	Canonical Polyadic Decomposition
CHO	Conditional Handover
CMAB	Contextual Multi-Arm Bandit
DBSCAN	Density-Based Spatial Clustering of Applications with Noise
DRL	Deep Reinforcement Learning
DNN	Deep Neural Network
DDRL	Double Deep Reinforcement Learning
DDQN	Double Deep Q-Network
FLC	Fuzzy Logic Control
GPU	Graphical Processing Unit
HM	Hysteresis Margin
HO	Handover
LSTM	Long Short-Term Memory
LTE	Long-Term Evolution
Near-RT	Near-Real-Time
Non-RT	Non-Real-Time
NR	New Radio
ML	Machine Learning
O-RAN	Open Radio Access Network
PPO	Proximal Policy Optimization
QoE	Quality of Experience
QoS	Quality of Service
ReLU	Rectified Linear Unit
RIC	RAN Intelligent Controller
RL	Reinforcement Learning
RSRP	Reference Signal Received Power
RSRQ	Reference Signal Received Quality
RT	Real Time
SINR	Signal-to-Noise and Interference Ratio
SL	Supervised Learning
SNR	Signal-to-Noise Ratio
TTT	Time-to-Trigger
UAV	Unmanned Aerial Vehicle
UDSC	Ultra-Dense Small Cells
UE	User Equipment

References

Agiwal, M.; Roy, A.; Saxena, N. Next Generation 5G Wireless Networks: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2016, 18, 1617–1655. [Google Scholar] [CrossRef]
Shafi, M.; Molisch, A.F.; Smith, P.J.; Haustein, T.; Zhu, P.; De Silva, P.; Tufvesson, F.; Benfebbour, A.; Wunder, G. 5G: A Tutorial Overview of Standards, Trials, Challenges, Deployment, and Practice. IEEE J. Sel. Areas Commun. 2017, 35, 1201–1221. [Google Scholar] [CrossRef]
Polese, M.; Giordani, M.; Zorzi, M. 3GPP NR: The Standard for 5G Cellular Networks. In 5G Italy White eBook: From Research to Market; Marsan, M.A., Melazzi, N.B., Buzzi, S., Eds.; Consorzio Nazionale Interuniversitario per le Telecomunicazioni: Parma, Italy, 2018; pp. 69–78. ISBN 978-8-8321-7001-6. [Google Scholar]
Kozlov, S.; Spirina, E.; Ashaev, I.; Bukharina, A.; Gaysin, A. Novel Modification of the Collective Dynamic Routing Method for Sensors’ Communication in Wi-Fi Public Networks. Sensors 2022, 22, 8602. [Google Scholar] [CrossRef] [PubMed]
O-RAN Alliance, O-RAN: Towards an Open and Smart RAN. Available online: https://mediastorage.o-ran.org/white-papers/O-RAN.White-Paper-2018-10.pdf (accessed on 7 November 2025).
Singh, S.K.; Singh, R.; Kumbhani, B. The Evolution of Radio Access Network towards Open-RAN: Challenges and Opportunities. In Proceedings of the 2020 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), Seoul, Republic of Korea, 6–9 April 2020; pp. 1–6. [Google Scholar]
Balasubramanian, B.; Daniels, E.S.; Hiltunen, M.; Jana, R.; Joshi, K.; Tran, T.X.; Wang, C. RIC: A RAN Intelligent Controller Platform for AI-Enabled Cellular Networks. IEEE Internet Comput. 2021, 25, 7–17. [Google Scholar] [CrossRef]
Taleb, T.; Benzaïd, C.; Addad, R.A.; Samdanis, K. AI/ML for Beyond 5G Systems: Concepts, Technology Enablers & Solutions. Comput. Netw. 2023, 237, 110044. [Google Scholar] [CrossRef]
Tanveer, J.; Haider, A.; Ali, R.; Kim, A. An Overview of Reinforcement Learning Algorithms for Handover Management in 5G Ultra-Dense Small Cell Networks. Appl. Sci. 2022, 12, 426. [Google Scholar] [CrossRef]
Agarwal, B.; Irmer, R.; Lister, D.; Muntean, G.M. Open RAN for 6G Networks: Architecture, Use Cases and Open Issues. IEEE Commun. Surv. Tutor. 2025, 27, 2881–2924. [Google Scholar] [CrossRef]
Basaran, O.T.; Zafar, H.; Kasparick, M.; Dressler, F.; Stańczak, S. Next-Gen AI-on-RAN: AI-native, Interoperable, and GPU-Accelerated Testbed Towards 6G Open-RAN. In Proceedings of the 2025 IEEE International Conference on Communications (ICC), Montreal, QC, Canada, 8–12 June 2025; pp. 1–6. [Google Scholar]
5G NR, NR and NG-RAN Overall Description Stage-2, Document TS 38.300 v15.8.0. 2019. Available online: https://www.etsi.org/deliver/etsi_ts/138300_138399/138300/15.08.00_60/ts_138300v150800p.pdf (accessed on 7 November 2025).
5G NR, Radio Resource Control (RRC) Protocol Specification, Document TS 38.331 v15.8.0. 2019. Available online: https://www.etsi.org/deliver/etsi_ts/138300_138399/138331/15.08.00_60/ts_138331v150800p.pdf (accessed on 7 November 2025).
Martikainen, H.; Viering, I.; Lobinger, A.; Jokela, T. On the Basics of Conditional Handover for 5G Mobility. In Proceedings of the 2018 IEEE 29th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Bologna, Italy, 9–12 September 2018; pp. 1–7. [Google Scholar]
Kim, C.; Dudin, A.; Dudin, S.; Dudina, O. Mathematical model of operation of a cell of a mobile communication network with adaptive modulation schemes and handover of mobile users. IEEE Access 2021, 9, 106933–106946. [Google Scholar] [CrossRef]
Stanczak, J.; Karabulut, U.; Awada, A. Conditional Handover in 5G—Principles, Future Use Cases and FR2 Performance. In Proceedings of the 2022 International Wireless Communications and Mobile Computing (IWCMC), Dubrovnik, Croatia, 30 May–3 June 2022; pp. 1–6. [Google Scholar]
Lin, H.-P.; Juang, R.-T.; Lin, D.-B. Validation of an improved location-based handover algorithm using GSM measurement data. IEEE Trans. Mob. Comput. 2005, 4, 530–536. [Google Scholar] [CrossRef]
Evolved Universal Terrestrial Radio Access (E-UTRA); Mobility Enhancements in Heterogeneous Networks, Document TR 36.839 v11.1.0. 2013. Available online: https://www.3gpp.org/ftp/Specs/archive/36_series/36.839/36839-b10.zip (accessed on 7 November 2025).
Tayyab, M.; Gelabert, X.; Jäntti, R. A Survey on Handover Management: From LTE to NR. IEEE Access 2019, 7, 118907–118930. [Google Scholar] [CrossRef]
Haghrah, A.; Abdollahi, M.P.; Azarhava, H.; Niya, J.M. A Survey on the Handover Management in 5G-NR Cellular Networks: Aspects, Approaches and Challenges. EURASIP J. Wirel. Commun. Netw. 2023, 2023, 1–57. [Google Scholar] [CrossRef]
Alraih, S.; Nordin, R.; Abu-Samah, A.; Shayea, I.; Abdullah, N.F. A Survey on Handover Optimization in Beyond 5G Mobile Networks: Challenges and Solutions. IEEE Access 2023, 11, 59317–59345. [Google Scholar] [CrossRef]
Mollel, M.S.; Abubakar, A.I.; Ozturk, M.; Kaijage, S.F.; Kisangiri, M.; Hussain, S.; Imran, M.A.; Abbasi, Q.H. A Survey of Machine Learning Applications to Handover Management in 5G and Beyond. IEEE Access 2021, 9, 45770–45802. [Google Scholar] [CrossRef]
Mahamod, U.; Mohamad, H.; Shayea, I.; Othman, M.; Asuhaimi, F.A. Handover Parameter for Self-Optimisation in 6G Mobile Networks: A Survey. Alex. Eng. J. 2023, 78, 104–119. [Google Scholar] [CrossRef]
Saoud, B.; Shayea, I.; Alnakhli, M.A.; Mohamad, H. Mobility and Handover Management in 5G/6G Networks: Challenges, Innovations, and Sustainable Solutions. Technologies 2025, 13, 352. [Google Scholar] [CrossRef]
Cabral de Brito Guerra, T.; Dantas, Y.R.; Sousa, V.A., Jr. A Machine Learning Approach for Handover in LTE Networks with Signal Obstructions. J. Commun. Inf. Syst. 2020, 35, 271–289. [Google Scholar] [CrossRef]
Mei, L.; Gou, J.; Cai, Y.; Cao, H.; Liu, Y. Realtime Mobile Bandwidth and Handoff Predictions in 4G/5G Networks. Comput. Netw. 2022, 204, 108736. [Google Scholar] [CrossRef]
Alhammadi, A.; Hassan, W.H.; El-Saleh, A.A.; Shayea, I.; Mohamad, H.; Saad, W.K. Intelligent Coordinated Self-Optimizing Handover Scheme for 4G/5G Heterogeneous Networks. ICT Express 2023, 9, 276–281. [Google Scholar] [CrossRef]
Asad, S.M.; Klaine, P.V.; Rais, R.N.B.; Mollel, M.S.; Hussain, S.; Abbasi, Q.H.; Imran, M.A. Context-Aware Handover Skipping for Train Passengers in Next Generation Wireless Networks. J. Commun. Netw. 2023, 25, 285–298. [Google Scholar] [CrossRef]
Alraih, S.; Nordin, R.; Abu-Samah, A.; Shayea, I.; Abdullah, N.F. ML-Based Self-Optimization Handover Technique for Beyond 5G Mobile Network. IEEE Access 2025, 13, 8568–8584. [Google Scholar] [CrossRef]
Karmakar, R.; Kaddoum, G.; Chattopadhyay, S. Mobility Management in 5G and Beyond: A Novel Smart Handover With Adaptive Time-to-Trigger and Hysteresis Margin. IEEE Trans. Mob. Comput. 2023, 22, 5995–6010. [Google Scholar] [CrossRef]
Prananto, B.H.; Iskandar; Hendrawan; Kurniawan, A. LSTM Neural Network Algorithm for Handover Improvement in a Non-Ideal Network Using O-RAN Near-RT RIC. IEICE Trans. Commun. 2024, E107-B, 458–469. [Google Scholar] [CrossRef]
Dzaferagic, M.; Xavier, B.M.; Collins, D.; D’Onofrio, V.; Martinello, M.; Ruffini, M. ML-Based Handover Prediction Over a Real O-RAN Deployment Using RAN Intelligent Controller. IEEE Trans. Netw. Serv. Manag. 2025, 22, 635–647. [Google Scholar] [CrossRef]
Gain, M.; Raha, A.D.; Dam, S.K.; Amirjon, A.; Kim, K.; Hong, C.S. AI-Driven Proactive Handover Optimization for NextG O-RAN Systems. In Proceedings of the 2025 25th Asia-Pacific Network Operations and Management Symposium (APNOMS), Kaohsiung, Taiwan, 22–24 September 2025; pp. 1–6. [Google Scholar]
Mollel, M.S.; Abubakar, A.I.; Ozturk, M.; Kaijage, S.; Kisangiri, M.; Zoha, A.; Imran, M.A.; Abbasi, Q.H. Intelligent Handover Decision Scheme using Double Deep Reinforcement Learning. Phys. Commun. 2020, 42, 101133. [Google Scholar] [CrossRef]
Mollel, M.S.; Kaijage, S.; Kisangiri, M. Deep Reinforcement Learning based Handover Management for Millimeter Wave Communication. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 784–791. [Google Scholar] [CrossRef]
Prado, A.; Stöckeler, F.; Mehmeti, F.; Krämer, P.; Kellerer, W. Enabling Proportionally-Fair Mobility Management With Reinforcement Learning in 5G Networks. IEEE J. Sel. Areas Commun. 2023, 41, 1845–1858. [Google Scholar] [CrossRef]
Yajnanarayana, V.; Rydén, H.; Hévizi, L. 5G Handover using Reinforcement Learning. In Proceedings of the 2020 IEEE 3rd 5G World Forum (5GWF), Bangalore, India, 10–12 September 2020; pp. 349–354. [Google Scholar]
Dai, J.; Mahboob, S.; Wang, H.; Liu, L. Intelligent Handover Management Enabled by O-RAN and Deep Reinforcement Learning. In Proceedings of the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), Washington, DC, USA, 7–10 October 2024; pp. 1–6. [Google Scholar]
Korobkov, A.A.; Gaysin, A.K.; Safiullin, I.A.; Ashaev, I.P.; Nadeev, A.F. Interaction Model of O-RAN Radio Access Network Elements for Mobility Management. Electromagn. Waves Electron. Syst. 2025, 30, 79–92. [Google Scholar]
Kolda, T.G.; Bader, B.W. Tensor Decompositions and Applications. SIAM Rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.-P.; Xu, X. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Datab. Syst. 2017, 42, 1–21. [Google Scholar] [CrossRef]
Murphy, K.P. Naive Bayes Classifiers. Univ. Br. Columbia 2006, 18, 1–8. [Google Scholar]
Hills, J.; Lines, J.; Baranauskas, E.; Mapp, J.; Bagnall, A. Classification of Time Series by Shapelet Transformation. Data Min. Knowl. Discov. 2014, 28, 851–881. [Google Scholar] [CrossRef]
Ashaev, I.P.; Safiullin, I.A.; Gaysin, A.K.; Nadeev, A.F.; Korobkov, A.A. An Approach for Using a Tensor-Based Method for Mobility-User Pattern Determining. Inventions 2024, 9, 1. [Google Scholar] [CrossRef]
Hendrawan, H.; Zain, A.R.; Lestari, S. Performance Evaluation of A2-A4-RSRQ and A3-RSRP Handover Algorithms in LTE Network. J. Elektron. dan Telekomun. 2019, 19, 64–74. [Google Scholar] [CrossRef]

Figure 1. Example of HO mechanism based on A3 Event.

Figure 2. Examples of ping-pong HO.

Figure 3. Scheme of the proposed solution. On the left side, in non-RT RIC, the measurements collected in Step 1 are passed to each rApp. In Step 7 the classification results from rApp 2 are passed to xApp 1. In Step 9, the ML training results from rApp3 are passed to xApp 2. In the center, the blue rectangle represents the cluster of BSs. The dashed lines show user movements. On the right side, in near-RT RIC, xApp 2 manages the handover execution within the BS cluster.

Figure 4. Overview of general RL algorithm. The dashed lines show user movements.

Figure 5. Online and target network architecture.

Figure 6. Simulation scenarios. The dash-dotted line from left to right shows the user movement in the first scenario. The dash line from top to bottom shows the user movement in the second and third scenarios.

Figure 7. Simulation results for scenario 1.

Figure 8. Simulation results for scenario 2.

Figure 9. Simulation results for scenario 3.

Table 1. Scenario parameters.

Parameter Name	Value
Number of BSs	2
Number of UEs with high mobility	5
Number of static UEs (scenario 3)	10
Transmit power of BSs, dBm	30
Channel model	3GPP Urban Macro
HO threshold for baseline algorithm, dBm	3
Antenna mode	Isotropic
Center frequency, GHz	3.5
Channel bandwidth, MHz	20
Mobility model	Constant velocity
HARQ transmission	Enabled
RRC model	Ideal
Traffic model	UDP, 2.56 Mbps

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Safiullin, I.A.; Ashaev, I.P.; Korobkov, A.A.; Gaysin, A.K.; Nadeev, A.F. Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN. Inventions 2026, 11, 8. https://doi.org/10.3390/inventions11010008

AMA Style

Safiullin IA, Ashaev IP, Korobkov AA, Gaysin AK, Nadeev AF. Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN. Inventions. 2026; 11(1):8. https://doi.org/10.3390/inventions11010008

Chicago/Turabian Style

Safiullin, Ildar A., Ivan P. Ashaev, Alexey A. Korobkov, Artur K. Gaysin, and Adel F. Nadeev. 2026. "Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN" Inventions 11, no. 1: 8. https://doi.org/10.3390/inventions11010008

APA Style

Safiullin, I. A., Ashaev, I. P., Korobkov, A. A., Gaysin, A. K., & Nadeev, A. F. (2026). Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN. Inventions, 11(1), 8. https://doi.org/10.3390/inventions11010008

Article Menu

Reinforcement Learning-Based Handover Algorithm for 5G/6G AI-RAN

Abstract

1. Introduction

1.1. Formulation of HO Problem

1.2. Review of Existing Articles

1.2.1. Application of AI/ML Algorithms in Mobile Networks

1.2.2. Application of RL Algorithm in Mobility Management

2. Materials and Methods

2.1. Description of the Overall Concept for HO Management

2.2. Proposed HO Algorithm Based on RL

3. Results

3.1. Simulation Setup

3.2. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI