Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization

Sun, Youngghyu; Kim, Kyounghun; Lee, Seongwoo; Seon, Joonho; Kim, Soohyun; Kim, Jinyoung

doi:10.3390/electronics15102203

Open AccessArticle

Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization

by

Youngghyu Sun

,

Kyounghun Kim

,

Seongwoo Lee

,

Joonho Seon

,

Soohyun Kim

and

Jinyoung Kim

^*

Department of Electronic Convergence Engineering, Kwangwoon University, Seoul 01897, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2203; https://doi.org/10.3390/electronics15102203

Submission received: 30 April 2026 / Revised: 17 May 2026 / Accepted: 19 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue Advanced Indoor Localization Technologies: From Theory to Application)

Download

Browse Figures

Versions Notes

Abstract

In this paper, a supervised deep reinforcement learning (SDRL)-based positioning scheme is proposed for indoor wireless localization. The proposed scheme formulates the positioning problem as a Markov decision process and introduces a target-aware reward design based on the artificial potential field (APF) to alleviate the sparse reward problem commonly encountered in search-based reinforcement learning. In the proposed scheme, supervision is provided at the reward level by incorporating the target position into the reward design, rather than at the action level via expert demonstrations. A multi-scale action set with 49 candidates is further adopted to provide a favorable trade-off between estimation accuracy and search efficiency. An anchor-based environment construction strategy is developed by selecting the four strongest reference points (RPs) and transforming their coordinates with respect to the strongest RP. Simulation results show that the proposed scheme achieves a mean absolute error (MAE) below 0.8 m and success rates above 99.1% within 1 m and 99.2% within 2 m under the default Bluetooth Low Energy setting, while the convex-valid rate of the anchor-based environment exceeds 99.5%. Compared with existing methods, the proposed scheme reduces the MAE by approximately 92.3%. Ablation studies confirm that multi-scale actions reduce the average search steps by approximately 69.5% compared with a single-scale baseline. The proposed scheme also retains stable performance across BLE, Wi-Fi, and Zigbee infrastructures when trained under a representative path-loss setting without retraining and maintains sub-meter accuracy under mild shadow fading. These results confirm that the proposed scheme can improve positioning accuracy and search efficiency for indoor wireless localization.

Keywords:

positioning; indoor wireless localization; supervised deep reinforcement learning; artificial potential field; multi-scale action

1. Introduction

1.1. Background

Indoor wireless localization has emerged as a key research area, attracting substantial interest from both academia and industry owing to its wide range of applications [1]. Indoor wireless localization aims to accurately estimate the position of devices within indoor environments and has shown considerable potential to enhance user experience [2]. Moreover, the integration of indoor wireless localization with Internet of Things (IoT) devices opens up new opportunities for smart home and smart building applications [3]. Despite these promising applications, achieving high-precision indoor wireless localization remains a significant challenge [4]. Unlike outdoor environments, indoor environments are inherently complex and subject to severe signal attenuation and multi-path effects caused by walls, furniture, and other obstacles. Moreover, their dynamic nature, characterized by the continuous movement of people and objects, introduces time-varying propagation conditions that further complicate the positioning problem [5]. These characteristics make it difficult to obtain reliable and consistent position estimates using conventional techniques. Therefore, advanced algorithms have been actively investigated to achieve accurate and robust indoor wireless localization performance [6].

1.2. Related Works

In pursuit of high-precision indoor wireless localization, positioning methods using received signal strength (RSS) have become a prominent approach. In ref. [7], multi-lateration methods incorporating zone selection and virtual position-based compensation were proposed, where RSS is utilized to estimate target distances and positions. RSS fingerprint-based positioning methods can generally be divided into two categories. The first directly predicts the precise geographical coordinates of a target based on the measured RSS. The second partitions the environment into uniformly sized grids and then uses the RSS fingerprints to determine which predetermined grid or zone contains the target.

RSS fingerprint-based methods incorporating deep learning (DL) have gained considerable popularity owing to their robustness and compatibility with the installed infrastructure in indoor environments [8]. Key advantages of DL include the ability to inherently extract representative features from the data samples as well as the ability to jointly optimize feature learning and inference through end-to-end trainable architectures [9].

A major challenge for DL-based methods using RSS fingerprints is that they require large-scale labeled training databases to achieve the desired accuracy. In general, the data collection phase requires substantial human effort to obtain a sufficiently diverse labeled dataset. Moreover, the methods are constrained by fingerprint maps that become outdated as the environment changes. Therefore, DL-based methods exhibit limited adaptability across deployments and dynamic environments. Another major limitation is that DL-based methods using RSS fingerprints suffer from performance degradation over time. For example, determining an appropriate grid size is difficult because it typically requires prior knowledge of the environment. When the grid resolution needs to be changed, position estimators often have to be retrained. Achieving higher estimation precision requires a finer-resolution grid over the entire search area.

To alleviate the challenges of DL-based methods with RSS fingerprints, several recent studies have employed deep reinforcement learning (DRL) [10,11,12,13]. Unlike conventional DL-based methods, which rely heavily on labeled datasets, DRL enables an agent to learn optimal policies through repeated interaction with the environment. Through sequential observation of environmental states and execution of actions, the agent progressively refines its decision-making policy. Consequently, DRL not only estimates the instantaneous position but also maximizes long-term cumulative rewards, thereby reducing the reliance on extensive data collection and labeling.

In ref. [10], a DRL framework incorporating a deep generative model was proposed to extract features from both labeled and unlabeled data, enabling the agent to learn policies efficiently and robustly for IoT services under limited data availability. In ref. [11], an unsupervised DRL-based positioning method was proposed in which the positioning process is formulated as a grid search task; the scarcity of labeled data is addressed through a novel reward mechanism that leverages landmark extraction from unlabeled RSS signals. In ref. [12,13], DRL-based frameworks that progressively narrow the search space have been proposed for 3D positioning. In ref. [12], the agent learns an optimal policy for hierarchical partitioning of the 3D environment, achieving high positioning accuracy with substantially reduced computational complexity. In ref. [13], by learning a policy that focuses on the most probable sub-regions, the framework further improves both the precision and efficiency of target positioning.

Although these DRL-based approaches alleviate some of the data-dependency issues of DL-based methods, they still exhibit several limitations when applied to RSS-based positioning. First, DRL formulations based on grid- or area-search suffer from the sparse reward problem, as the agent receives informative feedback only when it reaches the target region, which often requires hundreds of exploratory steps before an effective policy emerges. Second, their estimation accuracy is fundamentally constrained by the predefined resolution of the grid or the search area, so that achieving finer accuracy inevitably incurs higher time complexity. Third, similar to DL-based methods, these approaches show limited adaptability across diverse indoor scenarios, since any change in the layout or reference node deployment requires costly retraining of the DRL agent. These limitations motivate the development of a learning framework that provides denser reward signals, supports flexible resolution, and adapts to varying deployment conditions without retraining.

1.3. Contributions and Organization

In order to address these limitations, a positioning scheme using supervised deep reinforcement learning (SDRL) is proposed. Unlike conventional SDRL approaches relying on expert demonstrations or labeled action trajectories, the proposed scheme introduces “supervision” which focuses on a target-aware reward-shaped DRL formulation rather than the action imitation by the expert. In this paper, SDRL refers to a two-stage framework that combines offline policy learning with online inference, comprising an offline training stage and an online position estimation stage. In the offline training stage, target-related information is incorporated into the reward design so that the agent can learn a target-directed policy, whereas in the online position estimation stage, the trained agent sequentially performs position estimation without directly using the target position. In the proposed framework, the positioning problem is formulated as a Markov decision process (MDP), where an agent sequentially estimates the target position by interacting with an RSS-based environment. For policy learning, the proximal policy optimization (PPO) algorithm is adopted so that the agent can learn an optimal policy for sequential positioning. Moreover, to mitigate the sparse reward problem commonly encountered in search-based DRL, an artificial potential field (APF) is employed in the reward function to provide dense and continuous guidance during the learning process. To address the rigid trade-off caused by fixed-resolution search, multi-scale movement actions are introduced so that a favorable balance between estimation accuracy and search efficiency can be flexibly controlled. In addition, for improved deployment flexibility, only four reference points (RPs) are selected in each given estimation, and their coordinates are transformed with respect to the RP with the strongest RSS to construct an initial environment. As a result, the proposed positioning scheme reduces the dependence on exhaustive labeled data, alleviates sparse reward learning difficulty, and improves adaptability to varying indoor deployment conditions. The main research questions addressed in this paper are summarized as follows:

RQ1: How can RSS-based indoor positioning be formulated as a reinforcement learning problem in which the sparse reward problem of conventional DRL-based search methods is alleviated, while reducing dependence on dense labeled fingerprint data?
RQ2: How can the trade-off between estimation accuracy and search efficiency, inherent to fixed grid-resolution DRL formulations, be flexibly controlled within a single unified policy?
RQ3: How can a learned positioning policy be made adaptable across diverse RP deployments and propagation conditions without retraining?

As answers to these research questions, the main contributions of this paper can be summarized as follows:

An SDRL-based positioning framework is proposed by modeling RSS-based indoor localization as an MDP. By explicitly incorporating the target position into the reward design, the proposed framework enables the agent to learn target-directed search behavior more effectively than conventional DRL-based positioning approaches.
An APF-based reward scheme with multi-scale movement actions is proposed. The APF-based reward scheme alleviates the sparse reward problem by providing dense and continuous learning signals, whereas the multi-scale movement actions overcome the limitation of fixed grid-resolution by enabling a flexible trade-off between estimation accuracy and time complexity.
An anchor-based environment construction strategy is introduced that uses only four RPs and transforms their coordinates with respect to the RP with the strongest RSS. This strategy reduces dependency on scenario-specific layouts and improves the adaptability of the proposed positioning scheme across diverse indoor environments.

The rest of this paper is organized as follows. The SDRL framework is presented for RSS-based indoor localization, including the MDP formulation, PPO algorithm, and the anchor-based environment construction strategy in Section 2. In Section 3, the proposed positioning scheme is described. Furthermore, the simulation setting and simulation results are described in Section 4. Finally, Section 5 concludes this paper.

2. Proposed Supervised Deep Reinforcement Learning Framework

2.1. Markov Decision Process Formulation for Positioning

RSS has been widely utilized for indoor localization due to its practical availability and location-dependent characteristics [14,15]. Unlike positioning methods requiring dedicated ranging or angle-measurement hardware, RSS can be directly obtained from widely deployed wireless communication systems such as Bluetooth, Wi-Fi, and Zigbee, making it a cost-effective solution for indoor positioning. In indoor environments, RSS is affected by the distance between a device and RPs [16]. The RSS values observed from multiple RPs can serve as meaningful information for estimating the target position. For this reason, RSS provides a practical basis for formulating the indoor positioning problem considered in this paper.

The overall structure of the proposed SDRL framework is depicted in Figure 1.

The indoor wireless localization environment consists of a user and N RPs. The user collects signals, including coordinates of RPs, and the RSS values are measured from the N RPs. Using the wireless signal path-loss model [11] as expressed in (1), the four RPs nearest to the user are chosen to establish an environment of the SDRL framework.

d = 10^{\frac{R S S - b}{- 10 n}},

(1)

where d represents the distance between the RP and the user, n is the path-loss exponent that governs the rate of signal attenuation with distance, and b is the reference RSS in dBm measured at

d = 1

m. Both parameters are determined empirically for each indoor propagation environment. Based on the selected RPs, the user position is estimated through interaction between the agent and the environment. To achieve stable and efficient learning, the agent is trained using an actor-critic-based SDRL framework.

In general, SDRL approaches often use expert demonstrations, labeled state–action pairs, or supervised pretraining to guide policy learning [17]. In such approaches, supervision is usually provided in the form of desired actions or trajectories, and the agent learns to imitate or initialize a policy from externally provided examples. By contrast, the SDRL framework considered in this paper does not rely on expert demonstrations or labeled action trajectories. Instead, supervision is introduced through the reward design during the agent training stage. Specifically, the target position is used only in the training stage to construct a target-aware reward function, allowing the agent to learn a policy that selects actions leading toward the target. In the online position estimation stage, the target position is no longer given to the agent; the trained policy sequentially estimates the user position. Therefore, the term “supervised” in the proposed SDRL framework refers to reward-level supervision rather than to action-level supervision. Similar to reward-shaped DRL [18], prior task knowledge is injected via the reward signal instead of imitation. The term SDRL is adopted to emphasize that the target information serves as a supervisory cue during training, while preserving the standard reinforcement learning interaction between the agent and the environment.

Based on this SDRL formulation, the positioning problem can be modeled as an MDP [11]. The MDP consists of three major components: state, action, and reward function. The state is defined as a six-dimensional vector composed of the current agent position and the RSS values measured from the four selected RPs, where the selected RPs are those providing the strongest RSS values among all available RPs. The action space is designed as a multi-scale action set

A

to overcome the accuracy-efficiency limitation imposed by a fixed movement resolution. Specifically,

A

consists of two-dimensional movement vectors in which each axis independently takes one value from

{- 1, - 0.1, - 0.01, 0, 0.01, 0.1, 1}

, yielding 49 action candidates per decision step. These three scales of

{1, 0.1, 0.01}

support coarse, intermediate, and fine movements, so that the agent can perform rapid coarse exploration when far from the target and fine adjustment near the target within a single unified policy. Through repeated interactions with the environment, the agent learns an effective policy for target-directed position estimation.

While the multi-scale action design addresses the resolution-efficiency trade-off in the agent’s movements, the reward function should be designed to counter another well-known limitation of search-based DRL, namely the sparse reward problem. To provide dense and continuous learning signals throughout the search, the reward function is constructed based on the APF method. The underlying principle of the APF method is to model the motion of an agent as being driven by a virtual potential field, in which goals and obstacles are represented by attractive and repulsive potentials, respectively [19]. Following this principle, the target is treated as a source of attractive force, namely the gravitational force (GF), acting on the agent, while the selected RPs are treated as sources of repulsive force (RF). The GF is derived from the negative gradient of an attractive potential, as given in (2).

F_{a t t} = - \nabla U_{a t t} = - k d_{t a r},

(2)

where k,

d_{t a r}

denote the positive coefficient and the agent’s distance from the target, respectively. The magnitude of the GF increases as the agent moves farther from the target, thereby encouraging the agent to move toward the target location. Likewise, the RF is derived from the negative gradient of a repulsive potential field as expressed in (3).

F_{r e p} = - \nabla U_{r e p} = \{\begin{matrix} m (\frac{1}{d_{R P}} - \frac{1}{δ}) \frac{1}{d_{R P}^{2}} \frac{\partial d_{R P}}{\partial X}, & d_{R P} \leq δ, \\ 0, & d_{R P} > δ, \end{matrix}

(3)

where

\frac{\partial d_{R P}}{\partial X} = (\frac{\partial d_{R P}}{\partial x}, \frac{\partial d_{R P}}{\partial y}) .

Here, m,

d_{R P}

,

δ

are the positive coefficient, the agent’s distance from the RP, and the maximal range of influence of the RP, respectively. In particular, the maximal range of influence is assigned differently to each selected RP according to the RSS ranking. Specifically, a smaller maximal range of influence is assigned to the RP with a stronger RSS, whereas a larger maximal range of influence is assigned to the RP with a weaker RSS. This design is motivated by the fact that an RP with stronger RSS is more likely to be located closer to the user and, therefore, should affect the agent within a more localized region.

Using the APF method, the GF and RF can be converted into a reward or penalty according to the action taken by the agent. The APF-based reward function is shown in (4).

R_{A P F} = \{\begin{matrix} R_{a t t} + R_{r e p, i}, & under δ_{i} \\ R_{a t t}, & otherwise \end{matrix},

(4)

where

\begin{matrix} R_{a t t} & = (d_{max} - d_{t a r}^{t}) (d_{t a r}^{t} - d_{t a r}^{t + 1}), \\ R_{r e p, i} & = (δ_{i} - d_{R P_{i}}^{t}) (d_{R P_{i}}^{t + 1} - d_{R P_{i}}^{t}) . \end{matrix}

Here,

d_{max}

denotes the distance between the agent’s initial location and the target, and the superscript represents the temporal moment. The true target position is required only during the offline training stage for two specific purposes: (i) evaluating

F_{a t t}

in (2), which depends on

d_{t a r}

, and (ii) computing

R_{a t t}

in (4). In practice, this ground-truth information is obtained through an offline data collection phase of conventional fingerprint-based methods [6,8]; however, the labeling burden is significantly reduced because the proposed framework requires only a sparse set of labeled target positions to shape the reward, rather than a fine-grained labeled grid that covers the entire indoor area. Once the policy has been trained, the online estimation stage operates entirely without ground-truth target information, relying solely on real-time RSS measurements and the learned policy. Through this target-aware reward design, the proposed SDRL framework can provide dense and continuous learning signals, thereby mitigating the sparse reward problem that commonly arises in search-based positioning tasks.

When the agent reaches a position sufficiently close to the target, a positive terminal reward is assigned, and the episode is terminated. By contrast, when the updated position exceeds the feasible search boundary, a negative reward is imposed. Accordingly, the proposed MDP formulation enables the agent to learn an effective positioning policy that achieves a desirable balance between search efficiency and estimation accuracy.

2.2. PPO-Based Policy Learning

PPO is a policy gradient-based reinforcement learning algorithm that aims to improve the policy while keeping the training behavior stable [20]. In PPO, a neural network parameterizes the policy, and the agent learns it by maximizing the expected cumulative reward through repeated interactions with the environment. PPO has been widely adopted to stabilize policy learning by constraining excessive policy updates. As a result, PPO can be a viable solution for the proposed search-based positioning framework.

A key component of PPO is importance sampling, which measures the discrepancy between the updated policy and the previous policy through their probability ratio. The sampling ratio is represented as (5).

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})},

(5)

where

θ

is the policy parameter,

π_{θ} (a_{t} | s_{t})

and

π_{θ_{old}} (a_{t} | s_{t})

denote the new and old policies, respectively. This ratio indicates how much the probability of selecting an action

a_{t}

at a state

s_{t}

changes after the policy update.

To avoid overly large policy changes, PPO employs a clipping mechanism in the surrogate objective. The clipped policy loss is expressed as (6).

L_{t}^{C L I P} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})],

(6)

where

clip (\cdot)

is the clipping function,

ϵ

is a hyperparameter controlling the allowable update range, and

A_{t}

denotes the advantage function. The advantage function represents how much better the selected action is than the average expected action at the given state. By clipping the probability ratio within a predefined interval, PPO suppresses excessively large updates and thereby improves training robustness.

The overall objective function of PPO consists of the clipped policy objective, the state-value loss, and the entropy bonus, as given by (7).

L_{t}^{P P O} (θ) = E_{t} [L_{t}^{C L I P} (θ)] - c_{1} L_{t}^{V F} (θ) + c_{2} S (s_{t}),

(7)

where

L_{t}^{C L I P} (θ)

is the loss of the policy gradient,

L_{t}^{V F} (θ)

is the loss of the value function,

S (s_{t})

is the entropy bonus of the policy, and

c_{1}

and

c_{2}

are the coefficients. The value loss helps the critic network estimate the state value more accurately, whereas the entropy term encourages sufficient exploration and reduces the risk of premature convergence to suboptimal policies. Through this composite objective, PPO achieves stable policy optimization while preserving the exploration capability of the agent.

Based on the objective function in (7), the policy parameters are iteratively updated by gradient ascent as

θ \leftarrow θ + α \nabla_{θ} L_{t}^{P P O} (θ),

(8)

where

α

denotes the learning rate. In the proposed positioning framework, PPO is adopted to train the agent so that it can learn a stable and effective policy for target-directed position estimation under the designed MDP. Owing to its clipped update structure and actor-critic framework, PPO is well-suited to the proposed SDRL framework, where stable convergence and reliable policy improvement are essential.

2.3. Anchor-Based Environment Construction Strategy

In order to improve the adaptability of the proposed positioning scheme across different indoor environments, an anchor-based environment construction strategy is adopted. In RSS-based positioning, the absolute coordinates of RPs may vary significantly depending on the deployment scenario, even when the relative signal characteristics observed by the user are similar. If the localization environment is directly represented using the original RP coordinates, the learned policy may become highly dependent on a specific layout, which degrades its adaptability to different RP deployment conditions. To address this issue, the proposed positioning scheme constructs a transformed local environment using only the four selected RPs and redefines their coordinates with respect to a common RP.

The coordinates of the four selected RPs are denoted by

g_{i} = {[x_{i}, y_{i}]}^{T}, i \in {1, 2, 3, 4},

(9)

where the selected RPs correspond to the four strongest RSS values among all available RPs. Among them, the RP providing the strongest RSS is chosen as the anchor RP, since it is expected to be the closest RP to the user. The coordinate of the anchor RP is expressed by

g_{a n c} = {[x_{a n c}, y_{a n c}]}^{T} .

(10)

Then, the coordinates of all selected RPs are transformed by subtracting the anchor RP coordinate as

{\tilde{g}}_{i} = g_{i} - g_{a n c}, i \in {1, 2, 3, 4} .

(11)

Through this transformation, the anchor RP is always located at the origin, i.e.,

{\tilde{g}}_{a n c} = {[0, 0]}^{T} .

(12)

Using the same coordinate transformation, the target position and the current agent position are also represented in the transformed local coordinate system. The original target and agent coordinates are denoted by

p^{t a r}

and

p_{t}

, respectively. Their transformed coordinates are given by

\begin{matrix} {\tilde{p}}^{t a r} & = p^{t a r} - g_{a n c}, \end{matrix}

(13)

\begin{matrix} {\tilde{p}}_{t} & = p_{t} - g_{a n c} . \end{matrix}

(14)

Accordingly, the position estimation process is carried out in a local coordinate system centered at the RP with the strongest RSS. This transformation preserves the relative geometry among the selected RPs, the agent, and the target, while eliminating unnecessary dependence on absolute global coordinates.

The main advantage of this strategy is that it enables the policy network to focus on the relative spatial relationship implied by the RSS measurements rather than memorizing scenario-specific RP layouts. In other words, even if the absolute RP deployment changes from one indoor environment to another, the transformed environment can still exhibit a similar structural pattern when the relative arrangement of the strongest RPs is comparable. As a result, this strategy can improve the deployment flexibility of the proposed SDRL framework while enhancing robustness across various localization scenarios.

In addition, the feasible search region is determined based on the transformed coordinates of the selected RPs. Therefore, the agent performs position estimation in a compact local search space constructed from the most relevant neighboring RPs instead of the entire global environment. This not only reduces the complexity of the positioning problem but also improves the consistency of the state representation and reward design, leading to more accurate and reliable positioning results. Overall, the proposed anchor-based environment construction strategy can provide an effective basis for stable policy learning and enhanced adaptability in RSS-based indoor localization.

3. Proposed Positioning Scheme

The block diagram of the proposed positioning scheme is illustrated in Figure 2.

The proposed scheme consists of two main stages: an offline training stage and an online position estimation stage. In both stages, RSS measurements are first obtained from all available RPs, and the four RPs providing the strongest RSS values are selected to construct a local positioning environment. After the RP selection, a geometric validity check is performed to determine whether the four selected RPs form a convex quadrilateral. This step is introduced to ensure that the transformed local environment is geometrically valid and that the agent can search within a consistent feasible region bounded by the selected neighboring RPs. The convex quadrilateral condition is adopted for three reasons: (i) it allows the selected RPs to enclose a bounded region for constructing a consistent transformed coordinate system; (ii) it enables the APF reward gradients in (2)–(4) to provide stable attractive and repulsive guidance within the region; and (iii) it provides a stable bounding box for the agent’s search space during PPO training, preventing the policy from extrapolating into geometrically inconsistent regions that were not encountered during training.

If the selected RPs satisfy the convex quadrilateral condition, the RP with the strongest RSS is chosen as the anchor RP, and the coordinates of the selected RPs, as well as the target and agent positions, are transformed into a local coordinate system. Based on this transformed environment, the agent performs sequential position estimation using the PPO-based SDRL framework with the APF-based reward design and multi-scale movement actions. During the training phase, the policy is progressively optimized through interactions with the constructed local environments, whereas in the online position estimation stage, the pre-trained policy is applied to new RSS observations. The detailed procedures of the training stage and the online position estimation stage are described in Section 3.1 and Section 3.2, respectively.

3.1. Offline Training Stage

In the offline training stage, the agent learns a target-directed policy through repeated interactions with the transformed RSS-based environment. For each training episode, RSS values are obtained from all available RPs with respect to a given target position. Among them, the four RPs providing the strongest RSS values are selected to construct the local environment, since they are expected to contain the most relevant spatial information for position estimation. After the RP selection, the geometric configuration of the selected RPs is examined to determine whether the four points form a convex quadrilateral. This validity check is important because the proposed local search space and coordinate transformation strategy assume a consistent neighboring RP structure that can define a feasible bounded region for the agent. If the selected RPs do not form a valid convex quadrilateral, the current episode is excluded from policy learning. This exclusion is adopted for three reasons. First, the APF reward in (4) assumes that the target lies within a bounded region defined by the selected RPs; for non-convex configurations, this region is ill-defined and the repulsive force may produce gradients that guide the agent away from the target, contaminating the learning signal. Second, the anchor-based coordinate transformation in (9)–(14) implicitly assumes that the four RPs span a meaningful local frame; non-convex configurations produce degenerate or inconsistent local frames that introduce distributional shift in the state representation seen by the policy. Third, including such degenerate samples introduces high-variance gradient signals that destabilize PPO training. Geometrically invalid configurations are handled separately at inference time by the fallback mechanism described in Section 3.2, so excluding them from training does not compromise the framework’s ability to produce an estimate for every test sample.

If the selected RPs satisfy the convex quadrilateral condition, the anchor-based environment construction strategy described in Section 2.3 is applied. Specifically, the RP with the strongest RSS is chosen as the anchor RP, and the coordinates of the selected RPs, the target, and the initial agent position are transformed with respect to the anchor RP so that the positioning process is carried out in a local coordinate system. Using the current agent position and the RSS values of the four selected RPs, the state is set up based on the transformed environment. Then, the agent selects an action from the predefined multi-scale action set according to the current policy. After the action is executed, the agent position is updated, and the next state is obtained. At the same time, the APF-based reward is computed by considering the progress toward the target, the repulsive influence of the selected RPs, and the terminal conditions. The resulting transition tuple is stored and used for PPO-based policy learning. The actor network learns a policy that maximizes the total reward by interacting with valid local environments iteratively. The critic network, meanwhile, estimates the state value to stabilize the learning process.

The training procedure is repeated until the policy converges or the maximum number of training episodes is reached. As a result, the agent acquires a stable position estimation policy that is trained only on geometrically valid local RP configurations. The detailed training procedure of the proposed positioning scheme is summarized in Algorithm 1.

Algorithm 1 Offline Training Stage of Proposed Scheme

1:: Input: multi-scale action set $A$ , number of episodes $N_{e p i}$ , number of steps $T_{m a x}$
2:: Output: Trained PPO policy $π_{θ}^{*}$
3:: ⁢
4:: Initialize the actor network, critic network, and PPO-related hyperparameters
5:: for episode $= 1$ to $N_{e p i}$ do
6:: Generate a target position $p^{t a r}$
7:: Obtain RSS measurements from all N RPs using (1)
8:: Select the four RPs with the strongest RSS values
9:: if the convex quadrilateral condition of the four RPs is not satisfied then
10:: break
11:: end if
12:: Transform the coordinates using (9)–(14)
13:: Initialize the current state $s_{0}$
14:: for step $t = 0$ to $T_{m a x} - 1$ do
15:: Select an action $a_{t} \in A$ by the PPO policy $π_{θ} (a_{t} | s_{t})$
16:: Update the agent position by the selected action
17:: Obtain the next state $s_{t + 1}$
18:: Compute the reward using (4)
19:: Check the terminal condition or boundary condition
20:: if the terminal condition is satisfied then
21:: break
22:: end if
23:: end for
24:: Update the actor and critic networks using the PPO clipped objective in (8)
25:: end for
26:: Return the trained policy $π_{θ}^{*}$

3.2. Online Position Estimation Stage

In the online position estimation stage, the trained PPO policy is used to estimate the position of a user from new RSS observations. In contrast to the training stage, this stage does not involve policy updates. Instead, the learned policy is directly exploited to determine movement actions sequentially in the transformed local environment. Given the RSS observations from all available RPs, the four RPs providing the strongest RSS values are first selected, since they contain the most relevant local spatial information for the current target. Then, the geometric validity of the selected RP configuration is examined by checking whether the four points form a convex quadrilateral. This step ensures that the online estimation process is performed only when the selected neighboring RPs can define a proper local search region consistent with the environment used during training.

If the selected RPs form a valid convex quadrilateral, the anchor-based environment construction strategy is applied in the same manner as in the training stage. Specifically, the RP with the strongest RSS is chosen as the anchor RP, and the coordinates of the selected RPs and the current agent position are transformed with respect to the anchor RP to form the local coordinate system used for online inference. After the transformed environment is constructed, the initial state is generated from the current agent position and the RSS values of the selected four RPs. Based on this state, the trained PPO policy selects an action from the multi-scale action set. The selected action updates the current agent position, and the next state is formed accordingly. This process is repeated until the stopping criterion is satisfied. Since the online stage uses the same state representation, action space, and transformed local environment as the training stage, the learned policy can be consistently applied to unseen positioning scenarios. Finally, the estimated position obtained in the transformed local coordinate system is converted back to the original coordinate system by adding the coordinate of the anchor RP. Through this procedure, the proposed scheme can perform position estimation online without requiring additional retraining for each new RSS observation, while preserving the deployment flexibility enabled by the transformed environment representation.

If the selected four RPs do not satisfy the convex quadrilateral condition, the transformed local environment is not constructed. In this case, the proposed implementation uses the mean coordinate of the selected four RPs as a fallback estimate of the target position. This fallback is adopted for three reasons. First, the mean coordinate corresponds to the centroid of the selected neighboring RPs, which can provide a simple approximation of the user position when no further geometric structure can be exploited. Second, it is a bounded, deterministic estimate that guarantees the output remains within a sensible range, avoiding catastrophic outliers. Third, it is the maximum-entropy estimate under the assumption that all four RPs are equally informative, making it a safe default in the absence of additional structural assumptions. Although this fallback estimation does not exploit the learned policy, it provides a simple and stable estimate for geometrically invalid RP configurations. The detailed online position estimation procedure is summarized in Algorithm 2.

Algorithm 2 Online Position Estimation Stage of Proposed Scheme

1:: Input: Trained PPO policy $π_{θ}^{*}$ , multi-scale action set $A$ , number of steps $T_{m a x}$
2:: Output: Estimated target position ${\hat{p}}^{t a r}$
3:: ⁢
4:: Obtain RSS measurements from all N RPs
5:: Select the four RPs with the strongest RSS values
6:: if the convex quadrilateral condition of the four RPs is not satisfied then
7:: Return the mean coordinate of the selected four RPs
8:: end if
9:: Transform the coordinates using (9)–(14)
10:: Initialize the current state $s_{0}$
11:: for step $t = 0$ to $T_{m a x} - 1$ do
12:: Select an action $a_{t} \in A$ by the trained PPO policy $π_{θ}^{*} (a_{t} | s_{t})$
13:: Update the agent position by the selected action
14:: Obtain the next state $s_{t + 1}$
15:: if the terminal condition is satisfied then
16:: break
17:: end if
18:: end for
19:: Convert the estimated position by the inverse coordinate transformation
20:: Return the estimated target position ${\hat{p}}^{t a r}$

4. Simulation Results

In this section, the performance of the proposed positioning scheme is evaluated through extensive simulations. First, the simulation environment and parameter settings are introduced. Then, the performance of the proposed scheme is examined under different communication environments. Thereafter, ablation studies are conducted to analyze the effects of the multi-scale action design and the adopted reinforcement learning algorithm. In addition, the impacts of the reference point deployment structure and the number of reference points are investigated. The proposed scheme is compared with existing positioning methods, and its adaptability is verified across BLE (Bluetooth Low Energy), Wi-Fi, and Zigbee environments. Finally, the robustness of the proposed scheme under realistic indoor propagation effects is examined by extending the path-loss model with a log-normal shadow-fading term.

4.1. Simulation Setup

The proposed positioning scheme was simulated under three practical environments: indoor BLE [21], Wi-Fi [22], and Zigbee [7]. The parameter values of the path-loss model were taken directly from the empirical measurements reported in [7,21,22]. Specifically, the BLE parameters were derived from the indoor Bluetooth 5.0 measurements [21]. The Wi-Fi parameters follow the IEEE 802.11 measurements using five different mobile devices with Wi-Fi 4 modules [22]. The Zigbee parameters correspond to the IEEE 802.15.4 measurements at 2.4 GHz using Z1 wireless modules with the CC2420 RF transceiver [7]. The simulation environment layout is illustrated in Figure 3. A total of 20 RPs are deployed over a 125 m × 65 m indoor area in a rectangular grid configuration.

The simulation parameters are summarized in Table 1.

The performance in each environment was evaluated using 100,000 test samples. The hyperparameters were tuned empirically based on the complexity of the method. These parameter settings were used throughout the experiments unless otherwise stated.

Between the three communication environments described above, BLE is adopted as the default environment for the ablation studies, RP deployment analyses, and comparison with existing methods, unless otherwise stated. This choice is motivated by practical considerations. Compared to Wi-Fi and Zigbee, BLE can provide lower power consumption and deployment cost, making it one of the most widely adopted communication infrastructures in real-world indoor localization systems [21].

All simulations were conducted on a workstation equipped with an AMD Ryzen Threadripper PRO 5975WX CPU, a single NVIDIA GeForce RTX 4090 GPU, and 128 GB of RAM.

4.2. Performance Metrics

To quantitatively evaluate the proposed positioning scheme, several performance metrics are considered from the perspectives of estimation accuracy, search efficiency, and applicability of the transformed local environment. First, the positioning error for each test sample is defined as the Euclidean distance between the estimated position

{\hat{p}}_{j} = {[{\hat{x}}_{j}, {\hat{y}}_{j}]}^{T}

and the true target position

p_{j} = {[x_{j}, y_{j}]}^{T}

, i.e.,

e_{j} = {∥{\hat{p}}_{j} - p_{j}∥}_{2} .

(15)

Based on the positioning error, the mean absolute error (MAE) and root mean squared error (RMSE) are computed as

\begin{matrix} M A E & = \frac{1}{M} \sum_{j = 1}^{M} e_{j}, \end{matrix}

(16)

\begin{matrix} R M S E & = \sqrt{\frac{1}{M} \sum_{j = 1}^{M} e_{j}^{2}}, \end{matrix}

(17)

where M denotes the number of test samples. For each metric, the mean value and standard deviation over test samples are reported to evaluate not only the average performance but also its statistical stability across test samples.

In addition to the average error metrics, the cumulative distribution function (CDF) of the positioning error is presented to provide a more comprehensive evaluation of the error distribution. The CDF indicates the probability that the positioning error is less than or equal to a given threshold, thereby allowing the overall performance as well as the robustness of the proposed scheme to be visually analyzed. From the CDF curves, representative percentile-based metrics can be obtained. In particular, the median error (50th percentile) represents the typical positioning performance, while the 90th percentile positioning error reflects the worst-case performance under high-reliability conditions [23,24,25].

Furthermore, to evaluate the reliability of the proposed scheme under predefined accuracy requirements, the success rate (SR) within an error threshold is additionally considered [26,27,28]. The success rate is defined as the proportion of test samples for which the positioning error is below a given threshold

e_{th}

, i.e.,

S R (e_{th}) = \frac{1}{M} \sum_{j = 1}^{M} 1 (e_{j} \leq e_{th}),

(18)

where

1 (\cdot)

denotes the indicator function. In this paper, multiple thresholds (e.g., 0.5 m, 1 m, and 2 m) are considered to evaluate the performance under different accuracy requirements. This metric provides a practical interpretation of positioning performance by indicating how frequently the proposed scheme satisfies a given accuracy constraint.

To evaluate the computational efficiency of the online position estimation stage, the number of steps required until the stopping criterion is satisfied is also recorded. The mean and standard deviation of the step count are used to characterize the convergence speed and stability of the proposed method. This metric is particularly important because the proposed multi-scale action design aims to provide a favorable trade-off between estimation accuracy and search efficiency, in which the average step count serves as the primary indicator of efficiency.

Finally, to assess the applicability of the proposed transformed local environment, the frequency with which the selected four RPs satisfy the convex quadrilateral condition is measured. This quantity is expressed as the convex-valid rate (CVR), defined as the ratio of valid RP configurations to the total number of test samples.

C V R = \frac{1}{M} \sum_{j = 1}^{M} I_{j},

(19)

where

I_{j}

is an indicator function that takes a value of 1 when the condition is satisfied and 0 otherwise. Since the proposed positioning process is performed only when the selected RPs form a valid convex quadrilateral, this metric indicates how often the proposed framework can be effectively applied in practice.

4.3. Performance Evaluation of Proposed Scheme in Different Communication Infrastructures

This subsection evaluates the performance of the proposed scheme under BLE, Wi-Fi, and Zigbee to verify its robustness under heterogeneous wireless signal conditions. Estimation accuracy is assessed by MAE and RMSE together with the median and 90th percentile errors, while efficiency and applicability are analyzed through the average step count and the CVR.

The main performance metrics of the proposed scheme across the three communication infrastructures are presented in Table 2.

The proposed scheme achieves stable and accurate positioning performance across all three communication environments. The MAE remains below 0.8 m and the 90th percentile error stays within approximately 0.97 m in every environment, while the CVR is consistently above 0.99, confirming the practical applicability of the transformed local environment regardless of the communication infrastructure. Although Wi-Fi yields slightly better accuracy in absolute terms, with the lowest MAE and median error between the three environments, the differences across BLE, Wi-Fi, and Zigbee are at the centimeter level for all accuracy metrics and lie close to or within the corresponding standard deviations. These results indicate that the proposed scheme delivers comparable sub-meter accuracy across diverse wireless infrastructures.

Efficiency, however, differs substantially across environments. Wi-Fi requires only about 36 steps on average, compared with roughly 107 in Zigbee and 110 in BLE. The CVR remains almost constant across all three cases, confirming that the applicability of the transformed local environment is practically independent of the communication setting. These differences can be attributed to the path-loss exponent n, which governs the RSS-distance sensitivity. The Wi-Fi environment has the largest n, producing more discriminative RSS variations that allow the agent to resolve neighboring positions more clearly. By contrast, the smaller n of BLE yields a weaker spatial gradient and hence larger errors and longer search; Zigbee lies between the two extremes, consistent with its intermediate accuracy and step count. The CDFs of the positioning error for the BLE, Wi-Fi, and Zigbee infrastructures are depicted in Figure 4.

The Wi-Fi curve is clearly left-shifted relative to BLE and Zigbee, while all three curves rise steeply, indicating that small estimation errors dominate regardless of the communication infrastructure.

4.4. Ablation Studies

This subsection presents ablation studies on two key components of the proposed framework: the multi-scale action design and the policy-learning algorithm. Each component is modified while the rest is held fixed, and performance is assessed by MAE, RMSE, CDF-based error analysis, the average step count, and the SR.

4.4.1. Comparison Between Multi-Scale and Single-Scale Actions

The multi-scale action design is compared with two single-scale baselines using fixed step sizes of 1 m and 0.1 m. The comparison targets both estimation accuracy (MAE, RMSE, CDF-based error analysis) and practical efficiency (step count, SR). The estimated accuracy, search efficiency, and threshold-based reliability of the single-scale and multi-scale action configurations are shown in Table 3.

The single-scale (1 m) configuration yields the smallest MAE and RMSE and the highest SR(0.5 m), but this accuracy is obtained at the cost of a markedly larger average step count. The proposed multi-scale design accepts a slight increase in MAE and RMSE while reducing the step count by more than half, and still achieves SR(1 m) and SR(2 m) above 0.99. This confirms that multi-scale actions provide a far better accuracy-efficiency trade-off than a coarse fixed step. In contrast, the single-scale (0.1 m) configuration exhibits the largest MAE and RMSE and an almost negligible SR(0.5 m), demonstrating that an excessively fine fixed step prevents the agent from approaching the target region quickly and degrades the overall positioning performance.

A more comprehensive view of the positioning error distribution is displayed in Figure 5, which shows the CDFs of the positioning error for the three action-scale configurations.

The proposed multi-scale design lies between the two single-scale curves in terms of error distribution while outperforming both when the step cost is considered jointly.

The reduction in the average step count enabled by the multi-scale design also has a direct implication for the real-time inference on resource-constrained mobile terminals. The policy network adopted here is a compact fully-connected actor whose input is the six-dimensional state vector and whose output is the 49-dimensional action distribution, with no convolutional or attention layers, so that its per-step computational cost is dominated by a small number of dense matrix-vector multiplications. Recent on-smartphone measurements of lightweight neural networks of comparable or larger scale provide a useful reference for the expected latency. At the per-inference scale, a lightweight convolutional neural network (CNN) for Wi-Fi fingerprinting reports an on-device latency of 198 μs on a Redmi Note 8 [29]. At the per-task scale, a CNN-based audio localization system on Android smartphones reports an on-device latency of 14 ms for single-frame estimation and 180 ms for multi-frame estimation [30]. Under the conservative assumption that the per-step inference latency of the proposed policy network matches the 198 μs reported in [29], the average step count of 110.5 in the BLE setting translates into an estimated per-fix latency of approximately 22 ms. This per-fix latency lies between the single-frame and multi-frame latencies reported in [30], and is therefore consistent with the latency regime of real-time on-smartphone localization systems.

4.4.2. Comparison Between PPO and DQN

The proposed PPO-based policy learning is compared with a deep Q-network (DQN) [31] baseline under the same MDP and hyperparameter setting. The estimation performance of PPO and DQN under the proposed framework is summarized in Table 4.

PPO significantly outperforms DQN across every metric. PPO attains MAE and RMSE of 0.79 m and 1.19 m, whereas DQN yields 5.66 m and 9.37 m; the 90th percentile error of DQN is more than sixteen times that of PPO, indicating much poorer tail behavior. The average online step count also differs by approximately six times, so PPO not only converges to a significantly more accurate policy but also reaches its estimates with substantially fewer interactions.

4.5. Performance Under Different Reference Point Deployments

This subsection evaluates the robustness of the proposed scheme under different RP deployment configurations, including variations in both spatial arrangement and density. Accuracy is assessed by MAE, RMSE, and percentile-based metrics, while applicability is measured through the CVR.

4.5.1. Performance Under Diamond Deployment

The default rectangular deployment is compared with a diamond RP deployment illustrated in Figure 6.

The main performance metrics of the proposed scheme under rectangular and diamond RP deployment patterns are provided in Table 5.

The rectangular deployment yields substantially better accuracy than the diamond one. MAE increases from 0.79 m to 2.11 m and RMSE from 1.19 m to 4.71 m when switching to diamond. The strongest four RPs in the rectangular layout are more likely to form a geometrically regular neighborhood, which is more favorable for both the transformed coordinate system and the subsequent policy-based search.

The search in the diamond deployment terminates in fewer steps on average, but together with the large error increase, this reduction reflects premature termination rather than faster convergence. The CVR drops from 0.995 in the rectangular case to 0.861 in the diamond case, indicating that the strongest four RPs form a valid convex quadrilateral much less frequently and that the applicability of the transformed local environment is reduced accordingly.

Table 6 disaggregates the positioning performance of the proposed scheme into two disjoint subsets according to the outcome of the convex-validity check: valid-policy cases and fallback cases. In the rectangular deployment, valid-policy cases account for 99.5% of the test samples and achieve an MAE of 0.728 m, whereas the remaining 0.5% of fallback cases exhibit an MAE of 12.786 m. In the diamond deployment, valid-policy cases account for 86.1% of the test samples with an MAE of 0.684 m, while the remaining 13.9% of fallback cases exhibit an MAE of 10.980 m. Notably, the valid-policy MAE in the diamond layout is slightly smaller than in the rectangular layout, indicating that when the convex-validity condition is satisfied, the diamond layout can provide slightly more favorable local geometries for the policy.

Two observations are drawn from this disaggregation. First, in both deployments, the valid-policy subset alone achieves sub-meter MAE, confirming that the proposed positioning scheme provides accurate and reliable performance whenever the convex-validity condition is satisfied. The degradation of the global metrics—most pronounced in the diamond deployment, where the global MAE rises to 2.11 m—is therefore not attributable to the trained policy itself, but rather to the fact that a non-negligible fraction of samples fall into the fallback subset. Second, this observation indicates that the performance of the proposed scheme is strongly affected by the geometric validity of the local RP structure: as long as the four strongest RPs at the user location are guaranteed to form a convex quadrilateral, the proposed scheme is expected to deliver sub-meter accuracy regardless of the global RP deployment pattern. In practical deployments, this condition can be promoted by ensuring sufficient RP density and by avoiding highly anisotropic layouts such as the diamond pattern.

Building on the above observations, a limitation of the proposed scheme should also be acknowledged: its performance can be constrained under highly irregular RP layouts. This limitation arises from two complementary mechanisms. First, irregular RP placements increase the likelihood that the four strongest RPs fail to form a convex quadrilateral, which in turn enlarges the proportion of samples handled by the fallback mechanism and this subset incurs substantially larger positioning errors than the valid-policy subset. Second, even when the convex-validity condition is satisfied, the resulting local RP structures under irregular layouts can be geometrically heterogeneous and may differ noticeably from those encountered during training, which can introduce a distributional shift in the state representation seen by the policy and thereby hinder the policy from learning a single representation. Together, these two mechanisms suggest that, although the proposed scheme delivers reliable sub-meter accuracy when the local RP structure is geometrically regular, additional considerations may be necessary to maintain the same level of performance under highly irregular deployment patterns.

The CDFs of the positioning error under the two deployment patterns are displayed in Figure 7.

The rectangular CDF lies clearly to the left of the diamond CDF, consistent with the much larger MAE and RMSE of the diamond case.

4.5.2. Performance Under Sparse and Dense Deployments

The impact of RP density on positioning performance is investigated by considering sparse, default, and dense deployment scenarios. The RP layouts utilized in the sparse and dense deployment scenarios are displayed in Figure 8.

A comparison is made across sparse, default, and dense deployments utilizing 10, 20, and 40 RPs. Reducing the number of RPs is expected to weaken the RSS-based spatial information, whereas increasing it improves the geometric validity of the local environment. The performance of the proposed scheme under the 10-RP, 20-RP, and 40-RP deployment configurations is provided in Table 7.

As shown in Table 7, the positioning performance improves as the RP density increases. Doubling the RP count from 10 to 20 reduces the MAE by approximately 75% and raises CVR from 0.790 to 0.995, whereas a further doubling to 40 yields only an additional 18% reduction in MAE. These results indicate that once the RP density is sufficient for the strongest four RPs to consistently form a valid convex quadrilateral, the marginal benefit of further densification diminishes rapidly. From a practical deployment perspective, the default 20-RP configuration therefore offers a favorable balance between sub-meter accuracy and the cost of additional RPs, while denser deployments remain attractive for applications requiring shorter search time, and sparse 10-RP deployments are not recommended for general use given the resulting degradation in both accuracy and validity. The increase of SR(0.5 m) from 0.083 to 0.295 with denser deployment further indicates that fine-grained sub-0.5 m precision benefits most from RP densification.

Table 8 disaggregates the positioning performance of the proposed scheme under different RP densities into valid-policy cases and fallback cases. In the sparse 10-RP deployment, valid-policy cases account for 79.0% of the test samples with an MAE of 1.090 m, while the remaining 21.0% of fallback cases exhibit an MAE of 10.920 m. In the rectangular 20-RP deployment, valid-policy cases account for 99.5% of the test samples with an MAE of 0.728 m, while only 0.5% of fallback cases exhibit an MAE of 12.786 m. In the dense 40-RP deployment, all test samples satisfy the convex-validity condition, and the valid-policy subset achieves an MAE of 0.645 m.

Two insights are derived from this disaggregation. First, the proportion of fallback cases varies systematically with the RP density: it decreases from 21.0% under the sparse 10-RP deployment to 0.5% under the rectangular 20-RP deployment, and vanishes entirely under the dense 40-RP deployment. As the fallback subset incurs substantially larger positioning errors than the valid-policy subset, this trend explains a large portion of the global MAE difference observed across the three deployments. Second, the valid-policy MAE itself also varies with the RP density, decreasing from 1.090 m under the sparse deployment to 0.645 m under the dense deployment. This trend indicates that the RP density also influences the performance of the trained policy itself, in addition to controlling the proportion of fallback cases.

The variation of the valid-policy MAE with the RP density can be attributed to the fact that a change in RP density inevitably alters the local geometry observed by the policy. As the RP density decreases, the four strongest RPs at a given user location are more widely separated, and the resulting local RP structures span larger and more heterogeneous regions than those encountered under the denser deployments used during training. This induces a distributional shift in the state representation seen by the policy, which is consistent with the geometry-sensitivity mechanism already identified in Section 4.5.1 for the diamond layout. Conversely, as the RP density increases, the local structures become more compact and more similar to the training distribution, and the valid-policy MAE accordingly improves. The analysis in Table 6 and Table 8 indicates that the performance of the proposed scheme is governed by the local RP geometry along two complementary dimensions as follows: the proportion of geometrically valid configurations and the geometric similarity between the valid local structures and those encountered during training. Both factors are jointly influenced by the RP density and the global deployment layout.

4.6. Comparison with Existing Methods

The proposed scheme is compared with DQN-based [11], DL-based [32], zone-selection [7], and multi-lateration methods [7] under identical simulation conditions. Performance is reported in terms of MAE, RMSE, median and 90th percentile errors, SR, and, for search-based methods, the average step count. The performance of the proposed and existing positioning methods is compared in Table 9.

The proposed method achieves the best performance across every metric. Its median error is at least an order of magnitude smaller than those of the compared baselines, and its MAE and RMSE are several times lower than the DL-based, zone-selection, and multi-lateration methods, and substantially lower than that of the DQN-based method. Among the search-based methods, the proposed scheme requires an average of 110.5 steps, compared to 120.0 for DQN, yet attains dramatically lower estimation errors, indicating a much more target-directed policy.

The performance of the DQN- and DL-based methods may be affected by their convergence behavior under the adopted training setting. Since the same general hyperparameter setting described in Section 4.1 is applied to all methods to ensure a direct comparison, the methods may be insufficiently converged, and their large residual errors partly reflect incomplete training rather than purely intrinsic limitations. Both the proposed scheme and the DQN-based method are implemented with fully-connected networks of the same capacity.

The main hyperparameters of each baseline are summarized as follows. For the DQN-based method, the principal hyperparameters include the neural-network capacity, the grid resolution, the training episodes and steps, the learning rate, the discount factor, the optimizer, the

ϵ

-greedy schedule, the replay-memory size, the replay-memory start size, the target-network update period, and the minibatch size. Among these, the training episodes and steps, the learning rate, the discount factor, the replay-memory size, the target-network update interval, and the

ϵ

-greedy schedule are expected to have the largest influence on DQN performance, through their respective roles in determining the state-space coverage achievable during training, the stability of the Q-value update, the long-horizon credit assignment, the stability of the temporal-difference-bootstrapped target, and the exploration-exploitation balance.

For the DL-based method, the principal hyperparameters include the neural-network capacity, the grid resolution of the fingerprint database, the training epochs, the optimizer, the learning rate, and the minibatch size. Among these, the grid resolution, the neural-network capacity, the learning rate, and the training epochs are expected to have the largest influence, with the grid resolution setting a fundamental upper bound on the achievable accuracy and the remaining three jointly determining the convergence point of the regression network.

The values reported in [11,32] are adopted directly where they are explicitly specified, while the remaining values follow the same general setting described in Section 4.1. It should also be noted that the DQN-based method reported here follows the action space and reward structure of [11] rather than those of the proposed scheme. Therefore, the suitability of DQN specifically for the action space and reward structure of the proposed scheme is separately evaluated in the ablation study of Section 4.4.2, where PPO and DQN are compared under the same MDP as the proposed scheme. Under this controlled comparison, PPO is shown to be better matched to this MDP than DQN, which is consistent with the observation that the proposed scheme still converges to a stable and accurate policy in the same regime here. Hence, the comparison also highlights the practical trainability of the proposed scheme, and the choice of PPO is a deliberate algorithmic choice matched to the action space and reward structure of the proposed scheme rather than an artifact of insufficient DQN tuning.

The CDFs of the positioning error for the proposed and existing positioning methods are depicted in Figure 9.

The proposed method exhibits the most left-shifted curve, while the baselines—particularly the DQN- and DL-based methods with their broad right tails—lie distinctly to the right.

A key advantage of the proposed framework is that it requires substantially fewer labeled location samples than DL-based fingerprinting methods. Considering a 125 m × 65 m indoor area, the DL-based method [32] requires a dense fingerprint database, with each fingerprint corresponding to a labeled

(x, y)

location. Under a typical 1 m grid spacing, this corresponds to approximately 8125 labeled samples; even with a coarser 2 m grid, approximately 2050 samples are still required. By contrast, the proposed scheme uses 300 training episodes, each requiring one labeled target position to compute the APF-based reward, yielding only 300 labeled samples in total. This represents a reduction of approximately 96.3% in labeled-location supervision relative to the 1 m grid baseline. The reduction is enabled by the fact that supervision in the proposed framework is injected through the reward signal rather than through dense labeled fingerprints, so that a sparse coverage of target locations is sufficient to shape a target-directed policy.

4.7. Adaptability Across Different Indoor Communication Infrastructures

This subsection evaluates the adaptability of the proposed scheme across different communication infrastructures. The model is trained under a representative path-loss setting (

n = 1.6

,

b = - 30.0

) and directly tested—without retraining—in BLE, Wi-Fi, and Zigbee environments. This setting represents an idealized indoor line-of-sight (LOS)-like condition [33]. Path-loss exponents below 2 are commonly observed in corridor-like LOS environments due to waveguide-type reflections, and a relatively strong reference RSS b represents a short-distance anchor power. The training setting, therefore, does not correspond to a specific communication system but to a representative condition from which the learned policy is expected to transfer.

The performance of the proposed scheme in the BLE, Wi-Fi, and Zigbee environments is shown in Table 10.

The proposed scheme maintains stable performance across all three environments despite training only under the representative setting. BLE and Wi-Fi yield nearly identical results, indicating consistently strong performance without environment-specific retraining. The threshold-based reliability is also almost identical across environments, with SR(1 m) and SR(2 m) close to 0.994 in every case. The proposed framework, therefore, maintains robust and comparable reliability even when the signal propagation differs across communication infrastructures.

This consistent performance can be attributed to the anchor-based environment construction. Because the state and search are defined in a local coordinate system anchored at the strongest RP, the learned policy focuses on the relative spatial structure implied by RSS rather than on a specific absolute deployment or propagation setting. The CDFs of the positioning error for the three communication infrastructures are presented in Figure 10.

As shown in Figure 10, the BLE and Wi-Fi curves almost overlap, while the Zigbee curve is only slightly shifted to the right, and all three curves rise steeply, confirming the cross-infrastructure robustness of the proposed framework.

4.8. Performance Under Realistic Indoor Propagation Effects

This subsection evaluates the robustness of the proposed scheme under realistic indoor propagation effects that are not captured by the log-distance path-loss model in (1). Real indoor RSS is influenced by various effects such as shadowing, multi-path fading, non-line-of-sight (NLOS) propagation, device orientation, body blockage, temporal variation, and other environmental factors. Among these factors, shadowing is known to be a dominant large-scale stochastic component in time-averaged RSS measurements [34]. This is because small-scale multi-path fluctuations are mitigated by RSS averaging, while NLOS attenuation can be partially reflected in the shadow-fading variance. Therefore, the path-loss model is expanded to

R S S = b - 10 n {log}_{10} d + X_{σ}, X_{σ} \sim N (0, σ^{2}),

(20)

where

σ = 2

dB represents mild shadowing typical of open-plan indoor environments and

σ = 4

to 6 dB represents moderate to strong shadowing characteristic of office and industrial indoor environments [33]. All other simulation parameters are kept identical to those of Section 4.1.

Table 11 summarizes the performance of the proposed scheme under each shadow-fading level.

Under mild shadowing (

σ = 2

dB), the proposed scheme retains sub-meter accuracy with an MAE of 0.949 m and SR(1 m) above 0.97, while the CVR remains above 0.97. Under moderate shadowing (

σ = 4

dB), the MAE increases to 1.956 m and the CVR drops to 0.911, indicating a noticeable but still moderate degradation. Under strong shadowing (

σ = 6

dB), the MAE further increases to 3.235 m and the CVR drops to 0.861. The robustness observed up to

σ = 2

dB can be attributed to two mechanisms of the proposed scheme: the dense and continuous APF-based reward signal stabilizes the learned policy against per-sample RSS perturbations, and the anchor-based environment construction operates on the relative ranking of the four strongest RPs, which is comparatively less sensitive to additive RSS variations than the absolute RSS values themselves. The degradation observed at larger

σ

values primarily reflects the increasing probability that the shadow-fading perturbation alters this relative ranking, which in turn reduces the CVR and shifts a larger fraction of test samples into the fallback subset; the decreasing step count from

σ = 4

dB onward reflects increased premature termination of episodes rather than faster convergence, consistent with the simultaneous MAE increase.

5. Conclusions

In this paper, the SDRL-based positioning scheme has been proposed for RSS-based indoor localization. To address the limitations of existing DL- and DRL-based approaches, the positioning problem was formulated as an MDP, and a target-aware reward design based on the APF was introduced to provide dense and informative learning signals. In addition, a multi-scale action design was employed to overcome the limitation of fixed-resolution search, while an anchor-based environment construction strategy was developed by selecting the four strongest RPs and transforming their coordinates with respect to the strongest RP.

Extensive simulation results demonstrate the effectiveness and robustness of the proposed scheme across different wireless environments and evaluation scenarios. Under the default BLE setting, the proposed scheme achieves an MAE of 0.788 m, an SR(1 m) of 99.1%, an SR(2 m) of 99.2%, and a CVR exceeding 99.5%, while also maintaining stable positioning performance across Wi-Fi and Zigbee environments. The ablation studies further verify the effectiveness of the multi-scale action design and the advantage of PPO over DQN; in particular, the multi-scale actions reduce the average number of search steps by approximately 69.5% relative to the single-scale (1 m) baseline while maintaining SR(1 m) and SR(2 m) above 99%. In addition, the simulations with different RP deployments show that the performance of the proposed framework is closely related to the geometric validity of the local RP structure. Finally, the proposed scheme reduces the MAE by approximately 86.1%, 90.8%, 96.0%, and 96.6% compared with DQN-, DL-, zone-selection-, and multi-lateration-based methods, respectively, confirming that it outperforms existing positioning methods in terms of both estimation accuracy and search efficiency.

The cross-infrastructure experiment also demonstrated that the model trained under a representative path-loss setting can sustain consistent performance across diverse communication environments without additional retraining, and the additional evaluation under shadow fading confirmed that the framework retains sub-meter accuracy under mild shadow-fading conditions, with multipath and NLOS effects partially absorbed into the shadow-fading variance under standard indoor propagation models. These results confirm that the proposed framework provides an efficient, resilient, and adaptable solution for RSS-based indoor localization. However, although the proposed scheme reliably achieves sub-meter average accuracy, it does not consistently achieve sub-0.5 m precision and is therefore not intended for applications requiring centimeter-level localization. Future work will aim to extend the proposed method for dynamic environments, incorporate additional sensing modalities to achieve sub-0.5 m precision, investigate alternative mechanisms for handling non-convex RP geometries, and validate the framework through real-world experiments considering various propagation effects.

Author Contributions

Conceptualization, Y.S.; methodology, Y.S., K.K. and S.L.; software, Y.S. and K.K.; validation, Y.S. and J.S.; formal analysis, Y.S. and S.L.; investigation, Y.S. and S.K.; resources, Y.S. and J.S.; data curation, Y.S. and S.L.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S., K.K., S.L., J.S., S.K. and J.K.; visualization, K.K. and S.K.; supervision, J.K.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the authors.

Acknowledgments

This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) (RS-2025-23524307) and supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2026-RS-2023-00258639) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

APF	Artificial Potential Field
BLE	Bluetooth Low Energy
CDF	Cumulative Distribution Function
CVR	Convex-Valid Rate
DL	Deep Learning
DNN	Deep Neural Network
DQN	Deep Q-Network
DRL	Deep Reinforcement Learning
GF	Gravitational Force
IoT	Internet of Things
LOS	Line of Sight
MAE	Mean Absolute Error
MDP	Markov Decision Process
P90	90th Percentile Error
PPO	Proximal Policy Optimization
RF	Repulsive Force
RL	Reinforcement Learning
RMSE	Root Mean Squared Error
RP	Reference Point
RSS	Received Signal Strength
SDRL	Supervised Deep Reinforcement Learning
SR	Success Rate

References

Langlois, C.; Tiku, S.; Pasricha, S. Indoor localization with smartphones: Harnessing the sensor suite in your pocket. IEEE Consum. Electron. Mag. 2017, 6, 70–80. [Google Scholar] [CrossRef]
Hayward, S.J.; van Lopik, K.; Hinde, C.; West, A.A. A survey of indoor location technologies techniques and applications in industry. Internet Things 2022, 20, 100608. [Google Scholar] [CrossRef]
Gufran, D.; Tiku, S.; Pasricha, S. STELLAR: Siamese multiheaded attention neural networks for overcoming temporal variations and device heterogeneity with indoor localization. IEEE J. Indoor Seamless Position. Navig. 2023, 1, 115–129. [Google Scholar] [CrossRef]
Singh, J.; Tyagi, N.; Singh, S.; Ali, F.; Kwak, D. A systematic review of contemporary indoor positioning systems: Taxonomy, techniques, and algorithms. IEEE Internet Things J. 2024, 11, 34717–34733. [Google Scholar] [CrossRef]
Zhang, L.; Wu, S.; Zhang, T.; Zhang, Q. Learning to locate: Adaptive fingerprint-based localization with few-shot relation learning in dynamic indoor environments. IEEE Trans. Wirel. Commun. 2023, 22, 5253–5264. [Google Scholar] [CrossRef]
Singh, N.; Choe, S.; Punmiya, R. Machine learning based indoor localization using Wi-Fi RSSI fingerprints: An overview. IEEE Access 2021, 9, 127150–127174. [Google Scholar] [CrossRef]
Booranawong, A.; Sengchuai, K.; Buranapanichkit, D.; Jindapetch, N.; Saito, H. RSSI-based indoor localization using multi-lateration with zone selection and virtual position-based compensation methods. IEEE Access 2021, 9, 46223–46239. [Google Scholar] [CrossRef]
He, S.; Chan, S.-H.G. Wi-Fi fingerprint-based indoor positioning: Recent advances and comparisons. IEEE Commun. Surv. Tuts. 2016, 18, 466–490. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Mohammadi, M.; Al-Fuqaha, A.; Guizani, M.; Oh, J.-S. Semisupervised deep reinforcement learning in support of IoT and smart city services. IEEE Internet Things J. 2018, 5, 624–635. [Google Scholar] [CrossRef]
Li, Y.; Hu, X.; Zhuang, Y.; Gao, Z.; Zhang, P.; El-Sheimy, N. Deep reinforcement learning (DRL): Another perspective for unsupervised wireless localization. IEEE Internet Things J. 2020, 7, 6279–6287. [Google Scholar] [CrossRef]
Dou, F.; Lu, J.; Xu, T.; Huang, C.-H.; Bi, J. A bisection reinforcement learning approach to 3-D indoor localization. IEEE Internet Things J. 2021, 8, 6519–6535. [Google Scholar] [CrossRef]
Sun, Y.G.; Kim, S.H.; Kim, D.I.; Kim, J.Y. Area-selective deep reinforcement learning scheme for wireless localization. IEICE Trans. Fundam. 2025, E108-A, 883–887. [Google Scholar] [CrossRef]
Zafari, F.; Gkelias, A.; Leung, K.K. A survey of indoor localization systems and technologies. IEEE Commun. Surv. Tutor. 2019, 21, 2568–2599. [Google Scholar] [CrossRef]
Cheng, Y.; Zhang, L. MAE-Based Radio Map Construction for Wi-Fi Fingerprint Indoor Localization. IEEE Commun. Lett. 2025, 29, 2008–2012. [Google Scholar] [CrossRef]
Yan, W.; Yin, F.; Gao, J.; Wang, A.; Tian, Y.; Chen, R. Attentional Graph Meta-Learning for Indoor Localization Using Extremely Sparse Fingerprints. IEEE Trans. Mob. Comput. 2026, 25, 8718–8734. [Google Scholar] [CrossRef]
Correia, A.; Petkov, N.; Pereira, G. A survey of demonstration learning. Robot. Auton. Syst. 2024, 176, 104673. [Google Scholar] [CrossRef]
Hu, Y.; Wang, W.; Jia, H.; Wang, Y.; Chen, Y.; Hao, J.; Wu, F.; Fan, C. Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Tang, Y.; Chen, H.; Ma, Z.; Jin, Z.; Yin, H. Application of artificial potential field method in three-dimensional path planning for UAV considering 5G communication. IEEE Access 2024, 12, 79238–79250. [Google Scholar] [CrossRef]
Lee, S.; Seon, J.; Sun, Y.G.; Kim, S.H.; Kyeong, C.; Kim, D.I.; Kim, J.Y. Novel architecture of energy management systems based on deep reinforcement learning in microgrid. IEEE Trans. Smart Grid 2024, 15, 1646–1658. [Google Scholar] [CrossRef]
Yang, S.; Wang, B. Residual based weighted least square algorithm for Bluetooth/UWB indoor localization system. In Proceedings of the 2017 36th Chinese Control Conference; IEEE: New York, NY, USA, 2017; pp. 5959–5963. [Google Scholar]
Laoudias, C.; Kolios, P.; Panayiotou, C. Differential signal strength fingerprinting revisited. In Proceedings of the 2014 International Conference on Indoor Positioning and Indoor Navigation; IEEE: New York, NY, USA, 2014; pp. 30–37. [Google Scholar]
Wu, Z.; Hu, P.; Liu, S.; Pang, T. Attention mechanism and LSTM network for fingerprint indoor localization in complex indoor environments. Sensors 2024, 24, 1398. [Google Scholar]
Martín-Frechina, S.; Dura, E.; Miralles, I.; Torres-Sospedra, J. A systematic review of Wi-Fi and BLE-based indoor localization systems. Sensors 2025, 25, 6946. [Google Scholar] [CrossRef]
Li, J.; Guo, X.; Han, Z.; Wang, H.; Cao, T.; Yu, L. Indoor localization method based on regional division with IFCM. Electronics 2019, 8, 559. [Google Scholar] [CrossRef]
Ayinla, S.L.; Abd Aziz, A.; Drieberg, M.; Susanto, M.; Laouiti, A. RELoc: An enhanced 3D WiFi fingerprinting indoor localization framework. Sensors 2026, 26, 326. [Google Scholar] [CrossRef] [PubMed]
Ioannou, I.; Vassiliou, V.; Raspopoulos, M. Adaptive multi-stage hybrid localization for RIS-aided 6G indoor networks. Sensors 2026, 26, 1084. [Google Scholar] [CrossRef] [PubMed]
Schmidt, S.O.; Cimdins, M.; John, F.; Hellbrück, H. SALOS—A UWB Single-Anchor Indoor Localization System. Sensors 2024, 24, 2428. [Google Scholar] [CrossRef]
Kargar-Barzi, A.; Farahmand, E.; Taheri Chatrudi, N.; Mahani, A.; Shafique, M. An edge-based WiFi fingerprinting indoor localization using convolutional neural network and convolutional auto-encoder. IEEE Access 2024, 12, 85050–85060. [Google Scholar] [CrossRef]
Küçük, A.; Ganguly, A.; Hao, Y.; Panahi, I.M.S. Real-time convolutional neural network-based speech source localization on smartphone. IEEE Access 2019, 7, 169969–169978. [Google Scholar] [CrossRef]
Lee, D.; Sun, Y.G.; Kim, S.H.; Sim, I.; Hwang, Y.M.; Shin, Y.; Kim, D.I.; Kim, J.Y. DQN-Based adaptive modulation scheme over wireless communication channels. IEEE Commun. Lett. 2020, 24, 1289–1293. [Google Scholar] [CrossRef]
Zhang, W.; Liu, K.; Zhang, W.; Zhang, Y.; Gu, J. Deep neural networks for wireless localization in indoor and outdoor environments. Neurocomputing 2016, 194, 279–287. [Google Scholar] [CrossRef]
International Telecommunication Union Radiocommunication Sector (ITU-R). Recommendation ITU-R P.1238-13: Propagation Data and Prediction Methods for the Planning of Indoor Radiocommunication Systems and Radio Local Area Networks in the Frequency Range from 300 MHz to 450 GHz; ITU: Geneva, Switzerland, 2025. [Google Scholar]
Rappaport, T.S. Wireless Communications: Principles and Practice, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]

Figure 1. Overall structure of the proposed SDRL framework.

Figure 2. Block diagram of the proposed positioning scheme.

Figure 3. Layout of the simulation environment with 20 RPs.

Figure 4. CDF of the positioning error under different communication infrastructures: (a) BLE, (b) Wi-Fi, (c) Zigbee.

Figure 5. CDF of the positioning error under different action-scale configurations.

Figure 6. Layout of the simulation environment with diamond RP deployment.

Figure 7. CDF of the positioning error under different RP deployment patterns.

Figure 8. Layout of the simulation environment under sparse and dense RP deployments: (a) 10 RPs, (b) 40 RPs.

Figure 9. CDF of the positioning error for the proposed and existing methods: DQN-based method [11], DL-based method [32], and zone-selection and multi-lateration methods [7].

Figure 10. CDF of the positioning error under the representative path-loss setting across different communication infrastructures: (a) BLE, (b) Wi-Fi, (c) Zigbee.

Table 1. Simulation parameters.

Parameter	Value
Indoor area size	125 m × 65 m
Number of RPs	20
Communication infrastructure (path-loss model parameters)	BLE ( $n = 1.89$ , $b = - 64.29$ ), Wi-Fi ( $n = 3.9$ , $b = - 35.0$ ), Zigbee ( $n = 2.53$ , $b = - 14.05$ )
Number of episodes (offline/online)	300/1000
Number of steps (offline/online)	5000/1000
PPO coefficients $c_{1}$ , $c_{2}$	0.5, 0.01
Actor learning rate	0.0001
Critic learning rate	0.0001
Clip range	0.05
Discount factor	0.99
Fully-connected layers (actor/critic)	3
Hidden layer neurons	128, 128, 128
Optimizer	Adam

Table 2. Performance of the proposed scheme under different communication infrastructures.

Environment	MAE	RMSE	Median Error	P90 Error	Steps	CVR
BLE	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$0.776 \pm 0.008$	$0.972 \pm 0.003$	$110.50 \pm 3.50$	$0.995 \pm 0.002$
Wi-Fi	$0.712 \pm 0.031$	$1.155 \pm 0.181$	$0.695 \pm 0.012$	$0.958 \pm 0.004$	$36.428 \pm 0.567$	$0.995 \pm 0.002$
Zigbee	$0.787 \pm 0.027$	$1.201 \pm 0.155$	$0.771 \pm 0.009$	$0.969 \pm 0.003$	$106.694 \pm 2.160$	$0.994 \pm 0.002$

Table 3. Performance comparison under different action-scale configurations in BLE environment.

Method	MAE	RMSE	Steps	SR(0.5 m)	SR(1 m)	SR(2 m)
Multi-scale	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$110.50 \pm 3.50$	$0.177 \pm 0.012$	$0.991 \pm 0.003$	$0.992 \pm 0.003$
Single-scale (1 m)	$0.711 \pm 0.029$	$1.148 \pm 0.173$	$361.85 \pm 9.39$	$0.319 \pm 0.013$	$0.932 \pm 0.008$	$0.991 \pm 0.002$
Single-scale (0.1 m)	$1.017 \pm 0.031$	$1.338 \pm 0.166$	$191.77 \pm 2.25$	$0.001 \pm 0.001$	$0.994 \pm 0.002$	$0.994 \pm 0.002$

Table 4. Performance comparison between PPO- and DQN-based methods in BLE environment.

Method	MAE	RMSE	Median Error	P90 Error	Steps	SR(0.5 m)	SR(1 m)	SR(2 m)
PPO	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$0.776 \pm 0.008$	$0.972 \pm 0.003$	$110.50 \pm 3.50$	$0.177 \pm 0.012$	$0.991 \pm 0.002$	$0.992 \pm 0.003$
DQN	$5.657 \pm 0.216$	$9.372 \pm 0.382$	$2.635 \pm 0.193$	$16.180 \pm 1.229$	$601.21 \pm 12.94$	$0.121 \pm 0.008$	$0.417 \pm 0.013$	$0.451 \pm 0.014$

Table 5. Performance of the proposed scheme under different RP deployment patterns in BLE environment.

Deployment	MAE	RMSE	Steps	CVR
Rectangular	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$110.50 \pm 3.50$	$0.995 \pm 0.002$
Diamond	$2.111 \pm 0.129$	$4.710 \pm 0.253$	$59.42 \pm 1.62$	$0.861 \pm 0.011$

Table 6. Performance of the valid-policy and fallback subsets in the different RP deployments.

Deployment	Subset	Share (%)	MAE	RMSE
Rectangular	Valid-policy	99.5	$0.728 \pm 0.009$	$0.771 \pm 0.014$
Rectangular	Fallback	0.5	$12.786 \pm 0.900$	$12.907 \pm 0.888$
Diamond	Valid-policy	86.1	$0.684 \pm 0.008$	$0.725 \pm 0.007$
Diamond	Fallback	13.9	$10.980 \pm 0.454$	$12.530 \pm 0.486$

Table 7. Performance comparison across different RP density configurations in BLE environment.

RP Density	MAE	RMSE	Median Error	P90 Error	Steps	CVR	SR(0.5 m)	SR(1 m)	SR(2 m)
10 RPs	$3.153 \pm 0.142$	$5.486 \pm 0.205$	$0.947 \pm 0.008$	$11.614 \pm 0.601$	$414.14 \pm 11.50$	$0.790 \pm 0.012$	$0.083 \pm 0.007$	$0.638 \pm 0.015$	$0.683 \pm 0.015$
20 RPs (default)	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$0.776 \pm 0.009$	$0.972 \pm 0.003$	$110.50 \pm 3.50$	$0.995 \pm 0.002$	$0.177 \pm 0.012$	$0.991 \pm 0.002$	$0.992 \pm 0.003$
40 RPs	$0.645 \pm 0.007$	$0.694 \pm 0.006$	$0.684 \pm 0.012$	$0.955 \pm 0.004$	$31.44 \pm 0.46$	$1.000 \pm 0.000$	$0.295 \pm 0.014$	$1.000 \pm 0.000$	$1.000 \pm 0.000$

Table 8. Performance of the valid-policy and fallback subsets in the different RP density configurations.

RP Density	Subset	Share (%)	MAE	RMSE
10 RPs	Valid-policy	79.0	$1.090 \pm 0.029$	$1.325 \pm 0.040$
10 RPs	Fallback	21.0	$10.920 \pm 0.303$	$11.699 \pm 0.281$
20 RPs (default)	Valid-policy	99.5	$0.728 \pm 0.009$	$0.771 \pm 0.014$
20 RPs (default)	Fallback	0.5	$12.786 \pm 0.900$	$12.907 \pm 0.888$
40 RPs	Valid-policy	100.0	$0.645 \pm 0.007$	$0.694 \pm 0.006$
40 RPs	Fallback	0.0	–	–

Table 9. Performance comparison between the proposed and existing positioning methods in BLE environment.

Method	MAE	RMSE	Median Error	P90 Error	Steps	SR(1 m)	SR(2 m)
Proposed	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$0.776 \pm 0.008$	$0.972 \pm 0.003$	$110.50 \pm 3.50$	$0.991 \pm 0.002$	$0.992 \pm 0.003$
DQN-based	$72.571 \pm 1.074$	$79.576 \pm 1.002$	$69.865 \pm 1.672$	$118.607 \pm 1.226$	$120.00 \pm 0.00$	$0.001 \pm 0.001$	$0.001 \pm 0.001$
DL-based	$8.594 \pm 0.161$	$10.091 \pm 0.209$	$7.636 \pm 0.172$	$15.466 \pm 0.391$	–	$0.013 \pm 0.003$	$0.052 \pm 0.006$
Zone-selection	$19.502 \pm 0.342$	$22.881 \pm 0.396$	$17.332 \pm 0.413$	$35.589 \pm 0.845$	–	$0.002 \pm 0.001$	$0.011 \pm 0.003$
Multi-lateration	$22.838 \pm 0.547$	$28.494 \pm 1.207$	$19.039 \pm 0.529$	$42.425 \pm 1.193$	–	$0.002 \pm 0.001$	$0.009 \pm 0.002$

Table 10. Performance of the proposed scheme trained under a representative path-loss setting across different communication infrastructures.

Environment	MAE	RMSE	Median Error	P90 Error	SR(1 m)	SR(2 m)
BLE	$0.702 \pm 0.029$	$1.168 \pm 0.171$	$0.696 \pm 0.014$	$0.965 \pm 0.005$	$0.994 \pm 0.002$	$0.994 \pm 0.002$
Wi-Fi	$0.699 \pm 0.031$	$1.168 \pm 0.190$	$0.686 \pm 0.015$	$0.963 \pm 0.004$	$0.994 \pm 0.002$	$0.994 \pm 0.002$
Zigbee	$0.772 \pm 0.032$	$1.191 \pm 0.178$	$0.759 \pm 0.012$	$0.970 \pm 0.004$	$0.994 \pm 0.003$	$0.994 \pm 0.003$

Table 11. Performance of the proposed scheme under varying shadow-fading levels in the BLE environment.

$σ$	MAE	RMSE	P90 Error	SR(0.5 m)	SR(1 m)	SR(2 m)	Steps	CVR
0.0	$0.788 \pm 0.029$	$1.186 \pm 0.169$	$0.972 \pm 0.003$	$0.177 \pm 0.012$	$0.991 \pm 0.002$	$0.992 \pm 0.003$	$110.50 \pm 3.50$	$0.995 \pm 0.002$
2.0	$0.949 \pm 0.050$	$1.955 \pm 0.250$	$0.979 \pm 0.003$	$0.179 \pm 0.011$	$0.973 \pm 0.005$	$0.975 \pm 0.005$	$155.10 \pm 5.50$	$0.978 \pm 0.004$
4.0	$1.956 \pm 0.161$	$5.065 \pm 0.444$	$1.321 \pm 0.734$	$0.177 \pm 0.013$	$0.908 \pm 0.009$	$0.910 \pm 0.009$	$136.61 \pm 4.82$	$0.911 \pm 0.009$
6.0	$3.235 \pm 0.261$	$8.153 \pm 0.575$	$11.354 \pm 1.847$	$0.183 \pm 0.011$	$0.859 \pm 0.011$	$0.860 \pm 0.011$	$112.40 \pm 4.17$	$0.861 \pm 0.012$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Kim, K.; Lee, S.; Seon, J.; Kim, S.; Kim, J. Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization. Electronics 2026, 15, 2203. https://doi.org/10.3390/electronics15102203

AMA Style

Sun Y, Kim K, Lee S, Seon J, Kim S, Kim J. Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization. Electronics. 2026; 15(10):2203. https://doi.org/10.3390/electronics15102203

Chicago/Turabian Style

Sun, Youngghyu, Kyounghun Kim, Seongwoo Lee, Joonho Seon, Soohyun Kim, and Jinyoung Kim. 2026. "Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization" Electronics 15, no. 10: 2203. https://doi.org/10.3390/electronics15102203

APA Style

Sun, Y., Kim, K., Lee, S., Seon, J., Kim, S., & Kim, J. (2026). Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization. Electronics, 15(10), 2203. https://doi.org/10.3390/electronics15102203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Positioning Scheme Based on Supervised Deep Reinforcement Learning for Indoor Wireless Localization

Abstract

1. Introduction

1.1. Background

1.2. Related Works

1.3. Contributions and Organization

2. Proposed Supervised Deep Reinforcement Learning Framework

2.1. Markov Decision Process Formulation for Positioning

2.2. PPO-Based Policy Learning

2.3. Anchor-Based Environment Construction Strategy

3. Proposed Positioning Scheme

3.1. Offline Training Stage

3.2. Online Position Estimation Stage

4. Simulation Results

4.1. Simulation Setup

4.2. Performance Metrics

4.3. Performance Evaluation of Proposed Scheme in Different Communication Infrastructures

4.4. Ablation Studies

4.4.1. Comparison Between Multi-Scale and Single-Scale Actions

4.4.2. Comparison Between PPO and DQN

4.5. Performance Under Different Reference Point Deployments

4.5.1. Performance Under Diamond Deployment

4.5.2. Performance Under Sparse and Dense Deployments

4.6. Comparison with Existing Methods

4.7. Adaptability Across Different Indoor Communication Infrastructures

4.8. Performance Under Realistic Indoor Propagation Effects

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI