2.1. Markov Decision Process Formulation for Positioning
RSS has been widely utilized for indoor localization due to its practical availability and location-dependent characteristics [
14,
15]. Unlike positioning methods requiring dedicated ranging or angle-measurement hardware, RSS can be directly obtained from widely deployed wireless communication systems such as Bluetooth, Wi-Fi, and Zigbee, making it a cost-effective solution for indoor positioning. In indoor environments, RSS is affected by the distance between a device and RPs [
16]. The RSS values observed from multiple RPs can serve as meaningful information for estimating the target position. For this reason, RSS provides a practical basis for formulating the indoor positioning problem considered in this paper.
The overall structure of the proposed SDRL framework is depicted in
Figure 1.
The indoor wireless localization environment consists of a user and
N RPs. The user collects signals, including coordinates of RPs, and the RSS values are measured from the
N RPs. Using the wireless signal path-loss model [
11] as expressed in (
1), the four RPs nearest to the user are chosen to establish an environment of the SDRL framework.
where
d represents the distance between the RP and the user,
n is the path-loss exponent that governs the rate of signal attenuation with distance, and
b is the reference RSS in dBm measured at
m. Both parameters are determined empirically for each indoor propagation environment. Based on the selected RPs, the user position is estimated through interaction between the agent and the environment. To achieve stable and efficient learning, the agent is trained using an actor-critic-based SDRL framework.
In general, SDRL approaches often use expert demonstrations, labeled state–action pairs, or supervised pretraining to guide policy learning [
17]. In such approaches, supervision is usually provided in the form of desired actions or trajectories, and the agent learns to imitate or initialize a policy from externally provided examples. By contrast, the SDRL framework considered in this paper does not rely on expert demonstrations or labeled action trajectories. Instead, supervision is introduced through the reward design during the agent training stage. Specifically, the target position is used only in the training stage to construct a target-aware reward function, allowing the agent to learn a policy that selects actions leading toward the target. In the online position estimation stage, the target position is no longer given to the agent; the trained policy sequentially estimates the user position. Therefore, the term “supervised” in the proposed SDRL framework refers to reward-level supervision rather than to action-level supervision. Similar to reward-shaped DRL [
18], prior task knowledge is injected via the reward signal instead of imitation. The term SDRL is adopted to emphasize that the target information serves as a supervisory cue during training, while preserving the standard reinforcement learning interaction between the agent and the environment.
Based on this SDRL formulation, the positioning problem can be modeled as an MDP [
11]. The MDP consists of three major components: state, action, and reward function. The state is defined as a six-dimensional vector composed of the current agent position and the RSS values measured from the four selected RPs, where the selected RPs are those providing the strongest RSS values among all available RPs. The action space is designed as a multi-scale action set
to overcome the accuracy-efficiency limitation imposed by a fixed movement resolution. Specifically,
consists of two-dimensional movement vectors in which each axis independently takes one value from
, yielding 49 action candidates per decision step. These three scales of
support coarse, intermediate, and fine movements, so that the agent can perform rapid coarse exploration when far from the target and fine adjustment near the target within a single unified policy. Through repeated interactions with the environment, the agent learns an effective policy for target-directed position estimation.
While the multi-scale action design addresses the resolution-efficiency trade-off in the agent’s movements, the reward function should be designed to counter another well-known limitation of search-based DRL, namely the sparse reward problem. To provide dense and continuous learning signals throughout the search, the reward function is constructed based on the APF method. The underlying principle of the APF method is to model the motion of an agent as being driven by a virtual potential field, in which goals and obstacles are represented by attractive and repulsive potentials, respectively [
19]. Following this principle, the target is treated as a source of attractive force, namely the gravitational force (GF), acting on the agent, while the selected RPs are treated as sources of repulsive force (RF). The GF is derived from the negative gradient of an attractive potential, as given in (
2).
where
k,
denote the positive coefficient and the agent’s distance from the target, respectively. The magnitude of the GF increases as the agent moves farther from the target, thereby encouraging the agent to move toward the target location. Likewise, the RF is derived from the negative gradient of a repulsive potential field as expressed in (
3).
where
Here,
m,
,
are the positive coefficient, the agent’s distance from the RP, and the maximal range of influence of the RP, respectively. In particular, the maximal range of influence is assigned differently to each selected RP according to the RSS ranking. Specifically, a smaller maximal range of influence is assigned to the RP with a stronger RSS, whereas a larger maximal range of influence is assigned to the RP with a weaker RSS. This design is motivated by the fact that an RP with stronger RSS is more likely to be located closer to the user and, therefore, should affect the agent within a more localized region.
Using the APF method, the GF and RF can be converted into a reward or penalty according to the action taken by the agent. The APF-based reward function is shown in (
4).
where
Here,
denotes the distance between the agent’s initial location and the target, and the superscript represents the temporal moment. The true target position is required only during the offline training stage for two specific purposes: (i) evaluating
in (
2), which depends on
, and (ii) computing
in (
4). In practice, this ground-truth information is obtained through an offline data collection phase of conventional fingerprint-based methods [
6,
8]; however, the labeling burden is significantly reduced because the proposed framework requires only a sparse set of labeled target positions to shape the reward, rather than a fine-grained labeled grid that covers the entire indoor area. Once the policy has been trained, the online estimation stage operates entirely without ground-truth target information, relying solely on real-time RSS measurements and the learned policy. Through this target-aware reward design, the proposed SDRL framework can provide dense and continuous learning signals, thereby mitigating the sparse reward problem that commonly arises in search-based positioning tasks.
When the agent reaches a position sufficiently close to the target, a positive terminal reward is assigned, and the episode is terminated. By contrast, when the updated position exceeds the feasible search boundary, a negative reward is imposed. Accordingly, the proposed MDP formulation enables the agent to learn an effective positioning policy that achieves a desirable balance between search efficiency and estimation accuracy.
2.2. PPO-Based Policy Learning
PPO is a policy gradient-based reinforcement learning algorithm that aims to improve the policy while keeping the training behavior stable [
20]. In PPO, a neural network parameterizes the policy, and the agent learns it by maximizing the expected cumulative reward through repeated interactions with the environment. PPO has been widely adopted to stabilize policy learning by constraining excessive policy updates. As a result, PPO can be a viable solution for the proposed search-based positioning framework.
A key component of PPO is importance sampling, which measures the discrepancy between the updated policy and the previous policy through their probability ratio. The sampling ratio is represented as (
5).
where
is the policy parameter,
and
denote the new and old policies, respectively. This ratio indicates how much the probability of selecting an action
at a state
changes after the policy update.
To avoid overly large policy changes, PPO employs a clipping mechanism in the surrogate objective. The clipped policy loss is expressed as (
6).
where
is the clipping function,
is a hyperparameter controlling the allowable update range, and
denotes the advantage function. The advantage function represents how much better the selected action is than the average expected action at the given state. By clipping the probability ratio within a predefined interval, PPO suppresses excessively large updates and thereby improves training robustness.
The overall objective function of PPO consists of the clipped policy objective, the state-value loss, and the entropy bonus, as given by (
7).
where
is the loss of the policy gradient,
is the loss of the value function,
is the entropy bonus of the policy, and
and
are the coefficients. The value loss helps the critic network estimate the state value more accurately, whereas the entropy term encourages sufficient exploration and reduces the risk of premature convergence to suboptimal policies. Through this composite objective, PPO achieves stable policy optimization while preserving the exploration capability of the agent.
Based on the objective function in (
7), the policy parameters are iteratively updated by gradient ascent as
where
denotes the learning rate. In the proposed positioning framework, PPO is adopted to train the agent so that it can learn a stable and effective policy for target-directed position estimation under the designed MDP. Owing to its clipped update structure and actor-critic framework, PPO is well-suited to the proposed SDRL framework, where stable convergence and reliable policy improvement are essential.
2.3. Anchor-Based Environment Construction Strategy
In order to improve the adaptability of the proposed positioning scheme across different indoor environments, an anchor-based environment construction strategy is adopted. In RSS-based positioning, the absolute coordinates of RPs may vary significantly depending on the deployment scenario, even when the relative signal characteristics observed by the user are similar. If the localization environment is directly represented using the original RP coordinates, the learned policy may become highly dependent on a specific layout, which degrades its adaptability to different RP deployment conditions. To address this issue, the proposed positioning scheme constructs a transformed local environment using only the four selected RPs and redefines their coordinates with respect to a common RP.
The coordinates of the four selected RPs are denoted by
where the selected RPs correspond to the four strongest RSS values among all available RPs. Among them, the RP providing the strongest RSS is chosen as the anchor RP, since it is expected to be the closest RP to the user. The coordinate of the anchor RP is expressed by
Then, the coordinates of all selected RPs are transformed by subtracting the anchor RP coordinate as
Through this transformation, the anchor RP is always located at the origin, i.e.,
Using the same coordinate transformation, the target position and the current agent position are also represented in the transformed local coordinate system. The original target and agent coordinates are denoted by
and
, respectively. Their transformed coordinates are given by
Accordingly, the position estimation process is carried out in a local coordinate system centered at the RP with the strongest RSS. This transformation preserves the relative geometry among the selected RPs, the agent, and the target, while eliminating unnecessary dependence on absolute global coordinates.
The main advantage of this strategy is that it enables the policy network to focus on the relative spatial relationship implied by the RSS measurements rather than memorizing scenario-specific RP layouts. In other words, even if the absolute RP deployment changes from one indoor environment to another, the transformed environment can still exhibit a similar structural pattern when the relative arrangement of the strongest RPs is comparable. As a result, this strategy can improve the deployment flexibility of the proposed SDRL framework while enhancing robustness across various localization scenarios.
In addition, the feasible search region is determined based on the transformed coordinates of the selected RPs. Therefore, the agent performs position estimation in a compact local search space constructed from the most relevant neighboring RPs instead of the entire global environment. This not only reduces the complexity of the positioning problem but also improves the consistency of the state representation and reward design, leading to more accurate and reliable positioning results. Overall, the proposed anchor-based environment construction strategy can provide an effective basis for stable policy learning and enhanced adaptability in RSS-based indoor localization.