1. Introduction
Integrated sensing and communication (ISAC) has shown great potential for intelligent networks and the Internet of Things (IoT). It is widely recognized as a key driver for the evolution of future wireless networks [
1]. Numerous innovative applications, including autonomous mobility [
2], virtual/augmented reality [
3], and location-based services (LBSs) [
4], require close collaboration between wireless communication and environmental sensing in beyond-fifth generation (B5G) and sixth-generation (6G) networks [
5,
6]. As a typical application of ISAC, indoor localization through wireless signals plays a crucial role across various scenarios. In most cases, wireless signals are reflected or absorbed, making satellite-assisted technologies insufficient for accurate indoor positioning [
7,
8]. As a result, indoor localization has become an increasingly critical area of research.
As the fifth generation of mobile networks, 5G provides significant advantages for indoor localization, including better energy efficiency, low latency, and the ability to support massive device connectivity. Among the various reference signals in 5G networks, the synchronization signal–reference signal received power (SS-RSRP) captures essential indoor radio propagation characteristics and is widely accessible on commercial devices without extra hardware cost [
9,
10], making it a more practical feature compared to positioning reference signals (PRSs) and channel state information (CSI). Motivated by these advantages, this study focuses on indoor localization methods based on SS-RSRP.
A fingerprint-based localization system typically operates in two phases: offline and online. As illustrated in
Figure 1, in the offline phase, the area of interest is partitioned into reference points (RPs), where received signal strength (RSS) measurements from available base stations (BSs) are collected to construct an offline database. In the online phase, RSS measurements of a mobile device (MD) from unknown locations are fed into a localization model trained on a server to estimate the position [
11,
12]. Owing to their robustness and adaptability in complex and dynamic indoor environments, fingerprinting-based approaches have come to be among the most widely adopted solutions for indoor positioning.
Fingerprint-based localization methods can be broadly categorized into traditional machine learning (ML) algorithms and deep learning (DL) approaches. Classical ML algorithms, such as k-nearest neighbors (KNN) [
13], support vector machines (SVM) [
14], and random forests (RF) [
15], have long been employed for indoor localization but are primarily designed for static positioning. In contrast, DL methods leverage the powerful feature extraction capabilities of neural networks, typically achieving higher localization accuracy than ML approaches and gaining considerable traction in trajectory prediction tasks. For example, some studies have utilized convolutional neural networks (CNNs) [
16,
17] to enable accurate feature extraction. Long short-term memory (LSTM) networks [
18,
19,
20] are also applied to achieve higher accuracy at the cost of increased computational complexity. Recently, transformer architectures have gradually been introduced into wireless signal processing and localization due to their powerful global modeling capabilities and efficient handling of sequential data [
21,
22,
23]. Accordingly, this study concentrates on DL-based fingerprint localization.
However, indoor localization of mobile devices in ISAC still faces two major challenges: (1) construction of the detailed map and (2) the dynamic prediction of device positions during movement. The details of these challenges and the proposed solutions are outlined as follows.
Challenge 1: The positioning accuracy of fingerprinting algorithms strongly depends on the size and coverage of the dataset. Collecting wireless signal measurements through on-site surveys is both time-consuming and labor-intensive, and achieving full coverage of the localization area is often impractical. Existing data augmentation methods mainly focus on static point enhancement, with limited consideration for the diversity enhancement of dynamic trajectories.
Challenge 2: In indoor scenarios involving device mobility, location estimation becomes inherently dynamic, with variations in movement states and instability in received signals substantially degrading the accuracy of trajectory prediction. Traditional fingerprinting-based localization systems, which rely on single signal measurements, are particularly vulnerable to wireless interference and often suffer from large deviations in positioning results.
Motivated by the above challenges, we propose a trajectory-based localization framework that combines data augmentation with a multi-head attention model. Our trajectory-based data augmentation (TDA) algorithm employs a generative adversarial network (GAN) to synthesize fingerprints at unmeasured locations, while a path generation procedure links synthetic and real points to form continuous trajectories. This expands the trajectory map without additional data collection, effectively enriching fingerprint diversity for improved localization. By constructing an extended trajectory map, the generated data can be effectively applied to trajectory prediction, overcoming limitations in prior studies. The augmented dataset is then exploited by a multi-head attention model to capture spatio-temporal dependencies in dynamic SS-RSRP sequences. Furthermore, an auxiliary loss with directional constraints is introduced to refine trajectory prediction, ensuring closer alignment with the ground truth. The proposed method enables efficient construction of the trajectory map with low cost overhead and fully leverages trajectory information to enhance the accuracy of trajectory prediction.
The main contributions of this study are summarized as follows.
- (1)
Trajectory-based data augmentation: In TDA, synthetic fingerprints at unmeasured locations are generated using a conditional Wasserstein generative adversarial network (CWGAN). An artificial path generation algorithm then links real and synthetic points to create diverse trajectories and construct an extended trajectory map, thereby improving both the density and variability of the training dataset.
- (2)
Attention-based localization model with an auxiliary loss: We employ a multi-head attention mechanism to learn spatial–temporal dependencies and dynamic variations in collected SS-RSRP sequences. To reduce trajectory prediction error, an auxiliary loss based on directional constraints is incorporated during model training, resulting in predicted trajectories that more closely match the ground truth.
- (3)
Validation of performance: Extensive real-world experiments in a 5G system with device mobility are conducted to evaluate the performance. The results show that our method significantly outperforms existing localization methods.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 introduces the system architecture and details the proposed framework.
Section 4 describes the experimental setup and discusses the results. Finally,
Section 5 concludes the paper.
4. Experiment and Performance Analysis
4.1. Experimental Setup
To evaluate the effectiveness of the proposed localization approach, we conducted experiments in a hall on the first floor of an academic building. The hall covers an area of approximately 500
. As illustrated in
Figure 8, six 5G BSs are deployed across different floors: two on the first floor, one on the second floor, and three on the third floor. For clarity, BSs on different floors are marked in different colors in the figure. This multi-floor deployment introduces significant challenges for signal propagation, as wireless signals are subject to attenuation and obstruction by the building’s structure. Due to the presence of walls between multiple floors and the height differences in base station deployment, the positioning scenario can be regarded as a non-line-of-sight (NLOS) environment, which poses significant challenges to achieving accurate localization. To maintain a clear space for regular human activity, a 10 × 7
area at the center of the hall was designated as the positioning area. During data collection, the environment contained moving pedestrians, contributing to a realistic and dynamic testing scenario.
During the offline training phase, data were collected using a mobile robot equipped with a 16-line light detection and ranging (LiDAR) sensor and a Huawei P40 smartphone mounted on its top, as shown in
Figure 9. The robot autonomously navigated predefined trajectories within the designated area. During movement, the smartphone continuously captured SS-RSRP from surrounding 5G BSs, while the robot simultaneously recorded its 2D coordinates using SLAM-based localization. The robot maintained a constant speed of 1 m/s to simulate pedestrian movement and recorded both position and signal measurements at 1 s intervals. At each time step, one sample comprises the SS-RSRP readings from all BSs and the corresponding 2D location. All SS-RSRP values were preprocessed by missing-value imputation and min–max normalization. The data were then segmented into sequences using a sliding window with a length of
L. In total, 2096 sequences were collected and split into training and validation sets at a ratio of 8:2.
During the online testing phase, the mobile device was rebooted and reconnected to BSs. Unlike the training phase, where movement followed a predetermined trajectory, the robot moved randomly throughout the entire localization area to construct the test dataset. This approach aims to closely simulate real-world positioning scenarios, where the target MD does not always follow straight trajectories. The collected test data were also preprocessed through missing-value imputation and normalization. Consecutive samples were segmented into multiple sequences of length
L, which were then fed into the pretrained localization model to output predicted coordinates. The test set comprised 300 sequences, ensuring no overlap with the training set. Detailed experimental settings are summarized in
Table 4.
All experiments were performed on a computer with a 13th Gen Intel Core i5-13400 processor (2.50 GHz) and 16 GB of RAM, running Windows 10. The software environment included Python 3.8 and PyTorch 2.3.0.
4.2. Accuracy of Location Estimation
Compared with existing methods, our approach expands the trajectory map without additional data collection cost and achieves more accurate and robust trajectory prediction using a multi-head attention mechanism and a direction-constrained auxiliary loss. To evaluate the performance of our method, we compare it with traditional ML algorithms and leading DL methods. ML algorithms include KNN [
13], SVM [
14], and RF [
15]. In addition, several leading DL methods are compared, including those that exploit sequential inputs for trajectory localization (e.g., CNN/RNN architectures [
26,
27,
34] and attention-based models [
22,
23]), as well as localization approaches enhanced through data augmentation [
34,
40]. For a fair comparison, all methods used the same test points. The localization results of each method are summarized and compared in
Table 5. For all baseline methods, the relative percentage improvement in MAE achieved by the proposed approach was evaluated. The results indicate that our proposed model achieves improved localization accuracy, with average positioning errors reduced by at least 47% and 34% compared to ML and DL methods, respectively. In the following, we discuss these results in detail.
In general, DL approaches outperform traditional ML methods in localization accuracy. Traditional ML algorithms typically generate predictions based on individual samples, making them more vulnerable to signal fluctuations and environmental noise. As shown in
Figure 10, ML methods yield average localization errors in the range of 2 m to 3 m. In contrast, DL approaches deliver superior localization performance, benefiting from their greater capacity for hierarchical feature extraction and representation learning.
To assess the benefits of trajectory-based localization, we first compare the proposed method with point-based DL approaches, such as CNN-LSTM and GAN. CNN-LSTM captures inter-feature dependencies but ignores temporal information, while GAN-based methods augment data at unmeasured points yet still perform single-point localization. Environmental dynamics, such as pedestrian activity, may lead to distribution shifts between training and deployment data, thereby degrading localization accuracy. Our model leverages multi-head attention to model global relationships across multiple steps rather than relying solely on instantaneous measurements. Our method outperforms these methods by over 49% under sequence data consideration to capture spatio-temporal correlations and mitigate NLOS and multipath effects, achieving lower errors, with an MAE under 2 m, as shown in
Table 5.
To further validate the effectiveness of the proposed attention-based localization model, we compared it with other trajectory prediction methods. A test set was collected along a trajectory near the boundary of the localization region.
Figure 11 presents both the predicted trajectories and the step-wise localization errors. The results show that our method yields predictions much closer to the ground truth and consistently outperforms other sequence-based approaches. WiFiNet mainly captures local features through convolution, limiting its ability to model dynamic trajectory changes. LSTMs improve sequential prediction but suffer from step-by-step dependency of wireless signals, which reduces robustness in complex environments. In contrast, the attention mechanism captures global temporal dependencies, making our model more resilient in predicting dynamic movement.
As for data augmentation, TDA is introduced for the construction of an extended trajectory map. For trajectory prediction, diverse trajectories are required, rather than isolated point data, to enhance the model’s feature extraction and fitting capability. VITAL and ANVIL apply simple image-domain augmentations, such as adding Gaussian noise or adjustments to brightness and contrast at existing locations. Although GAN generates new fingerprints at unmeasured locations to enrich the dataset, it still lacks path continuity. These approaches provide limited trajectory diversity. In addition, unlike MapLoc, which trains and fine tunes models separately on the generated and original datasets, our approach constructs an extended trajectory map on a mixed dataset. This enables simultaneous learning of fingerprint features from both existing and unmeasured locations, thereby reducing the risk of overfitting to a single type of dataset. Our method achieves an MAE of 1.09 m, which is 34.73%, 35.88%, and 45.50% lower than VITAL, ANVIL, and MapLoc, respectively. The 80% and 90% CDF errors are only 1.50 m and 2.23 m. These results validate the effectiveness of constructing extended trajectory maps via TDA in significantly enhancing trajectory prediction accuracy.
In summary, the proposed method effectively augments the dataset and expands the trajectory map, achieving the lowest localization error by leveraging TDA and the multi-head attention mechanism.
4.3. Ablation Study
To evaluate the contribution of each component of the proposed model, we conducted an ablation study to analyze the impact of TDA and the localization network on localization accuracy.
4.3.1. Validity of TDA
The performance of indoor localization for moving MDs is closely influenced by both the number of trajectories and their spatial distribution. To improve localization accuracy, we proposed TDA to construct an extended trajectory map as described in Algorithm 1.
We compared the localization performance with and without TDA. Based on empirical observations, the number of unmeasured locations is set equal to that of the original RPs. As shown in
Figure 12, without TDA, the localization model quickly overfits the training set, yielding a low training loss but consistently high validation loss due to distribution mismatch. With TDA, the model converges faster, and the loss gap between training and validation decreases. These results indicate that TDA improves both data diversity and training efficiency.
We further evaluated the performance of TDA under varying numbers of generated samples. As shown in
Figure 13, “+200% Gen” indicates that the number of generated trajectories is twice that of the original ones. Results show a consistent decrease in localization error with more generated samples, confirming the effectiveness of TDA. Using only the raw dataset achieves an MAE of 1.55 m, with CDF80% and CDF90% errors of 2.09 m and 2.83 m, respectively. With 100% generated samples, these errors decrease to 1.31 m, 1.88 m, and 2.45 m, representing reductions of 15.48%, 10.05%, and 13.43%, respectively. With 300% generated samples, errors drop further to 1.09 m (MAE), 1.50 m (CDF80%), and 2.23 m (CDF90%), achieving an improvement of approximately 30%. However, increasing the generated samples yields diminishing improvements and higher training costs. Therefore, “+300% Gen” offers the best trade-off between accuracy improvement and efficiency. These results further validate that TDA substantially enhances localization accuracy.
4.3.2. Impact of Multi-Head Attention and Auxiliary Loss
We conducted ablation experiments to evaluate the impact of the multi-head attention mechanism and the auxiliary loss on localization accuracy. By selectively removing the attention mechanism and the auxiliary loss, we assessed their individual contributions to error reduction. As shown in
Figure 14, the combination of the two mechanisms achieves the lowest localization error, with an MAE of 1.09 m, representing a reduction of 0.16 m compared to using either the attention mechanism or the auxiliary loss alone and a reduction of 0.74 m relative to using neither. With the constraint of the auxiliary loss, the predicted movement directions between consecutive positions more closely follow the true trajectory, with fewer abrupt turns, thereby reducing overall localization error. Notably, the multi-head attention and the auxiliary loss reduce the localization error by approximately 31.87% and 12.9%, respectively. These findings demonstrate that incorporating the auxiliary loss and attention mechanism significantly contributes to accuracy improvement.
4.4. Hyperparameter Analysis
In this section, we conduct experiments to examine the impact of key hyperparameter, including the sequence length, the number of encoder layers, and the number of attention heads. To ensure fair comparison, we focus solely on the parameters of the localization model, excluding data augmentation, which has been discussed in the previous section.
We first compare the average localization error under different sequence lengths. As illustrated in
Table 6, the results indicate that shorter sequences are more susceptible to noise from signal fluctuations, while increasing
L helps smooth out dynamic variations and yields more stable localization results. When
L = 4, the model achieves the lowest MAE of 1.55 m. However, further increasing
L leads to a slight degradation in performance. This is because the current position is more strongly influenced by recent movements. Longer sequences may introduce noise due to outdated or irrelevant observations. In addition, longer sequences incur higher computational costs. A sequence length of
L = 4 (i.e., sliding window size) is found to be optimal.
To discuss the impact of the number of attention heads, we fixed the layers of the multi-head attention encoder and varied the number of attention heads. As shown in
Table 7, different head configurations result in minor performance differences. Using four attention heads achieved the lowest MAE values of 1.08 m and 1.64 m on the validation and test sets. We further adjusted the number of encoder layers. The results presented in
Table 8 indicate that the model achieved lower localization errors when the encoder consisted of two or three layers. However, the performance gap between these two configurations was negligible on the validation set. Deeper architectures enhanced the model’s feature extraction capability but also increased the risk of overfitting and computational overhead. Considering the trade-off between model complexity and localization performance, we set the encoder depth to 2, which achieved an MAE of 1.03 m on the validation set and 1.55 m on the test set. Moreover, even without data augmentation, our model still outperforms ANVIL and VITAL by over 8%. These experimental results confirm the effectiveness and robustness of the proposed localization model when utilizing SS-RSRP sequences for positioning.
4.5. Complexity Analysis
In the fingerprinting-based indoor positioning system, the complexity of the offline construction phase would not affect the real-time performance of the online matching phase. Thus, the time complexity of the offline phase, which mainly comprises TDA and the process of the localization model, is analyzed in detail.
The time complexity of the CWGAN network is expressed as follows:
where
E is the number of epochs,
N is the number of training samples,
and
are the numbers of layers in the generator and discriminator, and
and
are the hidden units. This complexity is comparable to that of a GAN. The only additional computational cost arises from the path generation algorithm. Given
G generated trajectories, the complexity is
, which is negligible compared with the training complexity of the GAN network.
The computational complexity of the proposed localization model is dominated by the attention mechanism and is comparable to that of prior attention-based approaches, including ANVIL and VITAL. Since the sequence length (
L) is much smaller than the embedding dimension in multi-head attention, the complexity can be expressed as follows:
where
H represents the number of attention heads and
is the dimension of keys/queries per head.
Table 9 provides a detailed comparison of the computational complexities of different models, including the total training time and the inference time for a single test sample. Here, GFLOPs denote the giga floating point operations required by the localization model. For traditional machine learning algorithms, the complexity is negligible, since they do not involve explicit model training. During the offline phase, data augmentation (if applicable) and model training are only performed once. The reported training time includes both data augmentation and the training of the localization model.
In terms of GFLOPs, the proposed model is less complex than other attention-based methods such as VITAL and ANVIL. This advantage stems from the fact that VITAL introduces additional computational overhead through patch embedding, while our model adopts a more lightweight multi-head attention structure compared to ANVIL. Regarding training time, methods with data augmentation incur additional costs, since both the augmentation process and the training of the localization model must be considered. However, unlike the complex kernel function computation of Gaussian process regression applied in MapLoc, CWGAN generates new samples with only a single forward propagation. This makes our model more computationally efficient while also producing diverse samples beyond simple interpolation, offering greater feasibility in real-world deployment. Other trajectory-based methods (e.g., WiFiNet and LSTM) achieve shorter training times without data augmentation but fail to capture sufficient sample diversity and, thus, show limited accuracy on small datasets. Since model training occurs only offline, training time does not affect online efficiency. Our method achieves online inferring performance with a latency of about 2 ms. Overall, the computational overhead of our method remains acceptable and constitutes a reasonable trade-off for enhanced localization performance.
5. Conclusions and Future Work
In this work, we proposed an attention-based indoor localization system leveraging an extended trajectory map constructed via a novel trajectory-based data augmentation method. By employing a conditional Wasserstein generative network, synthetic fingerprints were generated at unmeasured locations, and a path generation algorithm enriched trajectory diversity. To improve trajectory prediction, a multi-head attention model with a direction-constrained auxiliary loss effectively captured spatial–temporal dependencies in SS-RSS sequences. Extensive experiments in a real 5G indoor environment demonstrate that the proposed system significantly outperforms existing methods, achieving at least a 34% improvement in localization accuracy. The findings confirm that integrating trajectory-based data augmentation with attention modeling enhances robustness and accuracy for positioning in next-generation wireless networks.
As part of future work, we will extend our study to large-scale indoor scenarios where environmental dynamics—such as furniture rearrangement, pedestrian density variations, and layout changes—may lead to distribution shifts and performance degradation. To enhance robustness under such conditions, we plan to investigate data filtering and noise suppression techniques, as well as domain adaptation and transfer learning methods, to improve the generalization capability of the proposed system across diverse environments.