Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction

Mkrtchyan, Rafayel; Ghukasyan, Edvard; Petrosyan, Khoren; Khachatrian, Hrant; Raptis, Theofanis P.

doi:10.3390/electronics14101905

Open AccessArticle

Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction

by

Rafayel Mkrtchyan

^1,2

,

Edvard Ghukasyan

^1,2,

Khoren Petrosyan

^1,2,

Hrant Khachatrian

^1,2

and

Theofanis P. Raptis

^3,*

¹

Machine Learning Group, Center for Mathematical and Applied Research, Yerevan State University, Yerevan 0025, Armenia

²

YerevaNN, Yerevan 0025, Armenia

³

Institute of Informatics and Telematics, National Research Council, 56124 Pisa, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(10), 1905; https://doi.org/10.3390/electronics14101905

Submission received: 31 March 2025 / Revised: 30 April 2025 / Accepted: 7 May 2025 / Published: 8 May 2025

(This article belongs to the Special Issue Localization, Navigation, and Channel Modeling: AI Solutions for Complex Indoor and Urban Environments)

Download

Browse Figures

Versions Notes

Abstract

Indoor pathloss prediction is a fundamental task in wireless network planning, yet it remains challenging due to environmental complexity and data scarcity. In this work, we propose a deep learning-based approach utilizing a vision transformer (ViT) architecture with DINO-v2 pretrained weights to model indoor radio propagation. Our method processes a floor map with additional features of the walls to generate indoor pathloss maps. We systematically evaluate the effects of architectural choices, data augmentation strategies, and feature engineering techniques. Our findings indicate that extensive augmentation significantly improves generalization, while feature engineering is crucial in low-data regimes. Through comprehensive experiments, we demonstrate the robustness of our model across different generalization scenarios.

Keywords:

pathloss prediction; vision transformers; neural networks

1. Introduction

The accurate prediction of radio signal propagation is a fundamental problem in wireless communications, enabling efficient network planning, resource allocation, and performance optimization [1]. The problem of pathloss map prediction is challenging both for outdoor [2] and indoor [3] settings. Outdoor environments introduce inherent challenges due to their dynamic nature, weather-dependent factors, and diverse terrain variations, including buildings, trees, and other obstacles that cause diffraction and shadowing. Additionally, the presence of moving objects such as vehicles and pedestrians further complicates signal modeling by introducing time-varying interference and multipath effects. On the other hand, indoor environments present unique challenges for radio propagation modeling due to their complex geometries, diverse materials, and varying building layouts, which create unpredictable signal blockages and non-line-of-sight conditions.

Traditional analytical models struggle to capture these intricate propagation effects, often requiring extensive parameter tuning and empirical measurements. The study presented in [4] demonstrates the superior predictive capabilities of deep learning models in the context of pathloss estimation. Similarly, Ref. [5] conducts a comparative analysis between a traditional log-distance pathloss model, a shallow neural network, and statistical machine learning approaches, including support vector regression and random forests. The findings indicate that the classical physical model underperforms relative to learning-based methods across multiple evaluation metrics. While statistical models that approximate pathloss as a monotonic decay function of distance provide a coarse baseline, these estimations can be further refined through neural networks [6]. Moreover, although ray-tracing simulations offer high accuracy, they are computationally intensive and less efficient than deep learning-based alternatives, as noted in [7]. Data-driven machine learning (ML) approaches have emerged as a powerful alternative, offering the potential to learn complex patterns directly from data [5]. By leveraging large-scale datasets and advanced learning algorithms, these methods can adapt to diverse environments and provide more accurate pathloss predictions compared to conventional models.

A challenge was created [8] to test the generalizability of ML models for outdoor environments, where participants were tasked with predicting the pathloss value for each pixel of an outdoor environment given the city map and transmitter location. The authors of the same challenge later introduced the first indoor radio map prediction challenge [9]. The objective of this new challenge was to predict the pathloss value for each pixel of a given building layout while incorporating key factors such as the reflectance and transmittance of walls and doors, the location of the transmitter, the carrier frequency, and the antenna radiation pattern. Unlike outdoor settings, where large-scale features dominate [10], indoor environments require fine-grained modeling due to the intricate interactions between radio waves and surrounding structures.

Following our previous works on using such approaches for closely related problems, such as wireless positioning [11] and environment map reconstruction [12], we shift our methodological design from convolutional neural networks [13] to vision transformers [14]. This shift is motivated by the superior capability of vision transformers in capturing long-range dependencies and global context, which are crucial for accurate pathloss predictions in complex indoor settings. Moreover, transformers have been proven to perform better, as the dataset size and model capacity increases [15,16]. However, unlike our previous works, where the primary focus was on large-scale spatial relationships, the model needs to be better suited for indoor environments, where localized interactions and fine structural details play a more significant role. To address the limitations imposed by the small dataset size, we employ a diverse set of data augmentation techniques, including flipping, rotation, and MixUp. These augmentations help mitigate overfitting and contribute to improved generalization across unseen spatial configurations, frequency bands, and antenna types.

2. Related Works

Numerous methodologies have been proposed to address the challenge of pathloss prediction in both outdoor and indoor environments [17,18]. Convolutional encoder-decoder architectures have demonstrated strong performance in this domain, effectively capturing spatial dependencies for accurate pathloss estimation [19,20].

The study presented in [21] introduces a UNet-based architecture designed for efficient radio map prediction in urban environments with mobile base stations and user equipment. Another method proposed in [19] utilizes a SegNet-based framework to address the challenges associated with outdoor pathloss radio map estimation. In [22], a line-of-sight (LoS) map is generated and incorporated as an input feature to the neural network: a feature engineering strategy particularly beneficial in scenarios where data availability is limited.

A UNet-based architectural framework, combined with a carefully designed loss function, enabled the authors of [23] to achieve first place in the first indoor pathloss radio map prediction challenge [9]. Additionally, meticulous feature engineering—particularly incorporating the number of walls between each point within a building and the transmitter—has been shown to significantly enhance indoor pathloss prediction accuracy [6]. An alternative and noteworthy method, TransPathNet, proposed in [24], tackles the problem of indoor radio map prediction through a two-stage pipeline: the first stage generates a coarse estimation of the pathloss distribution, which is subsequently refined by a second network to enhance the spatial and signal detail fidelity.

However, convolutional neural networks require deeper architectures to expand their effective receptive field, thereby enabling the modeling of long-range dependencies in spatial environments. In contrast, transformers are inherently capable of capturing such dependencies through self-attention mechanisms. The work by [7] demonstrates the successful application of transformer architectures to the task of radio map prediction, highlighting improvements in both predictive accuracy and computational efficiency. Moreover, empirical studies have demonstrated that increasing both dataset size and model capacity leads to substantial performance gains in transformer-based models [15,16].

Inspired by these works, our method leverages pretrained vision transformers to address the indoor pathloss prediction problem. Vision transformers, known for their ability to model long-range dependencies and capture complex spatial relationships, offer a promising alternative to convolutional architectures in this context. To address the data scarcity problem, data augmentation techniques are employed to enhance model generalization. Specifically, we utilize transformations such as flipping, rotation, random cropping, and MixUp to introduce variability and improve robustness. Our work is a natural continuation of our participation in the first indoor pathloss radio map prediction challenge [25].

3. Data Description

The dataset utilized in this study comprises pathloss radio maps generated through a ray-tracing algorithm, which simulates indoor wireless signal propagation under diverse environmental conditions. These maps serve as a benchmark for assessing the generalization capabilities of neural networks in indoor radio map reconstruction tasks.

3.1. Dataset Overview

The dataset includes pathloss radio maps corresponding to 25 distinct building layouts, each characterized by specific wall placements as well as their respective transmittance and reflectance properties. Signal propagation is simulated at three carrier frequencies—868 MHz, 1800 MHz, and 3500 MHz—capturing the frequency-dependent effects of indoor environments. Additionally, five different antenna configurations are incorporated, each defined by its unique radiation pattern, with the first configuration corresponding to an isotropic antenna. In every scenario, the transmitting antenna (Tx) is positioned at a height of 1.5 m above the floor.

The buildings and antenna locations are encoded in three-channel images. Each pixel corresponds to a grid cell of size 0.25 × 0.25 m. The first channel contains absolute values for normal incidence reflectance of the walls (0 for air, Figure 1a). The second channel contains absolute values for normal incidence transmittance of the walls (0 for air, Figure 1b). The third channel encodes the distance from the antenna in meters (Figure 1c).

3.2. Task Descriptions

To evaluate the generalization capabilities of the models under increasingly challenging conditions, three tasks are defined:

Task 1: This task assesses the model’s ability to generalize across different building layouts. The evaluation is conducted using an isotropic antenna operating at a carrier frequency of 868 MHz, ensuring that variations in the test set arise only from differences in building structures.
Task 2: In this setting, the model’s generalization is evaluated across both unseen building layouts and carrier frequencies. The training set includes data from isotropic antennas operating at 868 MHz, 1800 MHz, and 3500 MHz, requiring the model to learn frequency-dependent propagation characteristics in addition to spatial variations.
Task 3: The most challenging scenario is considered, where the model must generalize not only to unseen building layouts and carrier frequencies but also to unseen antenna configurations characterized by distinct radiation patterns. The training set comprises five different antenna configurations, introducing additional variability in wave propagation due to differences in antenna characteristics.

3.3. The Challenge Test Set

The challenge test set consists of 6 new building layouts. For Task 2, there are two carrier frequencies of 868 MHz and 2400 MHz. The test set contains three new radiation patterns for Task 3. A part of the test set, namely buildings #1 and #5, is available on Kaggle. Our main results are tested on the full challenge test set and evaluated by macro-averaged root-mean-square error (RMSE), while additional analysis are tested on the Kaggle test set and evaluated by micro-RMSE.

4. Model Design and Neural Network Architecture

Our approach employs deep learning techniques while relying on minimal prior knowledge of the underlying physics of radio signal propagation. Specifically, we utilize a pretrained neural network and fine-tune it on the available training data, leveraging extensive data augmentation strategies to enhance generalization and mitigate the limitations imposed by data scarcity.

4.1. Data Preprocessing

To ensure consistency and enhance feature representation, a series of systematic preprocessing steps were applied to the dataset. These steps standardize the input data, address variations in geometry and dimensions, and facilitate the integration of additional contextual information relevant to specific tasks. In alignment with the requirements of Tasks 1, 2, and 3, the dataset was prepared as follows:

For Task 1, the input data consist of three channels: reflectance, transmittance, and distance. Each input image is padded to form a square and subsequently resized to 518 × 518 pixels. Padding is applied using a value of −1, as the value 0 holds meaningful significance within the dataset. For Tasks 2 and 3, frequency information must also be incorporated. This is encoded as a single additional channel, where each pixel is assigned a uniform value corresponding to the carrier frequency in GHz. In Task 3, antenna configuration must be explicitly represented. To encode the antenna’s radiation pattern, an additional input channel of the same spatial dimensions is introduced. Each pixel value corresponds to the antenna gain at the angle between the respective pixel position and the antenna location. This representation provides a structured visualization of the antenna’s signal coverage, offering spatial context for signal propagation and station placement, as illustrated in Figure 1d.

4.1.1. Normalization

We conducted an exploratory analysis of the provided dataset to determine the value ranges for each input channel. Based on these observations, we selected channel-specific normalization factors: 25 for reflectance, 20 for transmittance, 200 for distance, 10 for frequency, and 40 for the radiation pattern. Each channel was subsequently normalized by dividing its values by the corresponding factor. The resulting normalized channels serve as the input for the models.

4.1.2. Data Augmentation

We applied a series of data augmentation techniques to enhance the model’s generalization capabilities and robustness to variations in the input data. The following augmentation strategies were employed:

MixUp augmentation: MixUp is a data augmentation technique that generates synthetic training samples by linearly interpolating pairs of input images and their corresponding labels. This method enhances the model’s ability to generalize by encouraging smoother decision boundaries. Additionally, MixUp blends the frequency channels of two inputs, effectively creating intermediate frequency values and enabling the model to generalize to unseen frequencies. In our training pipeline, MixUp was applied to 75% of the training samples, while the remaining 25% were left unchanged. This balanced approach ensures that the model benefits from both augmented and original data, promoting diversity in the learned feature representations while preserving alignment with the original data distribution.
Rotation and flipping: These augmentations introduce variations in the spatial orientation of input samples, improving the model’s invariance to geometric transformations. During training, each input sample is randomly rotated by 0°, 90°, 180°, or 270°, with equal probability. Since no transformation occurs when 0° is selected, rotation is effectively applied to approximately 75% of the training data. Similarly, flipping is applied, where each input sample, along with its corresponding label, is either left unchanged or flipped horizontally and/or vertically. These transformations mitigate overfitting to specific orientations and structural layouts, allowing the model to learn orientation-invariant representations and improve its generalization to unseen data.
Cropping and resizing: Cropping is used to simulate partial observations of the input data, forcing the model to learn from varying spatial contexts. This augmentation is applied to 75% of the training samples, while the remaining 25% remain unaltered. The crop size is randomly selected within a range between half the size of the input image (259 × 259 pixels) and the full size (518 × 518 pixels). Following cropping, each extracted region is resized to 518 × 518 pixels to maintain consistency in the input dimensions required by the model.

Figure 2 depicts the obtained inputs after applying the aforementioned augmentations.

4.1.3. Training, Validation, and Testing Splits

For our experiments, the dataset was partitioned into training, validation, and testing sets according to predefined criteria. Across all tasks, the training set comprises buildings #1 to #19, while buildings #20 to #22 are allocated to the validation set, and buildings #23 to #25 are reserved for testing. For Tasks 2 and 3, the training set includes carrier frequencies of 868 MHz and 3500 MHz, whereas the 1800 MHz frequency is exclusively used for validation and testing. For Task 3, the first three antenna configurations are included in the training set, while configuration #4 is designated for validation, and configuration #5 is held out for testing. As a result, the validation and test sets contain previously unseen buildings across all tasks. Additionally, in Tasks 2 and 3, the validation and test sets include carrier frequencies not present in the training data. In Task 3, the model is further challenged by evaluating its performance on unseen antenna configurations in the validation and test sets.

4.2. Neural Network Design

Our neural network consists of three parts: DINOv2 vision transformer [26], used as an encoder, UPerNet convolutional decoder [27], and a neck responsible for connecting the ViT-based encoder and the convolutional decoder. We choose the ViT-B/14 version of DINOv2 with pretrained weights. First, the input image is passed through a convolutional layer, which outputs a three-channel image. We do this for compatibility with the DINOv2 input, to be able to leverage the pretrained weights of the network. The resulting image is passed to the encoder to obtain the embeddings of the image. Then, the embeddings from all the 14 layers (the first layer, 12 hidden layers, and the output layer) are passed through a linear layer to decrease the embedding size from 768 to 256 for Task 1, and 512 for Tasks 2 and 3. The low-dimensional embeddings are then reshaped into

\frac{518 \times 518}{14 \times 14} = 37 \times 37

squares and fed to the convolutional neck.

The neck consists of a convolutional layer with kernel size 1 that maps the neck input dimension to a predefined depth for each layer, followed up by a bilinear resize operation that scales the image by a predefined factor for each layer, and, finally, another convolutional layer with kernel size 3 that does not alter the depth of the input tensor. Earlier layers use a higher scale factor and smaller depth, while later layers are scaled with a smaller factor, and higher depth. Specifically, the scale factors for each layer are {14, 14, 14, 8, 8, 8, 4, 4, 4, 2, 2, 2, 1, 1}, and the convolution depths are {16, 16, 16, 32, 32, 32, 64, 64, 64, 128, 128, 128, 256, 256} for Task 1, and {32, 32, 32, 64, 64, 64, 128, 128, 128, 256, 256, 512, 512} for Tasks 2 and 3.

The activations are then passed to the UPerNet [27] decoder to obtain the output. To reinforce the room borders to the network, we also concatenate the reflectance and transmittance channels to the activations obtained from the neck, before feeding it to the decoder. The sigmoid function is then applied to the output, and the result is multiplied by 160 to obtain the final prediction. The model architecture can be seen in Figure 3.

4.3. Training Details

The proposed network is trained using a mean squared error (MSE) loss function with a fixed learning rate of

3 \times 10^{- 4}

. Training is conducted for 200 epochs on Task 1, 120 epochs on Task 2, and 40 epochs on Task 3. All experiments are performed on an NVIDIA DGX A100 system equipped with two 40 GB A100 GPUs, employing distributed data parallelism. The batch size is set to 10 for Task 1 and 4 for Tasks 2 and 3, corresponding to the maximum capacity permitted by GPU memory constraints. For all experiments, the DINOv2 module within our architecture is initialized with pretrained weights from the DINOv2-L/14 vision transformer. Post-training, the model checkpoint with the lowest validation loss is selected for evaluation. To ensure reproducibility, we utilize the seed_everything function from the PyTorch 2.1.2 Lightning framework with a fixed seed value of 0.

5. Experiments

In this section, we present a series of experiments on Task 1 aimed at identifying the optimal model configuration. Specifically, we perform hyperparameter tuning to determine the most effective neural network settings, evaluate various data augmentation strategies to identify the most beneficial combination, and empirically demonstrate the contribution of manually crafted features under limited data conditions.

5.1. Optimal Architecture

The architecture described in the previous section was selected based on multiple experimental decisions, all evaluated on Task 1 using our test set. As shown in Table 1, enforcing wall information through a skip connection led to a slight improvement in model performance. Conversely, increasing the size of the neck resulted in a deterioration in the RMSE score, as evidenced by the performance drop between the second and third rows. We also experimented with masking loss values on padding pixels; however, this approach did not yield better results. Additionally, we conducted an experiment using a fixed pixel size by padding images to 550 × 550 without resizing. Only four images exceeded this size, which were first padded and then resized to match the fixed dimensions. During training, all images were randomly cropped to 518 × 518. Since this setting introduced a significant number of padding pixels, we performed the fixed-scale experiment exclusively with loss masking enabled. The results in Table 1 indicate that this approach was not effective.

5.2. Optimal Augmentations

To determine the optimal set of augmentations, we conducted a series of controlled experiments. The flipping augmentation was consistently applied, as it is a robust transformation that does not introduce interpolation artifacts. Similarly, rotating the input and the output by a multiple of 90° was also performed across all the experiments. The results in Table 2 indicate that excluding the MixUp augmentation leads to improved performance on our internal test set but results in a decline in performance on the Kaggle subset. Consequently, we retained MixUp for subsequent experiments. Additionally, we evaluated a more aggressive rotation augmentation, where both the input and corresponding output are rotated by arbitrary angles. However, this approach yields suboptimal performance on both the internal and Kaggle test sets. Lastly, we implemented an improved interpolation technique in which each pathloss radio map p is transformed into linear scale as

p^{'} = 10^{\frac{p}{10}}

, followed by interpolation

I (p^{'})

, and then converted back to decibels using

10 \times {log}_{10} I (p^{'})

. Surprisingly, this method also fails to enhance performance across both test sets.

5.3. Feature Engineering

Building on the insights from [6], we propose three feature engineering techniques and assess their impact on both our internal and Kaggle test sets. The first approach introduces an additional input channel that encodes free-space pathloss (FSPL) information. The second method computes the number of obstructions between each pixel and the transmitter. The third approach combines both features by taking their sum. The results of these experiments are presented in Table 3. Our findings indicate a significant performance improvement when incorporating the obstruction feature. However, providing FSPL estimates as an additional input negatively affects model performance.

6. Results

Table 4 presents a summary of the performance across all three tasks. We report results on both the Kaggle test set and the official test set from the first indoor pathloss radio map prediction challenge [9], where our approach achieved eighth place. The findings indicate that our training strategy demonstrates strong robustness to variations in frequencies and antenna types. Notably, the performance drop between Task 1 and the other two tasks remains within 3.5 absolute points, or 31% in relative terms, on the challenge test set. This degradation is lower than what we observed on our own test set and is comparable to the top-performing solutions in the competition. After selecting the best combination of neural network hyperparameters, augmentations, and manually engineered features, our best models outperforms our initial submission across all three tasks, both on the Kaggle subset and the entire challenge test set.

Additionally, we compare our model to the runner-up and the winner of the challenge.

Detailed Evaluation

Our test split is designed to isolate and analyze the individual effects of each axis of generalization for Tasks 2 and 3. Specifically, we evaluate our models on all possible combinations of seen and unseen buildings, frequencies, and, in the case of Task 3, antennas. As shown in Table 5, our model for Task 2 exhibits an unexpected improvement in performance on unseen frequencies compared to seen ones, a trend that is also observed for Task 3. Additionally, our findings indicate that the most challenging axis of generalization for our models is adapting to previously unseen buildings.

7. On Distribution Shift

Our analysis revealed that the substantial discrepancy between our test set and the challenge test set primarily arises from a distribution shift in building layouts. The challenge test set consists of significantly smaller layouts with a higher density of walls. Furthermore, data points from the challenge test set exhibit increased reflectance and transmittance values compared to our test set. To mitigate this distribution shift, we manually selected buildings from our training and validation sets and extracted regions characterized by dense wall structures with high reflectance and transmittance values. These cropped sections were then incorporated into our training and validation sets. Figure 4 illustrates random samples from the challenge test set, our test set, and the manually generated crops. To quantitatively assess the quality of our cropped regions, we compute the proportion of wall pixels relative to the total number of pixels, as well as the average transmittance and reflectance values, excluding air. Figure 5 presents the relationship between wall density and the average transmittance and reflectance. These values are calculated across our training, validation, and test sets, as well as the challenge test set, its Kaggle subset, and the manually extracted crops from our training and validation sets. The visualization in Figure 5 provides insight into the extent of the distribution shift and illustrates the extent to which the manual crops contributed to mitigating it.

However, as shown in Table 6, this approach was insufficient to fully capture the underlying characteristics of the challenge test set. Our results suggest that the pathloss values of the cropped sections are influenced by other parts of the building that were not included in the extracted regions, highlighting the complexity of the distribution shift.

8. Conclusions

In this work, we tackled the complex problem of indoor pathloss prediction using a ViT-based architecture incorporating DINO-v2 pretrained weights. We systematically explored various architectural choices through extensive experimentation to determine the most effective model configuration. Our empirical findings demonstrate that extensive data augmentation plays a crucial role in mitigating overfitting, thereby improving model generalization. Furthermore, we highlight the significance of feature engineering in addressing challenges arising from data scarcity. A detailed evaluation was conducted to assess model performance across different axes of generalization. Additionally, we attempted to mitigate distribution shift by manually cropping regions of existing maps. However, this approach proved ineffective, as it inherently assumes that the cropped areas are independent of the excluded regions, which does not hold in practice. We also compared our model to state-of-the-art approaches. The novelty of each approach is summarized in Table 7.

9. Future Directions

A promising direction for future work involves evaluating the scalability of our proposed method within the domain of pathloss radio map prediction, particularly through the generation of large-scale synthetic datasets. Additionally, developing more effective strategies to mitigate distribution shift remains an open and challenging research problem. Another interesting avenue is the integration of sparse real-world measurements to enhance the accuracy of predicted radio maps.

Author Contributions

Conceptualization, H.K. and T.P.R.; Methodology, R.M. and H.K.; Software, R.M., E.G. and K.P.; Validation, R.M., E.G. and K.P.; Investigation, H.K. and T.P.R.; Data curation, R.M.; Writing—original draft, R.M.; Writing—review & editing, H.K. and T.P.R.; Visualization, R.M.; Supervision, H.K. and T.P.R.; Project administration, H.K. and T.P.R. All authors have read and agreed to the published version of the manuscript.

Funding

The work of R. Mkrtchyan, E. Ghukasyan, and H. Khachatrian was partly supported by the RA Science Committee grant No. 22rl-052 (DISTAL). The work of T. Raptis was partly supported by the European Union, Next Generation EU, under the Italian National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.3, CUP B53C22003970001, partnership on “Telecommunications of the Future” (PE00000001—program “RESTART”).

Data Availability Statement

Data are contained within the repository [28].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xing, Z.; Chen, J. Constructing Indoor Region-Based Radio Map Without Location Labels. IEEE Trans. Signal Process. 2024, 72, 2512–2526. [Google Scholar] [CrossRef]
Popoola, S.I.; Misra, S.; Atayero, A.A. Outdoor Path Loss Predictions Based on Extreme Learning Machine. Wirel. Pers. Commun. 2018, 99, 441–460. [Google Scholar] [CrossRef]
Aldossari, S.A. Predicting Path Loss of an Indoor Environment Using Artificial Intelligence in the 28-GHz Band. Electronics 2023, 12, 497. [Google Scholar] [CrossRef]
Nguyen, T.T.; Yoza-Mitsuishi, N.; Caromi, R. Deep learning for path loss prediction at 7 GHz in urban environment. IEEE Access 2023, 11, 33498–33508. [Google Scholar] [CrossRef]
Zhang, Y.; Wen, J.; Yang, G.; He, Z.; Wang, J. Path loss prediction based on machine learning: Principle, method, and data expansion. Appl. Sci. 2019, 9, 1908. [Google Scholar] [CrossRef]
Feng, B.; Zheng, M.; Liang, W.; Zhang, L. IPP-Net: A Generalizable Deep Neural Network Model for Indoor Pathloss Radio Map Prediction. arXiv 2025, arXiv:2501.06414. [Google Scholar]
Tian, Y.; Yuan, S.; Chen, W.; Liu, N. Radionet: Transformer based radio map prediction model for dense urban environments. arXiv 2021, arXiv:2105.07158. [Google Scholar]
Yapar, Ç.; Jaensch, F.; Levie, R.; Kutyniok, G.; Caire, G. Overview of the first pathloss radio map prediction challenge. IEEE Open J. Signal Process. 2024, 5, 948–963. [Google Scholar] [CrossRef]
Bakirtzis, S.; Yapar, Ç.; Qiu, K.; Wassell, I.; Zhang, J. The first indoor pathloss radio map prediction challenge. arXiv 2025, arXiv:2501.13698. [Google Scholar]
Liu, L.; Yao, Y.; Cao, Z.; Zhang, M. DeepLoRa: Learning Accurate Path Loss Model for Long Distance Links in LPWAN. In Proceedings of the IEEE INFOCOM 2021—IEEE Conference on Computer Communications, Vancouver, BC, Canada, 10–13 May 2021; pp. 1–10. [Google Scholar] [CrossRef]
Khachatrian, H.; Mkrtchyan, R.; Raptis, T.P. Deep learning with synthetic data for wireless NLOS positioning with a single base station. Ad Hoc Netw. 2025, 167, 103696. [Google Scholar] [CrossRef]
Khachatrian, H.; Mkrtchyan, R.; Raptis, T.P. Outdoor Environment Reconstruction with Deep Learning on Radio Propagation Paths. In Proceedings of the 2024 International Wireless Communications and Mobile Computing (IWCMC), Ayia Napa, Cyprus, 27–31 May 2024; pp. 1498–1503. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 3505244. [Google Scholar] [CrossRef]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12104–12113. [Google Scholar]
Phillips, C.; Sicker, D.; Grunwald, D. A Survey of Wireless Path Loss Prediction and Coverage Mapping Methods. IEEE Commun. Surv. Tutor. 2013, 15, 255–270. [Google Scholar] [CrossRef]
Oladimeji, T.T.; Kumar, P.; Oyie, N.O. Propagation path loss prediction modelling in enclosed environments for 5G networks: A review. Heliyon 2022, 8, e11581. [Google Scholar] [CrossRef]
Qiu, K.; Bakirtzis, S.; Song, H.; Wassell, I.; Zhang, J. Deep learning-based path loss prediction for outdoor wireless communication systems. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–2. [Google Scholar]
Lee, J.H.; Molisch, A.F. A scalable and generalizable pathloss map prediction. IEEE Trans. Wirel. Commun. 2024, 23, 17793–17806. [Google Scholar] [CrossRef]
Levie, R.; Yapar, Ç.; Kutyniok, G.; Caire, G. RadioUNet: Fast radio map estimation with convolutional neural networks. IEEE Trans. Wirel. Commun. 2021, 20, 4001–4015. [Google Scholar] [CrossRef]
Krijestorac, E.; Sallouha, H.; Sarkar, S.; Cabric, D. Agile radio map prediction using deep learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–2. [Google Scholar]
Lu, W.; Lu, Z.; Yan, J.; Gao, S. SIP2Net: Situational-Aware Indoor Pathloss-Map Prediction Network for Radio Map Generation. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–2. [Google Scholar]
Li, X.; Liu, R.; Xu, S.; Razul, S.G.; Yuen, C. TransPathNet: A Novel Two-Stage Framework for Indoor Radio Map Prediction. arXiv 2025, arXiv:2501.16023. [Google Scholar]
Ghukasyan, E.; Khachatrian, H.; Mkrtchyan, R.; Raptis, T.P. Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction. arXiv 2024, arXiv:2412.09507. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Bakirtzis, S.; Yapar, Ç.; Qui, K.; Wassell, I.; Zhang, J. Indoor radio map dataset. IEEE Dataport 2024. [Google Scholar] [CrossRef]

Figure 1. An example of the three given input channels (a–c), the radiation pattern channel that we created (d), and the target (e).

Figure 2. An example of the input (a–d) channels, and the target (e) for a training sample after applying all the augmentations, normalization, and padding.

Figure 3. Model architecture.

Figure 4. Input examples from the challenge test set (a), from our test set (b), and from the crops that we generated (c).

Figure 5. Wall density vs the average value of transmittance and reflectance on the walls. The figure indicates the distribution shift from our dataset to the challenge test set and how we tried to address it with manual crops.

Table 1. The macro-RMSE (dBm) for Task 1 measured on our test set for several experiments.

Neck Size	Enforcing Walls	Loss Masking	Fixed Scale	RMSE
256	✗	✗	✗	6.0
256	✓	✗	✗	5.9
512	✓	✗	✗	6.7
256	✓	✓	✗	6.8
256	✓	✓	✓	6.7

Table 2. The macro-RMSE (dBm) for Task 1 measured on our test set and Kaggle subset for different feature engineering methods.

Arbitrary Rotations	MixUp	Proper Output Interpolation	Our Test	Kaggle Test
✗	✗	✗	5.9	11.7
✗	75%	✗	6.2	8.7
75%	75%	✗	7.2	9.3
✗	75%	✓	6.7	11.0

Table 3. The macro-RMSE (dBm) for Task 1 measured on our test set and Kaggle subset for several augmentation combinations.

FSPL	Obstructions	Our Test	Kaggle Test
✗	✗	6.2	8.7
✓	✗	6.7	11.4
✗	✓	4.3	7.2
✓	✓	4.7	7.7

Table 4. The RMSE (dBm) measured on the Kaggle test set and the private challenge test set for the three tasks.

Evaluation Set	Task	Our Submission	Our Best Model	Runner-Up	Winner
Kaggle	Task 1	8.7	7.2	5.3	5.3
	Task 2	16.7	14.7	10.6	10.5
	Task 3	17.4	15.7	12.6	12.0
	Weighted Average	14.6	12.9	9.8	9.5
Challenge	Task 1	11.3	11.0	7.8	7.9
	Task 2	14.8	13.2	10.2	10.1
	Task 3	14.7	13.4	10.3	10.1
	Weighted Average	13.7	12.6	9.5	9.4

Table 5. Detailed evaluation for Tasks 2 and 3.

Test Set	Task 2	Task 3
Unseen Building Frequency Antenna	10.8	12.1
Unseen Building Frequency Antenna	8.0	9.8
Unseen Building Frequency Antenna	-	10.1
Unseen Building Frequency Antenna	9.1	11.0
Unseen Building Frequency Antenna	-	13.1
Unseen Building Frequency Antenna	-	10.2
Unseen Building Frequency Antenna	-	11.7

Table 6. The micro-RMSE (dBm) measured on our test set and the Kaggle test set for the three tasks.

Crops	Task	Our Test	Kaggle Test
✗	Task 1	6.2	8.7
✓	Task 1	6.8	8.3
✗	Task 2	9.3	16.7
✓	Task 2	9.3	18.1
✗	Task 3	11.5	17.4
✓	Task 3	11.8	18.5

Table 7. Comparison of methods and their key innovations.

Method	Key Innovations
SIP2Net (Winner)	Incorporates a custom loss function and employs a discriminator network to distinguish predicted from real pathloss maps.
IPP-Net (Runner-up)	Utilizes curriculum learning and introduces a wall-counting feature that encodes the number of obstructions from the transmitter.
Ours	Leverages vision transformers and applies extensive data augmentation to enhance model generalization.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mkrtchyan, R.; Ghukasyan, E.; Petrosyan, K.; Khachatrian, H.; Raptis, T.P. Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction. Electronics 2025, 14, 1905. https://doi.org/10.3390/electronics14101905

AMA Style

Mkrtchyan R, Ghukasyan E, Petrosyan K, Khachatrian H, Raptis TP. Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction. Electronics. 2025; 14(10):1905. https://doi.org/10.3390/electronics14101905

Chicago/Turabian Style

Mkrtchyan, Rafayel, Edvard Ghukasyan, Khoren Petrosyan, Hrant Khachatrian, and Theofanis P. Raptis. 2025. "Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction" Electronics 14, no. 10: 1905. https://doi.org/10.3390/electronics14101905

APA Style

Mkrtchyan, R., Ghukasyan, E., Petrosyan, K., Khachatrian, H., & Raptis, T. P. (2025). Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction. Electronics, 14(10), 1905. https://doi.org/10.3390/electronics14101905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction

Abstract

1. Introduction

2. Related Works

3. Data Description

3.1. Dataset Overview

3.2. Task Descriptions

3.3. The Challenge Test Set

4. Model Design and Neural Network Architecture

4.1. Data Preprocessing

4.1.1. Normalization

4.1.2. Data Augmentation

4.1.3. Training, Validation, and Testing Splits

4.2. Neural Network Design

4.3. Training Details

5. Experiments

5.1. Optimal Architecture

5.2. Optimal Augmentations

5.3. Feature Engineering

6. Results

Detailed Evaluation

7. On Distribution Shift

8. Conclusions

9. Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Neck Size	Enforcing Walls	Loss Masking	Fixed Scale	RMSE
256	✗	✗	✗	6.0
256	✓	✗	✗	5.9
512	✓	✗	✗	6.7
256	✓	✓	✗	6.8
256	✓	✓	✓	6.7

Neck Size	Enforcing Walls	Loss Masking	Fixed Scale	RMSE
256	✗	✗	✗	6.0
256	✓	✗	✗	5.9
512	✓	✗	✗	6.7
256	✓	✓	✗	6.8
256	✓	✓	✓	6.7

Neck Size	Enforcing Walls	Loss Masking	Fixed Scale	RMSE
256	✗	✗	✗	6.0
256	✓	✗	✗	5.9
512	✓	✗	✗	6.7
256	✓	✓	✗	6.8
256	✓	✓	✓	6.7