A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks

Bolatbek, Askhat; Beyca, Ömer Faruk; Zholamanov, Batyrbek; Nurgaliyev, Madiyar; Dosymbetova, Gulbakhar; Almen, Dinara; Saymbetov, Ahmet; Yertaikyzy, Botakoz; Orynbassar, Sayat; Kapparova, Ainur

doi:10.3390/fi18040216

Open AccessArticle

A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks

by

Askhat Bolatbek

¹,

Ömer Faruk Beyca

²

,

Batyrbek Zholamanov

¹,

Madiyar Nurgaliyev

¹,

Gulbakhar Dosymbetova

^1,*

,

Dinara Almen

¹,

Ahmet Saymbetov

^1,*

,

Botakoz Yertaikyzy

¹,

Sayat Orynbassar

¹ and

Ainur Kapparova

¹

Faculty of Physics and Technology, Al-Farabi Kazakh National University, 71 Al-Farabi, Almaty 050040, Kazakhstan

²

Department of Industrial Engineering, Istanbul Technical University, Istanbul 34485, Türkiye

^*

Authors to whom correspondence should be addressed.

Future Internet 2026, 18(4), 216; https://doi.org/10.3390/fi18040216

Submission received: 16 March 2026 / Revised: 10 April 2026 / Accepted: 17 April 2026 / Published: 18 April 2026

(This article belongs to the Section Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of large-scale Internet of Things (IoT) deployments in urban environments requires accurate and energy-efficient localization methods for low-power wireless devices. In long-range wide-area networks (LoRaWAN), traditional GPS-based positioning is often impractical due to energy consumption constraints and signal propagation challenges in urban areas. This study proposes a hybrid localization system that integrates weighted centroid localization (WCL) with a machine learning (ML) regression model to improve outdoor positioning accuracy. The proposed approach first estimates approximate transmitter coordinates using a physically grounded WCL method based on received signal strength indicator (RSSI) measurements. These initial estimates are subsequently refined by ML models trained to learn nonlinear residual corrections. In addition to random partitioning, a spatial data splitting strategy is proposed and evaluated using a publicly available LoRaWAN dataset. The experimental results demonstrate that the hybrid WCL framework combined with a multilayer perceptron (MLP) significantly outperforms other ML models. The proposed method achieves a mean localization error of 160.47 m and a median error of 73.78 m. Compared to the baseline model, the integration of WCL reduces the mean localization error by approximately 29%, highlighting the effectiveness of incorporating physically interpretable priors into localization models.

Keywords:

LoRaWAN; outdoor localization; RSSI fingerprinting; machine learning; multilayer perceptron; residual learning

Graphical Abstract

1. Introduction

In recent years, IoT technologies have been increasingly used in various sectors, including environmental monitoring, intelligent transportation, industrial automation, and urban infrastructure management, resulting in a rapid growth in the number of low-power wireless devices operating in large-scale networks [1]. The rapid increase in the number of connected devices, expected to exceed 40.6 billion by 2034 from 19.8 billion in 2025, further amplifies the need for scalable, energy-efficient, and cost-effective IoT solutions [2]. In such applications, sensor data is of practical value when its location is known [3,4]. Global navigation satellite systems (GNSS), such as GPS, remain the dominant solution for outdoor positioning and can provide meter-level accuracy in open-air conditions [5]. However, integrating GNSS receivers into large-scale IoT deployments is often impractical due to their relatively high energy consumption, additional hardware cost, and the degradation of GNSS positioning performance in dense urban environments, where satellite signals are affected by blockage, non-line-of-sight reception, and multipath propagation [6]. Due to these limitations, there is research into alternative localization methods that utilize radio measurements in wireless communication infrastructures, thereby eliminating the need for specialized positioning equipment [7]. LoRaWAN has attracted considerable attention due to its long communication range, low power consumption, and the availability of public network infrastructure for large-scale urban IoT systems [8,9]. LoRaWAN allows battery-powered devices to operate for years, making this technology an energy-efficient solution for monitoring and location applications in smart cities [10]. Existing approaches to localization in LoRaWAN networks can generally be divided into time-based and RSSI-based methods. Time difference of arrival (TDoA) methods use precise timestamps at multiple gateways and can achieve localization errors on the order of 10–100 m under favorable conditions [11]. However, TDoA-based approaches rely on specialized infrastructure, including nanosecond-level synchronization between gateways, and are highly sensitive to multipath propagation, clock instability, and NLoS conditions. These practical constraints significantly limit their scalability and real-world applicability in large-scale public LoRaWAN deployments [12].

In contrast, RSSI-based localization does not require additional hardware or time synchronization in gateways and is widely supported in LoRaWAN infrastructures [13]. Traditional RSSI-based methods typically rely on path loss modeling or geometric triangulation. However, in complex urban environments, RSSI measurements are highly susceptible to shadowing, building density, multipath fading, and radio channel temporal variability, leading to large errors during localization using range-based models [14]. Unlike traditional methods based on direct processing of RSSI or TDoA, ML algorithms are able to identify complex nonlinear dependencies in radio data. Fingerprint-based localization, combined with ML methods, has recently been widely studied [15]. Various ML algorithms, including k-nearest neighbors (kNN), which estimates location by identifying nearby signal dependencies in a fingerprint database, support vector regression (SVR), which models nonlinear dependencies between RSSI features and spatial coordinates, have been widely applied in RSSI-based localization. Furthermore, artificial neural networks (ANNs), which are capable of learning complex propagation patterns using multi-layer nonlinear representations, have demonstrated superior performance in urban NLoS environments where signal behavior is highly irregular [16]. Despite the progress achieved, ML-based LoRaWAN localization methods remain limited in large-scale public deployments. Analyses conducted on real urban datasets show that RSSI observations are inherently sparse, heterogeneous, and subject to dynamically varying gateway visibility, resulting in incomplete and non-uniform feature representations that degrade model reliability and generalization capability [17]. Furthermore, many traditional ML models perform direct regression of geographic coordinates from raw RSSI data without incorporating physically interpretable priors. In dense urban environments, where signal propagation is strongly influenced by NLoS conditions and nonlinear attenuation patterns, such purely data-driven formulations often demonstrate instability and limited transferability across spatial regions [18].

To summarize, considering the limitations described above, this study aims to improve RSSI-based localization in public LoRaWAN networks by introducing a hybrid framework that integrates WCL with residual learning. The proposed approach combines a physically interpretable localization prior with data-driven residual regression. This ensures improved robustness and generalization capability under sparse, heterogeneous, and predominantly NLoS urban conditions. Within this framework, multiple regression models are evaluated at the residual stage, including a MLP, kNN, Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), enabling a systematic comparison of ML methods and neural networks under identical physical prior distributions and feature representations.

Main contributions:

Proposed a physically grounded hybrid localization framework, integrating WCL with residual learning to improve RSSI-based positioning in public LoRaWAN networks.
Conducted a systematic evaluation of multiple residual regression models, including MLP, kNN, XGBoost, and LightGBM, allowing a controlled comparison of ML methods and neural networks under identical physical priors and feature representations.
Introduced a spatial data splitting method based on grid partitioning to ensure geographic separation between training and testing regions, providing a realistic assessment of model performance in previously unseen urban areas.

2. Related Works

This paper explores the localization capabilities implemented using the intrinsic characteristics of LoRa technology, which allows for coordinate determination without the use of additional equipment or specialized infrastructure. Table 1 provides a brief overview of outdoor LoRaWAN localization methods reported in the literature, all of which were evaluated using data collected in the city of Antwerp [19].

The table presents the approaches proposed and evaluated by various researchers, along with the corresponding localization accuracy metrics, including the mean and median localization error and the R² score, where available. In [19], a kNN-based localization method was evaluated, where the optimal value of the parameter k was determined through hyperparameter tuning, resulting in a mean localization error of 398.40 m and a median error of 273.03 m. While the kNN-based fingerprinting approach demonstrates moderate localization accuracy, it relies solely on similarity matching in the RSSI space and does not incorporate physically interpretable priors or mechanisms for spatial generalization, which may limit its robustness in heterogeneous and geographically disjoint urban environments. Authors in [20] proposed an RSSI-based fingerprint localization method for LoRaWAN networks, achieving a mean localization error of 291.51 m using a branched Convolutional Neural Network (CNN) architecture enhanced with Squeeze-and-Excitation (SE) blocks. While deep convolutional models improve nonlinear feature extraction, they still rely on direct coordinate regression from RSSI fingerprints and do not explicitly incorporate physically interpretable priors, which may limit their robustness across spatially disjoint regions. Authors in [16] investigated outdoor localization using LoRa technology with the application of several ML models, such as kNN, CNN, SVR, ANN, XGBoost, and LightGBM. The authors proposed a hybrid architecture that combines convolutional feature extraction with gradient-based regression, achieving the best performance with a mean localization error of 244.51 m for the hybrid model, compared to 248.72 m for XGBoost and 249.57 m for LightGBM. Despite these improvements, the approach remains fully data-driven and does not explicitly decompose the problem into coarse physical estimation and structured residual correction, leaving open questions regarding interpretability and stability under varying gateway density.

In [21], an ensemble learning-based approach was proposed that integrates RSSI measurements with nanosecond-level timestamp information using a kNN combined with a Random Forest Regressor (RFR), achieving a mean localization error of 332.63 m and a median error of 193.63 m. TDoA-based localization relies on gateway-side infrastructure, specifically on multiple gateways providing precise and synchronized reception timestamps. In contrast, the present study focuses on localization from readily available radio features without assuming timing-based infrastructure as the main design basis.

Table 1. Comparison of outdoor LoRaWAN localization methods.

Paper	Method	Mean (m)	Median (m)	R² Score	Input Features	Preprocessing	Dataset Size	Year
[19]	kNN	398.4	273.0	N/A	RSSI	Missing gateway receptions filled with −200 dBm	123,529	2018
[18]	Random Forest	340	N/A	0.91	RSSI	Missing gateway RSSI set to −200 dBm; RSS transformed into normalized/exponential/powed forms; StandardScaler; PCA with 95% retained variance, reducing 72 features to 40 components, only messages with ≥3 gateways	55,259	2020
	Range-based	700		N/A
	kNN weighted	343		0.90
	SVR	1155		0.55
	Linear SGD	784		0.72
[22]	K-means + Weighted Kernel Regression	346.03	158.41	N/A	RSSI	Keep only messages received by ≥3 gateways; discard 75,556 messages with fewer than 3 gateways; remove gateways with <1% visibility; represent missing reception as −200 dBm	54,873	2022
[23]	RF	351	N/A	N/A	RSSI	RSS transformation into normalized/exponential/powed forms; StandardScaler; PCA with 95% retained variance, reducing 72 features to 40 components; −200 dBm mapped to 0 in positive representation	130,430	2023
[23]	Range-based	735.37	N/A	N/A	RSSI		130,430	2023
[20]	CNN + SE	291.51	147.55	0.93	RSSI, SF	Remove 28 inactive gateways; RSSI representations: Positive/Normalized/Exponential/Powed; StandardScaler + MinMaxScaler	130,430	2024
[17]	kNN	313.30	217.98	N/A	N/A	N/A	N/A	2025
[17]	Neural Network	277.61	163.49	N/A	N/A	N/A	N/A	2025
[16]	k-NN	284.58	160.01	0.9216	RSSI, SNR, SF, Estimated Signal Power (ESP)	Each message converted to one sample; missing RSSI filled with −200 dBm, then replaced by −128 dBm; logarithmic transform + min-max normalization for RSSI; min-max normalization for non-RSSI features and targets	130,430	2025
	CNN	319.57	219.96
	SVR	320.90	194.38
	ANN	279.66	171.70
	XGBoost	248.72	145.54
	LightGBM	249.57	146.72
	Hybrid Model	244.51	130.39
Our work	WCL + MLP	160.47	73.78	0.968	RSSI, SNR, SF, gateway observation mask, statistical features	Removed messages without valid GPS or without reception by at least one gateway; removed gateways with activity below 1%; removed messages received by fewer than 3 gateways; built sparse RSSI/SNR matrices, binary observability matrix, and aggregated statistics	54,874	2026

In study [22], a hierarchical clustering-based approach for urban LoRaWAN localization was proposed, combining K-means clustering, kernel density estimation, Kullback-Leibler divergence, and weighted kernel regression. The method achieved a median localization error of 158.41 m and a mean error of 346.03 m. Although the median error is competitive, the relatively high mean error indicates the presence of large localization deviations in certain regions, suggesting sensitivity to outliers and spatially irregular propagation conditions. In study [18], RSSI fingerprint-based localization methods were evaluated, achieving a mean localization error of 340 m, while path-loss-based ranging methods using propagation models resulted in a mean error of 700 m. In [23], a Random Forest model was applied for distance estimation, followed by modified trilateration, using RSSI, Signal-to-noise ratio (SNR), and Spreading Factor (SF) as input features. The proposed approach achieved a mean localization error of 735.37 m, whereas fingerprint-based localization attained a mean error of 351 m. These studies demonstrate that range-based localization methods, whether based on analytical path-loss models or ML-assisted distance estimation, remain highly sensitive to NLoS conditions and urban channel variability, leading to significantly higher localization errors and limited robustness in practical LoRaWAN deployments. Authors in [17] proposed ML approaches, including the kNN algorithm and neural networks, which were evaluated, achieving mean localization errors of 313.30 m and 277.61 m, respectively. While optimized hyperparameter tuning yields noticeable improvements, the study primarily focuses on model-level optimization rather than architectural reformulation of the localization pipeline.

Overall, although RSSI-based LoRaWAN localization has progressed significantly, most existing methods rely either on purely data-driven regression or analytical ranging models, without combining a physically grounded coarse prior with residual learning. This limitation motivates the development of hybrid frameworks that enhance robustness and spatial generalization.

3. Methodology

3.1. LoRaWAN Technology

LoRaWAN technology is a communication protocol that uses LoRa as its physical layer [24]. The network is organized using a centralized star-of-stars architecture, as illustrated in Figure 1. End devices transmit data exclusively to nearby gateways over the LoRa radio interface and do not communicate directly with each other. The gateways act as transparent relays, forwarding the received uplink frames to a central network server via IP-based backhaul connections such as Ethernet, Wi-Fi, or cellular networks.

In LoRa-based systems, the data rate varies from 300 bps to 50 kbps, largely dependent on the SF, which ranges from 7 to 12. SF determines the symbol duration in LoRa modulation, with higher values improving sensitivity and coverage at the expense of lower data rates. In urban areas, LoRa-based systems can offer a communication range of up to 5 km; in rural areas, the communication range can reach 15 km. The high receiver sensitivity of LoRa systems is a key factor in ensuring long-distance communications, resulting in significant channel budget savings [25]. These characteristics make LoRaWAN well-suited for Smart City applications requiring wide-area coverage, low power consumption, and scalable IoT connectivity.

Although LoRaWAN is an energy-efficient and low-cost solution for outdoor localization, it is important to note several inherent limitations. In particular, LoRaWAN localization is complicated by sparse and irregular gateway visibility, since packets are often received by only a limited subset of gateways. Moreover, public LoRaWAN infrastructures are typically designed for connectivity, which limits anchor geometry and makes precise outdoor localization more challenging. In addition, RSSI-based localization has fundamental limitations due to the physical nature of the radio channel. In dense urban environments, multipath propagation, signal fading, and NLoS conditions lead to significant RSSI fluctuations.

One of the most significant limitations is the variability of transmit power. In LoRaWAN networks, adaptive data rate mechanisms can dynamically adjust transmit power, and if this change is not explicitly accounted for, the localization algorithm may misinterpret a decrease in signal strength as an increase in propagation distance. Consequently, without accounting for the dynamics of transmit power and propagation distortions, RSSI cannot be considered a reliable distance metric, even under idealized conditions.

3.2. Research Architecture

The overall research methodology is presented in Figure 2, which illustrates the architecture of the proposed hybrid localization system. The architecture consists of sequential blocks that clearly demonstrate all stages of data processing. The process begins with the acquisition of real LoRaWAN measurements, followed by data preprocessing and structured feature construction. Subsequently, two types of data partitioning are defined: random splitting and spatial splitting. In each case, the dataset is divided into training and testing subsets, where the test set is completely isolated and not used during model training or hyperparameter tuning, thereby ensuring an objective evaluation.

For each data splitting, localization models are trained under identical conditions. First, ML models are evaluated, including MLP, kNN, XGBoost, and LightGBM. These ML models were selected due to their relevance and strong performance reported in recent scientific literature, where they demonstrate strong performance in RSSI-based localization tasks [16,17]. Then, the proposed hybrid framework is applied, in which the physically grounded WCL first provides a coarse position estimate, followed by a residual regression model employing ML algorithms that refines this estimate through corrective adjustment. In the final stage, the predicted coordinates are evaluated using standard localization metrics, enabling a quantitative comparison of accuracy and robustness under both random and spatial data partitioning strategies.

3.3. Dataset Collection

This study uses a publicly available LoRaWAN dataset collected in central Antwerp. This dataset was published by Aernouts et al. [19] and was obtained from large-scale measurements conducted over an operational public LoRaWAN network. This dataset is well-suited for localization research in Smart City environments due to its extensive urban coverage and real-world operational conditions. During data collection, twenty Antwerp Post vehicles were equipped with Firefly X1 GNSS receivers, which continuously recorded the vehicle locations (latitude and longitude) and GPS signal quality. This location data was periodically transmitted using IM880B-L radio modules. The LoRaWAN measurements were conducted over an urban area of approximately 53 km², covering the center of Antwerp (Figure 3).

We chose the most recent version of the dataset, released on 19 July 2019. This version was chosen due to its expanded metadata and improved temporal resolution, including nanosecond-precise reception timestamps and comprehensive gateway-specific information. A total of 130,430 messages were recorded by 72 LoRaWAN gateways over several months. The dataset provides LoRaWAN transmission parameters and reception data for each gateway, such as RSSI, SNR, and time of arrival. RSSI values in the dataset are expressed in dBm.

Analysis of the raw dataset shows that the RSSI distribution is highly asymmetric, with a significant number of invalid measurements (−200 dBm). As shown in Figure 4, most valid RSSI values are concentrated in the range from −120 to −105 dBm, indicating predominantly weak signal conditions in the urban LoRaWAN network. In addition, the majority of messages are received by only 2–4 gateways, despite the presence of 72 deployed gateways, which highlights the inherent sparsity of the network and justifies the necessity of preprocessing. This observation indicates that not all deployed gateways contribute equally to localization performance.

3.4. Data Preprocessing

Each raw message record contains the geographic coordinates of the mobile node i, physical-layer transmission parameters such as the SF, and signal measurements including the RSSI and SNR reported by multiple gateways. Preliminary analysis revealed that a substantial portion of these gateways appear only sporadically and therefore do not provide a consistent contribution to the localization task. In the first stage, the complete JSON file was loaded, and messages without valid GPS coordinates or without reception by at least one gateway were discarded.

For each valid message, sparse RSSI and SNR matrices were constructed, together with a binary observability matrix M, defined as:

M_{i, g} = \{\begin{array}{l} 1, if message i is received by gateway g, \\ 0, otherwise . \end{array}

(1)

Accordingly, the RSSI matrix

X \in R^{N \times G}

contains entries

X_{i, g}

only for observed gateway–message pairs, while missing measurements are preserved as undefined values. To enrich the representation of each transmission, a set of aggregated statistical features was computed. These include the number of gateways that successfully received message i:

n_{i} = \sum_{g = 1}^{G} M_{i, g,}

(2)

the mean RSSI value:

μ_{i}^{R S S I} = \frac{1}{n_{i}} \sum_{g : M_{i, g}} X_{i, g,}

(3)

the corresponding standard deviation:

σ_{i}^{R S S I} = \sqrt{\frac{1}{n_{i}} \sum_{g : M_{i, g}} {(X_{i, g} - μ_{i}^{R S S I})}^{2}}

(4)

the maximum observed RSSI:

X_{i}^{m a x} = \max_{g : M_{i, g}} X_{i, g}

(5)

The average SNR across all receiving gateways was also computed. In the subsequent cleaning stage, gateways with insufficient coverage were removed by applying a minimum activity threshold, retaining only those gateways that received at least 1% of all messages. In addition, messages received by fewer than three gateways were discarded, as they provide limited spatial information and negatively affect localization robustness. This preprocessing choice is consistent with prior studies [18,22], where messages with fewer than three receiving gateways were also excluded. At the same time, this should be regarded as a limitation of the present method, since the current framework is not designed for extremely sparse-observability cases with fewer than three available gateways. After filtering, the final dataset consists of RSSI, SNR, and MASK matrices, the SF vector, aggregated statistical features, and the corresponding ground-truth latitude and longitude coordinates. The processed dataset is used as input for subsequent training and evaluation of the localization models. Following the preprocessing and filtering stages, the final dataset contains 54,874 valid LoRaWAN messages and 44 active gateways. The reduction in the number of gateways reflects the exclusion of low-activity gateways that do not provide statistically reliable localization information.

3.5. Data Splitting

Reliable evaluation of RSSI-based outdoor localization systems requires careful data partitioning due to the strong spatial correlation between samples. In urban LoRaWAN networks, geographically adjacent measurements often share similar propagation characteristics, gateway visibility patterns, and environmental conditions. If such spatial dependencies are not properly controlled, overly optimistic performance estimates may be obtained. For this reason, two complementary data splitting strategies are employed in this study: random splitting and spatial splitting.

3.5.1. Random Split

The dataset is randomly divided into training and test subsets in an 80% to 20% ratio, respectively. The split is performed using a fixed random seed value of 42 to ensure reproducibility. After preprocessing, random partitioning results in 43,899 training samples and 10,975 test samples. The training and test subsets exhibit approximately uniform spatial distribution across the study area. This configuration evaluates the interpolation capability of the model within the observed geographic region. Since neighboring samples may be present in both subsets, random splitting reflects model performance under similar propagation conditions.

3.5.2. Spatial Split

To evaluate realistic generalization capability, a spatial splitting strategy is employed in which the geographic area is divided into grid cells. A grid size of 50 m × 50 m was selected to ensure proper spatial separation of the data. A smaller grid may lead to spatial data leakage between the training and test sets, whereas a larger grid may merge areas with heterogeneous signal propagation conditions. Therefore, this scale provides a realistic assessment of the model’s performance in an urban LoRaWAN environment. Each measurement point is assigned to a grid cell according to its spatial location. Samples belonging to the same grid cell are grouped together, and the division into training and test sets is performed at the group level. Entire grid cells are assigned exclusively either to the training set or to the test set. This approach guarantees geographic separation between training and testing regions. Consequently, the model is required to predict transmitter positions in areas not observed during training, thereby better reflecting practical LoRaWAN deployment scenarios. In this study, a fixed random seed value of 777 is used, resulting in 40,640 training samples and 14,234 test samples.

3.6. Proposed Hybrid Localization Framework

The localization task is formulated as estimating the transmitter position from sparse multi-gateway RSSI observations. In dense urban LoRaWAN deployments, direct regression from raw RSSI vectors is unstable due to nonlinear propagation effects and variable gateway visibility. To improve robustness, we adopt a two-stage hybrid framework that integrates a physics-based coarse estimate with a data-driven residual refinement. The first stage provides a WCL prior, while the second stage learns a residual correction. Let the i-th message be received by a subset of gateways

G_{i} \subseteq \{1, \dots, G\}

. For each received gateway

g \in G_{i}

, we observe a received signal strength

{R S S I}_{i, g}

. (Optionally

{S N R}_{i, g}

).

The complete pseudocode of the proposed hybrid localization framework is presented in Algorithm 1.

Algorithm 1. Pseudocode of the proposed hybrid localization framework

Input: Raw LoRaWAN dataset D
Output: Final predicted coordinates

\hat{x}

1:

Preprocess D:

remove invalid messages;
construct RSSI, SNR, and mask matrices, with the mask defined by Equation (1);
compute SF and aggregated statistical features using Equations (2)–(5).

2:

Filter data:

remove low-activity gateways;
remove messages with fewer than 3 observed gateways.

3:

Split filtered data into training and test subsets.

using either random split or spatial split.

4:

Estimate effective gateway anchors from training data only:

for each gateway, retain strongest observations and compute the RSSI-weighted effective anchor using Equation (6) with weights from Equation (7);
compute the gateway reliability factor using Equation (8).

5:

Compute WCL coarse position estimates for all samples using Equation (9) with WCL weights defined in Equation (10).

6:

Convert coordinates to local Cartesian space.

7:

Form residual targets on training set according to Equation (11).

8:

Train the residual regression model on the constructed features to learn the residual mapping defined in Equation (12).

9:

Predict residuals on the test set.

10:

Obtain final coordinates using Equation (13).

11:

Convert predictions back to geographic coordinates.

12:

Evaluate localization performance using the geodesic distance error defined in Equation (15).

The goal is to estimate the position

x_{i} \in R^{2}

. of the transmitter. For learning and computation, geographic coordinates (lat, lon) are converted into a local Cartesian system (x, y) in meters. All results are finally reported as geodesic (Haversine) distance errors (Equation (15)). The WCL stage therefore provides a physically interpretable coarse localization prior, which serves as the foundation for the second-stage data-driven refinement. For all considered models, the input feature vector consists of RSSI, SNR, SF, the gateway observation mask, and aggregated statistical features. All these input parameters are constructed and defined during the data preprocessing stage, where raw LoRaWAN measurements are transformed into a structured representation suitable for learning. The resulting input features used for model training are summarized in Table 2.

3.6.1. Weighted Centroid Localization (WCL)

The first stage computes a coarse but physically interpretable position estimate using WCL. A standard WCL formulation assumes that each gateway has a known anchor position

a_{g} \in R^{2}

, and the location of a message is estimated as a weighted average of these anchors. WCL has been widely studied as a low-complexity localization approach, and prior work has shown that its performance strongly depends on the weighting strategy, RSSI quality, and anchor geometry, while improved weight-correction schemes can substantially enhance accuracy under noisy wireless conditions [26]. In addition, analytical studies have demonstrated that WCL remains attractive because of its simplicity, but its estimation error and bias are affected by shadowing, propagation conditions, and network topology [27]. Other theoretical analyses have further confirmed that RSSI-based weighting is generally more effective than purely distance-based proximity weighting in weighted-centroid localization schemes [28]. However, in many practical datasets, precise gateway coordinates may be unavailable, or the effective “radio center” of a gateway may differ from its physical installation point due to urban blockage and directional effects. Therefore, instead of using nominal gateway positions, we estimate effective gateway anchors from the training data.

For each gateway g, consider all training samples where the gateway is observed. Denote the index set of these samples by

Ω_{g}

. To reduce the influence of far-range receptions and heavy NLOS distortions, we retain only the strongest fraction of observations. Let

α \in (0, 1)

be the “top-strong” fraction, and define

I_{g} \subseteq Ω_{g}

as the indices corresponding to the top α RSSI values for gateway g. We estimate the effective anchor as a weighted average of the true training coordinates

x_{i}

:

a_{g} = \frac{\sum_{i \in I_{g}} w_{i, g} x_{i}}{\sum_{i \in I_{g}} w_{i, g} + ϵ}

(6)

where ε is a small constant for numerical stability. The weight

w_{i, g}

is chosen as an exponential function of RSSI:

w_{i, g} = e x p (\frac{{R S S I}_{i, g} - {R S S I}_{r e f}}{τ})

(7)

Here,

τ > 0

is a temperature parameter controlling how sharply strong RSSI values dominate the weighted average, and

{R S S I}_{r e f}

is a reference RSSI level used to stabilize scaling (in our implementation a constant value is used). In addition, we compute a simple gateway reliability factor

r_{g}

based on the fraction of training samples in which gateway g is observed:

r_{g} = \frac{1}{N_{t r}} \sum_{i \in t r a i n} 1 [g \in G_{i}]

(8)

Given effective anchors

\{a_{g}\}

, for each message i we compute the WCL estimate as:

{\hat{x}}_{i}^{W C L} = \frac{\sum_{g \in G_{i}} {\tilde{w}}_{i, g} a_{g}}{\sum_{g \in G_{i}} {\tilde{w}}_{i, g} + ε}

(9)

{\tilde{w}}_{i, g} = r_{g} \cdot e x p (\frac{{R S S I}_{i, g} - {R S S I}_{r e f}}{τ})

(10)

The WCL estimate provides a stable initial guess even with sparse observations. Nevertheless, in urban environments it often exhibits systematic biases due to propagation anomalies that cannot be captured by a simple monotonic RSSI weighting.

To correct WCL’s systematic errors, we use a second stage that learns a residual correction from data. Instead of predicting absolute coordinates directly, the model predicts the residual vector between the ground truth and the WCL estimate:

e_{i} = x_{i} - {\hat{x}}_{i}^{W C L}

(11)

A regression model

f_{θ} (z_{i})

is trained to approximate this residual:

{\hat{e}}_{i} = f_{θ} (z_{i})

(12)

{\hat{x}}_{i} = {\hat{x}}_{i}^{W C L} + {\hat{e}}_{i}

(13)

where

z_{i}

denotes the feature vector constructed from the radio measurements and auxiliary information.

The motivation for the residual formulation is that the WCL stage already provides a physically grounded coarse approximation

{\hat{x}}_{i}^{W C L}

of the true transmitter position

x_{i}

. Therefore, instead of learning the full mapping from radio features to absolute coordinates, the second-stage regressor learns only the residual term

x_{i} - {\hat{x}}_{i}^{W C L}

. This reformulation reduces the effective regression range and shifts the learning problem toward modeling structured correction terms caused by urban propagation distortions, NLoS effects, and irregular gateway visibility. As a result, the second stage solves a narrower and more stable regression task than direct coordinate prediction from sparse LoRaWAN observations.

This interpretation is also supported empirically by the comparative results presented in Section 4. If the WCL prior did not simplify the second-stage learning task, its inclusion would not lead to such consistent improvements across different regressors. However, the WCL-enhanced variants systematically outperform their corresponding direct-regression baselines, especially in the case of WCL + MLP, which indicates that the coarse physically grounded prior makes the subsequent regression problem more structured and easier to learn.

The WCL parameter ranges were chosen to cover practically meaningful and interpretable operating modes. The temperature parameter τ ∈ {6.0, 8.0, 10.0} was chosen to support the analysis of tighter, intermediate, and smoother RSSI weighting. The fraction of strongest samples, top_strong ∈ {0.05, 0.10, 0.15}, was chosen to represent narrow, moderate, and broader subsets of strong signals when evaluating effective anchors. The minimum gateway support, min_gw ∈ {300, 500}, was used to compare a less stringent and more stringent reliability threshold when including the gateway in the calculation. These ranges were intentionally kept compact to allow for an interpretable sensitivity analysis while still covering the main practically relevant WCL configurations.

Within the proposed hybrid framework, the residual vector is learned using different regression models, such as a MLP, kNN, XGBoost, and LightGBM under a unified two-stage formulation. This ensures that all compared ML approaches operate under the same WCL-based prior distribution and identical feature representation. Such a residual formulation reduces the regression range that the network must learn, since the WCL prior already captures the coarse spatial structure.

3.6.2. Multilayer Perceptron (MLP)

We use a MLP [29] regressor to output a 2-D residual

[∆ x, ∆ y]

. Previous studies have demonstrated that MLP models can achieve reliable performance in RSS-based localization under channel impairments such as multipath propagation and fading, and may also serve as effective components of more advanced neural localization frameworks [30]. In addition, MLP-based models have shown good accuracy in coordinate regression for related localization tasks under noisy measurement conditions, supporting their use as flexible nonlinear estimators for spatial prediction problems [31]. To reduce sensitivity to outliers, the network is trained using the Huber loss:

L_{δ} (e_{i}, {\hat{e}}_{i}) = \{\begin{array}{l} \frac{1}{2} {‖e_{i} - {\hat{e}}_{i}‖}_{2}^{2}, {‖e_{i} - {\hat{e}}_{i}‖}_{2} \leq δ, \\ δ {‖e_{i} - {\hat{e}}_{i}‖}_{2} - \frac{1}{2} δ^{2}, otherwise, \end{array}

(14)

where δ is the Huber transition parameter. The choice of Huber loss for the residual regression stage is motivated by the inherent instability of urban LoRaWAN RSSI observations. Training uses early stopping on a validation split and standard feature normalization (z-score standardization) fitted on the training set. Figure 5 shows the architecture of a deep neural network with residual connection designed to regress geographic coordinates from LoRaWAN radio signal features.

The first part of the model consists of two sequential fully connected layers of 512 neurons. Each Dense layer includes a linear transformation, followed by batch normalization, a ReLU activation function, and a Dropout layer for regularization. A residual connection is implemented between these two layers: the output of the first block is summed with the output of the second block. After the residual block, the data passes through the regression part of the network, consisting of a 256-neuron layer with ReLU and Dropout activations, followed by a 128-neuron layer with ReLU. The final output layer contains two neurons without an activation function and produces the 2-D residual displacement (x, y) in the local Cartesian coordinate system. For the MLP model, a grid search was used to optimize the key hyperparameters, including the learning rate, dropout rate, the Huber loss delta parameter, and batch size.

3.6.3. k-Nearest Neighbors (kNN)

In the proposed hybrid framework, kNN [32] is used as a non-parametric residual regression model after the WCL stage. Instead of predicting absolute coordinates, kNN estimates the residual between the WCL coarse position and the ground truth by identifying the most similar training samples in the standardized feature space and averaging their residual vectors. The number of neighbors, distance metric, and weighting scheme are selected via cross-validation. By operating on the WCL-based prior, kNN learns a reduced correction range, which improves stability compared to direct coordinate regression.

3.6.4. Extreme Gradient Boosting (XGBoost)

In the proposed hybrid framework, XGBoost [33] is employed as a tree-based ensemble model to learn the residual mapping between the WCL coarse estimate and the ground-truth coordinates. Rather than predicting absolute positions, XGBoost estimates the residual displacement in the local Cartesian space, which is subsequently added to the WCL output to obtain the final localization result. For multi-output regression, two boosted models are trained using a MultiOutputRegressor wrapper. The main hyperparameters, including the number of trees, learning rate, maximum depth, and subsampling ratios, are optimized through Bayesian optimization on the training set to ensure stable generalization. This residual formulation enables XGBoost to capture nonlinear dependencies while remaining anchored to the physically grounded WCL prior.

3.6.5. Light Gradient Boosting Machine (LightGBM)

LightGBM is a gradient boosting framework based on decision trees that employs a leaf-wise tree growth strategy for improved computational efficiency and modeling flexibility [34]. Within the proposed hybrid framework, LightGBM is used as a residual regression model following the WCL stage. Instead of predicting absolute coordinates, it learns the residual displacement between the WCL coarse estimate and the ground-truth position in the local Cartesian space. For multi-output regression, two LightGBM models are trained via a MultiOutputRegressor wrapper, with key hyperparameters optimized through Bayesian optimization.

The hyperparameters and training settings used for the WCL, MLP, kNN, XGBoost, and LightGBM models are summarized in Table 3.

3.7. Performance Evaluation Metrics

Model performance is evaluated using the geodesic distance error between predicted and ground-truth GPS coordinates. Since the models operate in a local Cartesian coordinate system, predicted positions are converted back to latitude and longitude before computing the Haversine distance:

d (y_{i}, {\hat{y}}_{i}) = 2 R \cdot a r c s i n (\sqrt{\sin^{2} (\frac{∆ φ}{2}) + c o s (φ) c o s (\hat{φ}) \sin^{2} (\frac{∆ λ}{2})})

(15)

where R denotes the Earth radius, and ϕ, λ represent latitude and longitude in radians. Localization accuracy is summarized using the mean error, median error, R² score and the cumulative distribution function (CDF) of positioning errors. The CDF is particularly important in localization studies, as it characterizes the probability distribution of errors and prevents large outliers from disproportionately influencing the evaluation of system reliability.

The Friedman test was used to evaluate whether the compared localization models exhibited statistically significant overall differences in sample-wise localization error [35]. This test is a non-parametric alternative to repeated-measures ANOVA and is suitable for multiple related samples, since all models were evaluated on the same test instances and the resulting error distributions were non-normal. The test was performed separately for the random-splitting and spatial-splitting scenarios using only the common test samples available for all compared models within each split. For each test sample, model errors were converted to ranks, and the Friedman statistic was computed as:

Q = \frac{12 n}{k (k + 1)} = \sum_{j = 1}^{k} {(\bar{r_{j}} - \frac{k + 1}{2})}^{2}

(16)

where

n

is the number of common test samples,

k

is the number of models, and

\bar{r_{j}}

is the mean rank of the

j

-th model.

\bar{r_{j}} = \frac{1}{n} \sum_{i = 1}^{n} r_{i, j}

(17)

where

r_{i, j}

denotes the rank of the

j

-th model for the

i

-th test sample. Under the null hypothesis of equal performance, the statistic approximately follows a chi-square distribution with

k - 1

degrees of freedom. A

p

-value below 0.05 was considered to indicate statistically significant overall differences among the models. Lower mean-rank values indicate better overall localization performance.

4. Results and Discussion

This section presents a comprehensive evaluation of the proposed hybrid framework under both spatial and random splitting strategies. The analysis examines localization accuracy using mean error, median error, and R² metrics, as well as the spatial distribution of prediction errors across the deployment area. A spatial split is employed to evaluate the model’s generalization capability to previously unseen geographic regions, while a random split is used as a baseline reference scenario.

Table 4 presents a comparative evaluation of baseline ML models and their WCL-enhanced counterparts for outdoor LoRaWAN localization. The results demonstrate that the integration of WCL as a coarse localization prior leads to a substantial improvement in positioning accuracy for all considered models. In particular, the proposed WCL + MLP approach achieves the lowest mean and median localization errors of 160.47 m and 73.78 m, respectively, along with the highest R² score of 0.968. Compared to the baseline MLP model, which yields a mean localization error of 226.45 m, the incorporation of WCL reduces the mean error by approximately 29%, confirming the effectiveness of the proposed hybrid approach. This significant reduction highlights the advantage of combining physics-based coarse localization with data-driven learning, resulting in improved robustness and accuracy under heterogeneous and sparse LoRaWAN measurement conditions. Table 4 also reports the standard deviation of localization error (Std), the maximum localization error (Max), and the huge-error rate for errors above 1000 m, which provide a more explicit characterization of error dispersion and severe failure cases. A limited number of very large localization errors is still present for all compared methods, which explains the relatively high Std value. The best-performing configuration of MLP was obtained with a learning rate of 0.0012, dropout of 0.2, Huber delta of 20, and a batch size of 256.

Table 5 presents the localization performance of the ML models and their versions with WCL under spatial splitting. The best results are achieved by the WCL + MLP model, with a mean error of 157.95 m, a median error of 55.59 m, and the highest R² value (0.962). Compared to the pure MLP model (218.90 m), the addition of WCL reduces the mean error by approximately 28%, confirming the effectiveness of the hybrid approach. Overall, integrating WCL improves both accuracy and robustness, with WCL + MLP demonstrating the most reliable performance among all evaluated methods. Table 5 further reports the standard deviation of localization error (Std), the maximum localization error (Max), and the huge-error rate for errors above 1000 m, providing a more detailed view of error dispersion and extreme failure behavior under geographically unseen conditions. Although a small number of very large localization errors remain, which leads to relatively high Std values across all models, the proposed WCL + MLP model still shows a lower spread of errors and a lower frequency of severe failures than the compared methods.

While Table 4 and Table 5 were introduced to characterize uncertainty and severe failure cases, they also allow an empirical interpretation of the residual learning effect. The reduction in mean and median localization error suggests improved correction of systematic error, whereas the lower standard deviation and lower huge-error rate indicate reduced dispersion and fewer severe failures. Thus, the residual formulation appears to improve both bias-related and variance-related components of localization error.

To further confirm the claim of reduced effective training complexity, loss curves were plotted for the baseline MLP and the proposed WCL + MLP model (Figure 6).

The results show that WCL + MLP converges more favorably and achieves a lower loss function. This confirms that the introduction of a physically based WCL prior makes the second-stage problem more structured and robust to learning.

Figure 7 shows the localization error distribution function for the machine learning models and the WCL model under random splitting. The WCL models generally produce left-shifted curves, indicating smaller localization errors over most of the distribution.

This trend is particularly noticeable near 90%, where the best WCL models remain significantly lower below error thresholds than their corresponding baseline models. Specifically, the 90% errors are below 409.93 m for WCL + MLP, 464.94 m for WCL + LightGBM, and 445.81 m for WCL + XGBoost, compared to 558.84 m, 595.99 m, and 594.70 m for the corresponding baseline models. The improvement for kNN is smaller, but the WCL variant still reduces the 90% error threshold from 661.09 m to 619.83 m. These results demonstrate that the WCL model improves not only the central portion of the error distribution but also its performance in more complex cases. As a result, the hybrid variants demonstrate greater robustness by reducing the frequency of large localization errors.

Figure 8 shows the localization error distribution function for the WCL-based machine learning models and their hybrid variants under spatial separation. This setting is more demanding because the test data comes from geographically unknown regions, making the results more revealing. Under these conditions, the WCL-based models perform better, confirming the choice of physically based priors before the residual correction step. The most pronounced improvement is observed for WCL + MLP, whose curve remains consistently shifted to the left relative to the baseline MLP model. Specifically, 90% of errors are less than 404.84 m for WCL + MLP, compared to 538.83 m for MLP. A similarly strong effect is observed for LightGBM, where 90% of errors decrease from 729.15 m to 535.52 m after incorporating WCL. For XGBoost, the gain is smaller but still present, with 90% of errors being less than 530.52 m for WCL + XGBoost compared to 532.79 m for the baseline model. These results indicate that the WCL model is particularly useful for reducing the frequency of large localization biases in previously unobserved urban areas.

A different behavior is observed for kNN. While the baseline kNN model remains competitive in terms of distribution, the WCL model does not demonstrate the same consistent advantage as the neural and boosted models. In fact, near the 90% threshold, the threshold changes only slightly: from 681.93 m for kNN to 664.64 m for WCL + kNN. This suggests that the advantage of the WCL prior depends not only on the quality of the rough estimate but also on how effectively the regressor can exploit this prior. Overall, under spatial partitioning, WCL models remain the most effective for MLP and LightGBM, while WCL + MLP provides the most stable and robust performance among all the compared models.

CDF analysis and the detailed percent-of-error values for the random and spatial splitting configurations demonstrate that the integration of WCL shifts the entire error distribution toward lower values in both evaluation scenarios. In geographically overlapping and geographically disjoint configurations, the WCL-enhanced models exhibit a higher proportion of localization results within smaller error thresholds, reduced variability, and improved robustness.

To visualize the spatial behavior of localization errors and identify potential outliers, Figure 9b presents the geographic distribution of positioning errors for the best-performing model, WCL + MLP. The map illustrates the spatial distribution of errors under the random split configuration across the urban area. The results show that in most parts of the city, the localization error predominantly falls within the 100–300 m range, indicating stable performance in both central and peripheral regions. A significant number of points exhibit errors below 100 m, demonstrating the model’s ability to achieve high positioning accuracy under favorable signal propagation conditions and sufficient gateway visibility. These low-error estimates are spatially distributed throughout the study area rather than concentrated within a single cluster. Only a limited number of points correspond to large-error cases exceeding 1000 m, and such extreme errors do not form pronounced spatial clusters, indicating that the hybrid framework effectively mitigates severe outliers.

In contrast, Figure 9a presents the spatial error distribution for the MLP model. The map reveals a noticeably larger proportion of errors in the 300–500 m and 500–1000 m ranges, particularly within dense urban areas. Compared to the hybrid approach, the pure MLP model exhibits a higher concentration of medium-to-large deviations and a more heterogeneous spatial error pattern. Large-error regions appear more widespread, suggesting reduced robustness to complex propagation conditions.

Figure 10a illustrates the spatial distribution of localization errors for the MLP model. Compared to the random splitting scenario, the spatial error pattern appears somewhat more structured, with a moderate reduction in regions characterized by medium errors in the 300–500 m range.

However, noticeable clusters of medium and large deviations persist, particularly in dense urban areas and peripheral zones, indicating sensitivity to heterogeneous propagation conditions in the absence of WCL. Figure 10b presents the results for the WCL + MLP model. Under spatial splitting, the hybrid framework demonstrates an even more consistent error pattern than in the random split scenario. The majority of localization errors are concentrated within the 100–300 m range, while areas with errors below 100 m remain widely distributed across the deployment region. Importantly, large-error cases exceeding 1000 m are rare and do not form concentrated clusters in specific areas.

The results obtained under random and spatial data splitting strategies indicate a reduction in medium and large localization errors and a more geographically consistent distribution of prediction errors. While random splitting reflects statistical prediction accuracy, spatial splitting provides a more rigorous evaluation of generalization capability to previously unseen regions. Together, both protocols confirm the reliability and stability of the proposed WCL-based approach.

To further support the comparative analysis of localization performance, a non-parametric statistical evaluation was conducted using the Friedman test. The corresponding results are summarized in Table 6 and Table 7. Table 6 reports the overall significance of performance differences among the compared models for the random and spatial splitting scenarios, while Table 7 presents the mean ranks derived from the Friedman procedure, which provide an interpretable measure of the relative performance of each method across the test samples.

Table 6 shows that the Friedman test identifies statistically significant overall differences among the compared models in both evaluation scenarios (

p < 0.001

). This result indicates that the observed variation in localization error is not random and confirms that the models do not perform equivalently under either random or spatial splitting.

Table 7 presents the mean ranks of the compared methods, where lower values correspond to better overall performance. In both scenarios, WCL + MLP achieves the lowest mean rank, indicating the most consistent localization accuracy among all evaluated models. Under random splitting, WCL + XGBoost and KNN also demonstrate relatively strong performance, whereas the LightGBM and XGBoost models obtain the highest mean ranks. Under spatial splitting, WCL + MLP remains the best-performing method, while XGBoost becomes the strongest baseline model. In contrast, WCL + KNN yields the highest mean rank in the spatial setting, suggesting the weakest overall performance in that scenario.

Overall, the statistical analysis confirms that the compared models differ significantly in their localization behavior, and the mean-rank results further indicate that WCL + MLP is the most robust and consistently accurate method across both evaluation settings. This finding supports the effectiveness of combining WCL with residual learning based on MLP for outdoor LoRaWAN localization.

To assess the robustness of the WCL stage, a parametric analysis was performed using three key parameters: τ, top_strong, and min_gw. The results, presented in Figure 11, demonstrate that the accuracy depends on the choice of parameters but remains quite robust over the studied range. For min_gw = 300, the best result was obtained with τ = 10 and top_strong = 0.05, with a median error of 74.04 m, while for min_gw = 500, the minimum was observed at τ = 8 and top_strong = 0.15, with a median error of 73.78 m.

Overall, the analysis showed that a more stringent min_gw threshold, a moderate τ value, and a wider proportion of strong RSSI observations provide the most robust initial coordinate estimate.

Table 8 presents the results of an additional ablation analysis in terms of mean localization error for the full model and its simplified variants. As can be seen, the full WCL + MLP residual model achieves the lowest mean localization error of 160.47 m. When the residual correction is replaced by direct coordinate refinement in the WCL + MLP direct refine variant, the mean error increases to 230.73 m, indicating that the improvement is not due only to the introduction of the WCL prior, but specifically to the residual learning formulation. Removing statistical features, SNR, and the gateway observation mask further degrades the performance to 230.98 m, 239.13 m, and 261.75 m, respectively. Using WCL only results in a much larger mean localization error of 564.45 m, confirming that WCL in the proposed approach serves as a coarse, physically interpretable initial estimate rather than a sufficiently accurate standalone localization method. Overall, the results show that the best performance is achieved only when the WCL prior, residual correction, and complete feature representation are used jointly.

Figure 12 presents an analysis of the importance of group permutations for the trained hybrid model. The gateway observation mask has the greatest impact on performance, increasing the median localization error by 306.82 m after permutation. RSSI and SNR are the next most important feature groups, while aggregated statistical features make an additional, but smaller, contribution. SF has the least impact in the current configuration. These results indicate that the explicit gateway observation mask is a key factor determining the performance of the proposed method.

In terms of computational efficiency, the WCL + MLP hybrid model exhibits only a moderate increase in cost compared to the original MLP. Specifically, training time increases from 276.94 s to 312.49 s, and inference time from 1.5522 s to 1.5806 s. Similarly, the latency per sample increases insignificantly, from 0.1414 s to 0.1440 s. The model size also increases from 1.94 MB to 2.48 MB. For the fastest model, XGBoost, the latency is 8.61 ms, compared to 8.65 ms for WCL + XGBoost. This demonstrates that including WCL can improve localization accuracy while only slightly increasing computation time. Thus, the addition of WCL-based coordinate priors does not significantly degrade computational performance and can be considered an acceptable tradeoff for the potential improvement in localization accuracy. These results are important because both execution time and memory footprint are critical practical factors for real-world localization systems, especially when models are expected to operate with limited computing resources or in near-real-time settings. Therefore, the observed increases in execution time and memory footprint can be considered small and acceptable, given the corresponding improvement in localization accuracy. All experiments were conducted on a system with an Intel Core i7–12700H (12th Gen, 2.30 GHz) processor and 16 GB RAM (4800 MT/s).

A limitation of the present evaluation is that the use of a single dataset does not fully establish the generalizability of the proposed method to other urban environments. Nevertheless, large open datasets for urban outdoor LoRaWAN localization remain very limited, which is why the Antwerp dataset continues to serve as one of the main public references in this area. In addition, the benchmark itself contains a source of irreducible uncertainty, since the geographic labels were collected from moving vehicles and may not perfectly match the true transmitter position at the exact reception moment. Therefore, part of the observed localization error should be interpreted as benchmark label uncertainty rather than purely algorithmic error.

In addition to the quantitative comparison, the obtained results should also be interpreted from the perspective of practical applicability. The proposed method is aimed at large-scale low-power IoT applications, including asset logistics, environmental monitoring, and smart city systems, which represent typical LoRaWAN deployment scenarios. In such systems, the use of GNSS is technically possible; however, when thousands of battery-powered devices are involved, it often proves to be economically and energetically inefficient.

From this perspective, the achieved improvement has practical significance. In dense urban environments, localization errors on the order of several hundred meters are often too coarse for practical use, since they may assign a device to the wrong facility, administrative zone, or road segment. A reduction in the mean error and, especially, a reduction in the median error to below 100 m increases the usefulness of the system by enabling more reliable localization at the zone or object level. This is particularly important for distributed air-quality monitoring, municipal infrastructure supervision, and low-cost asset tracking, where the goal is not navigation-grade accuracy, but correct assignment of measurements or assets to the corresponding operational zone. Thus, the proposed method narrows the gap between costly GNSS-based positioning and traditional coarse localization based on LPWAN radio measurements.

5. Conclusions

This work presents a hybrid outdoor localization system for LoRaWAN networks that combines WCL with ML regression. The proposed approach was compared with several models, including MLP, kNN, XGBoost, and LightGBM, under both random and spatial data partitioning strategies. The results demonstrate that incorporating a physically grounded WCL prior significantly enhances model stability and generalization capability. In contrast to direct-regression models, the hybrid architecture benefits from a geometrically informed initial estimate, which reduces large localization errors and improves robustness under non-uniform signal propagation conditions in urban environments.

In conclusion, the study underscores the importance of integrating interpretable physical models with ML techniques for scalable and reliable outdoor localization in LoRaWAN networks. Future research will focus on cross-environment generalization capability, adaptive mechanisms for dynamically varying signal propagation conditions, and a more detailed investigation of spatial split settings and density-aware generalization behavior. It is also worth noting that a dedicated sensitivity analysis of the grid size can be included as a direction for future work.

Author Contributions

Conceptualization, A.B. and B.Z.; methodology, A.B. and B.Z.; software, M.N. and G.D.; validation, D.A. and A.S.; formal analysis, A.K. and G.D.; investigation S.O. and G.D.; writing—original draft preparation, A.B. and B.Z.; writing—review and editing, Ö.F.B. and A.S.; visualization, B.Y. and S.O.; supervision, A.S.; project administration, M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dritsas, E.; Trigka, M. Machine Learning for Blockchain and IoT Systems in Smart Cities: A Survey. Future Internet 2024, 16, 324. [Google Scholar] [CrossRef]
Taylor, P. Number of IoT connections worldwide 2022–2034. 2026. Available online: https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/ (accessed on 19 February 2026).
Amini Gougeh, R.; Zilic, Z. Systematic Review of IoT-Based Solutions for User Tracking: Towards Smarter Lifestyle, Wellness and Health Management. Sensors 2024, 24, 5939. [Google Scholar] [CrossRef]
Chen, H.; Xing, F.; Yang, Q.; Shu, Y.; Shi, Z.; Chen, J.; Tao, Z. A Lightweight Mobile-Anchor-Based Multi-Target Outdoor Localization Scheme Using LoRa Communication. IEEE Trans. Green Commun. Netw. 2023, 7, 1607–1619. [Google Scholar] [CrossRef]
Liu, T.; Liu, J.; Wang, J.; Zhang, H.; Zhang, B.; Ma, Y.; Sun, M.; Lv, Z.; Xu, G. Pseudolites to Support Location Services in Smart Cities: Review and Prospects. Smart Cities 2023, 6, 2081–2105. [Google Scholar] [CrossRef]
Hu, Z.; Xu, S.; Guo, J.; Li, Z. Non-Line-of-Sight GNSS Signal Classification for Urban Navigation Based on Machine Learning: Comparison and Validation. Adv. Space Res. 2025, 75, 7817–7834. [Google Scholar] [CrossRef]
Zholamanov, B.; Saymbetov, A.; Nurgaliyev, M.; Orynbassar, S.; Dosymbetova, G.; Kapparova, A.; Kuttybay, N.; Koshkarbay, N.; Yershov, E.; Bolatbek, A.; et al. Indoor Localization of LoRa Wireless Modules Based on RSSI Fingerprint Method Using Transfer Learning. Results Eng. 2025, 28, 108383. [Google Scholar] [CrossRef]
Mekki, K.; Bajic, E.; Chaxel, F.; Meyer, F. A Comparative Study of LPWAN Technologies for Large-Scale IoT Deployment. ICT Express 2019, 5, 1–7. [Google Scholar] [CrossRef]
Ertürk, M.A.; Aydın, M.A.; Büyükakkaşlar, M.T.; Evirgen, H. A Survey on LoRaWAN Architecture, Protocol and Technologies. Future Internet 2019, 11, 216. [Google Scholar] [CrossRef]
Sherazi, H.H.R.; Grieco, L.A.; Imran, M.A.; Boggia, G. Energy-Efficient LoRaWAN for Industry 4.0 Applications. IEEE Trans. Industr. Inform. 2021, 17, 891–902. [Google Scholar] [CrossRef]
Podevijn, N.; Plets, D.; Trogh, J.; Martens, L.; Suanet, P.; Hendrikse, K.; Joseph, W. TDoA-Based Outdoor Positioning with Tracking Algorithm in a Public LoRa Network. Wirel. Commun. Mob. Comput. 2018, 2018, 1864209. [Google Scholar] [CrossRef]
Asaad, S.M.; Maghdid, H.S. A Comprehensive Review of Indoor/Outdoor Localization Solutions in IoT Era: Research Challenges and Future Perspectives. Comput. Netw. 2022, 212, 109041. [Google Scholar] [CrossRef]
Nurgaliyev, M.; Bolatbek, A.; Zholamanov, B.; Saymbetov, A.; Kopbay, K.; Yershov, E.; Orynbassar, S.; Dosymbetova, G.; Kapparova, A.; Kuttybay, N.; et al. Machine Learning Based Localization of LoRa Mobile Wireless Nodes Using a Novel Sectorization Method. Future Internet 2024, 16, 450. [Google Scholar] [CrossRef]
Jondhale, S.R.; Jondhale, A.S.; Deshpande, P.S.; Lloret, J. Improved Trilateration for Indoor Localization: Neural Network and Centroid-Based Approach. Int. J. Distrib. Sens. Netw. 2021, 17, 155014772110539. [Google Scholar] [CrossRef]
Zholamanov, B.; Saymbetov, A.; Nurgaliyev, M.; Bolatbek, A.; Dosymbetova, G.; Kuttybay, N.; Orynbassar, S.; Kapparova, A.; Koshkarbay, N.; Beyca, Ö.F. RSSI Fingerprint-Based Indoor Localization Solutions Using Machine Learning Algorithms: A Comprehensive Review. Smart Cities 2025, 8, 153. [Google Scholar] [CrossRef]
Keleşoğlu, N.; Halama, M.; Strzoda, A. Enhancing LoRa-Based Outdoor Localization Accuracy Using Machine Learning. IEEE Access 2025, 13, 129432–129450. [Google Scholar] [CrossRef]
Bagherian, M.H.; Tehrani, Y.H.; Atarodi, S.M. Enhancing LoRaWAN Localization Efficiency: Leveraging Machine Learning and Optimized Parameter Tuning for a 60% Improvement. Measurement 2025, 242, 116064. [Google Scholar] [CrossRef]
Janssen, T.; Berkvens, R.; Weyn, M. Benchmarking RSS-Based Localization Algorithms with LoRaWAN. Internet Things 2020, 11, 100235. [Google Scholar] [CrossRef]
Aernouts, M.; Berkvens, R.; van Vlaenderen, K.; Weyn, M. Sigfox and LoRaWAN Datasets for Fingerprint Localization in Large Urban and Rural Areas. Data 2018, 3, 13. [Google Scholar] [CrossRef]
Lutakamale, A.S.; Myburgh, H.C.; de Freitas, A. RSSI-Based Fingerprint Localization in LoRaWAN Networks Using CNNs with Squeeze and Excitation Blocks. Ad Hoc Netw. 2024, 159, 103486. [Google Scholar] [CrossRef]
Pandangan, Z.A.; Talampas, M.C.R. Hybrid LoRaWAN Localization Using Ensemble Learning. In Proceedings of the 2020 Global Internet of Things Summit (GIoTS); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
Li, Y.; Barthelemy, J.; Sun, S.; Perez, P.; Moran, B. Urban Vehicle Localization in Public LoRaWan Network. IEEE Internet Things J. 2022, 9, 10283–10294. [Google Scholar] [CrossRef]
Islam, K.Z.; Murray, D.; Diepeveen, D.; Jones, M.G.K.; Sohel, F. Machine Learning-based LoRa Localisation Using Multiple Received Signal Features. IET Wirel. Sens. Syst. 2023, 13, 133–150. [Google Scholar] [CrossRef]
LoRa Alliance Geolocation Whitepaper, 2018. Available online: https://resources.lora-alliance.org/whitepapers/lora-alliance-geolocation-whitepaper (accessed on 19 February 2026).
Milarokostas, C.; Tsolkas, D.; Passas, N.; Merakos, L. A Comprehensive Study on LPWANs With a Focus on the Potential of LoRa/LoRaWAN Systems. IEEE Commun. Surv. Tutor. 2023, 25, 825–867. [Google Scholar] [CrossRef]
Yu, W.; Yu, C. Application of Weighted Centroid Algorithm Based on Weight Correction in Node Localization of Wireless Sensor Networks. Sci. Rep. 2025, 15, 23400. [Google Scholar] [CrossRef]
Magowe, K.; Giorgetti, A.; Kandeepan, S.; Yu, X. Accurate Analysis of Weighted Centroid Localization. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 153–164. [Google Scholar] [CrossRef]
Abbas, A.M. Analysis of Weighted Centroid-Based Localization Scheme for Wireless Sensor Networks. Telecommun. Syst. 2021, 78, 595–607. [Google Scholar] [CrossRef]
Pinkus, A. Approximation Theory of the MLP Model in Neural Networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
Mahdavi, F.; Zayyani, H.; Rajabi, R. RSS Localization Using an Optimized Fusion of Two Deep Neural Networks. IEEE Sens. Lett. 2021, 5, 7501104. [Google Scholar] [CrossRef]
Liu, H.; Fan, K.; He, B.; Wang, W. Unmanned Aerial Vehicle Acoustic Localization Using Multilayer Perceptron. Appl. Artif. Intell. 2021, 35, 537–548. [Google Scholar] [CrossRef]
Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A Lazy Learning Approach to Multi-Label Learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems; NeurIPS Foundation: Long Beach, CA, USA, 2017; p. 30. [Google Scholar]
Pereira, D.G.; Afonso, A.; Medeiros, F.M. Overview of Friedman’s test and post-hoc analysis. Commun. Stat. Simul. Comput. 2015, 44, 2636–2653. [Google Scholar] [CrossRef]

Figure 1. LoRaWAN network architecture.

Figure 2. The architecture of the proposed hybrid localization method.

Figure 3. LoRaWAN dataset collected in Antwerp city center. The blue dots represent the real GPS coordinate [19].

Figure 4. Distribution of RSSI values in the LoRaWAN dataset.

Figure 5. Architecture of an MLP.

Figure 6. Loss curves of the baseline MLP and the proposed WCL + MLP model: (a) training loss; (b) validation loss.

Figure 7. CDF of localization errors for ML models and hybrid model with WCL under random splitting.

Figure 8. CDF of localization errors for ML models and hybrid model with WCL under spatial splitting.

Figure 9. Spatial distribution of localization errors under the random split: (a) MLP; (b) WCL + MLP.

Figure 10. Spatial distribution of localization errors under the spatial split: (a) MLP; (b) WCL + MLP.

Figure 11. Heat map of WCL parameter sensitivity.

Figure 12. Group-wise permutation importance analysis.

Table 2. Input features used for model training.

Category	Input Features	Source	Description
Radio signal	RSSI, SNR	Obtained from gateway receptions	Characterize radio channel conditions
Transmission	SF	Extracted from LoRaWAN metadata	Represent transmission configuration
Gateway observation	Gateway observation mask	Derived from gateway reception patterns	Encodes spatial visibility of the transmitted packet
Statistical	Mean, minimum, maximum, standard deviation.	Computed during preprocessing	Provide statistical summaries of multi-gateway observations

Table 3. Hyperparameters and training settings used for the evaluated models.

Model	Hyperparameter Search
WCL	τ ∈ {6.0, 8.0, 10.0}, top_strong ∈ {0.05, 0.10, 0.15}, min_gw ∈ {300, 500}
MLP	learning_rate ∈ [5 × 10⁻⁴, 8 × 10⁻⁴, 1.2 × 10⁻³], dropout ∈ [0.10, 0.20, 0.30], delta ∈ [20.0, 30.0, 50.0], batch_size ∈ [128, 256]
kNN	n_neighbors ∈ {5, 10, 20, 30}; weights ∈ {uniform, distance}
XGBoost	max_depth ∈ [4, 10]; learning_rate ∈ [0.01, 0.1] (log scale); subsample ∈ [0.6, 1.0]; colsample_bytree ∈ [0.6, 1.0]; min_child_weight ∈ [1, 10]; gamma ∈ [0.0, 5.0]; reg_lambda ∈ [0.1, 10.0] (log scale); reg_alpha ∈ [0.0, 5.0]
LightGBM	num_leaves ∈ [32, 256]; learning_rate ∈ [0.01, 0.1]; max_depth ∈ [3, 12]; min_child_samples ∈ [5, 50]; subsample ∈ [0.6, 1.0]; colsample_bytree ∈ [0.6, 1.0]

Table 4. Localization performance of the ML models and hybrid model with WCL under random splitting.

Method	Mean Error (m)	Median Error (m)	Std, m	Max, m	Error > 1000 m, %	R² Score
WCL + MLP	160.47	73.78	251, 04	4350, 76	1, 55	0.968
WCL + kNN	249.63	152.65	307, 38	3367, 04	3, 25	0.944
WCL + LightGBM	201.45	135.41	238, 73	3815, 18	1, 28	0.966
WCL + XGBoost	190.09	122.15	233, 76	3604, 47	1, 23	0.967
MLP	226.45	131.43	290, 33	3845, 81	2, 65	0.950
kNN	256.56	138.63	348, 09	4537, 04	4, 18	0.932
LightGBM	261.03	183.99	288, 97	3977, 26	2, 58	0.943
XGBoost	259.65	182.27	286, 02	3993, 79	2, 52	0.944

Table 5. Localization performance of the ML models and hybrid model with WCL under spatial splitting.

Method	Mean Error (m)	Median Error (m)	Std, m	Max, m	Error > 1000 m, %	R² Score
WCL + MLP	157.95	55.59	237, 12	4564, 52	1, 49	0.962
WCL + kNN	262.58	123.31	324, 89	5734, 09	3, 72	0.919
WCL + LightGBM	220.74	112.23	268, 91	5750, 24	2, 04	0.942
WCL + XGBoost	217.02	110.65	266, 71	5987, 5	2, 06	0.943
MLP	218.90	109.10	274, 23	5621, 16	2, 12	0.941
kNN	257.39	85.72	365, 49	5210, 43	4, 65	0.909
LightGBM	365.91	283.64	302, 82	2854, 76	4, 16	0.915
XGBoost	219.90	111.86	273, 73	5965, 13	2, 22	0.940

Table 6. Friedman test results for overall comparison of localization models.

Split	n	k	Q	p
random	10,975	8	7890.852	<0.001
spatial	8348	8	3813.087	<0.001

Table 7. Mean rank of localization models obtained from the Friedman test.

Method	Mean Rank for Random Split	Mean Rank for Spatial Split
WCL + MLP	3.237	3.163
WCL + KNN	4.533	5.415
WCL + LightGBM	4.280	4.594
WCL + XGBoost	3.765	4.518
MLP	5.030	4.547
KNN	4.261	4.681
LightGBM	5.483	4.688
XGBoost	5.407	4.391

Table 8. Results of ablation analysis.

Methods	WCL + MLP Residual	WCL + MLP Direct Refine	WCL + MLP No Stats	WCL + MLP No SNR	WCL + MLP No Mask	WCL Only	MLP Only
Mean error (m)	160.47	230.73	230.98	239.13	261.75	564.45	226.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bolatbek, A.; Beyca, Ö.F.; Zholamanov, B.; Nurgaliyev, M.; Dosymbetova, G.; Almen, D.; Saymbetov, A.; Yertaikyzy, B.; Orynbassar, S.; Kapparova, A. A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks. Future Internet 2026, 18, 216. https://doi.org/10.3390/fi18040216

AMA Style

Bolatbek A, Beyca ÖF, Zholamanov B, Nurgaliyev M, Dosymbetova G, Almen D, Saymbetov A, Yertaikyzy B, Orynbassar S, Kapparova A. A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks. Future Internet. 2026; 18(4):216. https://doi.org/10.3390/fi18040216

Chicago/Turabian Style

Bolatbek, Askhat, Ömer Faruk Beyca, Batyrbek Zholamanov, Madiyar Nurgaliyev, Gulbakhar Dosymbetova, Dinara Almen, Ahmet Saymbetov, Botakoz Yertaikyzy, Sayat Orynbassar, and Ainur Kapparova. 2026. "A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks" Future Internet 18, no. 4: 216. https://doi.org/10.3390/fi18040216

APA Style

Bolatbek, A., Beyca, Ö. F., Zholamanov, B., Nurgaliyev, M., Dosymbetova, G., Almen, D., Saymbetov, A., Yertaikyzy, B., Orynbassar, S., & Kapparova, A. (2026). A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks. Future Internet, 18(4), 216. https://doi.org/10.3390/fi18040216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Physically Aware Residual Learning Framework for Outdoor Localization in LoRaWAN Networks

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. LoRaWAN Technology

3.2. Research Architecture

3.3. Dataset Collection

3.4. Data Preprocessing

3.5. Data Splitting

3.5.1. Random Split

3.5.2. Spatial Split

3.6. Proposed Hybrid Localization Framework

3.6.1. Weighted Centroid Localization (WCL)

3.6.2. Multilayer Perceptron (MLP)

3.6.3. k-Nearest Neighbors (kNN)

3.6.4. Extreme Gradient Boosting (XGBoost)

3.6.5. Light Gradient Boosting Machine (LightGBM)

3.7. Performance Evaluation Metrics

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI