3.2. Datasets
To capture diverse propagation conditions, a collection of publicly available datasets from both peer-reviewed studies and open repositories covering atmospheric, smart campus, indoor building, indoor library, and landslide scenarios is employed for training and evaluation of machine learning models for path loss prediction. All models use transmitter–receiver distance and carrier frequency as input features to ensure a consistent and comparable feature space across datasets. The datasets include both empirical measurements and synthetic channel data generated using the NYUSIM channel simulator. The synthetic dataset is used to emulate high-frequency propagation conditions (mmWave) that are difficult to capture through large-scale measurements, while energy-aware analysis is interpreted in a comparative and scenario-based manner rather than as a hardware-specific deployment prediction. The synthetic dataset spans sub-GHz to millimeter-wave frequencies and includes multiple propagation environments characterized by varying shadowing levels, multipath effects, and line-of-sight (LOS/NLOS) conditions [
35]. Combining simulated and measured datasets improves coverage of propagation regimes and supports evaluation of model generalization across heterogeneous environments. To address potential domain shift between heterogeneous real-world and synthetic datasets, a dataset-isolated evaluation strategy is adopted. Specifically, no joint training across datasets is performed. All models are trained, validated, and evaluated independently within each dataset using a consistent cross-validation protocol, without any cross-dataset data sharing or joint optimization. This design ensures that no information leakage occurs between domains and that each dataset is treated as a distinct propagation domain with its own underlying statistical characteristics. As a result, biases arising from distribution mismatch between simulated and real-world data are inherently mitigated by the evaluation design. Consequently, performance analysis focuses strictly on within-dataset comparisons rather than direct comparison of absolute metrics across datasets, enabling a controlled and fair assessment of model behavior under heterogeneous but internally consistent propagation conditions. It is important to clarify that the cross-band evaluation is conducted at the level of large-scale propagation modeling, where path loss is represented in the logarithmic (dB) domain as a function of distance and frequency. Under this abstraction, different frequency regimes share a common statistical representation based on distance-dependent attenuation and log-normal variability, allowing consistent evaluation within the proposed framework without implying equivalence of the underlying microscopic fading mechanisms.
The synthetic dataset (NYUSIM) is included in a complementary manner to extend coverage to high-frequency propagation regimes, and is not intended to substitute real-world measurements, thereby avoiding cross-domain interference that could arise from distribution mismatch between simulated and real-world propagation conditions.
The evaluated models include classical propagation models [
1,
2], multiple machine learning regressors [
10,
17], ensemble methods [
21,
22,
23,
24], and deep learning architectures, including CNNs [
9]. These models are selected to capture nonlinear and environment-dependent propagation behaviors.
The selected datasets were chosen to ensure a comprehensive evaluation of model performance across heterogeneous and realistic propagation conditions. Specifically, they cover diverse environments, including indoor (Indoor Building, Indoor Library), outdoor urban (Smart Campus), terrain-affected (Landslide), and high-frequency scenarios (Atmospheric dataset generated using the NYUSIM channel simulator). This selection enables the analysis of key propagation effects such as multipath, shadowing, and LOS/NLOS conditions. In addition, the datasets span multiple frequency bands from sub-GHz to millimeter-wave ranges, and include both real-world measurements and synthetically generated data, ensuring coverage of different propagation regimes while maintaining physical realism. Overall, this strategy was adopted to assess the generalization capability and robustness of the models across heterogeneous deployment scenarios rather than optimizing performance for a specific environment. In addition to model evaluation, the selected datasets enable system-level analysis by providing representative path loss distributions that can be directly mapped to link budget requirements, transmission power estimation, and energy-aware node lifetime analysis.
The datasets employed in this study represent a wide range of propagation environments relevant to heterogeneous wireless systems, including WSN, IoT, short-range wireless, and cellular communication scenarios. They include a millimeter-wave dataset generated using the NYUSIM channel simulator covering Urban Microcell scenarios at 7.125 GHz, 24.25 GHz, 52.60 GHz, and 71 GHz [
35]; a smart campus dataset collected at Covenant University at 1800 MHz with 3617 measured samples obtained via drive tests [
36]; two indoor datasets, one from a multi-floor academic building using LoRa in the sub-GHz band with structural descriptors such as number of floors and walls [
37], and another from a 3.5 GHz measurement campaign in a large library environment with dense structural variability [
38]; and a terrain-affected dataset from a landslide-prone area in Thailand at 2400 MHz with transmitter–receiver distances ranging from 0.5 m to 50 m and repeated measurements to mitigate small-scale fading effects. The authors provide additional path loss measurement and simulation data as supplemental material to the article by [
39]. Each dataset implicitly corresponds to a different communication technology and deployment scenario (e.g., sub-GHz LPWAN, short-range wireless, WiFi, and cellular systems), which justifies the use of technology-specific transmission power consumption models in the subsequent energy-aware analysis. This dataset-isolated strategy ensures evaluation within statistically consistent domains without cross-domain generalization.
Together, these datasets provide a comprehensive benchmark for evaluating model generalization across diverse propagation regimes while supporting deployment-level tasks such as transmission power planning, fade margin estimation, energy consumption analysis, and node lifetime prediction under realistic and heterogeneous conditions. Although some datasets contain additional environmental descriptors, these are excluded from the learning process to maintain a unified feature space and ensure fair cross-dataset comparability without dataset-specific tuning. This design choice ensures that the learned models remain agnostic to environment-specific descriptors, allowing the derived performance and uncertainty metrics (e.g., RMSE) to be consistently interpreted across datasets for subsequent system-level analysis.
To ensure methodological rigor, reproducibility, and consistency within the proposed unified framework, a unified experimental pipeline is adopted across all datasets. The feature space is restricted to transmitter–receiver distance (d) and carrier frequency (f), and all distance values are consistently converted to the appropriate units required by each propagation model implementation. A standardized preprocessing procedure is applied [
20], without dataset-specific calibration of analytical models, preserving their intrinsic formulations. Model evaluation is conducted using 5-fold cross-validation with data shuffling and a fixed random seed [
19]. Performance is assessed using Mean Absolute Error (MAE), RMSE, and the coefficient of determination (R
2). RMSE is additionally used as a proxy for prediction uncertainty and, under explicitly validated statistical conditions, as an empirical indicator of dispersion consistent with shadowing variability, enabling its subsequent use in fade margin estimation and energy-aware link budget analysis [
1,
2,
15,
16,
19,
20]. All experiments are implemented within a consistent pipeline to ensure fair benchmarking across model classes. The use of a unified preprocessing and evaluation pipeline ensures that performance differences reflect propagation characteristics rather than methodological inconsistencies, reducing potential bias when analyzing heterogeneous datasets. The implementation tool was Python 3.12.10 using NumPy 2.1.3, Pandas 2.2.3, SciPy 1.15.2, and scikit-learn 1.6.1 for data processing and classical machine learning models. Deep learning models were developed using TensorFlow 2.19.0, while XGBoost 3.1.1 was used for gradient boosting experiments. Visualization was performed using Matplotlib 3.10.1 and Seaborn 0.13.2. This software stack ensures reproducibility and consistency across all experiments.”
Summary statistics for each dataset, including number of samples, distance and frequency ranges, and environment type, are provided in
Table 1.
For each dataset, the regression task is defined using a constrained feature space consisting exclusively of transmitter–receiver distance
d (meters) and carrier frequency
f (MHz) [
1,
2]:
The target variable is the measured path loss in decibels [
7]:
All datasets [
32,
33,
35,
36,
37] were processed using a unified preprocessing pipeline to ensure methodological consistency. Column names were standardized, and only the universally available features—distance, frequency, and path loss—were retained. Samples with invalid entries (non-positive distance or frequency) or missing values were removed [
20]. No dataset-specific feature engineering or environment-dependent adjustments were introduced, ensuring a fair benchmarking procedure across heterogeneous propagation scenarios [
19,
20]. The large-scale propagation behavior between a transmitter and a receiver is described by the path loss PL(d,f), where d denotes the transmitter–receiver distance in meters and f the carrier frequency in MHz. The received power is defined as:
where antenna gains and hardware losses are incorporated into the effective path loss term [
1,
2]. This formulation provides the foundation for the subsequent integration of propagation modeling with link budget analysis and energy-aware system design, where predicted path loss values are translated into transmission power requirements and node lifetime estimates under realistic deployment assumptions.
The corresponding publicly accessible repositories, DOI references, and dataset resources associated with References [
35,
36,
37,
38,
39] are explicitly provided in the Data Availability Statement to ensure transparency and reproducibility.
3.3. Propagation Models
To establish a robust comparative baseline for wireless channel characterization, a set of widely adopted empirical and semi-empirical propagation models is implemented, namely the Free Space Path Loss (FSPL), Okumura–Hata, COST-231 Hata, Log-Distance, Egli, and ITU indoor models. The selection of these models is guided by the heterogeneous nature of the datasets employed in this study, which encompass indoor, outdoor urban, terrain-affected, and high-frequency propagation scenarios. Accordingly, the chosen models collectively cover the corresponding propagation regimes: indoor-oriented formulations such as the ITU model are aligned with indoor datasets, urban macrocell models such as Okumura–Hata and COST-231 are representative of outdoor urban environments, general-purpose models such as FSPL and Log-Distance provide baseline formulations applicable across conditions, while terrain-sensitive models such as Egli capture irregular propagation effects relevant to environments with non-uniform terrain characteristics. This alignment ensures that the evaluation framework reflects realistic deployment conditions and enables a consistent assessment of both environment-specific applicability and cross-environment generalization capability. These models are extensively used in wireless communication system design and are applicable across heterogeneous environments, including rural, suburban, urban, and indoor scenarios [
28]. Consistent parameterization is applied across all models to ensure comparability, with frequency expressed in MHz and distance converted to kilometers where required by specific formulations, while antenna heights are fixed at representative values of 30 m for the base station and 1.5 m for the receiver.
The parameters used in the propagation models are summarized in
Table 2.
The FSPL model provides a theoretical lower bound for signal attenuation under ideal Line-of-Sight (LoS) conditions:
where d is the distance in kilometers and f is the operating frequency in MHz.
The Okumura–Hata model is an empirical formulation derived from extensive field measurements:
where d is in kilometers and
represents the mobile antenna height correction factor.
The COST-231 Hata model extends the Okumura–Hata formulation to higher frequency bands (1500–2000 MHz):
where C is an environmental constant (0 dB for suburban and medium-sized cities, and 3 dB for metropolitan areas).
The Log-Distance model generalizes FSPL by introducing a path loss exponent:
where n denotes the path loss exponent and depends on the propagation environment and d
0 denotes the reference distance (typically 1 m).
The Egli model accounts for antenna heights and partial obstruction:
where h
t and h
r denote the transmitting and receiving antenna heights, respectively. This notation is consistent with the global variable definitions provided in
Table 2.
The ITU indoor model incorporates structural attenuation effects:
where N is the distance power loss coefficient and
represents floor penetration loss.
These models collectively provide a comprehensive baseline for evaluating propagation behavior across diverse environments [
28].
3.4. Machine Learning Models
To model the nonlinear relationship between propagation parameters and measured path loss, several machine learning regression algorithms are employed, including ensemble methods, kernel-based models, and neural networks [
17,
40]. The selection of these models is guided by the need to provide a comprehensive set of widely adopted regression approaches capable of capturing nonlinear propagation behavior across heterogeneous datasets. In particular, ensemble methods such as Random Forests [
21], Gradient Boosting [
22], XGBoost [
23], and Extremely Randomized Trees [
24] are included due to their strong ability to improve generalization through the aggregation of multiple decision trees and their robustness to noise and dataset variability, as established in classical classification and regression tree (CART) methodology [
41]. A Decision Tree Regressor is also considered as a baseline model for reference.
Support Vector Regression (SVR) with an RBF kernel is employed to capture nonlinear relationships in high-dimensional feature spaces [
31], while a Multi-Layer Perceptron (MLP) trained via backpropagation [
42] is used as a representative feedforward neural network model capable of approximating complex mappings between input features and path loss. In addition, CNN architectures have shown competitive performance in specific scenarios [
9,
43]. The CNN configuration (layers, epochs, optimizer, and other hyperparameters) was selected through automated hyperparameter tuning to ensure fair and reproducible performance across datasets, following the experimental setup in [
43]. The inclusion of these models ensures coverage of different regression paradigms, enabling a consistent and fair evaluation of their predictive performance and generalization capability across heterogeneous propagation scenarios.
All models use two input features (distance and frequency) to predict path loss in dB. Feature standardization is applied to ensure stable convergence. Model evaluation is performed using 5-fold cross-validation [
19], and performance is assessed using MAE, RMSE, and R
2.
Residuals are further analyzed using the Shapiro–Wilk test [
44] to assess their statistical distribution. RMSE is interpreted as a practical measure of prediction dispersion and, under approximately Gaussian residual conditions, as an estimator of shadowing variability. This interpretation is applied cautiously in cases where the Gaussian assumption does not strictly hold.
To further enhance reproducibility and methodological transparency, the hyperparameter configurations of the implemented machine learning and deep learning models, together with the preprocessing, cross-validation, and statistical evaluation settings adopted throughout the experimental pipeline, are summarized in
Appendix A (
Table A1 and
Table A2).
3.5. Node Lifetime Estimation
Based on the developed modeling and evaluation pipeline, the framework translates the obtained path loss predictions into system-level operational metrics by linking them to transmission power requirements and energy usage in battery-constrained WSN–IoT nodes. The estimated path loss is integrated within a link budget formulation to derive the minimum transmission power required for reliable communication, directly affecting per-transmission energy expenditure and, consequently, the expected node lifetime. It is important to clarify that the proposed framework does not perform power control or dynamic transmission policy optimization. Instead, it serves as a recommendation-oriented framework that translates link budget requirements into transmission power levels, intended for system-level design guidance rather than real-time control. To ensure reproducibility and consistency with established energy-aware modeling approaches for wireless sensor networks, all assumptions and parameters used in the proposed framework are explicitly defined and grounded in prior literature on WSN energy consumption and communication system design [
4,
5,
6,
25,
26,
27]. The assumptions and parameter values are displayed in
Table 3. It is emphasized that the contribution of this work does not lie in the individual formulation of these components, which are well established in the literature, but in their integration through a structured and statistically grounded transformation process that enables consistent propagation of prediction uncertainty into system-level performance metrics.
The considered network operation follows typical WSN–IoT behavior, where nodes transmit periodically and remain in low-power sleep states between transmissions [
4,
6,
25]. The device and communication parameters are selected to reflect realistic operating conditions reported in WSN and IoT deployments and are aligned with the characteristics of the datasets and their corresponding application scenarios [
4,
5,
25,
39]. These parameters are summarized in
Table 4.
These parameters are not tied to a specific hardware platform but represent typical configurations reported in the literature for comparable deployment scenarios.
Given that wireless transmission represents the dominant component of energy consumption in such systems [
4,
5,
6,
26,
27], accurate estimation of transmission power becomes critical for achieving energy-efficient operation. For each dataset, the model with the lowest RMSE is selected as the best-performing model for reliability-aware evaluation. In this context, this selection defines the reference model for generating power level suggestions rather than implying any control mechanism. Consequently, the overall framework remains recommendation-oriented rather than selection-driven. Furthermore, RMSE is interpreted as a proxy for prediction uncertainty and, under explicitly validated statistical conditions, as an approximate indicator of dispersion consistent with shadowing variability. This interpretation is conditional and does not imply that residuals exclusively represent physical channel effects.
3.5.1. Prediction Error, Channel Variability, and Fade Margin
Propagation model accuracy is evaluated using the RMSE calculated on an independent test dataset. For
samples with measured path loss
and predicted values
, the RMSE is defined as: [
19].
Large-scale propagation effects are commonly modeled using the log-normal shadowing model, where path loss consists of a deterministic component plus a Gaussian random variable in the logarithmic (dB) domain with standard deviation
[
1,
2].
Let the prediction residual be,
The Mean Squared Error (MSE) of the residual distribution is [
1,
2].
where
denotes the variance of the residuals and
denotes the mean residual. The bias corresponds to the mean value of the residuals and quantifies systematic over- or under-estimation, while σ
2 represents the variance of the residuals and is associated with the shadowing component of large-scale propagation.
The proposed methodology does not assume a direct equivalence between RMSE and the shadowing standard deviation (σ), but instead evaluates whether RMSE can serve as an approximate estimator of dispersion consistent with shadowing variability under explicitly defined statistical conditions. First, the absence of systematic error is assessed by testing whether the residual mean is statistically indistinguishable from zero (bias ≈ 0) using a two-sided hypothesis test at a predefined significance level (e.g., α = 0.05), i.e., H
0: μ = 0 versus H
1: μ ≠ 0; this condition is satisfied when the corresponding
p-value exceeds the significance threshold, indicating no evidence against the null hypothesis. Under the assumption of unbiased residuals, the Mean Square Error reduces to MSE ≈ σ
2, and consequently RMSE ≈ σ, which is interpreted here as an approximation of dispersion rather than a strict physical equivalence. Next, the normality of the residuals is evaluated using the Shapiro–Wilk test [
44]; a
p-value greater than α indicates no statistically significant deviation from Gaussianity. Under the joint conditions of negligible bias and approximate normality, RMSE can be interpreted as an empirical measure of dispersion in the logarithmic domain that is consistent with shadowing variability under log-normal propagation assumptions [
1,
2,
20], which are widely adopted in wireless channel modeling literature [
1,
2,
28].
This interpretation assumes that residuals predominantly reflect variability in the logarithmic (dB) domain. Nevertheless, in machine learning and deep learning models, residuals may also encompass contributions from modeling errors, approximation limitations, and dataset-specific biases. Consequently, RMSE should not be interpreted as a direct physical measurement of shadowing variability, but rather as an empirical measure of dispersion that remains consistent with large-scale propagation variability under appropriate statistical conditions. When these conditions are not strictly satisfied, RMSE serves as a robust indicator of prediction uncertainty. In such cases, its application in subsequent analysis is treated as an engineering approximation that enables consistent and comparative system-level evaluation across heterogeneous scenarios.
Due to stochastic shadowing, the received signal power may occasionally fall below the receiver sensitivity , leading to communication outages. This uncertainty is addressed by introducing a fade margin.
For a target link reliability
, [
1,
2].
where
is the standard normal quantile. Classical outage probability analysis for log-normal fading channels shows that a reliability level of 95% corresponds to
as derived in classical log-normal shadowing models and outage probability analysis [
1,
2,
20]. Therefore, for a target reliability level of 95%, the fade can be approximated as:
under the assumption of approximately Gaussian and unbiased residuals. In this work, this relation is not interpreted as a direct physical estimation of shadowing variability, but as a statistically grounded approximation that enables the propagation of prediction uncertainty into reliability-aware system design. In particular, the RMSE-derived fade margin should be understood as an effective margin that captures the combined impact of propagation-induced variability, model approximation errors, and dataset-related uncertainties in the logarithmic (dB) domain.
For datasets where the Gaussianity or zero-bias assumptions are not strictly satisfied, the resulting fade margin is treated as an engineering approximation that preserves consistency across heterogeneous scenarios rather than as a strict representation of physical channel parameters. This interpretation ensures that the proposed formulation remains robust to deviations from ideal statistical conditions while enabling a unified and reproducible mapping from prediction error to communication reliability, link budget requirements, and energy-aware system design. Consequently, the framework does not rely on a strict equivalence between RMSE and shadowing standard deviation, but instead employs RMSE as a consistent uncertainty propagation metric within a deployment-oriented analytical pipeline applicable to both Gaussian and non-Gaussian residual regimes. While more advanced uncertainty-aware models could be employed, the use of RMSE provides a model-agnostic and practically interpretable metric, enabling consistent comparison across heterogeneous modeling approaches.
3.5.2. Link Budget and Transmission Power Estimation
The required transmit power is determined using the link budget: [
1]
where
denotes the median path loss value of corresponding dataset.
Since commercial radios support discrete transmission power levels
the suggested transmit power is: [
1,
2].
ensuring operation at the lowest available power level satisfying the link budget requirement and providing link-level reliability under the assumed propagation conditions.
In practical wireless transceivers, transmission power levels are discrete and the corresponding current consumption follows a nonlinear relationship with output power due to power amplifier efficiency and hardware constraints. This behavior has been extensively reported in wireless sensor network and IoT device studies [
4,
5,
6,
26,
27]. To capture this effect, technology-aware transmission power consumption profiles are adopted, as summarized in
Table 5. These profiles are derived from representative device characteristics reported in the literature and reflect typical operating conditions across different communication technologies, including low-power wide-area networks (LPWAN), short-range wireless systems, and cellular IoT deployments [
4,
5,
6,
26,
27,
28].
The use of discrete transmission levels ensures consistency with practical radio implementations, where transmission power cannot be continuously adjusted but is selected from predefined hardware-supported levels [
1,
2]. Accordingly, the incorporation of RMSE-derived fade margin into the link budget should be interpreted as a mechanism for propagating prediction uncertainty into system design, rather than as a direct estimation of physical channel parameters.
3.5.3. Energy Consumption and Node Lifetime Estimation
Energy consumption is estimated using Coulomb counting [
4,
5,
6]:
During each reporting interval
, the node transmits for duration
and remains in sleep mode for the remaining time. The charge consumed per cycle is calculated as [
4,
5,
6]:
where
accounts for hardware overhead such as DC–DC inefficiencies and wake-up energy.
To account for aging and environmental effects, the usable battery charge is approximated as [
5,
25]
The number of supported reporting cycles is
and the expected node lifetime is [
5,
6]
expressed in days. In practical deployments, lifetime estimates may be capped at approximately 10 years due to hardware aging and maintenance constraints [
5,
6].
The proposed framework pipeline for node lifetime estimation is shown in
Figure 2.
Algorithm 1 summarizes the complete workflow and serves as the sole procedural representation of the proposed framework.
| Algorithm 1: System-Level Workflow |
| Input: |
| Dataset D = {distance d, frequency f, measured path loss PLmeas} |
| Step 1: Input dataset |
| x = [d, f], y = PLmeas |
| Step 2: Model training and prediction |
| For each model m ∈ {Propagation model, ML, DL}: |
| Predict: |
| PLpred = f_m(x) |
| Compute: |
| RMSE_m = √(1/N Σ (PLpred − PLmeas)2) |
| m* = argmin(RMSEm) |
| Step 3: Residual computation |
| ε = PLmeas − f_{m*}(x) |
| Step 4: Statistical validation (interpretation metrics) |
| Compute: |
| p_bias (two-sided t-test for zero-mean residuals), |
| p_normality (Shapiro–Wilk test for Gaussianity) |
| Step 5: Dispersion representation |
| σest ← RMSEm* (empirical dispersion proxy under validated statistical assumptions) |
| Step 6: Reliability margin |
| M = 1.65 · σest |
| Step 7: Link budget |
| Preq = Srx + PLbase + M |
| Step 8: Transmission power mapping |
| Ptx = min{Pk ∈ P|Pk ≥ Preq} |
| Step 9: Energy consumption |
| Qcycle = f(Ptx) |
| Step 10: Node lifetime |
| L = (Qbattery/Qcycle) · T_interval |
| Output: |
| m*, RMSEm*, Ptx, Qcycle, L |