LoRaWAN Path Loss Measurements in an Urban Scenario Including Environmental Effects

: LoRaWAN is a widespread protocol by which Internet of things end nodes (ENs) can exchange information over long distances via their gateways. To deploy the ENs, it is mandatory to perform a link budget analysis, which allows for determining adequate radio parameters like path loss (PL). Thus, designers use PL models developed based on theoretical approaches or empirical data. Some previous measurement campaigns have been performed to characterize this phenomenon, primarily based on distance and frequency. However, previous works have shown that weather variations also impact PL, so using the conventional approaches and available datasets without capturing important environmental effects can lead to inaccurate predictions. Therefore, this paper delivers a data descriptor that includes a set of LoRaWAN measurements performed in Medellín, Colombia, including PL, distance, frequency, temperature, relative humidity, barometric pressure, particulate matter, and energy, among other things. This dataset can be used by designers who need to ﬁt highly accurate PL models. As an example of the dataset usage, we provide some model ﬁttings including log-distance, and multiple linear regression models with environmental effects. This analysis shows that including such variables improves path loss predictions with an RMSE of 1.84 dB and an R 2 of 0.917.


Introduction
The Internet of things (IoT) is an enabling paradigm of Industry 4.0 that uses sensors to extract environment-aware data in diverse applications, such as domotics [1], smart energy [2], and precision agriculture [3], among other things.The data collected is further stored and analyzed to perform classifications or regressions, helping organizations make decisions about their processes [4].Although there are applications where the end nodes (ENs) transmit information over short distances and have unlimited energy resources (e.g., domotics), there are also some cases where the ENs must be deployed in hard-to-access places where the sensors' information has to be transmitted over distances, and where changing batteries is difficult or impossible, e.g., in forest fire monitoring [5], regulating the water level in dams [6], and landslide detection [7].Thus, when the application has low energy and long-distance constraints, low power wide area networks (LPWANs) are used because they exhibit a good compromise between range and power consumption [8].
One of the most widespread protocols for LPWANs is LoRaWAN [9], which has gained popularity for IoT deployments because it operates in unlicensed bands, consumes low energy, and covers wide ranges compared to other competitors like narrow-band IoT (NB-IoT) or Sigfox [10].
Because LoRaWAN is a wireless sensor network (WSN) protocol [11], deploying ENs in the field requires previous network planning and link budget analyses.These analyses help establish the network parameters that achieve reliable connectivity at low energy consumption, so designers can choose different radio elements, including antennas' geometries, antennas' gains, allowed attenuations caused by cables and connectors, expected path loss (PL) and shadowing features posed by the channel, and transmission powers [12].Consequently, the link budget calculation guarantees that the received signal strength indicator (RSSI) on the gateway (GW) side is sufficiently large to be demodulated correctly.The PL effects can be estimated by using theoretical models (e.g., Friis [13], and two-ray [12]) or empirical models (e.g., Okumura-Hata [14] and log-distance models [12]) to accomplish the link margin goals.More specifically, the Friis approach considers that the PL depends only on distance and frequency and does not consider multipath phenomena that cause shadowing [15].Besides, the two-ray model considers a theoretical approximation of the line of sight (LoS) ray and the ray reflected over the ground, so it partially considers shadow fading effects [12].
However, because the multipath phenomenon in real applications is very diverse and complex, empirical models based on measurement campaigns are also proposed.For instance, the Okumura-Hata approach [14] provides a closed-form expression derived from collected data in Tokyo, Japan.It depends on the distance between the EN and the GW, the antenna heights, and the frequency.This model was fitted for large cells where the antennas' heights are from 30 to 100 m; however, in WSN/IoT networks, ENs' antennas are close to the ground (for example, in precision agriculture [16]), causing shadow fading effects up to 14 dB [12], which may not be suitable for IoT applications, considering that the maximum transmitter power is about 20 dBm (e.g., LoRaWAN [9]).Because of these limitations, a log-distance path loss model (LDPLM) is also fitted from field data, including a shadow fading term, which is modeled considering that the probability density function (PDF) attends a lognormal distribution.In that way, according to [17], the statistical validity of the LDPLM must meet the following conditions: (i) pass an analysis of variance (ANOVA) test, by which the log-distance weight (also known as a path-loss exponent) is analyzed to check its significance, and (ii) the residual error/shadow fading term must be log-normally distributed, homoscedastic, and uncorrelated.However, in the proposed models, these tests are rarely addressed, and in the cases where it is handled, normality is not always met ( [18,19]).
Due to the limitations mentioned above, some previous datasets have tackled the problem of modeling the radio frequency features in LoRaWAN networks to improve PL predictions.For instance, in [20], the authors collected 665 samples in the city of Beirut, Lebanon, logging timestamp, distance, frequency (868 MHz), RSSI, signal-to-noise ratio (SNR), GW coordinates, and spreading factor (SF), with a fixed bandwidth (BW) of 125 kHz and a fixed payload of 37 bytes.An LDPLM was enhanced by adding a new feature based on the EN antenna height.After fitting the corresponding model, they found a PL standard deviation of 7.2 dB.
Another approach can be found in [21], wherein the authors collected some operational aspects of LoRaWAN in Brno, Czech Republic.Regarding PL modelling, the authors collected data for two months, logging timestamp, RSSI, SNR, timestamp, GW coordinates, EN coordinates, time on air (ToA), frequency (868 MHz), SF, payload size, and frame count.They found that the RSSI fluctuated up to 50 dB, concluding that the conventional propagation models may lead to significant inaccuracies in PL prediction.Furthermore, in [22], the authors deployed nine GWs in central London, UK, and collected timestamp, frequency (868 MHz), RSSI, SNR, SF, and payload size.However, this dataset does not include the distance between the GW and the ENs, so its use is mainly oriented to optimizing network parameters, and PL modelling is impossible.
Moreover, in [23], the authors provide a dataset for localization/tracking purposes by using fingerprinting techniques by which the base station position is not needed.This dataset provides the RSSI of 68 base stations, timestamp, SF, and EN coordinates, for three months, in the city center of Antwerp, Belgium.As an application, this approach shows a fingerprint location using clustering techniques, particularly KNN [24], achieving a mean error of 398.4 m.In that way, this dataset is unsuitable for PL modelling purposes.
In addition, because our dataset can be mainly used for path loss modeling, we retrieved the most recent approaches for LoRaWAN to further exhibit our dataset's contribution.For instance, Anzum et al. [25] proposed a LoRaWAN path loss model to characterize the attenuation in oil palm crops by using an LDPLM mainly based on the distance between the ENs and the GW and the number of canopies and trunks throughout the communication path.Alobaidy et al. [26] fitted a semiempirical machine-learning-based path loss model for LoRaWAN links combining the Friis model with a stepwise multiple linear regression that depends on the frequency, bandwidth, antennas' heights, spreading factor, and distance.Batalha et al. [27] performed a measurement campaign by using LoRaWAN in a suburban environment and fitted close-in and floating intercept LDPLMs that depend on the distance between the ENs and the GW; then, they compared their performance versus the Okumura-Hata model.Bianco et al. collected path loss measurements in a mountain environment, fitted an LDPLM by using the distance as a predictor variable, and used it in tracking and rescue applications.Callebaut et al. [28] evaluated coverage and path loss in urban, forest, and coastal environments and fitted a two-slope LDPLM to assess the protocol's reliability in each scenario.Finally, El Chall et al. [20] proposed different LDPLMs for indoor, campus, and city environments.The contributions regarding datasets and path loss models are summarized in Table 1.As presented in Table 1, the measurement campaigns and path loss models for Lo-RaWAN are mainly based on distance, frequency, and antennas' heights.However, previous studies have shown that PL variability is also accentuated by the change of some environmental-related variables like temperature [30], relative humidity [31], barometric pressure ( [32]), and particulate matter [33].However, these effects have not been measured in the available LoRaWAN datasets.In that way, this paper provides a comprehensive LoRaWAN measurement campaign carried out in an urban environment, in Medellín, Colombia, for four months.Our measurement setup includes one GW, and four fixed ENs from 2 km to 8 km.The dataset has up to 930.000 observations, including geometric conditions (distance and antennas' heights), link budget features (transmitter powers, antennas' gains, cables and connectors attenuations, carrier frequency, SF, and frame length), propagation variables (RSSI, SNR, ToA, effective signal power (ESP), noise power (P n ), and consumed energy) and environmental variables (temperature, relative humidity, barometric pressure, and particulate matter).The main contribution of building this dataset is the inclusion of the environmental variables because designers can fit more accurate path loss and shadowing models depending on weather variations.
The rest of this paper is organized as follows.Section 2 briefly introduces the main features of the LoRaWAN protocol.Section 3 specifies the logged fields in the dataset.Section 4 shows the experimental setup from the ENs' construction to the database logging.Section 5 shows a possible application of PL modelling using the dataset, including a lognormal combined path loss and shadowing (CPLS) model and an environment-based CPLS model that improves the prediction errors and increases the correlation factor.Finally, Section 6 shows the conclusions.

LoRaWAN Outline
This section provides a brief description of the LoRaWAN protocol.It includes the architecture, spectrum utilization, modulation characteristics, and transmission power control (TPC) strategies.

Architecture
The network architecture of LoRaWAN [9] is depicted in Figure 1.It includes four entities: (i) the ENs, which are in charge of sensing variables, and transmit them via LoRaWAN radio frequency (RF) protocol, (ii) the GWs, which receives the information from the ENs and retransmit it to a server via LoRaWAN TCP/IP, (iii) the network server (NS) that receives the information from the GWs, deletes duplicates, and exhibits some services to send information to the applications by using different protocols like message queue telemetry transport (MQTT) or reverse hypertext transport protocol (Reverse HTTP), and (iv) application servers that receive information from the NS and store it, analyze it, or visualize it.Mainly, if an EN needs to transmit a frame, it sends a broadcast message that various GWs can receive.In this case, the same frame can be retransmitted by more than one GW, so the NS filters the information and puts the information into an MQTT broker or a cloud-based server.

Spectrum Usage
LoRaWAN operates in the industrial, scientific, and medical (ISM) bands, which are unlicensed in most countries, so users do not have to pay any fee for their utilization [34].In the case of Europe, the accepted frequency ranges are from 863 MHz to 870 MHz, divided into 16 upload/download channels of BW equal to 125 kHz.In the case of America, the accepted frequency ranges are from 902 MHz to 928 MHz, divided into 64 uplink channels of BW of 125 kHz, eight uplink channels of BW of 500 kHz, and eight downlink channels of a BW of 500 kHz.Our dataset was generated in the approved band for America (US902-928).Because the ISM bands are intensively used, designers must attend duty cycle policies that demand to wait a period until a new transmission.For instance, in Europe, the duty cycles must be under 1%, and in America, there are no duty cycle limitations, but a transmission must not last more than 400 ms.

Modulation
LoRaWAN uses chirp spread spectrum (CSS) for modulation [35].This spread spectrum technique helps reduce interference and multipath and fading effects [23].If the central carrier frequency is f 0 , a chirp signal changes its frequency linearly in the interval ( f 0 −BW/2, f 0 + BW/2) during a symbol time T s .In addition, the channel bandwidth is divided according to the SF parameter (i.e., the number of bits per symbol), which takes values from 6 to 12. Thus, the spectrum is divided into 2 SF parts so that the symbols can start at 2 SF at different initial frequencies.The initial frequency determines the value to be transmitted, so any symbol can carry up to 2 SF values.Furthermore, a coding rate (CR) parameter defines the proportion of redundancy bits added to the frame for error correction and can take values of 4/5, 4/6, 4/7, and 4/8.The symbol time T s can be defined in Equation (1): It can be noticed that increasing the SF by one multiplies per two the symbol time T s .In addition, the SF modifies the receiver sensitivity allowing negative SNR limit values, as shown in Table 2.In this way, changing SF allows broader coverage but increases T s .Because the consumed energy depends on the transmission power and T s , increasing this parameter means the energy will also increase.

Transmission Power Control
LoRaWAN includes an adaptative data rate (ADR) algorithm as a TPC scheme, which is used to improve energy consumption dynamically.The idea behind this scheme is to change the transmission parameters (transmission power and SF) according to the channel state, particularly the behavior of the SNR.As previously discussed, when the SF increases, the effect is that the SNR limit can be worse [36], as shown in Table 2.The SNR limit is the worst SNR the receiver can tolerate to demodulate the received signal adequately.In the case of LoRaWAN, the SNR limit can be negative; that is, the received signal power can be less than the noise floor power.The steps to implement the ADR algorithm are discussed as follows [37].

1.
The EN enables the ADR scheme and informs the GW.Hence, the NS can change the transmission power and the SF.In addition, the EN sends information with an SF = 12 to guarantee that the data can reach the gateway.

2.
The base station collects 20 values of SNR (SNR measured ) from the node and sends them to the network server.

3.
The network server takes the maximum value of the 20 SNR measured samples, for example, 5 dB.It also takes the current SF.

4.
The network server calculates the margin M e as shown in Equation ( 2), where LM is the link margin, used as a security term to achieve reliable communications.For example, LM = 10 dB, SNR limit = −20 dB (SF = 12, from Table 2), so M e = 5 dB − (−20 dB) −10 dB = 15 dB.It means an excess of 15 dB in the link budget, which causes a waste of energy.

5.
The network server calculates the M e again with a lower spreading factor (it always must be greater than zero to guarantee a stable link).For example, for SF = 7, SNR limit = −7.5 dB, so M e = 5 dB − (−7.5 dB) − 10 dB = 2.5 dB.

6.
Because there is an excess of 2.5 dB, the network server also lowers the transmission power.

Data Description
The given dataset contains a comma-separated values file with the measurements of four ENs and one GW in Medellín, Colombia.The database includes 930,753 observations from October 2021 to March 2022, with a mean sample time of 60 s.According to the regulations of the ISM bands for US915, the maximum transmission time is 400 ms [38].In that way, we transmitted up to 242 bytes with SF = 7,125 bytes with SF = 8, 53 bytes with SF = 9, and 11 bytes with SF = 10.These frame sizes and SFs guarantee that the transmission time is less than 400 ms (https://www.thethingsnetwork.org/airtime-calculator,accessed on 14 December 2020) Furthermore, because each node transmitted data each 60 s and the maximum transmission time was 400 ms, we obtained that a duty cycle of 0.4 60.4 * 100 = 0.662%, which is recommended to have a fair use of the spectrum (obtained with SF = 7 and frame size of 242 bytes).
The fields in the dataset are described as follows.A statistical description of the numerical dataset fields is shown in Table 3.In addition, the empirical distributions of the most representative variables are depicted in Figure 2.These descriptions help us understand how data is distributed.For instance, it can be noticed from Figure 2a that the SFs are uniformly distributed from 7 to 10 for ENs 1, 2, and 4; however, EN3 used only SF = 10 beause the distribution of the SNR exhibited a mean of −15 dB (Figure 2j), which means that SFs of 7 to 9 are not large enough to demodulate the received signals (Table 2).To guarantee uniform distribution of SF, we disabled the ADR scheme and controlled it manually.It also can be noticed that the carrier frequencies used are uniformly distributed overall (Figure 2c).Regarding the environmental variables, it can be noticed that the weather conditions describe tropical weather.For instance, temperatures were from 13.9 °C to 35.1 °C, and concentrated around 20 to 30 °C (Figure 2d).Furthermore, relative humidity was concentrated in high values, showing the common behavior in a tropical environment (Figure 2e).Moreover, particulate matter was concentrated in low values for EN4 because it is located inside a campus surrounded by a forest; however, there are two peaks in 28 and 50 µg/m 3 , which are caused by a rock mine near the campus (Figure 2g).In addition, it can be noticed that the distribution of the experimental path loss is Gaussian-bell-shaped as expected [12] (Figure 2h).Finally, the distributions of consumed energy for ENs 1, 2, and 4 are similar; nevertheless, the EN3 has its energy concentrated around 0.1 J, which was caused by the fixed SF = 10 that guaranteed that the received signal could be demodulated.
Regarding the packet delivery rate (PDR) of each EN, we obtained 95.1%, 85.2%, 81.6%, and 86.35% for EN1, EN2, EN3, and EN4, correspondingly.These PDRs can be explained from the SNRs obtained for each EN, as depicted in Figure 2j.According to Table 2, varying the SF allows the signal power level to fall below the noise power level up to −20 dB.Furthermore, as we will see in Section 4.4, we distributed the SF uniformly for each EN from 7 to 10.In that way, we obtained PDRs according to the SNR of each EN.For instance, EN1 achieved SNRs over 0 dB, so getting a PDR of 95.1% is expected because many packages were delivered successfully.On the other hand, we notice that EN3 achieved the lowest PDR (81.6%) because the mean SNR is approximately −15 dB, so using a low SF can cause a loss of packets.
The dataset also includes the Effective Signal Power (ESP) metric, which is defined as the signal power in the receiver without including the noise power (Equation (3)) and the noise power P n (Equation ( 4)) [39]: ).
The ESP and P n are relevant metrics by which to evaluate the quality of LoRaWAN radio links instead of RSSI and receiver sensitivity (traditionally, RSSI > Sensitivity + LM) because successful demodulation is achieved when ESP (dB) > P n(dB) − SNR limit + LM.Thus, the ESP and P n empirical distributions are depicted in Figure 3, where it can be noticed that the ESP is always under the P n , concluding that LoRaWAN can withstand very adverse channel conditions.

Experimental Setup
We have deployed a LoRaWAN setup to collect the variables included in this dataset.The system architecture follows the model previously explained in Figure 1.Mainly, we implemented the network shown in Figure 4. First, we deployed four ENs and one GW in different locations in the urban area of Medellín, Colombia, keeping LoS between each EN and the GW.Medellín is a medium-sized city with an area of 328 km 2 and about 4 million people.The city is located in the central part of the Andes Mountain Range, and its topography is a valley surrounded by mountains.The GW is connected to the Internet using an Ethernet connection.Once the GW receives a frame from an EN, it resends it to the selected NS.We used the things network (TTN), a widespread open-source NS.Because there are several LoRaWAN GWs in Medellín, we filtered the duplicated information in the NS to preserve only data from our GW.Subsequently, we enabled the MQTT broker provided by TTN and used a database cloud-based MySQL server to subscribe to the broker and get all the messages the NS receives.The following subsections describe each component deeply.

End Nodes
We selected the platform Pycom LoPy4 (https://pycom.io/product/lopy4/,accessed on 14 December 2022) to assemble and program the ENs because it meets processing and communications needs; hence, previous approaches use it [20].The LoPy4 is a systemon-chip platform that embeds an Xtensa ® dual-core 32-bit LX6 microcontroller and four IoT radios, including LoRaWAN, Sigfox, Wi-Fi, and Bluetooth.Regarding LoRaWAN, the platform includes an SX1276 radio for ISM bands @ 433, 868, and 915 MHz.In addition, the processor can communicate with the environmental sensors by using different buses like serial peripheral interface (SPI), inter-integrated circuit (I2C), and universal asynchronous receiver and transmitter (UART).In addition, each EN has a transmission antenna.EN1, EN2, and EN3 use an omnidirectional antenna (Mobile Mark ref. PSKN3-900 (https: //www.mobilemark.com/product/pskn3-900-1900s/,accessed on 14 December 2022)) with a peak gain of 3 dBi.Because these antennas are omnidirectional, their mounting angle was considered to be 90°.In addition, the EN4 uses a 4-elements Yagi-Uda antenna (Pulse Larsen ref.YA6900W (https://www.tessco.com/product/890-960mhz-8dbi-4-elementyagi-antenna-57677,accessed on 14 December 2022)) with a peak gain of 8.8 dBi, which was mounted such that the boom was parallel to the ground, with the directors perpendicular to the ground.Figure 5 depicts the antennas' mounting positions.Furthermore, we performed an S11 and S21 analysis of the antennas and connectors by using a vector network analyzer (VNA).In particular, the S11 parameter delivers information on how much power is reflected from the antenna when a transmitter signal is supplied.In that way, this parameter allows designers to know how efficient the used antenna is.In addition, the S21 parameter measures how much power is transferred from port 1 to port 2 of the VNA, so it is used to determine attenuations/losses from cables and connectors.Carrying out these measurements guaranteed that the antenna gains and losses were accurate.An example of the S11 parametrization of our antennas is depicted in Figure 6.

Sensors
Each EN includes a set of sensors that captures weather variations.A brief description of each sensor is provided as follows.

•
An Aosong DHT22 sensor for temperature (accuracy: ±0.5 °C) and relative humidity (accuracy: ±2%).Regarding temperature, this sensor includes a transistor-based transductor.Concerning relative humidity, it consists of a capacitive sensor.The sensor contains a one-wire communication protocol to send the current values to the LoPy4 microcontroller, lowering the errors in the analog-to-digital conversion.• A Bosch BMP280 sensor for barometric pressure (accuracy: ±1 hPa).This sensor operates in a range of 300-1100 hPa, which is suitable for use in Medellín, Colombia because the barometric pressure is up to 900 hPa.This sensor embeds an I2C communication to communicate with the microcontroller, lowering the errors in analog-to-digital conversion.• A Honeywell HPMA115S0 sensor for particulate matter PM2.5 (accuracy: ±15%).Its operation is based on laser scattering, which detects and counts particles with concentrations up to 1000 µg/m 3 .The sensor communications are based on the RS232 protocol.Again, this communication reduces the errors in analog-to-digital conversion.

•
In addition, we included a Texas Instruments INA219 energy sensor (accuracy ±0.5%) in the printed circuit board to quantify the consumption under different radio configurations.It is based on a shunt resistor that can monitor voltages up to 26 VDC and currents up to 5 A. The sensor sends the information to the microcontroller by using the I2C protocol.

•
We added a Stevenson screen to protect the sensors from rainy conditions without losing accuracy and avoiding possible sensor saturations.
We also designed and built the ENs' testbeds as shown in Figure 7.The electronics are inside a IP65 box to protect it from rain and condensation caused by high relative humidity.

Gateway
According to the architecture shown in Figure 4, the ENs must send the information to a GW that serves as a relay between the field data and the NS.Hence, we have selected, programmed, and deployed a GW Dragino LG308 (https://www.dragino.com/products/lora-lorawan-gateway/item/140-lg308.html,accessed on 14 December 2022) that incorporates two Semtech radios (SX 1257 and SX1301).Both radios can demodulate signals with a power greater than −140 dBm in 10 different channels simultaneously.Each radio uses a panel antenna (Wilson Electronics ref. 311155 (https://www.wilsonamplifiers.com/one-additional-panel-antenna-kit-for-db-pro-311155-k1/, accessed on 14 December 2022)) with a peak gain of 4.4 dBi.Furthermore, we checked the S11 parameter of each antenna to get an accurate experimental path loss.Finally, the GW receives the ENs' frames via LoRaWAN radio frequency (RF) and resends them to the TTN via LoRaWAN IP.

Frame Configuration
Each EN packs the sensor's information in a binary frame.To enlarge the corresponding scales, we multiply the measurements by 10 before sending them, and then, the NS divides by 10, preserving one decimal point as shown in Table 4, so the corresponding frame length is 74 bits.Moreover, we added some dummy bits to enlarge the frame length (six different sizes), so the dataset includes the effects of this parameter on energy consumption.Furthermore, we used a BW of 125 kHz and iterated various subbands in the ISM US915 band (903.9, 904.1, 904.3, 904.5, 904.7, 904.9, 905.1, and 905 MHz).
We also varied the SF with values of 7, 8, 9, and 10, according to the spectrum usage regulations.In summary, there are six frame configurations, eight subbands, and four SFs, so we transmit in 192 different radio configurations.Once the GW resends the information to TTN, the latter runs a payload formatter where the sensors' readings are decoded, and the RSSI, ToA, and SNR are added to the final payload.Finally, TTN exposes an MQTT broker where the payloads are published, so we used a cloud-based server as an MQTT subscriber to get the messages and store them in a MySQL database.

Network Deployment
The ENs and the GW were deployed in the northern part of Medellín, where there is a valley with two mountains on each side, as shown in Figure 8.This topography allowed us to guarantee LoS conditions between each EN and the GW and avoid the obstruction in Fresnel zones, as shown in Figure 9 and as analyzed subsequently.The radius of the n th Fresnel zone can be calculated by where r n is the radius of the n th Fresnel zone, n is the considered Fresnel zone, λ is the signal wavelength, and d is the distance between the transmitter and the receiver.To establish successful links, it is recommended that the first Fresnel zone (F1) is clear by more than 60% [12].In our case, d is 2.11 km, 3.42 km, 5.32 km, and 8.21 km for EN1, EN2, EN3, and EN4 correspondingly, and λ = 0.33 m, so the F1 radii are 13.25 m, 16.87 m, 21.04 m, and 26.14 m.Because the worst F1 radii of our links are 3.9F1, 4.3F1, 3.2F1, and 2.2F1 (Figure 9), we are meeting and exceeding the minimum clearance of 60%, so our links can be modeled with LoS criteria.Thus, it can be concluded that the provided dataset is not dependent on the geometry of the terrain and can be used in other locations with similar weather characteristics.The coordinates of each device in the network and its corresponding antenna height (h) and altitude (Alt) are shown in Table 5.

Path Loss and Shadowing Modelling
This section describes how this dataset can be used for path loss and shadowing modelling.Although path loss is caused by the attenuation of the power radiated by the transmitter and the channel effects, shadowing is caused by wave phenomena like absorption, reflection, scattering, and diffraction [12].Both effects are shown in Figure 10.In that way, we fit an LDPLM and a multiple linear regression (MLR), including the environmental variables.Furthermore, we compare both approaches based on the RMSE and the correlation coefficient R 2 .Of course, different models can be used for path loss modelling [12][13][14], so this procedure just serves as an example of how to use this dataset.Moreover, some machine learning techniques can also be used to estimate path loss and shadowing [40]. .Theoretical behavior of path loss and shadowing.The dissipation caused by the channel, i.e., the ratio between the transmitted power P t and the received power P r , changes according to the logarithm of the distance d [12].

Data Preparation
To develop the CPLS models that can be seen as application examples of our dataset, we followed the process depicted in Figure 11.After collecting the in-field measurements, we processed the outliers using the Mahalanobis distance M d [41].This distance measures how far an observation and the dataset's distribution is; in addition, it is not affected by the scales of the predictor variables and considers the covariance between them.According to [41], a row is considered an outlier if M d is greater than the tabulated value of the χ 2 distribution with n − 1 degrees of freedom and with a significance level p = 0.001.Because the predictor variables that could be used from the dataset for modeling purposes are distance, frequency, SF, frame length, temperature, relative humidity, barometric pressure, particulate matter, time on air, energy, and experimental path loss, we have 11 degrees of freedom.Thus, the tabulated value of the χ 2 distribution with 10 degrees of freedom and a p-value = 0.001 is 29.59.Then, we calculated M d for all the observations and removed those whose value was greater than 29.59.
In the second step, we split the dataset into two subsets: one for training and one for testing.The idea behind this division is to adjust the models' parameters with the training subset and assess the model's ability to perform accurate predictions with new and unseen data (the test subset).In our case, we divided the whole dataset with 80% of rows for training, and 20% for testing, as commonly recommended in the literature [42] and previous path loss model approaches [25,26].We divided the rows for both subsets by using a random split with a fixed seed to ensure the reproducibility of the results using the library caTools of RStudio.We decided to use a uniformly distributed random split to correctly capture the CPLS changes caused by the environmental variables, which are very diverse in tropical regions like Colombia.Furthermore, using a uniformly distributed random split does not alter the distributions of predictor variables.The relation between the training subset, the testing subset, and the whole data frame in time is depicted in Figure 12, where it can be noticed that the subsets proportions are the same for each time interval.
In the third step, we divided our training subset into five folds to carry out k-foldbased cross-validation.This process helps reduce the risks of overfitting when calculating the models' parameters [42].Mainly, the operation principle is as follows: (i) divide the training subset into a, b, c, d and e folds, (ii) calculate the models' parameters with the folds a, b, c, and d, and calculate the RMSE with the fold e, (iii) iterate the literal ii alternating the folds, and (iv) calculate the average RMSE and the standard deviation of the RMSE for all the iterations.If the obtained model weights achieve low RMSE and standard deviation, it is considered that the model has a good ability for generalization.Finally, the fourth step uses the obtained model with unseen data, i.e., the testing set, and calculates the RMSE and R 2 .

Log-Distance Combined Path Loss and Shadowing Model
A standard method to fit a combined path loss and shadowing (CPLS) model by using empirical data is the LDPLM (Equation ( 6)), where K is a dimensionless constant that depends on the characteristics of the antennas and the average channel attenuation, d is the distance, d 0 is the far-field distance, γ is the path loss exponent, and ψ is a random variable from a lognormal distribution, which characterizes shadowing, whose probability distribution function (PDF) is presented in Equation (7), where ξ = 10/ln (10), σ ψ is the standard deviation of ψ, and µ ψ is the mean of ψ.The CPLS has a LoS component attributed to the path loss and a stochastic component associated with the shadowing phenomenon, characterized by Equation (7).The LDPLM is assumed as a linear regression model where the input feature is the logarithm of the distance, K is the intercept, and ψ is the residual error, i.e., the difference between the actual measurements and the model's predictions.In that way, the LDPLM in Equation ( 6) can be fitted by using minimum squares optimization as expressed in Equation (8).

∂ ∂γ
where M i (d i ) corresponds to the i th field measurement at the distance d i .The far-field distance d 0 is calculated by where D = 25 cm is the largest antenna size, and λ = 3×10 8 9×10 8 = 0.33 m is the wavelength, so d 0 ≥ 0.375 m.In that way, we considered a far-field distance d 0 = 1 m.
After optimizing the model weights using our training subset, we found a path loss exponent γ = 2.739 and K = 1.75.This value agrees with the typical interval for urban microcells [12].Then, we used the training and test sets as follows: (i) we computed the LDPLM values, and (ii) we obtained the RMSE, standard deviation s, and the R 2 (Table 6).The RMSE can be interpreted as the ability of the model to forecast the PL component, the standard deviation s captures the ability of the model to preserve the same error when using a different training subset, and the R 2 can be associated with shadowing.It can be noticed that the RMSE for both subsets is similar for LDPLM, so that the fitted model can forecast the PL with an RMSE = 2.46 dB.In the same way, the R 2 for both subsets was 0.85, so the model can explain the shadowing variability by 85%.

Multiple Linear Regression Model
Because previous works have shown that the environmental variables affect the CPLS, i.e., temperature [30], relative humidity [30], barometric pressure [43], particulate matter [33], and SNR [29], we proposed an enhancement of the LDPLM considering the log-distance term and the effects of the environmental variables as shown in Equation (10): (10) where β 0 is the model intercept (equivalent to K in Equation ( 6)), γ is the path loss exponent, β i ∈ i = {1, . . . ,5} are the model predictors, f is the frequency (Hz), T is the temperature (°C), RH is the relative humidity (%), BP is the barometric pressure (hPa), PM is the particulate matter (µg/m 3 ), SNR is the signal-to-noise ratio (the received signal power divided by the noise floor power level in dB), and ψ is the shadow fading term expressed in Equation (6).The path loss exponent is multiplied by ten from Equation ( 6), assuming a far-field distance d 0 = 1 m.The frequency weight is fixed to 20 from the Friis model [13].To fit the MLR model, we have applied minimum squares optimization by differentiating the set of Equations in (11): where K is the model intercept, θk ∈ β1 , . . ., β5 , γ are the model predictors, C i = M i − 20 log 10 ( f i ), M i is the i th experimental path loss observation, d i , f i , T i , RH i , BP i , PM i , and SNR i are the corresponding distance, frequency, temperature, relative humidity, barometric pressure, and SNR for the i th experimental path loss observation.The model coefficients found after the optimization are in Table 7.The model coefficients can be explained as discussed subsequently: (i) as expected, as the distance increases, the CPLS also increases, according to the predictions of different theoretical models like Friis [13] or two-ray [12]; furthermore, γ = 2.203 attends the empirical values usually found for microcells [12], (ii) CPLS is directly proportional to the changes in temperature [30], (iii) relative humidity is also directly proportional to path loss [30], (iv) because a high barometric pressure causes high water vapor concentration, the effect of barometric pressure on CPLS is proportional, causing signal attenuation [32], and (v) the SNR has a negative impact on path loss, that is, when the SNR is worse, the path loss increases [29].We obtained an RMSE for the test set of 1.84 dB and an R 2 equal to 0.9177, outperforming those obtained in the LDPLM.Remarkably, the R 2 increases from 0.85 (LDPLM) to 0.917 (MLR), exhibiting a better behavior regarding shadowing, so the inclusion of environmental variables helps improve the accuracy of the CPLS model.This fact can be verified in Figure 13, where we depict the histograms for the prediction errors of the LDPLM and the MLR CPLS models.It can be seen that (i) in the case of the LDPLM, the prediction error is not normally distributed, and (ii) in the case of the MLR model, it can be seen that the prediction error is normally distributed, fixing the problem of normality, and the errors are more concentrated around the mean.Furthermore, in Figure 14, it can be seen that the predicted values using the MLR CPLS model are more concentrated around the regression line because this method exhibits a higher R 2 .Finally, we depicted the error distributions of each model regarding the distance in the box plots of Figure 15, where it can be noticed that distance does not affect the relative errors of the models, which are around 1.5%.Increasing the accuracy of the CPLS models by using the environmental variables can be used for energy reduction by applying TPC strategies and for localization or tracking tasks [29].

Conclusions
This paper described an empirical dataset of different variables in a LoRaWAN deployment, including the collection methodology and possible applications.Our dataset contains the information of four ENs and one GW in Medellín, Colombia, in an urban environment, for four months.In addition to the traditional variables considered in other measurement campaigns, our experimental setup included sensors for measuring temperature, relative humidity, barometric pressure, and particulate matter PM2.5.These variables can be used for fitting accurate path loss and shadowing models that can be applied to different tasks like TPC or positioning.We also showed how our dataset could be used for CPLS modelling, including LDPLM and MLR models, indicating that the inclusion of the environmental variables helps improve the forecast of shadowing versus the traditional LDPLM from an R 2 of 0.85 to 0.91.Furthermore, our dataset also includes the energy consumption in each transmission, the SF, the SNR, and the frame size, so it can be used to study energy behavior in different transmitter configurations.

Figure 2 .
Figure 2.Empirical distributions of the collected variables per node.In (a), we depict the SF usage from 7 to 10.In (b), we show the number of collected samples at each distance.In (c), we show the number of collected samples at each frequency.In (d-j), we depict the experimental distributions per node for temperature, relative humidity, barometric pressure, particulate matter, path loss, energy, and SNR, correspondingly.

Figure 3 .
Figure 3. ESP and noise power of ENs.

Figure 4 .
Figure 4. LoRaWAN deployment to capture the information contained in the dataset.

Figure 5 .
Figure 5. Mounting angles of the ENs' antennas.The omnidirectional antennas were mounted perpendicular to the ground.The Yagi-Uda antenna was mounted with the boom parallel to the ground.

Figure 6 .
Figure 6.S11 characterization for an EN.It can be noticed that the antennas used have acceptable behavior regarding the S11 parameter, with returns of less than 3%.

Figure 9 .
Figure 9. Line of sight for the communication between the GW and the ENs.Credits: Radio Mobile.

Figure 10
Figure10.Theoretical behavior of path loss and shadowing.The dissipation caused by the channel, i.e., the ratio between the transmitted power P t and the received power P r , changes according to the logarithm of the distance d[12].
with the training folds a,b,c, and d with a, b, c, d ≠ i Calculate and save the RMSE i with the train fold e, with e = i i = i+1 Calculate the average and standard deviation of the RMSE i , with i = 1, ... ,5 Calculate the average and the standard deviation of the RMSE i , with i = 1, ... ,5 (generalization error) Use the fitted model with the test subset and calculate the RMSE Analyze the results of the generalization error, standard deviation, and the test error

Figure 11 .Figure 12 .
Figure11.Model fitting process.We (i) collected and cleaned the database, (ii) divided the dataset into training and testing, (iii) trained the models, and (iv) evaluated the models' performances.

Figure 13 .
Figure13.Shadowing distribution using the LDPLM and MLR models.It can be noticed that the errors are more concentrated around the mean when using the MLR model, and the distribution is bell-shaped.

Figure 14 .Figure 15 .
Figure 14.Predicted versus measured CPLS for LDPLM and MLR models.It can be noticed that the MLR CPLS predictions are more concentrated around the regression line than those predicted by the LDPLM CPLS model.
1.index: Sequential number that identifies the corresponding observation.2. timestamp: Date and time mark of the current observation.It is in format yyyy-mmdd hh:mm:ss.grx: Receiver (GW) antenna gain (characterized with a vector network analyzer), in dBi.The measured gain was 4.161 dBi.12. frequency: Carrier frequency, in Hz.The experiments were performed in the US902-928 ISM band.13. frame_length: Number of bytes of the current transmission's payload.14. temperature:

Table 3 .
Statistical description of the dataset fields.

Table 4 .
Frame configuration sent from the ENs to the GW.

Table 6 .
Errors of fitted models.

Table 7 .
Model predictors for the MLR CPLS model.