K-Means Spreading Factor Allocation for Large-Scale LoRa Networks

Low-power wide-area networks (LPWANs) are emerging rapidly as a fundamental Internet of Things (IoT) technology because of their low-power consumption, long-range connectivity, and ability to support massive numbers of users. With its high growth rate, Long-Range (LoRa) is becoming the most adopted LPWAN technology. This research work contributes to the problem of LoRa spreading factor (SF) allocation by proposing an algorithm on the basis of K-means clustering. We assess the network performance considering the outage probabilities of a large-scale unconfirmed-mode class-A LoRa Wide Area Network (LoRaWAN) model, without retransmissions. The proposed algorithm allows for different user distribution over SFs, thus rendering SF allocation flexible. Such distribution translates into network parameters that are application dependent. Simulation results consider different network scenarios and realistic parameters to illustrate how the distance from the gateway and the number of nodes in each SF affects transmission reliability. Theoretical and simulation results show that our SF allocation approach improves the network’s average coverage probability up to 5 percentage points when compared to the baseline model. Moreover, our results show a fairer network operation where the performance difference between the best- and worst-case nodes is significantly reduced. This happens because our method seeks to equalize the usage of each SF. We show that the worst-case performance in one deployment scenario can be enhanced by 1.53 times.


Introduction
The Internet of Things (IoT) is the integration of modern electronic devices, smart sensors, internet protocols, and wireless communications technologies. IoT applications are rapidly gaining popularity in many domains such as industrial operations, smart parking, augmented maps, healthcare, smart cars, and smart homes [1][2][3][4][5]. According to a Gartner Inc. report, there will be around 26 billion IoT devices deployed worldwide by 2020 [6]. In the Statista report, it is predicted that there will be over 75 billion IoT devices worldwide by 2025 [7].
In the modern era, the spectacular growth and transformation of wireless connectivity are driven by the IoT paradigm, with technologies having attributes of large-scale network infrastructure with low-cost sensors connected to the Internet. In this context, low-power wide-area networks (LPWANs) are quite popular in terms of prototypes, standards, and on the commercial level because of their Unlike Sigfox, LoRa can be deployed locally, i.e., without the need for a cellular infrastructure, and has higher bit rates. By contrast, NB-IoT is an expensive technology having the pros of low latency and high quality of service (QoS) [26]. In [27], the authors compare different LPWAN technologies (Bluetooth, ZigBee, SigFox, and LoRa) and discuss LoRa with respect to code rate (CR), bandwidth (BW), and SF but without considering the influence of Rayleigh fading and path loss attenuation. Theoretical and simulation results show that SF, BW, and CR influence the ToA of a packet. Larger SFs and CRs result in higher ToA of LoRa packets. Conversely, ToA reduces with larger bandwidths.
The work in [28] proposes two different algorithms named EXPLoRa-SF and EXPLoRa-AT and shows in simulation results that these algorithms perform considerably better than the LoRaWAN adaptive rate strategy (ADR). EXPLoRa-AT delivers higher bit rates in the event of higher traffic loads, while EXPLoRa-SF allocates SFs at the different subgroups of end-devices depending on the received signal strength indicator (RSSI). The results demonstrate that the data extraction rate (DER) drops dramatically for higher SFs and larger numbers of end-devices. The authors, however, assume a short range and dense network in their analysis.
The EXPLoRa approach is further extended in [29], K-means is applied to identify the non-circular crowded region, and all the nodes inside that area are assumed to have same SF. On the other hand, in the proposed work the geometry of network is circular, with six annuluses representing the range of individual SFs. We have analyzed the scalability and the performance of the uplink LoRa model considering Rayleigh fading, connection H 1 , capture Q 1 , and coverage probabilities H 1 Q 1 in the presence of interfering signals using the same SF. The considerations of H 1 and H 1 Q 1 are missing in [28,29]. Moreover, in our model, we consider a dense and wide network (radius of several kilometers) and analyze the performance by considering the maximum distance of individual SF boundaries from the gateway.
Another scientific study used K-means for the classification of end-devices into three groups based on traffic characteristics with different priorities. The grouping of end-devices was computed in terms of priority-based transmission instead of SF allocation [30].
In [31,32], SF distribution is mainly based on the power level of the signals that the gateway receives from the end-devices and gateway sensitivity, without considering the location of end-devices. As a drawback, SF allocation was disturbed because of high-density buildings, and 53.2% of the end-devices were forced to use SF12. Furthermore, in [28][29][30][31], only network-level simulators such as ns-3 and LoRaSim are used, which abstracts some characteristics of the physical layer that are incorporated in our analysis. Conversely, our study evaluated the performance of the proposed SF allocation algorithm considering the analytical model, realistic parameters, and averaging over 10 5 random deployment of the Poisson point process (PPP) by Monte Carlo computer simulations, which match with the theoretical results.
The tree-based spreading factor clustering algorithm (TSCA) for SF allocation in multihop LoRA networks is introduced in [33]. This approach offloads the data traffic in many sub-networks, which are linked to a sink node assigning a specific SF according to network clustering, thus enabling parallel frame transmission with multiple SFs. The authors show that TSCA increases the network performance in a network with rectangular geometry.
A single gateway uplink model considering path loss attenuation and Rayleigh fading is designed in [19], utilizing stochastic geometry to model network interference and then disconnection and collision probabilities. Such a model is further extended in [20], in which the authors propose a scheme that considers message replication and gateways with multiple receive antennas/decoders to attain time and spatial diversity. They demonstrate that the number of users and traffic density directly affects the performance of the LoRa network and that sending multiple message copies is beneficial for low-density networks. Both of these studies adopt equal radius SF allocation approaches. Unlike [19,20], our work considers K-means-based fair SF allocation of nodes in LoRa networks.
Recently, several studies have addressed the problems associated with automatic repeat request (ARQ) and contributed to downlink reliability in LoRaWAN applications. The sequential transmission of downlink frames, saturation of duty cycle, and half-duplex nature of LoRa gateway radios are marked as the major shortcomings for the downlink transmission [34,35]. Furthermore, these works also highlight the significance of gateway selection algorithm to prevent traffic losses due to sequential transmission of downlink frames and duty cycle limitations.
One experimental study evaluates the performance of a LoRa network at a 125 kHz bandwidth and SF7 for a sailing monitoring model, and the measurements show a 60.49% packet loss at the maximum distance of 3284 m [36]. Another LoRaWAN-based indoor environment monitoring system composed of 331 sensor nodes is deployed at the University of Oulu, where the gateway is installed at a distance of ∼180 m and 24 m above the ground [37]. The measurements performed at SF7 show a maximum 11.33% packet error rate (PER), which can be due to co-spreading factor interference because all 331 end-devices use the same SF. As illustrated in [19], nodes using the same SF face co-spreading factor interference. The motivation behind our work is to propose a suitable SF allocation algorithm for a large-scale LoRa network to efficiently utilize the different data rates. To enhance SF allocation, we propose a novel algorithm, based on the machine learning technique called K-means clustering, for effectively allocating the SFs.

System Model
We considerN uniformly distributed smart devices inside an uplink class-A LoRaWAN network without retransmissions, utilizing a single channel within radio range of R km and a circular area of V = πR 2 around a single gateway. Figure 1 illustrates a deployment withN = 500 and R = 3 km. The gateway is at the origin, and nodes are distributed uniformly in V = 28.26 km 2 . Note that such model captures the characteristics of telemetry applications such as those in smart cities and smart buildings. For instance, the University of Oulu Smart Campus has a LoRaWAN network constantly monitoring several sensors such as temperature, luminosity, and CO 2 [37].  The LoRa modulation bit rate is defined as [14] where 4 4+CR is the effective coding rate, ranging from 4 5 to 4 8 , while CR denotes the LoRa coding rate configuration, varying from 1 to 4. In our work, we assume CR = 1, and the LoRa uplink channel aggregated bit rate is expressed as bitrate U = ∑ 12 i=7 Rb i = 12.17 kbps. For instance, Table 1 shows the characteristics of 9 byte LoRa packets with explicit header and CRC modes enabled and BW = 125 kHz.

Uplink Outage Probability
The uplink transmission of nodes is based on the ALOHA protocol, and the probability of collision in ALOHA networks is high when many stations are connected [38]. In LoRa, simultaneous signals of different SFs are quasi-orthogonal because the inter-SF rejection gain varies from 16 to 36 dB [39]. Therefore, for the sake of simplicity, our work does not inspect inter-SF interference and focuses on co-SF interference only, which is stronger.
In this paper, the uplink model includes the influence of Rayleigh fading and path loss attenuation as the baseline model [19] for performance analysis, where g d k = λ 4πd k η is the path loss attenuation function, η ≥ 2 is the path loss exponent, λ is the wavelength, and h k is the fading in the link between the k-th node and the gateway. Let us consider the transmitted signal of a single LoRa node s 1 t to examine the impact of co-SF interference originated due to simultaneous transmission of nodes with same SF. The mathematical expression of the received signal at the gateway can be expressed as where n t is additive white Gaussian noise with zero mean and variance N = −174 + NF + 10log 10 (BW) dBm, NF is the noise figure of the receiver, and −174 dBm/Hz is the thermal noise spectral density constant. We consider that an outage of the received signal in an uplink channel can take place in the two scenarios [19]. First, if the signal-to-noise ratio (SNR) of the received packet is less than the SF specific threshold q SF , then the node is considered disconnected. Second, if the signal-to-interference ratio (SIR) between the target-received packet and any other concurrent signals of the same SF and frequency channel is less than 6 dB, then it is considered as a collision.

Outage Condition I
The distance of the end-device to the gateway in a wireless transmission domain is crucial.
The instantaneous SNR can be expressed as SNR = P 1 |h 1 | 2 g(d i ) N , where P 1 is the transmit power of end-device 1 in mW and |h 1 | 2 is the squared envelop of the channel coefficient. Communication is only possible when the SNR of the received signal at the gateway is less than the reception threshold q SF . Thus, the first outage condition, the connection probability, is defined as [19] where d 1 (in meters) is the distance of the desired end-device from the gateway.

Outage Condition II
A collision in LoRa end-device transmission takes place if the SIR of the desired signal with respect to interference from the same SF and frequency channel is less than 6 dB, i.e., if the desired signal is at least four times stronger than the interference. We model this outage condition based on [19], where interference is approached by considering the strongest interfering device. According to [19], the highest interference comes from the end-device k * .
The probability that no collision occurs or that the strongest interfering signal is at least 6 dB below the desired one, termed the capture probability, is The probability above depends on the distribution of X k * = |h k * | 2 g(d k * ). The cumulative distribution function (CDF) of X k * is derived in [19] and is denoted as F X k * . Thus, Moreover, in [19] the authors present an approximation for (5) that is only accurate at the edges of each annulus. This paper considers only the exact probability in (5).

Coverage Probability
The probability that defines whether a selected end-device is in coverage and can successfully communicate with the gateway is termed the coverage probability. It is the product of H 1 and Q 1 . The average coverage probability ℘ c can be achieved by deconditioning the location of the individual node by averaging over the network coverage area V = πR 2 , i.e., [19] The average coverage probability of a individual SF annulus is also inspected. It indicates the probability of an end-device at distance d 1 in the annulus i by considering the connection and capture probabilities and is defined as [20] where l i+1 is the radius of the outer circle and l i is the radius of the inner circle of the ith annulus.

Proposed SF Allocation Algorithm
In this paper, we propose an SF allocation algorithm, i.e., an algorithm to define the range of each SF annulus. Our solution uses the K-means machine learning algorithm [40], used in the process of vector quantization in data mining by clustering. It is a non-deterministic, numerical, and iterative approach. The main objective of the K-means algorithm is to find the minimum cost function, defined as the distance between each point in the data set and its nearest centroid. The distance between the cluster centers and data elements typically assumes the Euclidean distance. K-means clustering method can efficiently achieve robust clustering results when dealing with large data sets. The K-means algorithm first arbitrarily chooses K points from the data set, which indicate the initial centroids. The remaining points are then clustered to the closest centroid, and the coordinates of centroids are recalculated, iteratively, until the cost function converges.
Consequently, it is important to choose the appropriate number of centroids during the initialization procedure because the area of each annulus π(l 2 i+1 − l 2 i ) increases towards the higher SFs in a strategy based on equal distance steps per SF, which results in the growth of node density due to uniform distribution. That is why it is essential to select the sequences for K-means iterations that can provide larger values of K clusters for higher SFs. In order to avoid an extensive number of nodes in an individual SF, there should be a fair difference between the inner (l i ) and outer (l i+1 ) radii of annulus. In the proposed work, the annulus area is directly dependent on the difference between the K clusters for two consecutive iterations.
In our approach, we use five iterations of K-means. We start by computing the boundaries of the outermost SF ring, SF12, and then proceed to define the inner boundaries for lower SFs. For each iteration, K clusters are selected to develop the centroids of end-devices in the LoRa network covered by a single gateway. Four different mathematical sequences listed in Table 2-a Fibonacci series, square numbers, Wythoff array, and arithmetic series-are used to assign the values of K for each iteration.  16 11 In our work, K-means operates iteratively. Each iteration defines the set of nodes at the outer SF ring. In each iteration, the algorithm seeks the set of K centroids C that minimizes the average of the distances between any node and its closest centroid, i.e., where ED k is the set of devices at the k-th iteration, X i is a device in ED k , and C k is the closest centroid of X i . The function dist(x 1 , x 2 ) computes the Euclidean distance between x 1 and x 2 . This procedure returns the collection of K centroids of network nodes, whereas C = {C 1 , . . . C K }. After computing the centroids, the algorithm determines the boundary of C, so that [C x , C y ] = boundary(C), which determines the 2-D vector of border points around the Cartesian coordinates of the centroids. Then, it separates the nodes that are inside of the centroid boundary, forming the set I = [I x , I y ], where I x and I y are vectors storing the coordinates of the inner nodes in each of the Cartesian dimensions.
In the next step, the maximum absolute value of each dimension of I is calculated to set the radius as l i = max|I x |+max|I y | 2 , which defines the limit of the SF ring i. The procedure repeats to determine the boundaries of the remaining SF rings (l 5 , . . . l 1 ).
The steps involved in the proposed SF allocation technique are described in Algorithm 1. It repeats the process five times to allocate nodes for SF12-SF8.At the end, the remaining nodes use SF7. Initially, the SF12 outer limit is set to the network radius. In each iteration, the number of clusters is assigned to K depending on the chosen mathematical series (as mentioned in Table 2). For the first iteration, the algorithm considers all of theN nodes inside the set ED. In line 4, it computes the K-means of ED, which returns the centroids C = {C 1 , . . . C K } by (8). Since nodes in the set I are inside the boundary of centroids, line 7 computes the inner limit of SF ring l i . This process is repeated iteratively until l 1 is calculated for the allocation of SF8 and the remaining nodes are assigned to SF7 (line 11). Note that the set of nodes ED is updated at the end of each iteration by removing the nodes that were already allocated to an SF (line 9).
Compute the new SF ring limit 8:  Figure 2. The radius of the network circular area is R = 3 km, and therefore, the outer limit of SF12 is l 6 = R = 3 km. The first iteration of the algorithm defines the inner boundary of SF12, l 5 , as shown in Figure 2. After excluding the devices inside the SF12 ring from ED k , the algorithm runs a new iteration and defines l 4 , i.e., the inner boundary of the SF11 ring. The iterations continue until l 1 is defined and the complete network geometry is obtained.  Figure 3 depicts the clusters of nodes, centroids, and gateway at the origin for the last K-means iteration. The nodes outside l 1 are allocated to SF8, and the remaining nodes are assigned to SF7. The SF distribution ofN = 500 nodes based on the proposed approach considering the Fibonacci series for clustering is shown in Figure 4. The number of nodes in SF rings depends on the chosen series because of the distinct number of clusters for each sequence.   The area of each annulus also varies according to the mathematical series, which also affects the network performance. An important aspect to take into consideration is the selection of the number of clusters (K); if the difference between the clusters of two consecutive iterations is too big, that will result in a large number of end-devices in that specific region and, as a consequence, the probability of collisions and of co-SF interference will be high. In the same way, SF7 will have a larger coverage area and more nodes if K is high for the last iteration.

Numerical Results and Discussion
In this section, we evaluate the scalability and performance of the proposed methodology by means of computer simulations. The results are based on a p 0 = 1% duty cycle, BW = 125 kHz, η = 2.75 path loss exponent, 868 MHz European frequency band, and network radius of R = 3 km. Table 3 summarizes the parameters considered for the results.

SF Allocation and Scalability Analysis
As discussed in Section 4, we used mathematical series to assign the numbers of clusters for K-means iterations. The Fibonacci series has the shortest ranges for SF7, 715 m forN = 500. The distance between the SFs and the distribution of nodes can be changed by modifying the number of clusters (K). On the other hand, for the same number of nodes, the Square series has a longer range for SF7, which is 1201 m, and it contains more end-devices. This type of configuration is due to the higher value of K clusters for iterations (see Table 2). While in the case of Fibonacci and equal-distance-based SF allocation in [19,20], the network has fewer nodes in SF7, and the number of nodes in each SF region increases considerably towards the higher SFs, as shown in Table 4. The average SF ranges for the Fibonacci series, square series, Wythoff array and arithmetic series are shown in Table 5.  The boxplot is a standard process to quantify the variability of data on the basis of five parameters, i.e., the minimum, first quartile (25%), median, third quartile (75%), and maximum. Th distance of the SFs boundary from the gateway forN = 500 is demonstrated in Figure 5 for each of the considered series. The median of every SF is identical to distances provided in Table 5. As depicted in the graphical results, SF7 and SF8 have a large disparity in range and number of nodes for different scenarios, while SF11 and SF12 have nearly close coverage areas for all scenarios. Furthermore, we can also clearly observe that variation in K clusters selection has a direct effect on SF allocation based on the proposed methodology. The number of nodes in each SF are illustrated in Table 4. Square-series-based networks demonstrate five times more nodes in SF7 as compared to the reference model.
The large values of K clusters for the last iterations result in longer radii that directly increase the region of SF7 and keep SF8 further away from the gateway. These different scenarios can be used according to different situations and requirements of LoRa applications. An approach based on the Fibonacci series is beneficial for applications where fewer nodes in lower SFs are required, while the Wythoff array, square series, and arithmetic series have wider regions for SF7, and thus can be used in setups where more nodes are required in SF7 inside the radio range of nearly 1200 m to provide highest data rate (R b = 5.47 kbps, see Table 1). Several studies examined the performance of LoRa networks and show that the success probability of data packets decreases for higher SFs. In our work, we consider the effect of changing the coverage range and varying the number of devices for individual SFs.

Performance Analysis
After the application of the proposed SF allocation algorithm, we investigated the performance of the resulting LoRa network. The theoretical results were verified by Monte Carlo simulations. In the figures, each marker represents the average over 10 5 random deployments of the Poisson point process (PPP) for a single gateway LoRa uplink model, considering an end-device at d 1 meters from the gateway. In Figure 6, the solid lines demonstrate the theoretical results, while marker points of the same color illustrate the simulation outputs. The simulated results align with the theoretical ones. Within the context of the previously discussed mathematical sequences, we considered and examine the impact of different SF allocation scenarios on connection probability H 1 , capture probability Q 1 , and coverage probability H 1 Q 1 against the distance from the gateway. As expected, the distance of the end-device from the gateway has considerable influence on connection probability H 1 . In the case of the Fibonacci series, the model has a better success probability for lower SFs as compared to the square series, arithmetic series, and Wythoff array. This fact is due to the difference in a range of individual SF boundaries for said SF allocation schemes and the distance-dependent SNR threshold q SF . Furthermore, in the scenario of the square series, SF7 has large coverage areas of 1201 m (see Table 5), which affects path loss attenuation and the instantaneous SNR. The SNR threshold q SF , however, remains the same at −6 dB (see Table 1). As a consequence, the outage condition in (3) slightly degrades the network connection probability. Although a performance boost is illustrated during transitions of end-devices into the next SF because of the lower value of q SF , the performance of the previous SF has a direct consequence on the next SF.
Moving towards the capture probability Q 1 , unlike H 1 , it considers co-SF interference. Q 1 declines gradually with increasing SF, as illustrated in Figure 6. This trend is because of two major factors including ToA and the number of nodes in each annulus. ToA grows exponentially with SF, thus for the higher SFs, the wireless channel remains occupied for a long time slot, which increases the risk of collisions between simultaneously transmitted LoRa packets. In the same way, the number of end-devices in an individual annulus increases for higher SFs due to the uniform distribution of nodes in the circular coverage area, as demonstrated in Table 4. As a result, the network experiences co-SF interference that degrades the quality of transmission. For the cases with the square series, arithmetic series, and Wythoff array (Figure 6a,c,d), the model has a larger coverage area and more nodes in SF7, resulting in higher co-spreading factor interference, which is the major reason behind the lower performance of the network for these specific scenarios. Although we are sacrificing network quality for lower SFs with fewer nodes and high success probabilities, as presented in Figure 7, we improved the performance of the network for the higher SFs and regions with more nodes, where the network performance was weak in the baseline model from [19], which considers fixed distance steps from the gateway to define the SF allocation. Moreover, SF allocation based on the square series (Figure 6a) has better network performance, which happens because of improved gain in Q 1 for higher SFs as compared to the Fibonacci series, arithmetic series, and Wythoff array. In Figure 8, we present the performance of the square series based on the proposed SF allocation algorithm in comparison with the baseline model. First, we observe that the capture probability Q 1 and coverage probability H 1 Q 1 of the baseline model outperform the proposed algorithm in the region of radius R > 1000 m from the gateway. Here, it is worth mentioning that there are fewer nodes in this region as compared to the remaining area of the network. The proposed algorithm surpasses the baseline model with a gain in capture probability Q 1 (up to 53% for SF12 in Figure 8b). In addition, the nodes closer to the gateway always have better behavior in contrast with nodes far away, so that the network can tolerate a lower success probability with a boost in performance of farther end-devices. As expected, there was more change in success probability in the higher SFs region as compared to baseline model because of the fair distribution of SF by the proposed algorithm. Figure 8b shows the difference between the baseline model and the square series in terms of outage probabilities through the course of distance. The zero level on the y-axis (success probability) shows no difference, while there is a positive/negative gain either on the upper or lower side of that level. There is up to 16.73% growth in Q 1 demonstrated by the end-devices present in higher SFs. Moreover, we also consider the evaluation of the average coverage probability of the networks for different numbers of nodes (N) ranging from 300 to 700, and results demonstrate that ℘ c drops exponentially towards higherN. Figure 9 depicts the average coverage probability of different numbers of end-devices. The SF allocation schemes deployed using the square series demonstrated a better performance gain than all other scenarios, including the reference model. It was followed by the Wythoff array, arithmetic series, and Fibonacci series, in that order. The proposed SF allocation scheme overcomes the performance of the baseline model by an overall growth of around 5% in its ℘ c . For instance, taking the square series into account, atN = 500 there is a boost in the average coverage probability (6) from ℘ c = 41.9% to 46.81% compared with the reference model. On the other hand, SF allocation schemes deployed using the Fibonacci series showed the least-improved network coverage probability compared to all other user distribution series. We also investigated the performance of the proposed model taking into account the variation of different parameters on Fibonacci series. As seen in Figure 10a,b, H 1 is agnostic to the number of nodes and duty cycle. It is assessed at 0.1% and 1%, which is within the duty cycle range specified by ETSI for LoRa applications [34,35]. Furthermore, node density and the duty cycle demonstrate a negative impact on Q 1 because of co-SF interference caused by increasing medium usage fromN = 501 nodes toN = 1005 nodes. Likewise, in Figure 10c, the path loss exponent η illustrates network connection degradation at 2.65 from 2.5 because H 1 depends on the distance, while the the capture probability (Q 1 ) is not dependent on the path loss exponent. In the case of one frequency channel, the transmit power P k of nodes can be up to 20 dBm (100 mW) [11,41]. In order to evaluate the effect of different transmit powers, in Figure 10d we raised the transmit power from 14 dBm to 19 dBm. The results demonstrate that the transmit power of 19 dBm causes a better connection probability as compared to 14 dBm. Nevertheless, variations of the path loss exponent and transmit power do not affect Q 1 considerably because it is much more dependent on the number of nodes.

Discussion
The architecture of LoRaWAN consists of end-devices, a gateway, network server (NetServer), and application server [15]. The NetServer is mainly responsible for the overall management of the network. The dynamic configuration of the SF by the NetServer is already possible in LoRaWAN during the network join procedure or through specific MAC commands. In fact, these features are used in LoRaWAN when the adaptive data rate (ADR) mechanism is active. For our approach to be implemented in practice, the NetServer could run our algorithm periodically or when the number of connected devices changes significantly, and then issue the required MAC commands to reconfigure the devices that need to change their SF. Therefore, as the current LoRaWAN specification is already able to dynamically allocate SFs, our proposal only changes the way the proper allocation is calculated at the NetServer, and is therefore feasible in practice.

Conclusions
This paper has presented a novel SF allocation technique for a large-scale LoRa network using the K-means clustering machine learning algorithm. The authors also analyzed the impact of the distance of end-devices from the gateway and the number of nodes in each SF on network performance. In this work, four different scenarios are considered, which have different distances for the SF boundaries and variations in the number of nodes in an individual SF. Such fair distribution results in a better average coverage probability in the higher SFs, while dealing with the maximum number of nodes. Numerical findings show that our SF allocation algorithm outperforms the reference model not only in terms of success probability but also in regards to fair resource distribution. The evaluated theoretical and simulation results are useful for an in-depth understanding of large and dense LoRa networks. Our resource allocation method can handle dense and large circular coverage areas for LoRa sensors using distinct numbers of clusters instead of equal-radius-based SF allocation [19,20], while the techniques in [28,29,33] are designed for short-range networks. The studies [19,20,28,29,33] only highlighted the performance of networks for fixed parameters. In contrast to them, our work inspects different scenarios by obeying the restrictions of ETSI standards.