Graph-based Semi-supervised Learning for Indoor Localization Using Crowdsourced Data

Indoor positioning based on the received signal strength (RSS) of the WiFi signal has become the most popular solution for indoor localization. In order to realize the rapid deployment of indoor localization systems, solutions based on crowdsourcing have been proposed. However, compared to conventional methods, crowdsourced RSS values are more erroneous and can result in large localization errors. To mitigate the negative effect of the erroneous measurements, a graph-based semi-supervised learning (G-SSL) method is used to exploit the correlation between the RSS values at nearby locations to estimate an optimal RSS value at each location. Before using the G-SSL method, the Linear Regression (LR) algorithm is proposed to solve the device diversity problem in crowdsourcing system. Since the spatial distribution of the APs is sparse, the Compressed Sensing (CS) method is applied to precisely estimate the location of the APs. Based on the location of the APs and a simple signal propagation model, the RSS difference between different locations is calculated and used as an additional constraint to improve the performance of G-SSL. Furthermore, to exploit the sparsity of the weights used in the G-SSL, we use the CS method to reconstruct these weights more accurately and make a further improvement on the performance of the G-SSL. Experimental results show improved results in terms of the smoothness of the radio map and the localization accuracy.


Introduction
Indoor location-based services (LBS) such as indoor positioning, tracking and navigation, have been receiving a lot of attention in recent years [1,2].However, it remains a challenge to provide the users with an accurate and robust location estimation.Global Positioning System (GPS) is the most widely used localization system and provides precise positioning in outdoor environments.However, due to the lack of sufficient signal strength in most of the indoor areas, GPS is not a reasonable solution for indoor environments.Therefore, various alternatives to GPS have been proposed for indoor localization.Examples include but are not limited to the methods using Ultra-Wideband, Ultrasound, Infrared and Radio Frequency signals [2][3][4][5][6].These alternatives provide a good localization accuracy for many applications, however, they require additional infrastructure that would be a disadvantage to their large-scale deployment.
With the growing deployment of WiFi access points in indoor environments and the widespread use of mobile devices such as smart phones, WiFi received signal strength (RSS)-based indoor localization methods are getting popular due to their low deployment cost and relatively high localization accuracy.
In general, there are two main categories of localization methods that use WiFi RSS readings.The first category comprises those methods that rely on the radio propagation model of the WiFi signal in indoor environments as well as the locations of the WiFi Access Points (AP).Specifically, the RSS readings from different access points are used to estimate the distance of a mobile device from those access points.Then a triangulation method is used to estimate the location of the mobile device.The next category includes those methods that are based on WiFi RSS fingerprints also known as fingerprint-based methods.Originally proposed by P. Bahl et al. [7], various fingerprint-based localization systems have been designed and developed during the last decade [7][8][9].
Typically, fingerprint-based methods consist of an offline phase followed by an online phase [7].In the offline phase, RSS values from different WiFi access points are measured at some known locations throughout the indoor area.These location are referred to as Reference Points (RP) and the measured RSS vector for each RP is called a fingerprints.All fingerprints and their corresponding RPs are stored in a database called the radio map.In the online phase, a user's position can be estimated by comparing the RSS values measured by the user with the RSS fingerprints stored in the radio map.
A disadvantage to the offline phase of the fingerprint-based methods is the required time and labor to collect sufficient number of fingerprints throughout the indoor area.In addition, the RSS value of an AP at a certain location can change over time due to a number of reasons including but not limited to multipath fading, shadowing, moving objects and people [10].To mitigate these RSS fluctuations, a large number of RSS measurements are collected at every reference point in the offline training phase.However, collecting more RSS measurements at any location makes the offline phase even more time-consuming and labour-intensive.Several works have been proposed to reduce the workload of the offline phase [11][12][13].The crowdsourcing method has been shown to be a promising approach to solving this problem [14][15][16].In a crowdsourcing-based system, each user can contribute to the construction and updating of the radio map.Consequently, the number of RSS values collected in the offline training phase is greatly reduced.On the other hand, RSS measurements collected by the users moving in the environment are potentially more erroneous than those collected by the experts at the exact location of reference points.
One of the problems in the crowdsourcing localization system is that numerous of mobile devices are applied to build the radio map in the offline training phase and provide LBS for the device holders in the online phase.Due to the different WLAN adapter equipped in the mobile devices, the RSS values collected by the mobile device are subject to the difference of the WLAN adapter.As a result, different data collection devices may have different signal sensing capacities and yield different data distributions.Numerous studies show that, due to the hardware differences, the RSS differences collected by different devices exceeds more than 25 dB [17][18][19].Therefore, the localization accuracy is degraded significantly by the problem of RSS variations across different devices.
Another issue of indoor localization is the knowledge of the location of the access points.In most fingerprint-based methods, the location of the access points is considered to be unknown.This is a convenient simplifying assumption in many situations, especially when the signal strengths are measured in a passive mode.However, the knowledge of the location of the access points can enhance the localization accuracy.This is especially important since the location of an access point can be estimated using some signal processing techniques [20].The location of an access point can then be used to correlate the received signal strength across neighbouring locations, as will be discussed in this paper.
In this paper, in order to deal with the device diversity problem, the Linear Regression (LR) algorithm is used to mine the intrinsic relationship between different RSS values collected by different devices.Using the LR algorithm, the problem of device diversity will be solved automatically and the uniform RSS values are gotten, so as to ensure the application of the following algorithms.On the basis of graph-based semi-supervised learning (G-SSL) method, we propose RSS difference-aware G-SSL (RG-SSL) method and RSS difference-aware sparse graph SSL (RSG-SSL) method to smoothen the RSS values collected in the offline training phase and improve the localization results.Before smoothing RSS measurements using the G-SSL method, the locations of APs need to be known.Since the spatial distribution of the APs is sparse, the Compressed Sensing (CS)-based method of [20] is proposed to precisely estimate the AP locations.Based on the signal propagation model, the RSS difference between two locations is calculated with respect to the locations of RPs and APs.Furthermore, RG-SSL method is proposed to smoothen the radio map in the offline training phase.By leveraging the RSS readings in the local neighbourhood, the effect of noise and erroneous measurements can be reduced to obtain a higher localization accuracy.Finally, the sparsity of the graph is discussed and RSG-SSL method is used to obtain a better RSS smoothing and localization result.
The rest of the paper is organized as follows.The related works are given and discussed in Section 2. Section 3 formulates the indoor localization problem.In Section 4, the device diversity problem in crowdsourcing localization system is solved by linear regression method.The CS-based AP positioning method is explained in Section 5. Section 5 also explains some experiments with the proposed CS-based AP positioning method.In Section 6, RG-SSL method is proposed with some experimental results.Finally, we explain the RSG-SSL method in Section 7 and provide the localization results using RSG-SSL.Section 8 concludes the paper.

Background and Related Works
C. Feng et al. in [2] and J. J. Pan et al. in [11] proposed the CS-based method and the G-SSL method respectively, to reduce the workload of the radio map construction in the offline phase.Both methods, aim to reduce the number of reference points (RP) and RSS measurements.Also, [14][15][16] explore crowdsourcing-based methods to reduce the deployment workload by engaging the users to participate in radio map construction.
In [21], an RSS pre-processing method called the "sliding correlation time window filter" (SCTW) is used to reduce the noise in the measured RSS values.Similarly, in this paper, a sliding time window is used to average the RSS values collected in every RP to improve the accuracy of RSS measurements.However, this filter only uses a small number of the RSS values in the radio map and most of the information in the radio map is abandoned.
M. Hasani et al. [22] used a path-loss model to improve the reliability of the measured RSS values.In the offline phase of their method, a set of channel parameters are estimated for each access point.In the online phase, the user's location is found based on the calculated RSS values using the stored channel parameters.Their method results in a reliable localization thanks to the stability of the estimated channel parameters.In [23], S. Latif et al. proposed a D-model to estimate the radio signal strength in indoor areas.The experiments in their paper proved that the proposed D-model is capable of estimating the RSS values with a high accuracy.Also their method models the wall attenuation more accurately compared to the method of [22].Although the simulation result showed that the proposed method is fit for RFID positioning system, when this method is used in WiFi positioning system, the result is not satisfactory.
The signal propagation method gives us some inspiration, we proposed signal propagation-based outlier reduction technique (SPORT) to smooth the RSS collections in both the offline phase and the online phase and improve the localization accuracy [24].In this method, we investigate the relationship of RSS values between adjacent locations using a signal propagation model and show that the outliers can be corrected using a signal propagation model.Experimental results show that SPORT greatly smoothens the radio map and improves the location accuracy.
In order to minimize the fluctuation of RSS values, M. S. Rahman Sakib et al. [25] developed a method using a Particle Filter (PF).Particle filters are used to perform non-linear and non-Gaussian estimations.However, in the online phase, a large number of particles have to be used in order to obtain a high positioning accuracy.Consequently, the computational cost is high which may be unacceptable for some indoor positioning applications.
L. Ma et al. [26] proposed a method based on the singular value thresholding (SVT) to recover the missing RSS values both in offline and online phases.In that paper, the authors argued that the positioning performance degrades significantly when some of the APs are occasionally turned off such as in a green WLAN system.Therefore, they proposed an SVT-based method to estimate the missing RSS values both in the radio map and the online RSS readings.They showed that their SVT-based method could achieve an acceptable positioning performance.

Problem Formulation
Suppose a set of RPs are selected throughout the indoor area and M APs are visible at each RP location.In the offline training phase, we collect the i-th fingerprint (c i , r i ) at RP S i , where c i = (x i , y i ) T is the geographical coordinates of S i and r i is an M × 1 RSS vector.We refer to these fingerprints as labeled data.In the online phase, the user's location can be estimated by comparing the RSS value r k collected at the unknown location of the user S k with the fingerprints in the radio map.If r k is similar to a particular r i , then we reason that user's location S k must be close to RP location S i .
In practice, the RSS values measured by a mobile device are subject to multiple sources of noise, such as multi-path fading and shadowing.Figure 1 illustrates the histogram of 100 RSS values from a single AP at a particular location inside the Bahen Building at the University of Toronto.The RSS values are distributed in a wide range of −70 dBm to −50 dBm.Occasionally, we cannot receive any power from this AP and a value of −110 dBm is used to denote the missing RSS value.Figure 2 shows the RSS value from a single AP throughout the fourth floor of the Bahen Building after removing −110 dBm measurements and averaging over RSS values at each location.Next, we explain how we apply the G-SSL method to reduce the effect of noise in the radio map.Consider a set of u locations within the localization area that are not associated with RSS measurements hence we call them unlabelled data.In addition to these unlabelled locations, there are labelled RP locations as explained previously.Consequently, we have + u locations of labelled and unlabelled data.In the G-SSL method, a weighted graph is constructed using both labelled and unlabelled data.In this graph, the vertices represent the training data and all the vertices are connected by edges.The edge weight matrix, which is calculated by the training data, represents the relationship between vertices in the graph by assigning a weight to each edge connecting two vertices in the graph.Each vertex on the graph corresponds to a location and the weighted edges between vertices represent the relationship between both RSS values and locations corresponding to those vertices.As mentioned earlier in this section, measured RSS values in an indoor environment are affected by different types of noise.However, in the graph representation of the G-SSL method, any two vertices on the graph are related not only by the RSS values measured at those vertices but also by the physical locations corresponding to those vertices.Therefore, the G-SSL is able to reduce the effect of noise in the measured RSS value by incorporating both RSS and location information.Next, we will explain the G-SSL method with more details.Suppose Ω = (V, E) denotes the graph of the G-SSL method.The vertices of the graph, V, is defined as V = {c 1 , c 2 , . . ., c , c +1 , . . ., c +u } where the first elements are the location coordinates of the labelled data and the next u elements are the location coordinates of the unlabelled data.For every edge between two vertices at S i and S j , we can calculate its weight w ij .w ij indicates the similarity between the two vertices and takes values in the range [0, 1] with 0 indicating no similarity between the vertices.The result is an ( + u) × ( + u) weight matrix W containing all the calculated weights.The graph edges are usually undirected, so the edge (i, j) (weighted by w ij ) and the edge (j, i) (weighted by w ji ) are the same edge in the graph, which means w ij = w ji .In addition, the edge (i, i) does not exist, therefore, there are 1  2 [( + u) × ( + u) − ( + u)] edges in the graph.In summary, only the corresponding number of graph weights are calculated which makes the weight matrix W a symmetric matrix.To calculate the weights, here we use the well-known heat-kernel function: where ) is the square of the Euclidean distance between location S i and S j and τ is a parameter based on the application which controls how quickly the weight decreases.
The G-SSL uses W to estimate the labels of the unlabelled data using the relationship between different vertices in the graph.The result is a set of estimated labels ri for i ∈ {1, 2, . . ., + u}.If c i is close to c j , the estimated label ri is close to the given label r j for all j ∈ {1, 2, . . ., }.The estimated labels ri have to satisfy two conditions.First, for the labelled data, since the labels are already known, the estimated labels ri must be close to the real labels.For the labelled data (c i , r i ), we should have ri = r i .This condition is enforced by minimizing the following loss function min where R is the M × ( + u) matrix of all estimated RSS values and • is the Euclidean distance.
The second condition is that the graph should be smooth.The smoothness of the graph comes from the fact that data points which are close to each other should have similar labels.To satisfy the smoothness condition, the estimated labels ri and rj should meet the following loss function min If c i and c j are close to each other, the weight w ij would be large, and the labels ri and rj must be close in order for the whole term to be minimized.On the other hand, if c i and c j are far away from each other, the weight w ij would be very small and the choice of the labels does not have much effect on the minimization.
Hence, the estimated labels that satisfy both conditions above can be estimated using: where γ is a the weight of the smoothness term based on the application.γ is a design parameter used to enforce which term is of higher importance.In conclusion, the first term of the Equation ( 4) penalizes the difference between the actual labels and the estimated labels and the second term ensures the smoothness of the graph.
The proposed G-SSL-based RSS smoothing method for crowdsourcing is summarized in the system diagram shown in Figure 3.In the offline phase, since the actual coordinates of S i and S j are already known, the LR algorithm is used to obtain the uniform RSS values.Then the locations' APs are calculated by CS method.At last the RSS values can be smoothed by G-SSL method.In the online phase, the data collected simultaneously from sensors on the mobile device can be used to estimate the relative displacement between S i and S j , that is, the distance d(S i , S j ).Then the collected RSS values are processed by the LR method.After that, the RSS values can be smoothed using the calculated distance d(S i , S j ).Finally, we get a more accurate positioning result.

Linear Regression Algorithm against Device Diversity Problem
In the existing experimental systems, the same device is used to collect the RSS values in both the offline phase and the online phase.However, when the crowdsourcing method is widely applied to the indoor localization systems, a large number of different mobile devices have been used in the establishment of the radio map.In the online phase, a variety of mobile devices are also used by the users which are different from the device used to build the radio map.In this section, the linear regression (LR) algorithm is proposed to solve the device diversity problem in RSS-based crowdsourcing localization system.
We define X and Y are the signal space of different devices.Assume that the fingerprint r X belongs to X is the nearest neighbor to the online point r Y belongs to Y.As described above, although they were collected at close physical locations, the RSS values have obvious difference.In order to solve the device diversity problem, the relationship between different devices has to be studied.Therefore, these RSS values collected by different devices could be processed to make the r Y in closer to r X .Mathematically X ≈ f (Y ), By learning f , the radio map build by the training device could be used to localize any other devices.
Aiming to explore the mapping function between RSS values collected by distinct devices, the comparison results of RSS values across different training/tracking devices are plotted in Figure 4. Every point on the figure represents RSS values from two different devices measured at the same location from the same AP at the same location.For example, the top right subplot in Figure 4 represents the RSS values measured by Lenovo laptop and Huawei mobile device.From Figure 4, we can get a linear correlation between the RSS values measured by different devices.Hence, the following linear regression method can be employed as the mapping function.
where (a, b) are the coefficients in the mapping function.

Pre-Processing of RSS Values
In the typical WLAN localization scenario, the RSS values collected by the mobile device are subject to multiple sources of noise, such as multi-path fading and shadowing.To mitigate these RSS fluctuations, a large number of RSS measurements are collected from each AP at every location.Let RSS li = {rss 1 , rss 2 , . . ., rss p } be the set of RSS values collected at location l from the i-th AP.As shown in Figure 4, if we cannot receive any power from the AP, a value of −110 dBm is used to denote the missing RSS value.
In order to obtain the high localization accuracy, the first step in localization system is to stabilize the collected RSS values prior to the localization process.Aiming to overcome the fluctuations, the average of the collected RSS values is calculated.In the calculation of the average value, the filled RSS values of −110 dBm could produce meaningless RSS values and will have a adverse impact.These filled RSS values could affect the localization process and produce erroneous location estimations.As a result, the average is calculated using the collected RSS values exclude the filled RSS values as the following equation: The average value r li is used to build the radio map in offline training phase and estimate the current location in online localization phase.

Linear Regression Algorithm against Device Diversity Problem
Before using the linear regression method, the parameters a and b in Equation 6 should be computed.Since the outliers appear in the collected RSS values frequently and seriously affect the performance of the linear least squares (LLS) algorithm, the fast least trimmed squares (FAST-LTS) algorithm is used in this paper.
When the number of measured RSS values is c, the FAST-LTS solution for linear regression with intercept is given by min where Given the h-subset H old of all nearest neighbors, the C-step is used to compute the a and b as follows [27]: compute a old and b old := least squares regression estimator based on H old 2.
compute the residuals d old (i sort the absolute values of these residuals, |d old (1 arrange the absolute values of the residuals in ascending order, let H new be a subset consisting of the nearest neighbors corresponding to the first h the absolute values of the residuals in the sequence 5. compute a new and b new := least squares regression estimator based on H new Repeating C-step with numerous H old , a lot of regression coefficients will be gotten.The approximate solution is the coefficient corresponding to the least ∑ h i=1 d(i) 2 .After getting the regression coefficient a and b, r X is transformed as follows where r X ∈ Y.As a result, both r X and r Y belong to the same signal space, and a uniform radio map could be built using r X and r Y in the offline training phase and a higher positioning accuracy could be obtained in online phase.
To verify the LR method, five distinct devices, namely Lenovo, Huawei, Samsung, Xiaomi and Coolpad, are used to collect RSS values at all RPs and the linear regression coefficients could be calculated based on the measured RSS values and the corresponding coordinates.When the regression coefficients are gotten, all the RSS values could be mapped into the same signal space by LR method and a uniform radio map could be built.Using the processed radio map, the user's location will be estimated with a high accuracy in online phase.
In our localization systems, we use the Lenovo device as the standard device, and all the RSS values collected by other devices are mapped into the signal space of Lenovo device.We take the (Huawei, Lenovo) pair as an example.As shown in Figure 5, the collected data are more stable after pre-processing of RSS values, and the linear regression coefficients could be calculated by LTS method.Using the coefficients, the RSS values collected by Huawei device could be mapped into the signal space of Lenovo device.We compare the original RSS values and the transferred RSS values collected by Lenovo device with the RSS values collected by the Lenovo device, the comparison result is shown in Figure 6.From the figure, we can see that the difference of signal distribution between different devices is reduced significantly.Accordingly, a uniform radio map can be built in the offline phase and the positioning performance could be improved in the online phase.

Automatic Device-Transparent Algorithm for Crowdsourcing Indoor Localization System
Based on the LR method, the device diversity problem can be solved.However, the LR method is applied to the premise that the coordinates of the RSS values are same.In offline training phase, the RSS values used to build the radio map have been labeled, so these RSS values meet the prerequisites for the LR method and the device diversity problem could be solved automatically.In online localization phase, the coordinates of the RSS values are unknown, which means the LR method cannot be used directly.Therefore, we use the correlation ratio computed from the Pearson Product-moment correlation coefficient to roughly label the RSS values collected by an unknown device.
where m is the number of APs, r Y k and r X k are the RSS values measured from the k-th AP, r Y = 1 m ∑ m k=1 r Y k is the average of the RSS values from the tracking device and r X = 1 m ∑ m k=1 r X k is the mean of RSS values measured by the training device in a fingerprint.
The range of the absolute value of Pearson correlation ratio is (0, 1) where 1 indicates the highest linear correlation between RSS values and 0 indicates the least similarity.In the online phase, when the RSS vector r Y is acquired, the similarity between the online point and all fingerprints r X in X can be obtained by t.Given a threshold t th , we can get the set of nearest neighbor fingerprints in radio map X for r Y .
Based on the nearest neighbors in Equation ( 12), the RSS data collected in the online phase can be labeled roughly and the LR method proposed in the previous section is used to train the mapping function.
In summary, in the offline phase, because the coordinates of the collected RSS data are already known, the LR algorithm can be used to eliminate the device diversity problem directly.As a result, a uniform radio map can be built in the offline phase.In the online phase, the RSS values collected by the unknown device could be localized roughly by the Pearson correlation coefficient at the beginning.Then the RSS values can be mapped into the signal space of radio map using the LR algorithm.Finally, we can get a more accurate positioning result.

AP Localization Using Compressed Sensing Method
Typically, fingerprint-based localization methods do not rely on the location of the APs.In other words, the AP locations are assumed to be unknown.Nonetheless, better localization can be achieved if one could estimate the AP locations.Next, we discuss a compressed-sensing (CS)-based approach to estimate the AP locations.
Consider a set of N discrete locations throughout the indoor area.Suppose a set of M access points can be seen at each location.It is a practical assumption that the number of grid points is much larger than the number of access point in the indoor area i.e., M N. We will use this assumption to apply a CS-based method to recover the location of the APs.
Compressed Sensing is a signal processing technique that can efficiently reconstruct a signal by exploiting the sparsity and incoherence properties of the signal [28][29][30].Assume corresponding to the i-th AP, we define a vector θ i of size N. θ i is a vector that shows the location of the AP by assigning a one to one the N element and zero for the rest of the element.For example, if θ i (n) = 1 then the location of the i-th AP is estimated to be the location of the n-th grid point in the indoor area.Concatenating all such vectors for all M APs results in a so-called index matrix, Θ N×M as, According to the CS theory, rather than measuring the M-sparse signal or its sparse representation Θ directly, compressive noisy RSS measurements in an -dimensional space are used.These compressive measurements are obtained by multiplying a random matrix by the original signal, where 1. y ×M are the compressive noisy RSS measurements.

2.
Φ ×N is the measurement matrix.Each row in this matrix represents the location of one RP, with an element of 1 to indicate the grid point at which the RP is located.Thus, only a few of RSS values are collected on the locations of RPs instead of measuring all the RSS values on the overall grid, which reduces the workload in the offline phase.

3.
Ψ N×N is the sparsity basis on which the measured signals have sparse coefficients Θ.In this matrix, Ψ ij = RSS(d ij ) indicates the RSS values collected at grid point i from the AP located at grip point j, for all 1 ≤ i ≤ N and 1 ≤ j ≤ N. Assume that the transmition power of an AP is P t (dBm).Then RSS(d) is calculated based on the empirical indoor propagation model of [20]: where d is the physical distance from the transmitter (AP) to the receiver.
The locations of the APs can be recovered by the following 0 -minimization: Unfortunately, solving ( 16) is both numerically unstable and NP-hard.Therefore, 1 -minimization is used to recover the AP locations: This is a convex optimization problem and various methods have been proposed to find the solution such as BP [31], OMP [32] and SP [33].In this paper, we use OMP algorithm.
To evaluate the performance of the proposed CS-based AP localization algorithm, a few number of APs on the fourth floor of the Bahen Building at the University of Toronto have been localized.Figure 7 shows the AP localization results.As seen in the figure, all the AP locations are estimated with a high level of accuracy.Although the localization results contain some errors, it brings limited effect to our RSS smoothing method proposed later.

Original Target Positions Recovered Positions
Figure 7. AP localization results using CS method.

RSS Difference-Aware Graph-Based Semi-Supervised Learning RSS Smoothing Method
The G-SSL method tries to set the same value for ri and rj if the coordinates c i and c j at locations S i and S j are similar.However, since the distance between each RP and each of the unlabelled locations is known, we can use this information to estimate the expected difference in RSS based on the known locations of the APs and the radio propagation model.Thus, we define Rd (S i , S j ) as the estimated RSS difference between r i and r j at location S i and S j .We change the smoothing constraint to reflect that the difference ri − rj of estimated RSS values ri and rj should be close to Rd (S i , S j ).Accordingly, (4) can be written as: 6.1.Estimation of Rd (S i , S j ) Consider one of the APs as shown in Figure 8.The location of the AP, c AP , can be estimated using the CS-based method in [20].We use the indoor signal propagation model in [34].Therefore, the RSS value at location S i can be calculated as, where d i denotes the distance between the location of the i-th measurement and the AP, P is the transmission power of the AP, α is the propagation loss exponent and h is the combined effect of path loss, fading, and shadowing.Using this model and assuming h i = h j , we derive the following expression for Rd (S i , S j ):

. Offline Training Phase
In the offline training phase, the coordinates of RPs S i and S j , c i and c j , are already given in the radio map and the location of AP c AP can be calculated precisely using the CS-based method.Thus, the Euclidean distance d i and d j between the RPs and AP can be obtained.Finally, the RSS difference Rd (S i , S j ) can be calculated directly using (20).

Online Localization Phase
In the online localization phase, since the actual location of S j is unknown, d j cannot be calculated directly.However, d(S i , S j ) can be estimated using inertial sensor data and step counting algorithms and d i can then be calculated.We can use d j = d i − d(S i , S j ) (the mobile device moves towards AP) or d j = d i + d(S i , S j ) (the mobile device moves away from AP) instead.

Finding the Optimal Solution
The cost function in (18) can be written as: In order to find the optimal solution, we need to find the derivative of the cost function with respect to R. Since the cost function of ( 21) is not convex, we use the gradient descent method to solve the optimization problem.Next, we derive the derivative for each part of the cost function in (21).The first part of ( 21) can be written as: where R = [r 1 r 2 . . .r +u ] is the RSS matrix and if the labels are not given, we use 0 M×1 instead.J = diag(δ 1 , δ 2 , . . ., δ +u ) is a Hermitian indication matrix where δ i = 1 means that the corresponding i-th node in the graph is labelled and δ i = 0 otherwise.Using (22), ∂C 1 ∂ R can be written as: The second part of (21) can rearranged as: where L = D − W is the graph Laplacian and D = diag(µ 1 , µ 2 , . . ., µ +u ) where The derivative of the third part of the cost with respect to R is equal to 0. The last part of ( 21) is: where κ ij = w ij Rd (S i , S j ).In order to find ∂C 4 ∂ R , first we find ∂C 4 ∂r n for 1 ≤ n ≤ + u.Using [34], Since κ ij = w ij Rd (S i , S j ), κ ni = κ in .Therefore: where and ∂C 4 ∂ R is obtained using: where G [g 1 g 2 . . .g +u ].Finally, in order to find the optimal solution, we set ∂C ∂ R = 0: Using ( 23), ( 25) and ( 29): In summary, to find the optimal solution, initialize G = 0 M×( +u) .Then use an iterative procedure as follows: First, find R as the solution of (31).Second, update G based on the result of the first step and the definition of G. Repeat the two steps until convergence.

Experimental Results
In order to verify our method, we collected RSS data from the 4th floor of the Bahen Building at the University of Toronto.The radio map was constructed using a step-counter-assisted RSS measurement method.Sensor information from the accelerometer is used to estimate the distance between the RPs.Using this system, a radio map consisting of 251 RPs throughout the entire 4th floor of the Bahen building has been created in less than 30 minutes.However, the resulting radio map has only 5 RSS measurements at each RP.Consequently, it is more error-prone compared to the traditionally generated radio maps in which for each RP hundreds of measurements are collected.The Proposed localization procedure is tested on a sequence of 35 test points collected on a path from Room 4000 (top of Figure 9) to Room 4148 (bottom of Figure 9).
The RSS values from a single AP in the original radio map and the test points are shown in Figures 10a and 11a respectively.As can be seen, although the RSS values are generally consistent with the signal propagation model, there are some large fluctuations at some RPs.To eliminate the negative effects caused by this fluctuation, the proposed RG-SSL method is applied to smooth the RSS values.In order to obtain more accurate results, 125 unlabelled data throughout the whole 4th floor of the Bahen Building are considered.Following steps are repeated until all the labelled points are smoothed: 1.
Set one of the labelled points as unlabelled.

2.
Use the rest of the labelled points, 125 unlabelled points and RG-SSL method to estimate the RSS value of the above unlabelled point.As comparisons, the G-SSL method, SCTW method and SPORT method are also simulated in this paper.The simulation results are shown in Figures 10 and 11.From Figures 10b and 11b we can see that the RG-SSL method successfully smooths out the radio map and the effect of signal fluctuation is mitigated.Because most of the information in the radio map is abandoned in the SCTW algorithm, it cannot achieve the optimal result.In the SPORT algorithm, due to the variability of the parameters in the signal propagation model, we can obtain suboptimal solution of the RSS values rather than the best results.In the G-SSL algorithm, all the collected RSS values are used to correct the outliers, which leads to a better result.Furthermore, the RSS difference between different locations is used to improve the G-SSL method and the estimated RSS values are more accurate.As a result, although the radio map and the online RSS values are also smoothed by the other algorithms, the errors are larger than that in Figures 10b and 11b, especially in the upper part of the corridor.The increasing errors in RSS values in the radio map and the online data will inevitably result in the increased localization errors.
The localization result from directly using the original radio map and test point data are shown in Figure 12a.Compared with the actual locations in Figure 9, there are some significant errors in the localization results as certain distinct test points have been localized erroneously to a single location.The localization result using the modified radio map can be seen in Figure 12b-e.We see that the localization results are improved compared to the results in Figure 12a.Most of the test points were erroneously localized to one location in Figure 12a are now localized to correct distinct locations.These incorrect estimates were causing a large amount of localization error in the original method however are greatly reduced using both the RG-SSL method and the other methods.Clearly, the localization results of the proposed RG-SSL method are closer to actual locations than the results calculated by the other methods.Furthermore, the trajectory obtained in Figure 12b is clearly smoother than those in Figure 12c-e We can readily see the performance gain of the RG-SSL method in the cumulative distribution function (CDF) of the localization error for the RG-SSL method and the other methods, as shown in Figure 13 and Table 1.It is clear that the proposed localization method outperforms the other methods.As discussed above, the RSS values smoothed by RG-SSL are more accurate than those smoothed by the other algorithms, and the location accuracy is increased by 3.5% relative to G-SSL and SPORT, 9.8% compared to SCTW method and 20.6% relative to original data.The average localization error has been reduced from 2.89 m to 2.07 m, and notably, the maximum localization error has been reduced from 10 m to 4 m.

Sparse Graph Construction for RG-SSL Using CS Method
Since the radio map is constructed using a step-counter-assisted RSS measurement method, the coordinate of each RP calculated by this method contains a lot of noise.In the proposed RG-SSL method, the heat-kernel function is used to construct the graph and calculate the edge weights based on the Euclidean distance.However, the Euclidean distance and consequently the weights are very sensitive to noise.
The accuracy of the generated graph will greatly affect the positioning performance.When the vertices in the graph are far away from each other, the graph weight is much smaller than the graph weight calculated for neighboring vertices.Therefore, the graph weight matrix is sparse.Since the CS method is robust to noisy data, we can use it to estimate the graph weight matrix [35].As mentioned in Section 2, we denote the vertex set V = {c 1 , c 2 , . . ., c , c +1 , . . ., c +u }.Given the measurement matrix A and the matrix for unknown reconstruction coefficients W we can reconstruct a sparse W from V = AW using: min where • 0 denotes the 0 -norm.The 0 -norm minimization is NP-hard.However, if the solution is sparse enough, the following convex 1 -norm minimization can be used to solve the sparse representation problem: min Suppose the noise in the collected RSS is denoted by ξ.Then, where B = [A I] and W = W ξ .Thus the 1 -norm minimization can be rewritten as: For each c i in the vertex set, the measurement matrix B i is constructed as B = [c 1 , . . ., c i−1 , c i+1 , . . ., c +u , I] and w i is calculated using 1 -norm minimization: where w i is the i-th column of the matrix W. Then the graph weights w ij are obtained using: where i, j ∈ {1, 2, . . ., + u} and w i (j) denotes the j-th element of vector w .

Experimental Results
Since the labels of all the vertices in the graph are necessary for sparse reconstruction of the graph weight matrix, CS method can only be used in the offline phase.The weighted matrices calculated by the heat-kernel function and CS method are shown in Figure 14a,b, respectively.Each pixel in the figure represents the weight value w ij between two vertices and 0 ≤ w ij ≤ 1.A larger value of w ij between vertex S i and S j means a stronger correlation between them.If the vertices around the vertex S i have strong correlations with the vertex S i , we can get a more accurate RSS estimates for the vertex S i .As we can see from Figure 14a, since the measurements are noisy, the weight matrix contains some errors.In the weight matrix, the weight values are very small between different vertices, which means the relationship between different vertices is very weak.Therefore, the information transferred between different vertices is inaccurate and the estimated RSS values using this weight matrix are not accurate enough.As a result, the localization accuracy is reduced by the inaccurate relationship between different locations.Due to the sparsity of the graph and robustness to noise, the weight matrix is recovered more precisely than the traditional heat-kernel function.The relationship between different vertices in Figure 14b is much clearer than Figure 14a.Comparing Figure 14b with Figure 14a, the graph weight values calculated by the CS method are much larger than those obtained using the heat-kernel.As a result, it is possible to get more useful information between different vertices using the matrix in Figure 14b.Therefore, the estimated RSS values are more accurate than those calculated by the heat-kernel as shown in Figure 15a,b.Based on the matrix calculated by the CS method, the localization results are more accurate in Figure 15b.From Figure 16 we can learn that the cumulative probability is 71.4% when the location error is 2 m and the localization accuracy is increased by 7.9% relative to RG-SSL, 11.5% relative to G-SSL and SPORT, 17.1% compared to SCTW algorithm.Thanks to the more accurate radio map, the maximum localization error has been further reduced to 3.5 m.Meanwhile, the average localization error has been reduced from 2.07 m to 1.98 m.In summary, the RSG-SSL algorithm is more robust to noise and has achieved a better performance than RG-SSL algorithm and much better than other algorithms.By using the RSG-SSL method, the localization accuracy is improved significantly in crowdsourcing WLAN indoor localization system.As a result, the localization system could provide us with much better service.

Conclusions
In this paper, the effect of noise and erroneous measurements caused by the crowdsourced data are reduced using the relationship between RSS values of different locations.The LR method is used to solve the device diversity problem automatically in crowdsourcing system at the beginning.After getting the uniform radio map, the RG-SSL method is proposed to improve the localization accuracy by smoothing the RSS values and using label propagation to better estimate the radio map.The relationship between the RSS values is represented using a weighted graph connecting different locations.Additionally, the RSS difference is introduced in the traditional G-SSL method to achieve a better performance.In order to obtain the RSS difference, a CS-based method is used to precisely localize the location of the APs.Noisy RSS values can be corrected using the proposed RG-SSL method, resulting in a higher localization accuracy.Due to the sparsity of the weighted graph in the G-SSL, the weighted graph is reconstructed more accurately by the CS method compared to the traditional heat-kernel function which is the idea of the proposed RSG-SSL method.The experimental results performed at the University of Toronto show that a smoothed radio map and online RSS values are obtained by RG-SSL method and the localization accuracy is improved.The RSG-SSL method applied in the offline phase also resulted in an improved performance.

Figure 1 .
Figure 1.Histogram of 100 RSS values of a single AP measured at a location.

Figure 2 .
Figure 2. RSS values of an AP over the corridor area of the fourth floor of the Bahen Building, University of Toronto.

Figure 3 .
Figure 3.The system view of the proposed G-SSL based Localization.

Figure 4 .
Figure 4. Linear correlation between RSS values for different devices.
is an indicator function.

Figure 8 .
Figure 8. Mobile device is moving away from AP.

Figure 9 .
Figure 9. Actual locations of test points.

Figure 10 .
Figure 10.Comparison of signal distribution of radio map.(a) Original radio map (b) RG-SSL method (c) G-SSL method (d) SCTW method (e) SPORT method.

Figure 11 .
Figure 11.Comparison of signal distribution of test data.(a) Original radio map (b) RG-SSL method (c) G-SSL method (d) SCTW method (e) SPORT method.

Figure 13 .
Figure 13.Cumulative distribution function of the localization error.

Figure 14 .
Figure 14.Comparison of Weighted graph.(a) Weighted graph calculated by heat kernel method; (b) Weighted graph calculated by CS method.

Figure 15 .
Figure 15.Smoothed signal distribution of radio map and localization results using RSG-SSL.(a) Smoothed signal distribution of radio map; (b) Localization results.

Figure 16 .
Figure 16.Cumulative distribution function of localization error.

Table 1 .
Comparison of different algorithms.