Transfer Learning for Wireless Fingerprinting Localization Based on Optimal Transport

Wireless fingerprinting localization (FL) systems identify locations by building radio fingerprint maps, aiming to provide satisfactory location solutions for the complex environment. However, the radio map is easy to change, and the cost of building a new one is high. One research focus is to transfer knowledge from the old radio maps to a new one. Feature-based transfer learning methods help by mapping the source fingerprint and the target fingerprint to a common hidden domain, then minimize the maximum mean difference (MMD) distance between the empirical distributions in the latent domain. In this paper, the optimal transport (OT)-based transfer learning is adopted to directly map the fingerprint from the source domain to the target domain by minimizing the Wasserstein distance so that the data distribution of the two domains can be better matched and the positioning performance in the target domain is improved. Two channel-models are used to simulate the transfer scenarios, and the public measured data test further verifies that the transfer learning based on OT has better accuracy and performance when the radio map changes in FL, indicating the importance of the method in this field.


Introduction
In the mode of pervasive computing, people can acquire and process information at any time, any place, and in any way. Location information is essential for pervasive computing. Satellite positioning technology has been able to meet most outdoor location acquisition requirements, and indoor positioning technologies are constantly emerging to get through the "last meter" of positioning technology [1]. According to wireless technologies, the existing methods include Wifi positioning, Bluetooth positioning, ultra-wideband positioning, and lidar positioning, and so on. According to the measurement techniques, the existing methods include time of arrival (TOA) positioning, angle of arrival (AOA) positioning, received signal strength (RSS) positioning [2], Channel State Information (CSI) positioning [3], etc. According to the algorithms, the existing methods include triangulation, direct positioning, fingerprint localization (FL), and so on. Compared to other techniques, the FL can be applied to a complex environment, with wide application scope and easy implementation [2][3][4][5].
FL avoids the need for artificial modeling of complex indoor wireless channels and is typically achieved through machine learning techniques such as classification or regression algorithms. As with traditional machine learning applications, FL usually assumes that the training fingerprint data (also called radio map) are sampled from the same distribution as the test data. However, in practice, many factors will cause fingerprint distribution to change. For example, when using RSS fingerprints as features, changes in Access Point (AP) number/location, positioning environment, or equipment parameters will all lead to changes in channel parameters, resulting in changes in the location fingerprint distribution. As a result, the model's accuracy obtained from the old training data will decrease or even fail. This can be addressed by seeking transfer learning techniques.
Transfer learning studies how to transfer the knowledge from the old data domain to the new data domain to deal with the problem that there is no labeled data (unsupervised) or insufficient labeled data (semi-supervised) in the new domain. Transfer learning is a new paradigm in machine learning technology, which has been successfully applied in many machine learning fields, such as the biomedical field [6,7], language and text recognition field [8], graph neural network field [9], detection field [10], etc., and also in positioning [11].
The existing studies on FL transfer learning mainly consider three scenarios: time transfer [12][13][14], device transfer [15,16], and space transfer [17]. Time transfer refers to transferring the collected fingerprints or trained model knowledge from one time to another. Device transfer is the transfer from one device to another. Spatial transfer refers to the transfer from one area to another. No matter which kind of transfer is used, the essential learning is that the fingerprint distribution has changed from one state to another. Feature-based transfer learning has recently been verified to be effective in FL problems [18]. However, the frequently used maximum mean difference (MMD) distance does not consider the detailed differences between the two distributions.
In this paper, a novel transfer learning method based on optimal transport (OT) [19] is applied to FL. By introducing the Laplacian regularization and jointly learning mechanism, a smoother mapping function can be learned to improve the algorithm's robustness further. We use both the free space channel model and the multi-wall model to simulate the proposed method's performance and analyze the reason why the OT-based transfer learning performance is good. The performance of the algorithm is further verified on the public measured data set. The results indicate that OT technology is significant to the transfer learning problem in FL.

Wireless Fingerprinting Localization
RADAR is one of the earliest wireless fingerprint positioning systems [20]. It adopts the KNN regression positioning method: a classic FL method with a simple algorithm and good robustness. Subsequently, support vector machine [21], decision tree [22], neural network [23], and other machine learning methods are also used in fingerprint positioning. These methods are deterministic methods, which map the location fingerprint to a specific location. Another type of method is called the probability method, which considers the position fingerprint's randomness and maps a fingerprint into the probability density of the position. The usual methods are naive Bayes [24], probability kernel regression [25], conditional random field [26], and so on.

Transfer Learning in the Wireless Fingerprinting Localization
The research on the transfer learning of wireless FL started at about the same time as the general transfer learning. Yin et al. proposed a time transfer method of radio fingerprint map in 2005 [12], which uses regression analysis to learn the predictive relationship between the received signal strength of mobile devices and the received signal strength of reference points on sparse locations, to update the radio map in real-time according to the predictive relationship. LeManCoR [13] is a kind of adaptive method for fingerprint mapping based on the Manifold co-regularization. The LuMA method [15] takes into account both time transfer and device transfer. In addition, there are studies that consider multi-device transfer [16] and spatial transfer [17] in fingerprint positioning. All the above methods require the target domain to have a small amount of labeled data, so they belong to semi-supervised transfer learning. Furthermore, they all adopt the manifold alignment algorithm. TrHMM [14] is also a time transfer learning method for FL. Unlike the previous method, it is a parameter transfer learning method based on the Hidden Markov model. None of these methods adopt the latest available transfer learning algorithm.
General feature-based transfer learning methods include Transfer Component Analysis (TCA) [27], Joint Distribution Adaptation (JDA) [28], Balanced boundary Distribution Adaptation (BDA) [29], etc. These methods are types of unsupervised transfer learning, where there is no label information in the target domain. An improved method based on the general feature-based transfer learning has been successfully applied to FL recently [18]. However, the method has an excess of super parameters, and the performance may worsen when the super parameters are not chosen carefully. It is urgent to develop a fingerprint transfer learning method with fewer super parameters and robust performance in practical application.

Optimal Transport
The theory of optimal transport (OT) originates from what mathematician Gaspard Monge described in 1746 to 1818 as the "Monge problem" about how to move a sand pile to another place and change its shape into a predefined one at minimal cost [19]. It is then used by mathematicians to compare the distance between two probability distributions. In recent years, thanks to approximate solvers' appearance that can be extended to high-dimensional problems, the revolution of OT technology has been initiated. It has been successfully applied to a variety of problems in image science (such as color or texture processing), graphics (for shape processing), or machine learning (for regression, classification, and generation modeling), as well as solving the problem of transfer learning [30].

Problem Description
FL is a method of associating location information with its fingerprint and then using parameterized or nonparametric models for location identification. Specific environmental features are the basis for creating fingerprint information. In wireless FL, it uses the wireless signal features to create a fingerprint of the position, also known as a wireless fingerprint. The fingerprints in the interesting area construct a radio map. Various wireless channel measurements can be selected as site-specific signal features, such as TOA, AOA, RSS, CSI, etc. Sometimes, they can be fused to form a higher-dimensional feature space. The signal feature is mapped to the position fingerprint in a preset way. Then, a sample associated with position u can be expressed as x(u) = (x 1 (u), ..., x m (u)) ∈ X m , and its conditional probability density is denoted as P X|U , where, X m represents the m-dimensional fingerprint space.
Wireless FL usually includes two stages: the offline stage and the online stage. A radio map is first created in the offline phase, which contains data pairs of several coordinates and corresponding fingerprints within the location area. Then, the radio map and the learning algorithm are used to get the position recognition function g : x → u, which maps a fingerprint to its estimated coordinate. In the online phase, the target's estimated coordinate g(x) is obtained according to the online fingerprintx and the function g.
In transfer learning of FL, it is assumed the data distribution in the offline and online phases comes from different distributions. Mathematically, let the source domain be D s = X m s , P s X (x) and the task in the source domain be T s = U d s , g s (·) , where the subscript or superscript "s" represents the source domain, and if changed to "t", it represents the target domain. The following text will follow this notation. X m represents the m-dimensional fingerprint space, and U d represents the d-dimensional position space. In this paper, d = 2 is setted, and the superscripts "m" and "d" will be ignored below when it is well defined. g(·) stands for the position recognition function. The joint distribution of the offline phase, P s X,U (x, u), is associated with the source domain. Moreover, the joint distribution of the online phase, P t X,U (x, u), is associated with the target domain. In transfer learning setting, P s X,U (x, u) = P t X,U (x, u). Transfer learning of FL studies how to transfer the knowledge about location from the old radio map (in the source domain) to the new one (in the target domain), so as to make full use of the knowledge of the source domain to optimize the task of FL in the target domain. In practice, multiple instances of fingerprints in a domain can be observed, either with or without a location label.
Assume that the fingerprint-coordinate pair .., n s was observed in the source domain, including n s samples in total.
In the target domain, the unlabeled fingerprint i ∈ X t , i = n s + 1, ..., n s + n t and labeled fingerprint set D t = i ∈ U t , i = n s + n t + 1, ..., n s + n t + n t were observed, including n t and n t samples, respectively, and in general we have n t n s n t . When n t = 0, it is called unsupervised transfer learning; otherwise, it is called semi-supervised transfer learning.
The transfer learning task of FL is to estimate the position recognition functionĝ t (x; D s , D t , D t ) in the target domain according to the observed fingerprint samples in the source domain and the target domain, so as to minimize the generalization error of the target domain. The generalization error is expressed as follows, When X s = X t , it is called homogeneous transfer learning. This paper considers unsupervised homogeneous transfer learning. Hypotheses should be made to theoretically guarantee the transfer learning to succeed [31]. The following are hypotheses often used in the transfer learning problem.
• Class imbalance hypothesis: the distribution of labels in the two fields is different, i.e., P s Y (y) = P t Y (y), but the conditional probability distribution of the feature is the same, i.e., P s X|Y (x |y ) = P t X|Y (x |y ). • Covariance offset hypothesis: the marginal distribution of the two domains is different, that is, P s X (x) = P t X (x), but the conditional probability distribution of the label is the same, that is, P s Y|X (y |x ) = P t Y|X (y |x ) (equivalent to the learning function g s = g t = g).
In wireless FL, the typical scenario that requires transfer learning can be summarized into two cases: 1. The channel parameters on one or more links are changed; 2. The channel parameters of a local region are changed.
Whichever case is considered, the above hypotheses are too strong to be satisfied. First, the class imbalance hypothesis requires that the distribution of fingerprints at each location remain the same. Second, the covariance offset hypothesis requires that the position recognition function be the same. Therefore, the transfer learning algorithm based on these two hypotheses is easy to fail in FL.
In addition, the feature-based transfer learning approach assumes the existence of a pair of mapping functions {ϕ s (·), ϕ t (·)} that maps features from the source and the target domains to a common latent domain, while the labels remain unchanged. Therefore, the learning function in target domain g t (·) is approximately replaced by g l : ϕ s (x s ) → u s , the position recognition function trained in the latent domain.

Transfer Component Analysis
Transfer Component Analysis (TCA) [27] is one of the most usual feature-based transfer learning methods, whose principle is to adaptive the marginal distribution of the feature. TCA learns the cross-domain transfer components in the reproducing kernel Hilbert space by minimizing the MMD distance between the source domain and the target domain samples after mapping. Let the number of samples in the source domain and target domain be n s and n t , respectively, and the MMD distance between them is [32] D(X s , where X s and X t represent the fingerprint sample matrix of source domain and target domain, respectively. However, it is usually highly nonlinear and difficult to directly minimize the MMD distance. The above distance can be converted into kernel function form, so the problem is converted to kernel matrix learning and written as semi-definite program (SDP) in the form of where K is the kernel matrix defined on all data. Let K s,s , K t,t , and K s,t , respectively, represent the Gram matrix defined on the source domain, target domain, and cross-domain data, The elements in matrix L is calculated as The results can be constructed with dimension reduction method, that is, solve the first m eigenvalues of (KLK + µI) −1 KHK, reducing the computational cost of solving SDP. The TCA method is the base method in feature-based transfer learning, many other methods are extended upon it, such as JDA [28] and BDA [29].

Basic Method
In this paper, we implemented the optimal transport (OT) method in the transfer learning of FL. Different from the feature-based method (such as TCA), it is assumed that the drift of the domain is caused by an unknown transformation T : X s → X t from the distribution of source domain to the distribution of target domain. In the FL, the corresponding physical interpretation of the drift may be the change of fingerprint acquisition conditions, changes in environmental parameters, changes in noise conditions, or other unknown processes. Let us say that transformation T maintains the condition distribution of the location label in this process, namely, This means that the transformation maintains the information of the position decision function, so the position estimator in the target domain can be approximated by the estimator, g t (T (x s )), trained after the source domain is mapped to the target domain, as shown in the schematic diagram in Figure 1. Then, the knowledge about the location recognition function is transferred from the old radio map to the new one. From the perspective of probability, T transforms the marginal measure p s X of a fingerprint on the source field to the measure of its image, which is represented by T #p s X . It is another measure on X t , which satisfies T #p s If T #p s X = p t X , T is called a transport map from p s X to p t X . Under this assumption, X t comes from the same probability distribution as T #p s X . Therefore, the principle of solving the problem of transfer learning in FL through OT is the same as that in [16]: 1. The probability measures p s X and p t X are estimated using X s and X t . 2. Find a transport map T, from p s X to p t X . 3. The labeled sample X s is transported with T, and then the target domain estimator g t (·) is trained with the transformed samples.
The key point is to find the right transport T. OT finds T by minimizing the transport cost C(T): where the cost function c : X s × X t → R + is a distance function in the feature space X . C(T) can be interpreted as the total energy required to move the fingerprint probabilistic mass p s X to the fingerprint probabilistic mass T #p s X . The solution to the OT problem defined by The Monge problem is The problem in (9) is combinatorial, and the feasible set of which is nonconvex. Therefore, solving the Monge problem is difficult. Kantorovitch form of OT is a convex relaxed version. Define Π ∈ P (X S × X T ) as a set of probabilistic coupling, whose marginal measures are p s X and p s X . The Kantorovitch problem requires finding the probabilistic coupling minimizing the following formula, where γ can be regarded as the joint probability measure with marginal measure p s X and p t X , also known as transport map. The above formula has been proved to be applicable to define the distance between distributions, which is called Wasserstein distance or Earth Mover distance. The Wasserstein distance of order n between p s X and p t X is defined as Compared with MMD distance defined in formula (2), Wasserstein distance better describes the contour and detail differences of the two distributions.
As discrete samples are obtained in the actual situation, only the empirical distributions of p s X and p s X can be obtained, denoted as where δ x i and p i represent the Dirac function and the probabilistic mass at the sample x i , respectively. p i belongs to the probabilistic simplex, namely, ∑ i p i = 1. Then, define the probability coupling matrix as follows, where 1 d is the d-dimensional full 1 column vector. The OT of the discrete version of Kantorovitch form can be expressed as where ·, · F represents the Frobenius product, C ≥ 0 represents the cost function matrix, and C(i, j) = c(x s i , x t j ) represents the cost required to transport the probability mass from x s i to x t j . For simplicity, the Euclid square distance is used as the cost in this paper, that is, In general, γ 0 is a sparse matrix containing n s + n t − 1 nonzero elements at most. (14) is a linear programming problem, which can be solved by simplex method [33]; however, the complexity is high. The OT regularized with entropy adds an entropy regularized term Ωs (γ), namely, where Ω e (γ) = ∑ i,j γ(i, j) log γ(i, j) is the negative entropy of γ, and λ e represents the corresponding regularization coefficient. By adding the entropy of γ, a smoother transport diagram can be obtained, and a more efficient algorithm has been derived, called Sinkhorn-knopp [34]. After γ 0 is solved, barycentric matching [35] can obtain the mapping of all source domain samples in the target domain:X The purpose of the OT transfer learning is to correctly recover the transport graph from the data distribution in the source domain to the data distribution in the target domain, and what kind of transformation it can recover has not been proved theoretically. However, it has been proved that the affine transformation of discrete distributions can be recovered lossless [30].

Laplacian Regularization
In FL, fingerprints that are intuitively "close" in the source domain should also be "close" when transported to the target domain, and vice versa. Letx s i represent the value after the source domain sample x s i is mapped to the target domain, and the Laplacian regularization term is introduced: where S s (i, j) ≥ 0 is the element of sample similarity matrix S s in the source domain, and S t is similar. α is a super parameter, which represents the importance factor of Laplacian regularization in the source domain. When the marginal distribution is uniform, the above formula can be further simplified by formula (16): where L s = diag(S s 1) − S s is the Laplacian matrix associated with the graph S s , similarly, L t = diag(S t 1) − S t . When using Laplacian regularization, we solve the following problem, where λ l represents the coefficient of the Laplacian regularization. Then, the subsequent matching process goes the same as the Sinkhorn algorithm.

Joint Estimation of Transport Map and Transformation Function
The transport map is responsible for transporting the empirical probabilistic mass from the source domain to the target domain, or vice versa. The algorithm gets the transport map of probability density, not a transformation function. Jointly learning the transport map and transformation function makes the learner better extend to unknown samples, which is known as out-of-sample case [36]. The cost function of joint transport and transformation estimation is where R (·) is the regularization term related to the transformation T; λ γ and λ T are regularization coefficients. The hypothesis space of the transformation T, H, can be either a linear or nonlinear function space, and we adopted a linear function in this paper.

Data Preprocessing and Optimization Algorithm
In our experiments, the source domain and target domain data are both normalized by subtracting the mean value and dividing by the variance of the data. Data preprocessing is rarely mentioned in former transfer learning studies, but it significantly impacts performance. This is because data preprocessing can reduce the difference of mean value and variance between two domains to a certain extent, which is similar to the effect of "transfer".
For the optimization of the Laplacian regularization version in Formula (19), we adopt the Generalized Conditional Gradient (GCG) algorithm [37] to solve the optimization of the OT problem in this paper. The regularization in (19) is strictly convex, so the objective function can reach the minimum on B. Specifically, using f (γ) represents the objective function in formula (19), the method iterates the steps below until convergence: where τ l is obtained by linear search. For the optimization of the joint matching algorithm version in Formula (20), the block-coordinate descent (BCD) [38] method is used. The idea is to alternatively optimize λ and T.
These optimization algorithms are available on a public website [39] that readers can refer to.

Wireless Fingerprint Channel Model
In this paper, RSS fingerprint characteristics are taken as an example, assuming that there are m APs in the interesting region and the region is a 2-dimensional Euclidean space. The coordinate of the i-th AP is a i , i = 1, . . . , m. The APs transmit wireless signals at a certain power. The fingerprint at the coordinate u in the source domain is represented by x s (u), which is composed of the received power from all APs, namely, x s (u) = x s 1 (u), x s 2 (u), ..., x s m (u) . In the target domain, the channel link parameters change, and the fingerprint at coordinate u is represented by x t (u). The commonly used model of RSS fingerprints are free-space model, multi-wall model, and ray tracing [40]. In this paper, the free space model and multi-wall model are used to simulate the distribution changes in fingerprint transfer.

Free Space Loss Model
In the free space loss model, it is assumed that the received power is calculated over a long period of time without considering the small-scale fading of the channel. Assuming that the receiving power (unit mW) follows a lognormal distribution, the fingerprint component of the mobile device located at u from the i-th AP is [41] x i (u) = P i − 10 · α i log 10 where P i represents the received power of the i-th AP at the reference distance d (usually 1 m), with unit dBm; α i is the path loss exponent of the i-th AP; and N i is the link noise of the i-th AP, which obeys a Gaussian white noise with variance σ 2 i . Assuming that AP is independent of each other, the conditional probability density function of the location fingerprint is The free space loss model is a signal propagation model in an ideal environment. In this paper, the free space model is used to model the fingerprint changes when the link parameters of the channel change. Suppose the link noise of all AP is equal, that is, N 1 = N 2 = · · · = N m = N, and remains unchanged. When the link parameters change from θ s = P s 1 , P s 2 , ..., P s m , α s 1 , α s 2 , ..., α s m T to θ t = P t 1 , P t 2 , ..., P t m , α t 1 , α t 2 , ..., α t m T , the conditional probability of location fingerprint will change from P s (x |u; θ s ) to P t (x |u; θ t ).

Multi-Wall Model
The multi-wall model is an extension of the single-slope loss model, including an additional attenuation term, which is caused by the loss of direct path between transmitter and receiver encounters the wall and door [42]. In the multi-wall model, Formula (22) is rewritten as where the additional attenuation term can be expressed as where k wi represents the number of penetrable wall of type i, l i is the corresponding loss of signals passing through it; N d and N f d are the number of ordinary doors and fire doors passing through the direct path, respectively; l d and l f d are losses corresponding to signals passing through ordinary doors and fire doors, respectively; and χ n /λ n is a binary variable, indicating the state of the n-th ordinary door/fire door.
The multi-wall model considers the attenuation of the signal after passing through the wall, which can be used to simulate the RSS fingerprint under different indoor structures. In the experiment part, the multi-wall model is used to model the transfer learning when the local area's channel parameters change. Compared with the source domain, some wall structures of the target domain are changed.

Experiments
To verify the performance of OT-based transfer learning in FL, two models described in Section 5 are used for numerical simulation, and the performance of the algorithm is also verified with the public data set. First, we use the free space channel model to simulate the transfer scenario of RSS fingerprint when the radio link parameters change. Second, the indoor multi-wall model was used to simulate RSS fingerprint transfer's learning scene when the environmental parameters of a local area changed. Finally, the performance of the algorithm is further verified by using the publicly measured data set. One of the performance evaluation indicators used in this paper is the average positioning error (AE), and its calculation formula is whereû j and u j , respectively, represent the estimated and real coordinates of the j-th sample in the target domain test set, and n t represents the total number of the test samples. The cumulative error value of 50% and 80% was used as additional evaluation indicators.

Free Space Channel Model RSS FL Transfer Learning Simulation
The simulation was set as a 1-d positioning scene, with range [0, 10] and 2 APs located at −1 and 11, respectively. Using the model described in Section 5.1, the channel parameters, P i and α i , are changed to simulating the change in the radio link parameters. The parameters of the source domain and target domain are shown in Table 1. In the source domain, the positioning area is divided into 10 grid points, and the center of each grid serves as the real label of the position. Ten samples are randomly generated in each grid as the training set. In the target domain, 1000 samples were randomly collected as test sets. The standard deviation of the noise was set at 2 dBm. The TCA method and OT linear map [43] were used to transfer the samples of the above simulation scenarios. The left side of Figure 2 shows the normalized fingerprint samples in the source domain and the target domain. The two axes, respectively, represent the two features, and the size of the sample point represents the relative value of the real coordinate values. Accordingly, the right side of Figure 2 shows the results after the source domain and the target domain are mapped to the latent domain through TCA transformation. Figure 3 shows the fitted distributions of the two features on the left and the fitted distributions after mapping to the latent domain through TCA transformation on the right.    It can be observed from Figures 2 and 3 that TCA transformation fails to match the distributions of the two domains. The samples' variance in the source domain is large, while it is small in the target domain. This is still the case after mapping to the latent domain. We know that the objective of TCA is to minimize the mean difference of the mapped samples. As the samples have been normalized, the mean difference between the two domains is relatively small. Therefore, TCA does not significantly improve it. As the cost function does not constrain the variance, the variance between the two distributions is still massive after TCA transfer. Figure 4 shows the changes in the fingerprint samples in the data domain before and after the OT-based transfer learning. The left part of Figure 4 shows the samples in the fingerprint space from the source domain, the target domain, and the target domain mapped from the source domain through OT. The middle and the right part are the fitted distributions of the two features in different domains.
It can be observed from Figure 4 that the samples mapping from the source domain to the target domain with OT have a higher matching degree with the sample distribution in the target domain, whatever the mean, the variance, or the contour. We know that OT is to transport the source domain's distribution to the distribution of the target domain under the principle of minimum cost, so the two distributions are matched better. According to the Transfer Learning theory, the generalization error bound of transfer learning is related to the distance between the distributions of the two domains, and the smaller the distance is, the lower bound will be reached [44]. It is observed from the above simulation that the distributions obtained by OT match better than what by TCA. Moreover, Figure 5 shows the relationship between the average test error with changing the noise level without transfer learning, TCA learning, and OT learning. It can be observed from Figure 5 that the transfer performance using OT is the best, and the advantage gradually decreases with the increase of noise level. Besides, TCA does not improve the target domain location performance but rather weakens. From the fitted distribution of Figures 3 and 4, it is obvious that the positioning performance has a great correlation with the degree of distribution coincidence.

Multi-Wall Model of RSS FL Transfer Learning Simulation
We use the indoor multi-wall model to simulate the transfer learning algorithm's performance when the local propagation environment changes. Figure 6 shows the heat maps of the power obtained by simulating two environments with 6 APs placed at the same relative locations. Roughly, the two environments have the same layout, and the area is about 50 * 20 m, but there are fewer walls in Environment 1. The simulation data of Environment 1 are taken as samples in the source domain. Some walls are added in Environment 2 to simulate the local changes compared with Environment 1. The simulation data of Environment 2 is taken as samples in the target domain. In both environments, 7200 fingerprint samples were collected at uniform and the same locations, without noise. It can be observed from Figure 6 that there are some local differences in the power energy due to local layout changes.  Figure 7 shows a heat map of the power mapped from source to latent domain by TCA transformation on the left and from target to latent domain on the right. The data are preprocessed by normalization before the transformation. It can be observed that the heat map on the right is still locally different from that on the left. Figure 8 shows the power heat map of the source domain on the left, the target domain in the middle, and the source domain after mapping to the target domain using OT on the right. It can be seen that OT makes the heat map mapped from the source domain more similar to that in the target domain. However, note that the heat map is the superposition of all APs, so it is not possible to ultimately determine whether the two are similar from this figure alone.
Finally, the relationship between the algorithms' average test error and the noise-level is shown in Figure 9. The curve in Figure 9 has a similar trend with that in Figure 5. This indicates that OT plays a positive transfer role in both models, while the TCA method appears to negative transfer [40] in both cases.

Measured Data Experiment
In this section, we use the recent publicly available measured data set [45] to test the performance of transfer learning based on OT and typical feature-based transfer learning algorithms in FL. The data set was collected at the same locations in a library over 15 natural months, using the same device each month. A total of 448 APs were detected in the whole data set. However, due to the long period of sample collection, the number of APs detected every month was different, especially since the 12th month, which had a significant change. In this experiment, the first month's data were taken as the sample set of the source domain, with a total of 8640 samples. The mean value of 6 samples in a continuous period and at the same location was taken, and 1440 training samples were finally obtained. The remaining data of each month is taken as the target domain sample set, which contains 6 data sets each month: one of them is taken as the validation set, containing 576 samples; 3 of them are taken as the test set 1, containing 1728 samples; the remaining two are taken as the test set 2, containing 1392 samples. This experiment focuses on the time transfer of location fingerprints, namely, training the model with the first month's data and then using the transfer learning algorithm to transfer the model to the unlabeled samples of the remaining months.
The validation set is used for choosing the super parameters. For each transfer learning algorithm, the grid searching method is used to select the super parameters that make the validation set perform best. In the transductive setting, the test set 1 is used as the target domain samples and test samples simultaneously. In the out-of-sample setting, the transfer learning algorithm uses the test set 1 as the target domain samples, but the test set 2 is used for testing. In this paper, the classical TCA transfer learning algorithm and its extension method BDA are selected as the comparison algorithms. In addition, the performance without transfer learning was also included in the comparison (represented by Without). The other three methods are based on the OT method: EMDL is the OT method with Laplacian regularization (described in Section 4.2.2), Sinkhorn is the OT method with entropy regularization (described in Section 4.2.1), and Map. OT is the joint estimation method (described in Section 4.2.3). The average positioning error of month 2 to 15 with different algorithms is shown in Figure 10, under the transductive setting on the left and out-of-sample setting on the right. It can be observed from the figure that traditional transfer learning methods (TCA and BDA) showed no significant performance improvement during months 2 to 10 under both settings. While under the transductive setting the average error of transfer learning based on OT was reduced by about 10% compared with not using any transfer algorithms. All transfer learning methods improved in accuracy after month 10, and the OT-based method performs even better under both settings. When the cumulative error distribution function values are at 50% and 80% for each algorithm, the corresponding error values (represented by C.5 and C.8, respectively) are shown in Tables 2 and 3. We can draw similar conclusions from these results.    The results from the real data set have a slight difference from the simulated data. Nevertheless, they all show the superior performance of the OT-based methods.

Super Parameters
In this section, we explain how the super parameters in the algorithm are selected and how they affect the average positioning error. In the Sinkhorn algorithm, there is a super parameter λ e , namely, the entropy regularization coefficient. We empirically select λ e that minimizes the validation error, as shown in Figure 11. For space reasons, we only show the result in the 12th month, still using the 1st month data as the source domain data. It can be seen from the figure that when λ e is set at 0, the error is large, and when it is set at 20, the error reaches the minimum value, indicating that moderate entropy regularization plays a role in improving the performance. There are two super parameters in the EMDL algorithm, λ l and α, which represent the regular coefficient of Laplacian and the proportion of the importance of the source domain data. The M. OT algorithm also has two super parameters, namely, λ γ and λ T , which represent the regularization coefficient of transport cost and the regularization coefficient of the matching function. Similarly, in Figures 12 and 13, we, respectively, show the variation of the validation error when taking different super parameters in the 12th month.  As you can see from Figure 12, in the EMDL algorithm, λ l and α both need to be set to a larger value to achieve better performance. This indicates that Laplacian regularization has improved performance and that the source domain data is of greater importance. As can be seen from Figure 13, the selection of λ T in the M. OT algorithm is more important than λ γ to the performance, but, in general, they have little impact on the positioning mean error.
For these three algorithms, we observe that the optimal super parameter value of each month is almost unchanged, which is not shown in the paper due to the limited space. This shows that our algorithm has good robustness in super parameter selection.

Discussion
The problem of transfer learning is closely related to the distance metric of data distribution, and OT is an important tool to study the distance of data distribution. The distance derived from OT has many good properties. In this paper, the OT method is applied to the transfer learning of FL. The simulation and measured fingerprint data of different fingerprint models show that the transfer learning method based on the OT method has better transfer performance than the traditional TCA method. We find that this is related to the way they define the distance measure of the empirical distribution. The latter's MMD distance is the most commonly used distance metric in transfer learning, but it only describes the mean value of the distribution, while the Wasserstein distance used by the former gives a good description of the details of the distribution. The latest transfer learning review article does not cover the methods based on OT, which we believe should receive wider attention. This paper analyzes and conducts experiments under two different channel models for the transfer learning problem in FL. We find that the traditional method has a negative transfer effect in the experiment, while the OT method can achieve positive transfer, indicating that the occurrence of negative transfer is related to the algorithm. The experimental study in this paper causes us to think about the following theoretical questions.
1. What conditions the location fingerprints can be positively transferred under? 2. How good the generalization bound can be reached in the transfer learning of FL? 3. What causes the difference between the simulation model and the real data?