Teacher-Assistant Knowledge Distillation Based Indoor Positioning System

: Indoor positioning systems have been of great importance, especially for applications that require the precise location of objects and users. Convolutional neural network-based indoor positioning systems (IPS) have garnered much interest in recent years due to their ability to achieve high positioning accuracy and low positioning error, regardless of signal ﬂuctuation. Nevertheless, a powerful CNN framework comes with a high computational cost. Hence, there will be difﬁculty in deploying such a system on a computationally restricted device. Knowledge distillation has been an excellent solution which allows smaller networks to imitate the performance of larger networks. However, problems such as degradation in the student’s positioning performance, occur when a far more complex CNN is used to train a small CNN, because the small CNN does not have the ability to fully capture the knowledge that has been passed down. In this paper, we implemented the teacher-assistant framework to allow a simple CNN indoor positioning system to closely imitate a superior indoor positioning scheme. The framework involves transferring knowledge from a large pre-trained network to a small network by passing through an intermediate network. Based on our observation, the positioning error of a small network can be reduced to up to 38.79% by implementing the teacher-assistant knowledge distillation framework, while a typical knowledge distillation framework can only reduce the error to 30.18%.


Introduction
Over the last decade, mobile technologies, such as smartphones, have become a necessity of everyday life, considering their applications are getting progressively diverse. Following the evolution of mobile devices, various software and programs are made available on these devices. Location-based services (LBS) are the center of much research because plenty of applications, including navigation, traffic updates, asset tracking and safety-related services, require the availability of geographical positioning data to add value to their services [1]. According to Statista [2], the number of mobile users using LBS in the United States is around 150 million. Existing LBS are mainly centered on the outdoor environment. However, indoor LBS have shown immense business potential, since the technology has been adapted to numerous systems, for example, a real-time locating system to guide visually impaired people [3] and a monitoring system to keep track of an isolated COVID-19 patient [4]. Based on a report available at the Research and Market [5] site, it is stated that the value of the global indoor positioning and navigation market is $6.92 billion in 2020, and by 2025, its worth is expected to shoot up to $23.6 billion.
For outdoor environments, the existence of Global Navigation Satellite Systems (GNSS), such as the global positioning systems (GPS), have long been relied upon thanks to their ability to maintain excellent performance [6]; these systems have been used widely for a long period. Unfortunately, it is difficult to implement such systems indoors because they demand line-of-sight (LoS) between the satellites and the receiver [7]. Since an indoor space contains obstacles and external walls, which lead to poor coverage of satellite signals [6], GPS is unsuitable for indoor location-based services [7]. Therefore, several other techniques are much preferred for indoor settings.
Earlier on, indoor positioning systems (IPS) typically adopted geometry-based algorithms, such as triangulation [8] and trilateration [9], for which the location of the base station and LoS wireless signals are required for the system to perform effectively. Unfortunately, issues with this technique emerged, since information on the location of the base station was unavailable in some instances, and it is challenging to conduct LoS transmission in an indoor environment due to the presence of objects in that indoor environment. Hence, a different scheme is preferable in complex indoor environments, and in recent years, the fingerprint approach managed to bring about satisfactory localization accuracy, because even though the number of reference points in unit space affects this technique [10], it does not require the knowledge of the exact access point (AP) locations or the angle and distance measurements [11].
A prevalent technology used for indoor positioning is Wi-Fi fingerprinting. The main reason Wi-Fi fingerprinting has become a great candidate for indoor positioning is the ubiquity of Wi-Fi access points to support wireless Internet connectivity [12]. In general, fingerprint-based IPS operates in two phases. The first phase, known as the offline phase, is performed by measuring the received signal strength indicator (RSSI) of the APs in the environment at known locations, which are regarded as the reference points (RPs). Signal collection from various RPs is conducted to build a fingerprint database. Subsequently, in the online phase, real-time RSSIs are measured by mobile devices at an unknown position. The device's location can be predicted by computing the similarity between the real-time RSSI and the set of fingerprints stored in the database by applying statistical modeling or machine learning algorithms [10].
It was observed that there had been a tremendous leap in performance for natural language processing [13], speech recognition [14] and image processing [15] using deep learning, and this has initiated the implementation of deep learning algorithms in an indoor positioning system. The convolutional neural network (CNN) has recently been a favourite choice because the convolution operation allows the positioning system to study the entire topology of an RSSI image to provide a high-accuracy positioning [16]. The hurdle of a CNN-based IPS is its high computational requirement, which causes difficulties in establishing the IPS model on inexpensive devices with limited resources. The architecture of the model affects the computational requirements. By substantially reducing the size of the CNN-IPS model, the computational need can be minimized at the expense of positioning accuracy.
The model compression can be applied to a CNN-based IPS as a strategy to overcome this issue. A common way to apply model compression is by using knowledge distillation (KD). However, it can be seen that with an increase in the complexity of the teacher model, a lightweight student model cannot perfectly imitate the teacher model's performance because it does not have sufficient capacity to mimic the high-complexity teacher model [17]. Additionally, the logits generated by the cumbersome teacher model are less soft since the certainty of associating the RSSI fingerprint images to their location classes is greater [17]. To overcome this issue, the teacher model may transfer its knowledge to another model with closer complexity. Following that, the model introduced can transfer the informative knowledge it had gained to the simple student model, as presented in [17]. The authors emphasized that adding an additional distillation step can bring finer students. By establishing multiple teacher assistants, they have also demonstrated that the best distillation path to enhance the learning experience occurs when the process passes through all intermediate teacher-assistants. Since such benefits can be observed from the teacher-assistant knowledge distillation (TAKD), the framework may boost the positioning accuracy of a simple CNN-IPS. However, there are still some unanswered problems concerning the possibility of TAKD successfully assisting the transfer of knowledge, mainly the relationship between the RSSI and location classes from the teacher model to the student model, in order to reduce the positioning error of the system. This work presents two CNN-based IPS with different complexity, taking RSSIs and converting them into a 2D fingerprint image. The more complex model has a higher number of convolutional layers, filters and hidden layers. Subsequently, several intermediate CNN algorithms are developed to introduce a TAKD framework in this work. The TAKD framework is able to distill the useful information from a teacher model to the student model through an intermediary model known as the teacher assistant model. The main contributions are as follows: 1.
Two CNN-based IPS with different complexities are generated to prove that the model's complexity will affect the performance of the positioning system. Then, knowledge distillation is performed so that the simple model can mimic the performance of the larger model.

2.
Two CNN-based algorithms are developed as the teacher assistant model. Then, we proposed a TAKD framework to distill knowledge from the pre-trained teacher model to the teacher assistant model and lastly, to the student model. Ultimately, we investigate the benefit of employing the proposed technique by comparing the performance of IPSs that utilize the TAKD framework, the baseline KD framework and the performance of CNN-based IPS.
The remainder of the paper is structured as follows. Section 2 reviews the related work done on the indoor positioning system. Section 3 provides details regarding the existing work that will be used as a benchmark for this work. The proposed TAKD CNN-based IPS is presented in Section 4. In Section 5, a thorough comparison of the proposed and existing techniques has been made to emphasize the benefit of the proposed framework. Lastly, we conclude this work in Section 6.

Related Works
Numerous works have been done on the fingerprint-based indoor positioning system, and a few methods described in this paper are summarized in Table 1. The classical approach to estimating a user's location is by implementing the K-nearest neighbors (k-NN). The first k-NN-based IPS was introduced through the development of RADAR [18]. The system utilizes radio-frequency wireless networks to locate and track users inside of a building. Since then, several other works [19,20] have appeared to develop IPS based on k-NN. The IPS in [19] implements 1-NN, whereby the number of nearest neighbors is set to 1 using the UJIIndoorLoc dataset, which covers a real multi-building and multi-floor scenario. From there, they obtained an average positioning error of 7.9 m with 89.92% success rate, which has become the standard for other IPS. Other machine learning algorithms, such as decision tree [21] and random forest [22], were introduced to enhance the performance of IPS. Through [21], it is found that the computational efficiency of the decision tree-based IPS is superior to that of a simple 1-NN algorithm. However, in terms of accuracy, the results of 1-NN were similar to or better than that of the decision tree-based IPS.
Although machine learning provides satisfactory localization output, it requires timeconsuming parameter tuning, especially for larger buildings which contain numerous data. Additionally, it is deprived of the ability to thoroughly learn reliable features from the training data used to map out the RSSIs to their locations. If a higher training accuracy is needed, the algorithm must be able to process the training data extensively. Deep learning algorithms are known to be able to extract complex structure from data, as they are designed based on the hierarchical structure of the human brain [23]. Thus, many works have adopted deep learning algorithms for location prediction. The works in [24][25][26] incorporated deep neural networks (DNN) to counter feature extraction issues. In [24], the authors designed a DNN structure pre-trained by Stacked Denoising Autoencoder (SDA) for indoor positioning with Wi-Fi fingerprinting to tackle the fluctuating wireless signals. Hidden Markov Model (HMM) was used to improve the accuracy of the DNN model. The work in [25] also devised a DNN-based positioning scheme with Wi-Fi fingerprints. The main aim of the scheme was to achieve accurate indoor positioning with a reduced workforce. The KD-CNN-IPS achieve better accuracy and average error than CNN-IPS. Positioning error as low as 1.5 m is achieved.
Poor positioning performance when the complexity gap between the teacher and student models is large.
shadowing effect. Due to the fast pace of technological advancement, a more accurate IPS is desired; hence, several works [27,28] have started employing convolutional neural network (CNN) algorithms in IPS. The empirical results acquired by [16] proved that CNN has superior positioning capability, as their proposed CNN framework shows a lower positioning error than DNN-based IPS. In order to enable CNN-based positioning to be smoothly deployed in mobile applications, ref. [29] introduced knowledge distillation into the CNN-based positioning, resulting in KD-CNN-IPS, and demonstrated the relevance of the proposed framework. When a lightweight student model learns from a larger model, it can improve its positioning performance and provide a better execution time than the larger model. Despite the adequate performance, the architecture of the complex teacher and student models do not differ significantly. Therefore, it is understandable why the student model was able to train well under the supervision of the teacher model. However, in cases where the positioning system takes in readings from a much higher number of APs, the complexity of the CNN algorithm would have to rise. As a result, a larger complexity gap between the teacher and the student model is introduced. When this happens, the student model might not be able to learn competently from the teacher model, and we could anticipate a possibly inferior positioning performance. Hence, a more suitable framework must be studied to address this.

Existing Methods
For this work, we benchmarked our proposed method against a basic CNN-based IPS and the recent (KD-CNN-IPS) [29] framework. Thus, a detailed explanation regarding the architecture of the CNN-based IPS is presented in Section 3.1 and the working principle of the KD-CNN-IPS is provided in Section 3.2.

CNN-Based IPS
The core idea of CNN-based IPS is to map N samples of the input RSSIs {r n |n = 1, 2, · · · , N} to their corresponding location labels {y n |n = 1, 2, · · · , N}. Consider an IPS with M location classes and g m samples in the mth location class. The total number of samples in the dataset N can be expressed as N = M ∑ m=1 g m . The CNN-IPS architecture used in this work mainly comprises an input layer and several convolutional layers, followed by a max pooling layer, a flattened layer and a dense layer. For the purpose of providing better visualization of the CNN-IPS architecture, Figure 1, which illustrates the architecture of the complex model, is presented. Initially, the CNN-IPS takes in a series of one-dimensional (1D) RSSI vectors r n = r n 1 , r n 2 , · · · , r n K . The element of the vector is represented by r n k where k ∈ [1, K] and K is the total number of APs available. After obtaining the 1D RSSI vectors, the vectors will be transformed into a square fingerprint image X n of size Q 1 × Q 1 . To ensure that it is possible to reshape the vector into a square fingerprint image, the number of elements in r n needs to be the square of an integer. Therefore, if K = c 2 , where c is an integer, r n will be zero padded before the image is processed by the convolutional layers, and subsequently, the dense layer. At the final layer of the dense network, a softmax activation function is implemented to calculate the probability for each of the location classes. The activation function is represented by the following equation: where x j is the logit of the j-th neuron, l = 1, 2, . . . , L and L is the total number of neurons in the dense layer. In this case, L equals M, since the number of neurons at the last fully connected layer will be identical to the number of location classes. Hence, a predicted output vectorŷ n = ŷ n 1ŷ n 2 · · ·ŷ n M of size 1 × M is generated.

Knowledge Distillation CNN-IPS (KD-CNN-IPS)
In order to comprehend the working principle of our proposed teacher-assistant knowledge distillation-based indoor positioning system (TAKD-CNN-IPS), it is crucial to understand the concept and operational steps of knowledge distillation, particularly response-based knowledge distillation. As mentioned in Section 2, knowledge distillation is implemented to enable a lightweight model to achieve an almost identical output to a cumbersome model by training the smaller model (student model) under the supervision During training, loss functions are crucial for model optimization, and in theory, a perfect model would have a loss of 0. For classification models, the cross-entropy cost function is typically used for training, and it can be computed using the following: where H(ψ, ξ) = − M ∑ k=1 ψ k log(ξ k ) represents the cross-entropy loss function and z n = z n 1 z n 2 · · · z n M is the logits vector.

Knowledge Distillation CNN-IPS (KD-CNN-IPS)
In order to comprehend the working principle of our proposed teacher-assistant knowledge distillation-based indoor positioning system (TAKD-CNN-IPS), it is crucial to understand the concept and operational steps of knowledge distillation, particularly response-based knowledge distillation. As mentioned in Section 2, knowledge distillation is implemented to enable a lightweight model to achieve an almost identical output to a cumbersome model by training the smaller model (student model) under the supervision of a larger, pre-trained model (teacher model) [30,31]. Soft labels generated by the teacher model are used in the training of the student model. During the training of the teacher model, the fingerprint images are mapped out to a vector of logits z = z 1 z 2 · · · z M produced by the final dense layer. For the teacher model to generate soft labels, the temperature-scaled softmax activation function is applied on the logits, where T ≥ 1 denotes the temperature parameter.
For knowledge distillation to accomplish its task during the training of the student model, the loss function of the student model L S−KD can be calculated using the following equation [30], so that the outputs of the student model match the soft targets of the teacher model. Unlike the CNN-based IPS, the loss function used to train the student model consists of both the student loss function L CE (z n s , y n ) and the distillation loss function L KD (z n s , z n t ), where z n s and z n t represent the output logit vectors of the teacher and student model for the nth input sample, respectively. The student loss is the cross-entropy loss between the student predictions and the ground truth. The weight of the student loss function is α, where α ∈ [0, 1]. On the other hand, β = 1 − α signifies the weight of the distillation loss. Mathematically, the distillation loss function can be formulated as: where D KL (ψ, ξ) is the Kulback-Leiver divergence, which can be expressed as:

Teacher-Assistant Knowledge Distillation Based Indoor Positioning System
A lightweight CNN-IPS model is not able to provide adequate positioning performance. In order to improve the accuracy of a model, more layers and parameters are often introduced. This implementation would result in a more complex model. Unfortunately, a complex model is unsuitable for deploying in computationally restricted devices. It has been established that knowledge distillation is able to improve the accuracy of the lightweight model by allowing the model to train by using both the ground truth label and the complex model's softened output. However, due to the low capacity of the student model, not much performance gain can be observed by the student model. Based on what has been presented, an introduction to the teacher-assistant model might be able to rectify this problem. It is essential to ensure that the size and capacity of the teacher-assistant model are somewhere in the middle of the teacher and student model so that the introduced model has enough capacity to learn from the teacher and while not being too big for the student model to learn from. In order to evaluate the effectiveness of the proposed scheme, this work adopted the Alcala Tutorial 2017 dataset [32], which contains RSSI information collected at the corridor of the School of Engineering of the University of Alcala. The dataset allows a comprehensive investigation to be conducted, as the performance of the proposed scheme is tested against the performance of the simple CNN-IPS model and the CNN-IPS model, which only use the baseline knowledge distillation scheme, as well as the complex CNN-IPS model. The simulation results should prove that the proposed scheme's execution time is similar to that of the simple CNN-IPS model while also having a better positioning performance than the simple CNN-IPS and the KD-CNN-IPS.
Understanding the KD-CNN-IPS framework in Section 3.2 makes it easier to grasp the fundamental concept of the proposed TAKD-CNN-IPS since the training process is similar to a conventional knowledge distillation. The development of the proposed indoor positioning system was actually inspired by the work presented in [17]. Figure 2 depicts the general block diagram for the proposed TAKD-CNN-IPS. As can be seen from Figure 2, extra intermediate networks are designed to act as teacher-assistant networks. The role of the teacher-assistant networks is to bridge the gap between the teacher network and the student network since the complexity level of the introduced network is in between the complexity of the teacher and the student network.
positioning system was actually inspired by the work presented in [17]. Figure 2 d the general block diagram for the proposed TAKD-CNN-IPS. As can be seen from 2, extra intermediate networks are designed to act as teacher-assistant networks. Th of the teacher-assistant networks is to bridge the gap between the teacher netwo the student network since the complexity level of the introduced network is in be the complexity of the teacher and the student network. The initial training step of the TAKD-CNN-IPS is similar to those of the CN and KD-CNN-IPS, where the 1D RSSI input vector must first be transformed int fingerprint image. Then, the teacher network will produce soft labels by using Eq (3). Unlike the KD-CNN-IPS, knowledge from the teacher model will first be trans to the teacher-assistant model. Therefore, the teacher-assistant model will be trained the following loss function: is the distillation loss. Note that t tillation loss is computed using the logits of the teacher-assistant model ta z n and the of the teacher model z n t . In the case of TAKD-CNN-IPS with U teacher-assistan works, (7) will be applied to train the first teacher-assistant model and then the subse teacher-assistant models will adopt the following loss function: The initial training step of the TAKD-CNN-IPS is similar to those of the CNN-IPS and KD-CNN-IPS, where the 1D RSSI input vector must first be transformed into a 2D fingerprint image. Then, the teacher network will produce soft labels by using Equation (3). Unlike the KD-CNN-IPS, knowledge from the teacher model will first be transferred to the teacher-assistant model. Therefore, the teacher-assistant model will be trained using the following loss function: where L CE (z n ta , y n ) represents the cross-entropy loss between the teacher assistant prediction and the ground truth while L KD (z n ta , z n t ) is the distillation loss. Note that the distillation loss is computed using the logits of the teacher-assistant model z n ta and the logits of the teacher model z n t . In the case of TAKD-CNN-IPS with U teacher-assistant networks, (7) will be applied to train the first teacher-assistant model and then the subsequent teacher-assistant models will adopt the following loss function: where u = 1, 2, . . . , U, L CE z n ta u , y n and L KD z n ta u , z n ta u−1 represent the uth teacher-assistant loss and distillation loss, respectively. Unlike the first teacher-assistant model, the logits required to compute the distillation loss for the intermediate teacher assistant are calculated using the logits of the uth teacher assistant model z n ta u and the logits of the (u-1) th teacher assistant model z n ta u−1 . Once the training process for the last teacher-assistant model (Uth teacher-assistant model) completes, the student model will finally be trained by the Uth teacher-assistant model using the following loss function: where z n s and z n ta U signify the logits of the student and the loss function Uth teacherassistant model, respectively. Algorithm 1 summarizes the overall localization process of the proposed TAKD-CNN-IPS. · · · r n c 2   .

: for n ≤ N do
Train the teacher model 6 : Employ X n as the input of the teacher model. 7 : Apply (3) to calculate the soft labels ρ T for the teacher network.
Train the teacher-assistant model 8 : for u ≤ U do 9 : Employ X n as the input of the teacher-assistant model. 10 : Apply (1) to generate the uth teacher assistant's hard predictions. 11 : Apply (3) to generate the uth teacher assistant's soft prediction ρ TA u . 12 : Execute (2) to calculate teacher-assistant cross-entropy loss. 13 : if u = 1 then 14 : Execute (5) to calculate distillation loss. 15 : Apply loss function from (7)  Apply X n as the input of the student model.

Results and Analysis
In this section, the proposed scheme's performance is thoroughly assessed and compared with the performance of the teacher, student and the baseline knowledge distillation scheme. To examine the proposed technique's applicability, this work utilizes the publicly available Alcala Tutorial 2017 dataset [32], which was collected in the School of Engineering of the University of Alcala when they organized the 2017 Fingerprinting-based Indoor Positioning tutorial. This dataset comprises RSSI readings from a single floor in one building. The number of reference points available in the Alcala Tutorial 2017 dataset is 110, which we labeled as coordinate, while RSSI measurements of every reference point in the database were taken from 162 access points.

Simulation Setting
The simulations were carried out using Python 3.7.12 and we adopted the Keras framework to model various deep learning-based IPS schemes considered. During the simulations, the 1512 sample points present in the database were split into 90% testing data and 10% training data. As for the networks, four CNN-IPS models with different numbers of convolutional layers, ranging from one to eight, were created. The configurations of all the models considered in this work are shown in Table 2. Referring to Table 2, the size of the model indicates the number of convolutional layers in that particular model. The model with the highest complexity and most convolutional layers is appointed as the teacher model and is known as CNN-IPS (TM) in this work. Additionally, a student model was designed with the lowest convolutional number, termed CNN-IPS (SM). Other than that, a few model paths were generated so that the performance of the proposed technique, as well as the baseline technique, could be analyzed. We included the KD-CNN-IPS scheme to ensure that the proposed technique performs better than the baseline technique. Since the proposed approach incorporates teacher-assistant networks, we also consider three different models based on the proposed scheme, which are abbreviated as TAKD-CNN-IPS (M1), TAKD-CNN-IPS (M2) and TAKD-CNN-IPS (M3). Table 3 describes the model paths and the hyperparameter of each technique. It is important to note that, unlike the other two TAKD-CNN-IPS techniques, TAKD-CNN-IPS (M3) has two teacher-assistants, and it is trained from the model with the highest complexity to the lowest complexity.

Results and Discussion
Before the performance of the proposed schemes can be discussed, it is crucial to understand the impacts of the α and T hyperparameters. Generally, knowledge distillation trains the student model by utilizing both the true label and studying how the bigger network represents and manipulates data. In order to allow the student network to gain more information from the larger network, the temperature scaling hyperparameter T is introduced to soften the peaky probability distribution of the trained larger network. It is done mostly because soft probabilities from trained networks disclose more information about the data, mainly on the relationship between the RSSI and location classes than the actual label. A higher value of T will generate a softer distribution. Hence, T larger than one is able to improve the performance of the student model. In this work, we have selected T = 2. As mentioned in Section 4, α is the weight of the student loss, while β is the weight of distillation loss used in the objective function to train the distilled model. Thus, by varying α, users can control the importance of the student loss and the amount of information being distilled from the larger model since β will automatically change as a result. More explicitly, a higher value of α will result in a higher contribution from student loss and a lower contribution from distillation loss, while the opposite will happen with a lower value of α. In order to illustrate how the α hyperparameter affects the positioning performance of the proposed IPS, the performance gain index has been introduced. The performance gain index P is calculated using the following equation to demonstrate the amount of improvement attained by the knowledge-distilled schemes against the student model CNN-IPS (SM).
where E SM and E KD represent the average positioning errors of the CNN-IPS (SM) and knowledge-distilled schemes, respectively. The graph of the performance gain index against α with T = 2 is displayed in Figure 3. From the figure, it is observed that all the positioning schemes considered exhibit the best performance at α = 0.1. This result indicates that the positioning schemes attain better improvement when the contribution from distillation loss is larger. Besides that, it is also noteworthy that the performance gain index at α = 1 is zero. This is because at α = 1, the models are trained solely by the student loss. Since all the student models in the knowledge-distilled schemes have the same architecture as CNN-IPS (SM), their average positioning errors will be identical to that of the CNN-IPS (SM).
The results are displayed as a benchmark for the proposed scheme. From the table, CNN-IPS (TM) outperforms CNN-IPS (SM) as expected in training and testing accuracy. More specifically, the training accuracy and testing accuracy of CNN-IPS (TM) exceeds  Table 4. The average positioning errors for D test samples are measured using the Euclidean distance between the predicted location (x d ,ŷ d ) and the ground truth ( x d , y d ), which can be calculated using the following formula: The results are displayed as a benchmark for the proposed scheme. From the table, CNN-IPS (TM) outperforms CNN-IPS (SM) as expected in training and testing accuracy. More specifically, the training accuracy and testing accuracy of CNN-IPS (TM) exceeds CNN-IPS (SM) by 0.37% and 25%, respectively. Additionally, a similar trend is also observed for the loss, where there is an improvement of 58.36% and 29.02% when we compare CNN-IPS (TM) to CNN-IPS (SM) in the training and testing phases, respectively. It is also noteworthy that the average positioning error of CNN-IPS (TM) is only 35.9% of that of CNN-IPS (SM). More specifically, there is a performance gap of 1.4288 m between the average positioning error of CNN-IPS (TM) and CNN-IPS (SM). Comparing the performance of CNN-IPS (TM) and CNN-IPS (SM) gives a rough idea of how the size of a network affects the localization performance of CNN-IPS. Next, we will investigate how these knowledge distillation schemes improve the estimation of the CNN-IPS. The training accuracies of all techniques are quite close to each other. In fact, all the methods can achieve a training accuracy of approximately 98%. However, the testing accuracy of the models ranges from 48.68% to 73.68%. Therefore, the testing accuracy clearly indicates which techniques give the greatest and the worst positioning performance. The training accuracy, as well as the average positioning error, of each technique are presented in . Additionally, it is found that the proposed TAKD-CNN-IPS scheme with multiple teacher assistants attains a higher testing accuracy than those with a single teacher assistant due to the fact that TAKD-CNN-IPS with multiple teacher assistants has a better capability to pass helpful knowledge from CNN-IPS (TM) to the student model. Interestingly, although the testing accuracies of TAKD-CNN-IPS (M1) and KD-CNN-IPS are identical, as indicated in Figure 5, the TAKD-CNN-IPS (M1) is found to exhibit a performance gain of 9.82% over KD-CNN-IPS in terms of average positioning error. This is because the Euclidean distances between the actual and predicted locations of the test samples are misclassified by the TAKD-CNN-IPS (M1) and are much closer than those of the test samples, which KD-CNN-IPS incorrectly predicts.
performance of the CNN-IPS, KD-CNN-IPS and all of the TAKD-CNN-IPS schemes, the difference in terms of average positioning error is around 0.025 m. For an indoor positioning system, a difference of 0.025 m can be regarded as insignificant. Figure 5 presents the improvement in terms of positioning accuracy of the proposed technique as compared to the baseline CNN-IPS (SM) and the KD-CNN-IPS technique. The figure shows that TAKD-CNN-IPS (M3) is up to 10.53% improved in terms of testing accuracy compared with CNN-IPS (SM). Besides that, the testing accuracies of TAKD-CNN-IPS (M3) and TAKD-CNN-IPS (M2) are superior to those of the KD-CNN-IPS, while TAKD-CNN-IPS (M1) has the same accuracy as KD-CNN-IPS. Hence, this provides evidence that when an intermediate network is introduced, the student network can capture more knowledge because it is trained using the soft distribution produced by a network with a closer capacity.  As shown in Figures 6 and 7, CNN-IPS (SM) is the worst performer in terms of average positioning error. It can be seen that all of the models based on the TAKD-CNN-IPS          CNN-IPS and CNN-IPS (SM), which is 0.17 s. This observation can be explained by the following. During the testing phase, the KD-CNN-IPS and CNN-IPS only need to execute their student model for location prediction. Since the architectures of CNN-IPS (SM) and the student models of TAKD-CNN-IPS (M1), TAKD-CNN-IPS (M2), TAKD-CNN-IPS (M3) and KD-CNN-IPS are identical, all these techniques result in the same testing time.

Conclusions
Despite the great success of CNN-IPS in achieving impressive positioning performance, it is impractical to implement such cumbersome deep models on resource-constrained devices due to the prohibitively high computational complexity and massive storage requirements associated with CNN-IPS. One possible solution is to apply KD-CNN-IPS to distill essential information acquired by the complex pre-trained CNN-IPS (teacher model) to a lightweight CNN-IPS (student model). However, if the teacher model is far more complicated than the student model, the performance of the KD-CNN-IPS will degrade, as the logits produced by the teacher network will be left soft and the lightweight student model has insufficient capacity to imitate the behavior of the teacher network. To circumvent these issues, this paper proposes a novel TAKD-CNN-IPS, whereby teacher-assistant models are employed as intermediate networks and muti-step knowledge distillation is performed. Extensive simulations have been conducted to evaluate the performance of the proposed technique and the effects of the hyperparameters have been analyzed. The results demonstrate that the proposed TAKD-CNN-IPS techniques are capable of achieving larger performance improvements in terms of average positioning error over the baseline CNN-IPS (SM) as compared to that of their KD-CNN-IPS counterparts. Quantitatively, in terms of an average positioning error, the improvement of the proposed TAKD-CNN-IPS models over the baseline CNN-IPS (SM) ranges from 33.81% to 38.79%, while KD-CNN-IPS only attains a gain of 29.73%. In addition, the testing time incurred by the proposed models is also substantially shorter than that of the CNN-IPS (TM) since the testing time of TAKD-CNN-IPS is only 14.91% of that associated with CNN-IPS (TM). The combined features of excellent localization performance, simple architecture and shorter execution time make the TAKD-CNN-IPS an appealing option for deployment on edge devices, such as smartphones or embedded sensor nodes, to support various real-time indoor applications, which include, but are not limited to, positioning, tracking and navigation.