A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation

The Internet of Things (IoT) and sensors are becoming increasingly popular, especially in monitoring large and ambient environments. Applications that embrace IoT and sensors often require mining the data feeds that are collected at frequent intervals for intelligence. Despite the fact that such sensor data are massive, most of the data contents are identical and repetitive; for example, human traffic in a park at night. Most of the traditional classification algorithms were originally formulated decades ago, and they were not designed to handle such sensor data effectively. Hence, the performance of the learned model is often poor because of the small granularity in classification and the sporadic patterns in the data. To improve the quality of data mining from the IoT data, a new pre-processing methodology based on subspace similarity detection is proposed. Our method can be well integrated with traditional data mining algorithms and anomaly detection methods. The pre-processing method is flexible for handling similar kinds of sensor data that are sporadic in nature that exist in many ambient sensing applications. The proposed methodology is evaluated by extensive experiment with a collection of classical data mining models. An improvement over the precision rate is shown by using the proposed method.


Introduction
The infrastructure of the Internet of Things (IoT) is establishing rapidly recently, with the hype of smart cities over the world. Many ambient-sensing applications subsequently were developed that tap into the maturity of IoT technology [1]. From these applications and the proliferation of sensor equipment and ubiquitous communication technologies, we are able to collect a huge amount of useful data, which was not possible before [2,3].
As the Figure 1 shows, there are some problem in the internet of things.For example, a typical and challenging ambient sensing application namely human activity recognition (HAR) collects data about human activities, such as walking, running and standing, in a confined space [4]. HAR tries to make sense of a massive amount data that is collected continuously for analyzing what a person is doing over a long period of time. Another typical application of ambient sensing focuses on detecting unusual environmental occurrences by using many outdoor sensors such as atmospheric sensors [5]. Although sensors have become ubiquitous and their applications are prevalent covering every aspect of our life, the data mining component in an IoT system has to keep up its effectiveness with the sheer volume of dynamic data. Sensor data are data feeds which exhibit some patterns when they are zoomed out and viewed longitudinally. The patterns are somewhat different from the structured datasets that are used traditionally for supervised learning. Earlier on, some researchers advocated that this kind of sensor data has unique characteristics, and it is unsuitable for use directly in data analysis [6]. First, the sensor data collected by the sensor are usually in numerical values. Second, the frequency of collection is relatively quick, and a lot of data can be collected within a few seconds at a time, depending on the sampling rate. Third, the adjacent data cells along a sensor data sequence may be very similar to each other without any significant changes in values over a period of time. For instance, a sensor that is tasked to monitor the activity of a sleeping baby or a sitting phone operator: their body postures change relatively little in slow or still motions; the same goes for monitoring the humidity of a forest or agriculture in fine and stable weather [7]. Fourth, sensor data may be corrupted by noise and losing useful information due to various transmission errors or malfunctioning sensors [8]. It is not uncommon for sensors that collect bogus data due to unreliable medium or external interference in a large scale IoT network deployed in harsh environment.
From the perspective of data mining, the process may not know whether a piece of data is noisy or a plain outlier, until and unless the fault is pinpointed. For example, when a sensor malfunctions, it sends incorrect readings to the server. The wrong data will not be discovered though they are more than useless for training a data mining model. As a result, the induced model is degraded by junk data. The wrong prediction outputted from the data mining model would propagate to throughout the IoT network and eventually to the final [9]. Basically, the characteristic of the IoT data collected by sensors are sequential and huge. The data values are largely repetitive and noises that have irregular values may easily go unnoticed in the data. Many users would aggregate the data for deriving descriptive statistics such as mean and distributions. Relatively they pay less attention to individual pieces of data at the micro-scale. It is known that sensor data are often generated at narrow intervals at the high sampling rate. Some problematic data that have gone unnoticed will cause the classifiers drop in performance during operation. At the data level, it is difficult for the model induction process to know whether an incoming training data will be counter-productive to the supervised learning. This may require post-processing or feedback-learning. A simpler method is pre-processing, which is usually fast and lightweight suitable for dynamic IoT scenarios.
Some popular pre-processing methods, such as Principle Component Analysis (PCA) [10] only change the original data by reducing the attribute dimension. Dimension reduction may not be so effective here because sensor data may have only a few attributes about the sensor readings. Sensor data are time-series that come as training data feed in sequential manner. Some regularization methods [11] only change the scope of the data, and they cannot effectively isolate outliers from a data feed of mediocre values. Changing the time interval of the instance by using a simple sliding window and statistical methods is possible and straightforward, but it could be further improved. Some efficient and effective pre-processing mechanism is required for upholding the performance of a trained model. For solving the problem of "repetitive and redundant data" in an IoT situation, a novel pre-processing methodology is proposed in this article. In this paper, the proposed pre-processing part of a classification model calculates the probability between the subspace of source sequential data and the target data. This model transforms the subsequent input data into probability data in a period by the length of sliding windows that controls the time interval hence the resolution. It focuses on the data features while maintaining the data original structures.
The contributions of this paper are as follows: (i) a pre-processing method suitable for sensor data that may have persistent and redundant data values is proposed. The method converts the original data values into probabilities that are computed based on the similarity of subspace; (ii) The pre-processing method could set the size of the subspace and the length of the sliding window, and it can effectively combine the needs of the time segment analysis in the real task. (iii)The advantages of proposed pre-processing mechanisms can combine perfectly with different models, reducing the sensitivity to noisy data and redundancy problems. The precision of the classification model will be improved by using this pre-processing method, rather than directly using classification algorithms alone. The source code of this new pre-processing methodology can be downloaded for testing and verification at https://github.com/Ayo616/TBSS.
In Section 2, some related work about the pre-processing methods that are applicable for sensor data are introduced. In Section 3, the proposed pre-processing methodology is explained in detail. In Section 4, validation experiment is designed for evaluating the performance of our method. Empirical datasets are used in the comparison of other methods. In Section 5, the work is concluded.

Literature Review
With the advancement of communication technologies and, electronic sensing devices are increasing, sensor-centric IoT technologies and applications experienced rapid growth. From recent statistics [12], by 2017 the total value of the IoT industry reached 29 billion U.S. dollars. This huge market has attracted attention from both practitioners and academic researchers.
Over the years, some companies have been developing and building IoT smart systems such as smart homes, smart transportation, and smart security [13][14][15]. These technologies are aimed at solving the specific problem for a relatively pure dataset in Internet of Things.
The massive deployment of IoT devices helps people to obtain large amounts of sensory data. How to tap valuable information from this vast amount of data and form knowledge to serve life more effectively is an important issue [16]. Some researchers have tried to use current data mining technology in the development of the Internet of Things to make the Internet of Things more intelligent [17][18][19][20]. These methods under the IoT framework could usefully solve some parts of the RFID mining task problem.
Clustering is commonly used in data mining of the Internet of Things. The most common clustering method is K-means [21]. K-means is very mature in traditional data mining. It divides a dataset into several clusters. K-means have been widely used in unsupervised classification task. The distribution of Internet of Things data in some cases is a clustering problem [22,23], but the classification results presented by clustering are only similar data and cannot be judged. If people are unfamiliar or unclear with the collected data, they cannot rely on the clustering results to precisely dig out effective knowledge.
In supervised learning, people often use decision tree algorithms for data mining of the Internet of Things [24,25]. In addition, probabilistic models are also widely used, such as the Naive Bayesian model [26,27]. In machine learning, there is also a simple and efficient classification method that is SVM [28], combining with the kernel function can linearly separate data in high-dimensional space. SVM methods are suitable for small dataset due to its long learning time. Finding a hyperplane in SVM is time-consuming because it iteratively tries and searches for the most appropriate non-linear division among the data.
However, the traditional classification model shows an unstable performance on the actual sensor data. That is because of the unique nature of the Internet of Things data. Through analysis and observation, all collected sensor data have highly repetitive characteristics and often contain noisy data. The source of these noise data may be due to sensor detection errors. That leads to too many samples of negative instances in the classification process, and the accuracy of the training model will decrease. In addition, in real life, people pay more attention to the results of a period, which is inconsistent with the phenomenon of collecting data in seconds.
Some research [10] used pre-processing of the dimension to deal with the sensor data problem. It is advantageous for tackling the dimension explosion problem. However, the useless features will increase the amount of calculation and it is not beneficial for the model's precision. So the performance of these methods is not very good in sensor data mining because this type of methods only affects the feature of instance [11]. Sensor data have a time-series feature, so some research uses sliding windows as pre-processing methodology to adjust the time interval and do statistical computation within the windows. This way is straightforward and easy to implement but still affected by noisy data. Its contribution to the performance of the model is limited [29,30] because it depends on the data purity and special requirement of task.
Our intuition for this method is that a subspace is corresponding to a behavior state. No matter how the redundant in data, the corresponding space is limited. We could detection the similarity of space to identify the status. Therefore, we propose a new type of data pre-processing methodology called Transfer by Subspace Similarity (TBSS). TBSS constructs subspace from the initial dataset. Our proposed TBSS methodology combines an anomaly detection algorithm that derives a probability table, and finally using just a traditional classification method for classification. TBSS pre-processing method can significantly improve the precision of classification. Moreover, this pre-processing methodology can be combined with various classification methods and anomaly detection methods, as well as it is flexible for real-time activity detection.

Problem Definitions
Suppose we have a training dataset D train , d j i = (x j i , c n ), d ∈ D train , it includes original sequence training data. These data are all from the sensors.
The training dataset includes different labels. The collection of these labels is C = {c 1 , c 2 , ..., c n }, c n ∈ C. These labels indicates the the status of object at a moment.
Each instance in dataset has a set of features.
The collection of these features without label is X = {x 1 , x 2 , ..., x i }.
Our task is to design an algorithm to deal with these sequence data in order to improve the data mining quality in the Internet of things. So we need to transfer the dataset D to theD,d = (x n e , c n ),d ∈D.

Reconstruct Training Data Table
In this step, we deal with the original training data table. We could divide some group from the original dataset according to the different labels. We get collection τ based on different class.
Next we construct a sub-dataset ∆ from the τ and the sample size of the ∆ is z ∆ . The τ represent the datum of a space.
We construct a sub-dataset Γ from the τ,and the sample size of Γ is z Γ . The Γ is used to compare with the τ.
We also construct a noisy dataset φ = Nos C m n m=1 , and we add this φ to the Γ.
All this sampling method does in the above operation is randomly select data. When we constructed these sub-datasets, we could compute some information between them as new features. Here we define some definitions. First we define TBSS function(transfer by subspace similarity).

Definition 1.
Function TBSS train is defined to generate a new training dataset T 1 from T 0 with the setting of z Γ (the length of L C m 0 ) and z ∆ (the length of ST C m 0 ).
The new table is combined as follows, and thex e is new instance for a new table.
The features of instance in new table is computed by the function TDT.
x e = ({[TDT] e,m } n m=1 , c i ) x e means we construct a new training data inD train . The new attributes are a group of non-isolation rates and class of this instance is c i . A group of non-isolation rates is computed by function ITR () for whose input is a combination of (L C i 0 , ST C j 0 ), 1 ≤ i ≤ n and 1 ≤ j ≤ n and z Γ < z ∆ . ) and the element in the k th row and the (n + 1) th column is C k ,1 ≤ i ≤ n and 1 ≤ j ≤ n: ∆ and Γ are sub-datasets from the τ, so the ∆ and Γ are subspaces of the τ. We use non-anomaly attribute to represent the similarity of two spaces. Definition 3. Function ITR() used X as standard case to train an isolation forest which is an algorithms created by Prof. Zhi Hua Zhou [31] for detecting the isolation point and then put the Y sample set(detection case) into the isolation forest model to classify whether there are isolation points or not in the Y. Finally, computing the rate P of data in Y that normally obeys the distribution in X. In other words, this function is to compute the non-isolation rate of Y. Table   Suppose we have a test dataset. We only use the label of test dataset in evaluation process. We extract the features from each instance. The collection of test dataset isd = (y j i , c n ),d ∈D test . We still apply some transfer processes to the D test ,transformingd = (w j r , c n ), whered ∈D test . A sliding window is used to construct the sub-dataset. The length of the sliding window is p. p is same as z Γ . So the original dataset could be transferred to the w z = y 1+z , y 2+z , . . . y p+z , where r is determined by the length of sliding window p and length of test dataset.

Reconstruct Test Data
where ST C j 0 = Sam md (T C j 0 , z ∆ ), 1 ≤ i ≤ n and 1 ≤ t ≤ T. w t is the t th sliding window, while the element in the t th row and j th column of TDT 1 is computed by function ITR(w t , ST C j 0 ).
With the help of TDT 1 the final high level function of step two can be defined as follows.

Definition 5.
Function TBSS test is defined for transforming testing dataset D test into new testing datasetD test which has same categories of attributes as T 1 . But part of computing input changes because we no longer use τ to gain sample set L C i 0 whose size is z Γ but using sliding windows of D test while the class of new testing data is replaced by the Maj w t which is major class of a sliding window.

Step 3: Model Learning
After using the pre-processing method to generate the new training dataset D train and new testing dataset D test , user could apply different algorithms to make the prediction with the help of D train and D test . Then a group of high level equations that represent these processes is defined as follow: is defined for a group of high level equations that use training data set (such as D train ), testing dataset (such as D test ) as well as a group of parameters for training the model and testing the model. Finally the performance evaluation index pf is obtained.
where Pre is a collection of all possible specific parameters with respect to the demand of user and P ∈ Pre, p f is a performance evaluation index which is a combination of several statistical parameters of model such as accuracy, recall and f 1 − score. In this paper, there are five classical classification algorithms being tested, they are SVM, logistic regression, C4.5, Bayes classifier and KNN.
In TBSS, the probability between each subspace similarity is calculated. If the subject is walking, of course the data collected under walking status belong to the statistical distribution of walking training dataset. It is known that walking is a homogenous activity without any strange behaviour. So we should obtain a same data space from the walking test data. Time-consumption depends on the time spent on constructing the subspace and the distribution of the original space. If the original space is large (for example, we can possibly have different styles of lying down behavior), the sampling times will be large too because we need to ensure the coverage rate for the original space. If the original space is small (for example we only have a simple kind of walking behavior), the sampling times will be short and the cost time will be low too. Figures 2 and 3 show the TBSS's process details as well as we introduced the pseudo code of TBSS Algorithm 1 in detail.
After comparing the performance between before and after pre-processing, we found that this new method actually could improve the accuracy of the prediction. The following section will show the experiment results in detail.   Table And Test Data Table;  2

Experiment
In this section, the performance of the proposed method is evaluated through extensive experiments. We chose three datasets which are generated from some IoT applications, and they are all related to the sensors and ambient assisted environment. The performance of five chosen classification algorithms that function without any pre-processing, serves as a baseline. The baseline is then compared with three pre-processing methods including our new method.

Datasets
Three datasets are prepared in this experiment. All these data are related to the sensor applications, which monitor human activities and environment. The description of each dataset is tabulated in Table 1. These datasets inherently consist of redundancy problem but lack of error information. To simulate the real situation in IoT where faulty sensor can exit, we inject a controlled level of noisy data when we construct the training and testing datasets. As the Figure 4 shows, we analyze some feature of datasets at first. The source datasets are all from the UCI datasets website.  Heterogeneity Activity Recognition Data Set: The Heterogeneity Human Activity Recognition (HHAR) dataset from Smart phones and Smart watches is a dataset devised to benchmark human activity recognition algorithms (classification, automatic data segmentation, sensor fusion, feature extraction, etc.) in real-world contexts; specifically, the dataset is gathered with a variety of different device models and use-scenarios, in order to reflect sensing heterogeneities to be expected in real deployments.
Localization Data for Person Activity Data Set: Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.
Activity Recognition system based on a Multisensor data fusion (AReM) Data Set: This dataset contains temporal data from a Wireless Sensor Network worn by an actor performing the activities: bending, cycling, lying down, sitting, standing, walking.This dataset represents a real-life benchmark in the area of Activity Recognition applications. The classification tasks consist of predicting the activity performed by the user from time-series generated by a Wireless Sensor Network (WSN), according to the EvAAL competition technical annex

Comparison of Pre-Processing Methods
This section chooses two ways to compare the performance of TBSS. First, five traditional methods were selected as baseline comparisons. There are KNN, Logistic Regression, Naive Bayes, Decision Tree and Support Vector Machine. We try to use the algorithms directly to get the results, then compare with the result that is after TBSS. Second, some pre-processing methods were selected as comparisons.
• K-Neighbors Classifier • Logistic Regression • Gaussian Naive Bayes • Decision Tree • Support vector machine Here we select three well-known comparing pre-processing methods as comparisons. They are PCA, Incremental PCA and Normalize Method. Two methods are related to reducing dimension (PCA and Incremental PCA) and one method (Normalize Method) is related to changing the range of the data values. These methods are useful in traditional data mining because they could deal with redundancy problem and dimension explosion.

Evaluation Criteria
Three evaluation indicators are used for evaluating the pre-processing methods. There are precision, recall and F1-score. These evaluation criteria are very commonly used in classification tasks so we chose these evaluation criteria as a baseline evaluation. In the experiment result, we could focus on the precision criteria. Our proposed methodology performs well in the precision indicator.

Parameters Setting
For setting the parameters of the comparing methods, default values that are given by the Sklearn package are used. For a fair evaluation of the experiment, the program code is run for ten times and obtain the average precision and recall, etc. The length of the sliding window could be changed to suit different tasks. Further experimentation is planned as future work for exploring the optimal sliding window size. The amount of noise we injected into the original dataset is proportional to the amount of the original dataset.

Result and Analysis
First, we use some original classification algorithms as a baseline to test our datasets. As we could see the results from Tables 2-4, the performance of the model that is trained by raw data is not excellent, because the raw data include so much noisy data, and we found that the running time of the whole process is long because the model repeatedly calculated the redundant data.  Table 2 shows that the performance of TBSS and other pre-processing methods on localization recognition dataset. The TBSS pre-processing method is useful to improve the performance of the model. The parameters are set as z Γ = 10, z ∆ = 20 and epoch = 50. The KNN method does not perform very well in this dataset. Compared with other methods, the TBSS get high precision. Especially the combination model of TBSS and decision tree get the about 0.3264 precision in this dataset.  Table 3 shows that the performance of TBSS and other pre-processing methods on AReM dataset. In this evaluation of dataset, parameters are set as z Γ = 10, z ∆ = 20 and epoch = 50. As we can see, the TBSS pre-processing is successful in improving the performance of the model by combining with a classification algorithm. The The precision of Bayesian algorithm with TBSS is up to about 0.3477, and it is higher than original precision. Table 4 shows that the performance of TBSS and other pre-processing methods on activity recognition dataset. In this dataset, parameters are set as z Γ = 20, z ∆ = 30, epoch = 50, and we adjust the interval of data in order to improve higher precision. The TBSS still perform well compared with other methods.
The TBSS gets good results in these datasets, although these datasets have noisy and redundancy problem, because the TBSS focuses on the similarity of space and a subspace, which is not easy to affect by single error instances. We use non-anomaly detection and statistics representing the similarity of each instance. These mechanisms strengthen the model's robustness and precision. For a realistic simulation, we design other experiments in which the length of the sliding window (z Γ , z ∆ ) is a variable. When we change z Γ , z ∆ , the whole model will generates different performance. As Table 5 shows, we fix the value of epoch and set z Γ = 10, z ∆ = 20, z Γ = 20, z ∆ = 30, z Γ = 20, z ∆ = 40, and we use a localization dataset to test the TBSS. It is clear that the setting of z Γ = 20, z ∆ = 40 is better. We analyze this dataset and we found that its classification interval is suitable for the twenty to thirty seconds(which is approximately z Γ = 20). So this result indicated that TBSS methodology is robust with regards to pre-process the sensor data according to the interesting time interval.
The final experiment is to test the influence of epoch. TBSS will receive more instances when the epoch is large because the times of sampling decide the coverage of the subspace to the original space. As Table 6 shows, we set the parameters z Γ = 10, z ∆ = 20, z Γ = 20, z ∆ = 30, z Γ = 10, z ∆ = 30. We try to use 10 to 3 epochs to run the model. We found that the more epochs there are in a range, the greater the improvement in the precision of thenmodel. The epoch does not affect the model's performance out of a range because the subspace similar is overlapping.

Conclusions
This paper reports a subspace similarity detection model based on the subspace-attribute probability calculation, and the computational process uses the anomaly detection method. The proposed methodology is to be used as a pre-processing method that transforms the fine granular time scaled dataset (that has frequent intervals) into a probability dataset in a period, hence better classification model training. The TBSS pre-processing method can effectively solve the problem of repeatability and noise that exist in the sensor data. TBSS is able to smoothly combine with anomaly detection and traditional classification algorithm. Therefore it should be flexible for most of the IoT machine learning applications in real life. Through the different aspect of experiments, we observe that this model can effectively improve the performance of traditional machine learning classification algorithms in data mining of the Internet of Things. Although TBSS improves the precision in the sensor data mining task, it has a shortcoming. The time-consumption is relatively long, because the sampling time incurs certain overhead for covering the whole search space in each step of window sliding. In the future work, we will look into formulating a low-complexity design suitable for fast sampling for TBSS. The codes of TBSS should be optimized for leveraging GPU computation for fast running.