A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation
Abstract
:1. Introduction
2. Literature Review
3. Methodology
3.1. Problem Definitions
3.2. Reconstruct Training Data Table
3.3. Reconstruct Test Data Table
3.4. Step 3: Model Learning
4. Experiment
4.1. Datasets
4.2. Comparison of Pre-Processing Methods
- K-Neighbors Classifier
- Logistic Regression
- Gaussian Naive Bayes
- Decision Tree
- Support vector machine
- PCA
- Incremental PCA
- Normalize Method
4.3. Evaluation Criteria
- Precision
- Recall
- F1 score
4.4. Parameters Setting
4.5. Result and Analysis
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Dhakar, L.; Tay, F.; Lee, C. Skin based flexible triboelectric nanogenerators with motion sensing capability. In Proceedings of the 2015 28th IEEE International Conference on Micro Electro Mechanical Systems (MEMS), Estoril, Portugal, 18–22 January 2015; pp. 106–109. [Google Scholar]
- Li, T.; Liu, Y.; Tian, Y.; Shen, S.; Mao, W. A storage solution for massive iot data based on nosql. In Proceedings of the 2012 IEEE International Conference on Green Computing and Communications, Besancon, France, 20–23 Novomber 2012; pp. 50–57. [Google Scholar]
- Madakam, S.; Ramaswamy, R.; Tripathi, S. Internet of Things (IoT): A literature review. J. Comput. Commun. 2015, 3, 164. [Google Scholar] [CrossRef]
- Nyan, M.; Tay, F.E.; Murugasu, E. A wearable system for pre-impact fall detection. J. Biomech. 2008, 41, 3475–3481. [Google Scholar] [CrossRef] [PubMed]
- Dan, L.; Xin, C.; Chongwei, H.; Liangliang, J. Intelligent agriculture greenhouse environment monitoring system based on IOT technology. In Proceedings of the 2015 International Conference on Intelligent Transportation, Big Data and Smart City, Halong Bay, Vietnam, 19–20 December 2015; pp. 487–490. [Google Scholar]
- Shi, F.; Li, Q.; Zhu, T.; Ning, H. A survey of data semantization in internet of things. Sensors 2018, 18, 313. [Google Scholar] [CrossRef] [PubMed]
- Xie, S.; Chen, Z. Anomaly detection and redundancy elimination of big sensor data in internet of things. arXiv 2017, arXiv:1703.03225. [Google Scholar]
- Pérez-Penichet, C.; Hermans, F.; Varshney, A.; Voigt, T. Augmenting IoT networks with backscatter-enabled passive sensor tags. In Proceedings of the 3rd Workshop on Hot Topics in Wireless, New York, NY, USA, 3–7 October 2016; pp. 23–27. [Google Scholar]
- Nesa, N.; Ghosh, T.; Banerjee, I. Outlier detection in sensed data using statistical learning models for IoT. In Proceedings of the 2018 IEEE Wireless Communications and Networking Conference (WCNC), Barcelona, Spain, 15–18 April 2018; pp. 1–6. [Google Scholar]
- Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
- Patro, S.; Sahu, K.K. Normalization: A preprocessing stage. arXiv 2015, arXiv:1503.06462. [Google Scholar] [CrossRef]
- M&M Research Group. Internet of Things (IoT) & M2M Communication Market: Advanced Technologies, Future Cities & Adoption Trends, Roadmaps & Worldwide Forecasts 2012–2017; Technical Report; Electronics. ca Publications: Kirkland, QC, Canada, 2012. [Google Scholar]
- Atzori, L.; Iera, A.; Morabito, G. The internet of things: A survey. Comput. Netw. 2010, 54, 2787–2805. [Google Scholar] [CrossRef]
- Miorandi, D.; Sicari, S.; De Pellegrini, F.; Chlamtac, I. Internet of things: Vision, applications and research challenges. Ad Hoc Netw. 2012, 10, 1497–1516. [Google Scholar] [CrossRef] [Green Version]
- Bandyopadhyay, D.; Sen, J. Internet of things: Applications and challenges in technology and standardization. Wirel. Pers. Commun. 2011, 58, 49–69. [Google Scholar] [CrossRef]
- Xu, L.; Guo, D.; Tay, F.E.H.; Xing, S. A wearable vital signs monitoring system for pervasive healthcare. In Proceedings of the 2010 IEEE Conference on Sustainable Utilization and Development in Engineering and Technology, Petaling Jaya, Malaysia, 20–21 November 2010; pp. 86–89. [Google Scholar]
- Cantoni, V.; Lombardi, L.; Lombardi, P. Challenges for data mining in distributed sensor networks. In Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong, China, 20–24 August 2006; Volume 1, pp. 1000–1007. [Google Scholar]
- Keller, T. Mining the Internet of Things-Detection of False-Positive RFID Tag Reads Using Low-Level Reader Data. Ph.D. Thesis, University of St. Gallen, St. Gallen, Switzerland, 2011. [Google Scholar]
- Masciari, E. A Framework for Outlier Mining in RFID data. In Proceedings of the 11th International Database Engineering and Applications Symposium, Banff, AB, Canada, 6–8 September 2007; pp. 263–267. [Google Scholar]
- Bin, S.; Yuan, L.; Xiaoyi, W. Research on data mining models for the internet of things. In Proceedings of the 2010 International Conference on Image Analysis and Signal Processing (IASP), Zhejiang, China, 9–11 April 2010; pp. 127–132. [Google Scholar]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 21 June–18 July 1965; Volume 1, pp. 281–297. [Google Scholar]
- Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. (CSUR) 1999, 31, 264–323. [Google Scholar] [CrossRef]
- Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
- Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man, Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef]
- Friedl, M.A.; Brodley, C.E. Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 1997, 61, 399–409. [Google Scholar] [CrossRef]
- McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization Citeseer, Madison, WI, USA, 26–27 July 1998; Volume 752, pp. 41–48. [Google Scholar]
- Langley, P.; Iba, W.; Thompson, K. An analysis of Bayesian classifiers. Aaai 1992, 90, 223–228. [Google Scholar]
- Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef] [Green Version]
- Danita, M.; Mathew, B.; Shereen, N.; Sharon, N.; Paul, J.J. IoT Based Automated Greenhouse Monitoring System. In Proceedings of the 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 14–15 June 2018; pp. 1933–1937. [Google Scholar]
- Akbar, A.; Khan, A.; Carrez, F.; Moessner, K. Predictive analytics for complex IoT data streams. IEEE Int. Things J. 2017, 4, 1571–1582. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
Dataset | Instances | Features | Labels | Noisy |
---|---|---|---|---|
LD | 164860 | 8 | 8 | 10000 |
HHAR | 43930257 | 16 | 6 | 100000 |
AReM | 42240 | 6 | 4 | 1000 |
Model | Precision | Recall | F1 Score |
---|---|---|---|
Localization Dataset () | |||
KNN | 0.1721 ± 0.018 | 0.3015 ± 0.011 | 0.2134 ± 0.019 |
KNN+PCA | 0.1154 ± 0.016 | 0.2125 ± 0.013 | 0.1363 ± 0.014 |
KNN+Normalize | 0.1422 ± 0.012 | 0.2614 ± 0.015 | 0.1832 ± 0.009 |
KNN+IPCA | 0.0134 ± 0.017 | 0.1683 ± 0.029 | 0.1035 ± 0.027 |
KNN+TBSS | 0.0467 ± 0.028 | 0.0454 ± 0.014 | 0.0335 ± 0.018 |
DT | 0.1811 ± 0.015 | 0.2436 ± 0.025 | 0.2021 ± 0.019 |
DT+PCA | 0.1043 ± 0.018 | 0.1653 ± 0.029 | 0.1137 ± 0.025 |
DT+Normalize | 0.1457 ± 0.028 | 0.2134 ± 0.014 | 0.1612 ± 0.018 |
DT+IPCA | 0.0854 ± 0.035 | 0.1210 ± 0.023 | 0.0814 ± 0.037 |
DT+TBSS | 0.3264 ± 0.016 | 0.0978 ± 0.015 | 0.1248 ± 0.026 |
SVM | 0.1321 ± 0.035 | 0.3014 ± 0.029 | 0.1731 ± 0.013 |
SVM+PCA | 0.0934 ± 0.019 | 0.2426 ± 0.028 | 0.1367 ± 0.018 |
SVM+Normalize | 0.0823 ± 0.031 | 0.2464 ± 0.012 | 0.1134 ± 0.019 |
SVM+IPCA | 0.1012 ± 0.027 | 0.2810 ± 0.036 | 0.1521 ± 0.021 |
SVM+TBSS | 0.1634 ± 0.022 | 0.0767 ± 0.010 | 0.0936 ± 0.019 |
LG | 0.1311 ± 0.029 | 0.3023 ± 0.019 | 0.1763 ± 0.016 |
LG+PCA | 0.0325 ± 0.026 | 0.0312 ± 0.028 | 0.0853 ± 0.019 |
LG+Normalize | 0.0821 ± 0.009 | 0.2516 ± 0.008 | 0.1274 ± 0.007 |
LG+IPCA | 0.1234 ± 0.020 | 0.2346 ± 0.010 | 0.1474 ± 0.012 |
LG+TBSS | 0.2668 ± 0.025 | 0.1121 ± 0.019 | 0.1453 ± 0.016 |
NB | 0.1333 ± 0.022 | 0.2745 ± 0.018 | 0.1624 ± 0.017 |
NB+PCA | 0.0929 ± 0.014 | 0.2535 ± 0.018 | 0.1374 ± 0.019 |
NB+Normalize | 0.1153 ± 0.013 | 0.2074 ± 0.018 | 0.1136 ± 0.018 |
NB+IPCA | 0.1167 ± 0.012 | 0.2963 ± 0.011 | 0.1526 ± 0.019 |
NB+TBSS | 0.2442 ± 0.011 | 0.0947 ± 0.013 | 0.1163 ± 0.029 |
Model | Precision | Recall | F1 Score |
---|---|---|---|
AReM Dataset () | |||
KNN | 0.1743 ± 0.016 | 0.1945 ± 0.014 | 0.1324 ± 0.012 |
KNN+PCA | 0.3053 ± 0.016 | 0.2631 ± 0.018 | 0.2734 ± 0.016 |
KNN+Normalize | 0.1784 ± 0.014 | 0.1982 ± 0.028 | 0.1335 ± 0.014 |
KNN+IPCA | 0.3033 ± 0.101 | 0.2613 ± 0.018 | 0.2743 ± 0.026 |
KNN+TBSS | 0.3042 ± 0.026 | 0.2634 ± 0.025 | 0.2733 ± 0.018 |
DT | 0.1832 ± 0.010 | 0.2421 ± 0.015 | 0.2053 ± 0.015 |
DT+PCA | 0.3453 ± 0.013 | 0.2442 ± 0.018 | 0.2042 ± 0.016 |
DT+Normalize | 0.4935 ± 0.024 | 0.2674 ± 0.032 | 0.2434 ± 0.018 |
DT+IPCA | 0.1845 ± 0.012 | 0.1534 ± 0.012 | 0.1057 ± 0.014 |
DT+TBSS | 0.3037 ± 0.019 | 0.3353 ± 0.018 | 0.3073 ± 0.015 |
SVM | 0.1335 ± 0.012 | 0.3073 ± 0.022 | 0.1753 ± 0.017 |
SVM+PCA | 0.2675 ± 0.010 | 0.2443 ± 0.018 | 0.2123 ± 0.016 |
SVM+Normalize | 0.3554 ± 0.017 | 0.3943 ± 0.012 | 0.2532 ± 0.012 |
SVM+IPCA | 0.2642 ± 0.010 | 0.2421 ± 0.018 | 0.2122 ± 0.016 |
SVM+TBSS | 0.2844 ± 0.013 | 0.2142 ± 0.013 | 0.2242 ± 0.009 |
LG | 0.1334 ± 0.013 | 0.3034 ± 0.009 | 0.1732 ± 0.016 |
LG+PCA | 0.2586 ± 0.010 | 0.2394 ± 0.028 | 0.1975 ± 0.018 |
LG+Normalize | 0.2832 ± 0.020 | 0.2212 ± 0.042 | 0.2523 ± 0.032 |
LG+IPCA | 0.2212 ± 0.010 | 0.1863 ± 0.038 | 0.1524 ± 0.036 |
LG+TBSS | 0.3234 ± 0.029 | 0.2413 ± 0.014 | 0.2523 ± 0.021 |
NB | 0.1313 ± 0.022 | 0.2726 ± 0.025 | 0.1613 ± 0.027 |
NB+PCA | 0.3086 ± 0.010 | 0.2426 ± 0.018 | 0.2111 ± 0.016 |
NB+Normalize | 0.3132 ± 0.023 | 0.1713 ± 0.009 | 0.0239 ± 0.010 |
NB+IPCA | 0.2845 ± 0.015 | 0.2234 ± 0.027 | 0.2132 ± 0.023 |
NB+TBSS | 0.3477 ± 0.020 | 0.2663 ± 0.017 | 0.1856 ± 0.012 |
Model | Precision | Recall | F1 Score |
---|---|---|---|
Activity Recognition () | |||
KNN | 0.1734 ± 0.027 | 0.3023 ± 0.011 | 0.2113 ± 0.019 |
KNN+PCA | 0.1543 ± 0.016 | 0.1434 ± 0.014 | 0.1665 ± 0.025 |
KNN+Normalize | 0.1932 ± 0.028 | 0.2075 ± 0.024 | 0.2157 ± 0.023 |
KNN+IPCA | 0.1374 ± 0.016 | 0.1834 ± 0.014 | 0.1656 ± 0.015 |
KNN+TBSS | 0.0934 ± 0.015 | 0.1074 ± 0.013 | 0.0455 ± 0.013 |
DT | 0.2664 ± 0.014 | 0.2767 ± 0.015 | 0.2568 ± 0.014 |
DT+PCA | 0.2964 ± 0.016 | 0.2221 ± 0.018 | 0.2468 ± 0.025 |
DT+Normalize | 0.2635 ± 0.013 | 0.2867 ± 0.013 | 0.2434 ± 0.024 |
DT+IPCA | 0.3168 ± 0.024 | 0.2878 ± 0.023 | 0.2767 ± 0.027 |
DT+TBSS | 0.3441 ± 0.027 | 0.2342 ± 0.039 | 0.3084 ± 0.024 |
SVM | 0.1345 ± 0.022 | 0.3066 ± 0.012 | 0.1753 ± 0.037 |
SVM+PCA | 0.3105 ± 0.026 | 0.2368 ± 0.046 | 0.2323 ± 0.013 |
SVM+Normalize | 0.3243 ± 0.023 | 0.2163 ± 0.035 | 0.2435 ± 0.033 |
SVM+IPCA | 0.2532 ± 0.016 | 0.2463 ± 0.046 | 0.2513 ± 0.014 |
SVM+TBSS | 0.3021 ± 0.035 | 0.2147 ± 0.033 | 0.2373 ± 0.013 |
LG | 0.1924 ± 0.013 | 0.2432 ± 0.010 | 0.1703 ± 0.010 |
LG+PCA | 0.2523 ± 0.029 | 0.2154 ± 0.024 | 0.2163 ± 0.014 |
LG+Normalize | 0.2935 ± 0.026 | 0.1964 ± 0.016 | 0.2243 ± 0.035 |
LG+IPCA | 0.2626 ± 0.037 | 0.2172 ± 0.013 | 0.2025 ± 0.015 |
LG+TBSS | 0.2834 ± 0.016 | 0.2053 ± 0.014 | 0.2153 ± 0.015 |
NB | 0.2653 ± 0.024 | 0.2753 ± 0.025 | 0.3026 ± 0.014 |
NB+PCA | 0.2923 ± 0.014 | 0.2845 ± 0.015 | 0.2624 ± 0.053 |
NB+Normalize | 0.3129 ± 0.012 | 0.2965 ± 0.023 | 0.3024 ± 0.012 |
NB+IPCA | 0.3123 ± 0.023 | 0.2754 ± 0.023 | 0.2636 ± 0.014 |
NB+TBSS | 0.3534 ± 0.017 | 0.2623 ± 0.034 | 0.2753 ± 0.027 |
Localization Dataset | |||
---|---|---|---|
Model (TBSS) | Precision | Recall | F1 Score |
= 20 | |||
KNN | 0.0612 ± 0.012 | 0.0723 ± 0.011 | 0.0353 ± 0.012 |
DT | 0.3134 ± 0.032 | 0.0713 ± 0.017 | 0.1462 ± 0.016 |
SVM | 0.2052 ± 0.013 | 0.0746 ± 0.013 | 0.1134 ± 0.032 |
LG | 0.2852 ± 0.038 | 0.1153 ± 0.013 | 0.1524 ± 0.013 |
NB | 0.2626 ± 0.011 | 0.1163 ± 0.013 | 0.1323 ± 0.012 |
= 30 | |||
KNN | 0.0253 ± 0.010 | 0.0623 ± 0.011 | 0.0352 ± 0.011 |
DT | 0.2353 ± 0.022 | 0.1275 ± 0.041 | 0.1253 ± 0.023 |
SVM | 0.3125 ± 0.012 | 0.1035 ± 0.012 | 0.1153 ± 0.012 |
LG | 0.1834 ± 0.042 | 0.0734 ± 0.014 | 0.0835 ± 0.014 |
NB | 0.1953 ± 0.031 | 0.0922 ± 0.021 | 0.1073 ± 0.011 |
= 40 | |||
KNN | 0.0134 ± 0.007 | 0.0324 ± 0.008 | 0.0132 ± 0.005 |
DT | 0.2035 ± 0.056 | 0.1652 ± 0.022 | 0.1764 ± 0.012 |
SVM | 0.4212 ± 0.010 | 0.1662 ± 0.024 | 0.1823 ± 0.032 |
LG | 0.1626 ± 0.033 | 0.1023 ± 0.012 | 0.1043 ± 0.042 |
NB | 0.3017 ± 0.028 | 0.0845 ± 0.012 | 0.1115 ± 0.051 |
Parameters | Epoch | Precision | Recall | F1 score |
---|---|---|---|---|
10 | 0.2952 ± 0.025 | 0.0942 ± 0.012 | 0.1343 ± 0.022 | |
20 | 0.3045 ± 0.032 | 0.1084 ± 0.032 | 0.1253 ± 0.022 | |
30 | 0.3264 ± 0.023 | 0.1311 ± 0.023 | 0.1442 ± 0.014 | |
40 | 0.3323 ± 0.022 | 0.1434 ± 0.013 | 0.1503 ± 0.035 | |
50 | 0.3353 ± 0.015 | 0.1334 ± 0.024 | 0.1452 ± 0.024 | |
10 | 0.3035 ± 0.010 | 0.0823 ± 0.013 | 0.1224 ± 0.022 | |
20 | 0.3130 ± 0.012 | 0.1026 ± 0.012 | 0.1164 ± 0.009 | |
30 | 0.3364 ± 0.024 | 0.1282 ± 0.034 | 0.1033 ± 0.013 | |
40 | 0.3464 ± 0.013 | 0.1374 ± 0.024 | 0.1079 ± 0.031 | |
50 | 0.3457 ± 0.022 | 0.1275 ± 0.015 | 0.1135 ± 0.023 | |
10 | 0.4223 ± 0.010 | 0.1452 ± 0.025 | 0.1653 ± 0.024 | |
20 | 0.4232 ± 0.030 | 0.1674 ± 0.024 | 0.1854 ± 0.022 | |
30 | 0.4423 ± 0.012 | 0.1534 ± 0.016 | 0.1734 ± 0.033 | |
40 | 0.4564 ± 0.020 | 0.1241 ± 0.015 | 0.1477 ± 0.032 | |
50 | 0.4542 ± 0.028 | 0.1537 ± 0.025 | 0.1652 ± 0.025 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhong, Y.; Fong, S.; Hu, S.; Wong, R.; Lin, W. A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation. Sensors 2019, 19, 4536. https://doi.org/10.3390/s19204536
Zhong Y, Fong S, Hu S, Wong R, Lin W. A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation. Sensors. 2019; 19(20):4536. https://doi.org/10.3390/s19204536
Chicago/Turabian StyleZhong, Yan, Simon Fong, Shimin Hu, Raymond Wong, and Weiwei Lin. 2019. "A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation" Sensors 19, no. 20: 4536. https://doi.org/10.3390/s19204536
APA StyleZhong, Y., Fong, S., Hu, S., Wong, R., & Lin, W. (2019). A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation. Sensors, 19(20), 4536. https://doi.org/10.3390/s19204536