# Self-Adaptive Pre-Processing Methodology for Big Data Stream Mining in Internet of Things Environmental Sensor Monitoring

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Materials and Methods

#### 3.1. Variable Sub-Window Division

_{i}is the width of ith sub-window in a sliding window. Note that each sw

_{i}is not necessarily the same because of variable division. n is the number of divided sub-windows with the range of 1 ≤ n ≤ W, and W is length of the sliding window.

_{3}of the first sliding window W

_{1}in equal division of Figure 1), our method prefers the variant sort of sub-window division over a whole sliding window (lower half of Figure 1 with the title ‘variable interval sub-window division’). It can generate potential and efficient decompositions of the expected patterns by randomly creating both the length and number of sub-windows under a whole sliding window. Additionally, a pool with fully sufficient candidates of various sub-window combinations (i.e., every segmentation seg

_{i}in variable division of Figure 1) is produced through multiple times of random segmentations. Thus, these candidates can make up the population of the search space as the input for later processing. The advantage of a variable sub-window is to provide more inclusive and diverse possibilities of sub-window combinations. From the combinations, we try to find some ideal sub-window lengths and numbers that match the underlying pattern distribution (e.g., Figure 1, the peak represented by last sub-window sw

_{4}in segmentation seg

_{2}of the first sliding window W

_{1}in variable division). This should work well compared to non-self-adaptive (or fixed) subdivisions of equal size, which may miss some significant pattern information.

#### 3.2. Statistical Feature Extraction

_{i}from a particular sliding window W

_{i}. Various sizes of sub-windows will definitely lead to different extracted attributes, so the boundary (dashed rectangle area in Figure 2) of SFX is the sub-window length sw

_{i}. Hereby, four extra features are involved in our algorithm, as shown in Figure 2: volatility, Hurst exponent, moving average convergence/divergence (MACD), and distance.

^{2}and q is the order of the ARCH terms ϵ

^{2}:

_{1}, y

_{2}, …, y

_{k}] so that the sum of the intra-cluster squared distance functions of every observation X

^{t}= [${x}_{1}^{t}$, ${x}_{2}^{t}$, …, ${x}_{n}^{t}$] of D

_{t}within cluster y

_{i}to ith centroid is a minimum:

_{i}is the center of cluster y

_{i}. Since there are only two clusters (k = 2) in the test experiment, feature d

_{1}means the intra-cluster distance of one observation X

^{t}to its own cluster and feature d

_{2}means the inter-cluster distance to the other exclusive cluster.

#### 3.3. Alternative Feature Selection

#### 3.4. Clustering-Based PSO

_{i}) after the ith iteration would affect the velocity upgrade to some degree, and the degree is decided by random weights.

_{1}and c

_{2}are learning factors with typical values of c

_{1}= c

_{2}= 2, and r

_{1}and r

_{2}are uniform distributed random values within [0,1].

## 4. Materials and Methods

## 5. Results

## 6. Discussion

## 7. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Gil, D.; Ferrández, A.; Mora-Mora, H.; Peral, J. Internet of things: A review of surveys based on context aware intelligent services. Sensors
**2016**, 16, 1069. [Google Scholar] [CrossRef] [PubMed] - Ducange, P.; Pecori, R.; Sarti, L.; Vecchio, M. Educational big data mining: How to enhance virtual learning environments. In International Joint Conference Soco’16-Cisis’16-Iceute’16: San Sebastián, Spain, October 19th-21st, 2016 Proceedings; Graña, M., López-Guede, J.M., Etxaniz, O., Herrero, Á., Quintián, H., Corchado, E., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 681–690. [Google Scholar]
- Ducange, P.; Pecori, R.; Mezzina, P. A glimpse on big data analytics in the framework of marketing strategies. Soft Comput.
**2017**, 1–18. [Google Scholar] [CrossRef] - Nguyen, H.-L.; Woon, Y.-K.; Ng, W.-K. A survey on data stream clustering and classification. Knowl. Inf. Syst.
**2015**, 45, 535–569. [Google Scholar] [CrossRef] - Gaber, M.M.; Zaslavsky, A.; Krishnaswamy, S. Mining data streams: A review. ACM Sigmod Rec.
**2005**, 34, 18–26. [Google Scholar] [CrossRef] - Sun, D.; Zhang, G.; Zheng, W.; Li, K. Key technologies for big data stream computing. In Big Data, Algorithms, Analytics, and Applications; CRC: London, UK, 2015. [Google Scholar]
- Atzori, L.; Iera, A.; Morabito, G. The internet of things: A survey. Comput. Netw.
**2010**, 54, 2787–2805. [Google Scholar] [CrossRef] - Lim, J.; Yu, H.; Gil, J.-M. An efficient and energy-aware cloud consolidation algorithm for multimedia big data applications. Symmetry
**2017**, 9, 184. [Google Scholar] [CrossRef] - Kwon, D.; Park, S.; Ryu, J.-T. A study on big data thinking of the internet of things-based smart-connected car in conjunction with controller area network bus and 4g-long term evolution. Symmetry
**2017**, 9, 152. [Google Scholar] [CrossRef] - Byun, H.; Park, J.H.; Jeong, Y.-S. Optional frame selection algorithm for adaptive symmetric service of augmented reality big data on smart devices. Symmetry
**2016**, 8, 37. [Google Scholar] [CrossRef] - Bonomi, F.; Milito, R.; Zhu, J.; Addepalli, S. Fog Computing and its role in the internet of things. In Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, Helsinki, Finland, 17 August 2012; pp. 13–16. [Google Scholar]
- Kennedy, J.; Eberhart, R.; Shi, Y. Swarm Intelligence; Morgan Kaufmann publishers Inc.: San Francisco, CA, USA, 2001. [Google Scholar]
- Fong, S.; Lan, K.; Wong, R. Classifying human voices by using hybrid sfx time-series preprocessing and ensemble feature selection. BioMed Res. Int.
**2013**, 2013. [Google Scholar] [CrossRef] [PubMed] - Hall, M.A. Correlation-based feature selection of discrete and numeric class machine learning. In Proceedings of the ICML Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000. [Google Scholar]
- Yang, Y.; Pedersen, J.O. A comparative study on feature selection in text categorization. In Proceedings of the ICML Fourteenth International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997; pp. 412–420. [Google Scholar]
- Lin, S.-W.; Ying, K.-C.; Chen, S.-C.; Lee, Z.-J. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst. Appl.
**2008**, 35, 1817–1824. [Google Scholar] [CrossRef] - Yang, J.; Honavar, V. Feature subset selection using a genetic algorithm. IEEE Intell. Syst. Appl.
**1998**, 13, 44–49. [Google Scholar] [CrossRef] - Domingos, P.; Hulten, G. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000; pp. 71–80. [Google Scholar]
- Golab, L.; Özsu, M.T. Issues in data stream management. ACM Sigmod Rec.
**2003**, 32, 5–14. [Google Scholar] [CrossRef] - Muthitacharoen, A.; Chen, B.; Mazieres, D. A low-bandwidth network file system. In Proceedings of the eighteenth ACM SIGOPS Operating Systems Review, Banff, AB, Canada, 21–24 October 2001; pp. 174–187. [Google Scholar]
- Wang, L.; Dong, X.; Zhang, X.; Guo, F.; Wang, Y.; Gong, W. A logistic based mathematical model to optimize duplicate elimination ratio in content defined chunking based big data storage system. Symmetry
**2016**, 8, 69. [Google Scholar] [CrossRef] - Zhu, Y.; Shasha, D. Statstream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, 20–23 August 2002; pp. 358–369. [Google Scholar]
- Bifet, A.; Gavalda, R. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 443–448. [Google Scholar]
- Wong, B.P.; Kerkez, B. Real-time environmental sensor data: An application to water quality using web services. Environ. Model. Softw.
**2016**, 84, 505–517. [Google Scholar] [CrossRef] - Gall, H.E.; Jafvert, C.T.; Jenkinson, B. Integrating hydrograph modeling with real-time flow monitoring to generate hydrograph-specific sampling schemes. J. Hydrol.
**2010**, 393, 331–340. [Google Scholar] [CrossRef] - Azami, H.; Hassanpour, H.; Escudero, J.; Sanei, S. An intelligent approach for variable size segmentation of non-stationary signals. J. Adv. Res.
**2015**, 6, 687–698. [Google Scholar] [CrossRef] [PubMed] - Tao, Y.; Lam, E.C.; Tang, Y.Y. Feature extraction using wavelet and fractal. Pattern Recognit. Lett.
**2001**, 22, 271–287. [Google Scholar] [CrossRef] - Yang, J.; Zhu, H.; Wang, Y. An orthogonal multi-swarm cooperative pso algorithm with a particle trajectory knowledge base. Symmetry
**2017**, 9, 15. [Google Scholar] [CrossRef] - Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica
**1982**, 50, 987–1007. [Google Scholar] [CrossRef] - Carbone, A.; Castelli, G.; Stanley, H.E. Time-dependent hurst exponent in financial time series. Phys. A Stat. Mech. Appl.
**2004**, 344, 267–271. [Google Scholar] [CrossRef] - Appel, G. Technical Analysis: POWER Tools for Active Investors; FT Press: Upper Saddle River, NJ, USA, 2005. [Google Scholar]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 27 December 1965–7 January 1966; pp. 281–297. [Google Scholar]

**Figure 1.**Illustration of equal and variable size sub-window segmentation under each sliding window of a big data stream. The equal method divides the sliding window into many sub-windows with a fixed size, while the variable method divides the sliding window into various sub-windows with different widths in a random way.

**Figure 2.**Demonstration of statistical feature extraction (SFX) on windowed time-series with sub-window segmentation. In the range of individual variable length sub-windows, additional features, including volatility, moving average convergence/divergence (MACD), Hurst exponent, and distance, are extracted from the original data for the sake of providing more informative attributes and expanding the primitive data.

**Figure 3.**Self-adaptive workflow of a wrapped clustering-based particle swarm optimization (CPSO) model for sub-window optimization. The wrapped loop starts from the variable sub-window division, then feature extraction and selection are applied to generate corresponding search candidates with subset features. After that, a clustering technique is used, and the comparative result between the clustered and existing labels will replace the traditional classifier accuracy as the fitness score of the PSO search, and the workflow ends until the appropriate combination of variable size sub-windows is found that can yield the best fitness.

**Figure 4.**The overall averaged accuracy comparison of different data stream pre-processing methods. Different colored columns mean various dataset classification accuracies upon which the specific pre-processing method is deployed. The black average curve shows the trend that ultimately our proposed pre-processing method (the most right group) yields the best classification accuracy.

**Figure 5.**An overall averaged accuracy comparison of distinguished testing datasets. Different colors mean various classification accuracies of the pre-processing methods applied to those datasets. The grey column in each dataset group also proves that our proposed pre-processing method performs excellently.

**Figure 6.**Visualization within one sliding window and its variable optimal sub-window division by the proposed self-adaptive method applied to various Internet of Things (IoT) datasets: (

**a**) Home: the distribution of almost all wave peaks with large amplitudes is detected; (

**b**) Gas: a separation gap is found between the particular shapes of two waves; (

**c**) Ocean: instead of amplitude, the wave density of intensive (compressed) and extensive (loose) is set; (

**d**) Electricity: the distribution of peaks with obvious amplitude differentials (yellow color) is figured.

**Figure 7.**The overall performance results on each IoT dataset with different pre-processing methods: (

**a**) Home; (

**b**) Gas; (

**c**) Ocean; (

**d**) Electricity. Similar to Figure 5, different colored lines mean various classification result evaluation factors with pre-processing methods applied to those datasets. The bordering grey c line in each dataset also indicates that our proposed pre-processing method produces the overall best performance.

Dataset | Acc (%) | Kappa | TPR | FPR | Precision | Recall | F1 | MCC | ROC | Time (min) |
---|---|---|---|---|---|---|---|---|---|---|

Home | ||||||||||

original | 80.813 | 0.613 | 0.808 | 0.199 | 0.812 | 0.808 | 0.807 | 0.618 | 0.832 | 0.02 |

sliding window | 79.260 | 0.583 | 0.793 | 0.212 | 0.794 | 0.793 | 0.792 | 0.584 | 0.826 | 12.57 |

mnl-subw-eql | 81.279 | 0.622 | 0.813 | 0.194 | 0.816 | 0.813 | 0.812 | 0.627 | 0.844 | 16.26 |

pso-subw-eql | 83.464 | 0.666 | 0.835 | 0.173 | 0.837 | 0.835 | 0.834 | 0.670 | 0.873 | 4535.58 |

pso-subw-var | 84.013 | 0.678 | 0.840 | 0.163 | 0.840 | 0.840 | 0.840 | 0.678 | 0.847 | 5311.12 |

cpso-subw-var | 84.728 | 0.686 | 0.843 | 0.159 | 0.850 | 0.843 | 0.843 | 0.693 | 0.887 | 1055.07 |

Gas | ||||||||||

original | 89.971 | 0.799 | 0.900 | 0.101 | 0.901 | 0.900 | 0.900 | 0.800 | 0.964 | 0.11 |

sliding window | 90.856 | 0.817 | 0.909 | 0.090 | 0.911 | 0.909 | 0.908 | 0.819 | 0.969 | 51.23 |

mnl-subw-eql | 92.045 | 0.841 | 0.920 | 0.079 | 0.921 | 0.920 | 0.920 | 0.842 | 0.974 | 55.40 |

pso-subw-eql | 93.126 | 0.862 | 0.931 | 0.069 | 0.931 | 0.931 | 0.931 | 0.863 | 0.978 | 85770.57 |

pso-subw-var | 93.735 | 0.875 | 0.937 | 0.061 | 0.939 | 0.937 | 0.937 | 0.876 | 0.980 | 90952.55 |

cpso-subw-var | 95.493 | 0.910 | 0.955 | 0.045 | 0.956 | 0.955 | 0.955 | 0.910 | 0.987 | 6025.34 |

Ocean | ||||||||||

original | 78.495 | 0.570 | 0.785 | 0.215 | 0.793 | 0.785 | 0.784 | 0.577 | 0.876 | 0.01 |

sliding window | 85.704 | 0.714 | 0.857 | 0.143 | 0.862 | 0.857 | 0.857 | 0.719 | 0.908 | 5.53 |

mnl-subw-eql | 85.171 | 0.703 | 0.852 | 0.149 | 0.856 | 0.852 | 0.851 | 0.707 | 0.890 | 6.40 |

pso-subw-eql | 87.109 | 0.742 | 0.871 | 0.129 | 0.872 | 0.871 | 0.871 | 0.743 | 0.913 | 1748.18 |

pso-subw-var | 87.953 | 0.759 | 0.880 | 0.121 | 0.881 | 0.880 | 0.880 | 0.760 | 0.921 | 2291.27 |

cpso-subw-var | 88.535 | 0.771 | 0.886 | 0.115 | 0.886 | 0.886 | 0.886 | 0.772 | 0.921 | 498.76 |

Electricity | ||||||||||

original | 75.626 | 0.486 | 0.756 | 0.286 | 0.762 | 0.756 | 0.749 | 0.499 | 0.792 | 0.03 |

sliding window | 77.310 | 0.532 | 0.773 | 0.247 | 0.775 | 0.773 | 0.771 | 0.536 | 0.825 | 3.87 |

mnl-subw-eql | 77.746 | 0.541 | 0.778 | 0.242 | 0.779 | 0.778 | 0.776 | 0.544 | 0.838 | 4.30 |

pso-subw-eql | 77.456 | 0.533 | 0.775 | 0.249 | 0.777 | 0.775 | 0.771 | 0.538 | 0.845 | 2452.05 |

pso-subw-var | 77.895 | 0.551 | 0.779 | 0.225 | 0.783 | 0.779 | 0.779 | 0.553 | 0.842 | 2424.08 |

cpso-subw-var | 78.980 | 0.566 | 0.790 | 0.230 | 0.789 | 0.790 | 0.788 | 0.567 | 0.868 | 523.51 |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lan, K.; Fong, S.; Song, W.; Vasilakos, A.V.; Millham, R.C.
Self-Adaptive Pre-Processing Methodology for Big Data Stream Mining in Internet of Things Environmental Sensor Monitoring. *Symmetry* **2017**, *9*, 244.
https://doi.org/10.3390/sym9100244

**AMA Style**

Lan K, Fong S, Song W, Vasilakos AV, Millham RC.
Self-Adaptive Pre-Processing Methodology for Big Data Stream Mining in Internet of Things Environmental Sensor Monitoring. *Symmetry*. 2017; 9(10):244.
https://doi.org/10.3390/sym9100244

**Chicago/Turabian Style**

Lan, Kun, Simon Fong, Wei Song, Athanasios V. Vasilakos, and Richard C. Millham.
2017. "Self-Adaptive Pre-Processing Methodology for Big Data Stream Mining in Internet of Things Environmental Sensor Monitoring" *Symmetry* 9, no. 10: 244.
https://doi.org/10.3390/sym9100244