# Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Methodology

#### 3.1. Dataset Choice and Description

#### 3.2. Dataset Impairment

#### 3.3. Methods Chosen for the Edge Data Imputation

**Mean imputation**is perhaps one of the most common and straightforward approaches, where missing values are replaced with the mean of the considered variable. In our case, the missing data for the ozone measurements is replaced with the mean of the observed ozone values. However, one must be aware that, often enough, this method is not producing good enough results, as it changes the standard deviation, and it does not account for the relationship among the variables.

**Multiple imputation by chained equations (MICE) data imputation**is a robust, statistical, principled, multiple imputation technique. It works by making multiple predictions for each missing value. The procedure fills in the missing data through an iterative series of predictive models, as explained in [39]. Azur et al. provide a comprehensive analysis and description of the chained equation approach to multiple imputation in [40], as well as an overview of the steps the MICE algorithm undertakes for convergence. In this work, we use the python library function impyute.imputation.cs.mice that differs from the implementation proposed by Buuren et al. in [41] in two aspects, namely stopping criterion and variable to regress on https://impyute.readthedocs.io/en/latest/_modules/impyute/imputation/cs/mice.html, accessed on 25 March 2021. We apply the technique on the whole dataset (consisting of the five columns, as described in Table 2).

**missForest data imputation**is an iterative imputation method, based on a random forest, and has been introduced in [42]. It works by averaging over a number of different decision trees (unpruned classification or regression trees). In this work, we use the missForest method, part of the missingpy Python library. We apply the technique on the whole dataset (consisting of the five columns, as described in Table 2).

**kNN data imputation**works by filling in missing data points based on the values of its closest k neighbours, identified through the usage of the euclidean distance [43]. In this work, we use the KNNImputer method, part of the sklearn.impute Python library. We apply the technique on the whole dataset (comprising the five columns, as described in Table 2). We chose a k value of 3 in order to keep the search of neighbours to a minimum. This can be further optimized by analysing the impact of the k value over the performance in relation to the time and space complexity.

#### 3.4. Experiment Design

**non-bursty case,**we compare the performance of the data imputation methods in the context of an impairment rate varying from 1% to 99%. A step of 5% is used from the impairment rate of 5% until that of 95%. From the rate of 95% until that of 99%, a step of 1% is used.

**bursty case,**the methods consider a burst size varying from 5 to 200 with a step of 5. We also include the non-bursty case scenario with burst size 1 for comparison. The impairment rate is kept within the 1% to 25% range.

## 4. Result Analysis

#### 4.1. Non-Bursty Case

#### 4.2. Bursty Case

#### 4.3. Time and Space Complexity

## 5. Discussion

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Ahmed, E.; Yaqoob, I.; Gani, A.; Imran, M.; Guizani, M. Internet-of-things-based smart environments: State of the art, taxonomy, and open research challenges. IEEE Wirel. Commun.
**2016**, 23, 10–16. [Google Scholar] [CrossRef] - Ge, M.; Bangui, H.; Buhnova, B. Big Data for Internet of Things: A Survey. Future Gener. Comput. Syst.
**2018**, 87, 601–614. [Google Scholar] [CrossRef] - Chen, S.; Zheng, Y.; Lu, W.; Varadarajan, V.; Wang, K. Energy-Optimal Dynamic Computation Offloading for Industrial IoT in Fog Computing. IEEE Trans. Green Commun. Netw.
**2020**, 4, 566–576. [Google Scholar] [CrossRef] - Xiang, X.; Gui, J.; Xiong, N.N. An integral data gathering framework for supervisory control and data acquisition systems in green IoT. IEEE Trans. Green Commun. Netw.
**2021**, 5, 714–726. [Google Scholar] [CrossRef] - Tariq, U.U.; Ali, H.; Liu, L.; Hardy, J.; Kazim, M.; Ahmed, W. Energy-aware scheduling of streaming applications on edge-devices in IoT-based healthcare. IEEE Trans. Green Commun. Netw.
**2021**, 5, 803–815. [Google Scholar] [CrossRef] - Pace, P.; Aloi, G.; Gravina, R.; Caliciuri, G.; Fortino, G.; Liotta, A. An Edge-Based Architecture to Support Efficient Applications for Healthcare Industry 4.0. IEEE Trans. Ind. Inform.
**2019**, 15, 481–489. [Google Scholar] [CrossRef][Green Version] - Erhan, L.; Ndubuaku, M.; Di Mauro, M.; Song, W.; Chen, M.; Fortino, G.; Bagdasar, O.; Liotta, A. Smart anomaly detection in sensor systems: A multi-perspective review. Inf. Fusion
**2021**, 67, 64–79. [Google Scholar] [CrossRef] - Yu, W.; Liang, F.; He, X.; Hatcher, W.G.; Lu, C.; Lin, J.; Yang, X. A Survey on the Edge Computing for the Internet of Things. IEEE Access
**2018**, 6, 6900–6919. [Google Scholar] [CrossRef] - Savaglio, C.; Fortino, G. A Simulation-Driven Methodology for IoT Data Mining Based on Edge Computing. ACM Trans. Internet Technol.
**2021**, 21, 1–22. [Google Scholar] [CrossRef] - Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet Things J.
**2020**, 7, 7457–7469. [Google Scholar] [CrossRef][Green Version] - Guo, Y.; Liu, F.; Xiao, N.; Chen, Z. Task-based resource allocation bid in edge computing micro datacenter. Comput. Mater. Contin.
**2019**, 61, 777–792. [Google Scholar] [CrossRef] - Liu, Z.; Qiu, X.; Zhang, S.; Deng, S.; Liu, G. Service scheduling based on edge computing for power distribution IoT. Comput. Mater. Contin.
**2020**, 62, 1351–1364. [Google Scholar] [CrossRef] - Wang, J.; Wu, W.; Liao, Z.; Jung, Y.W.; Kim, J.U. An Enhanced PROMOT Algorithm with D2D and Robust for Mobile Edge Computing. J. Internet Technol.
**2020**, 21, 1437–1445. [Google Scholar] - Park, S.M.; Kim, Y.G. User profile system based on sentiment analysis for mobile edge computing. Comput. Mater. Contin.
**2020**, 62, 569–590. [Google Scholar] [CrossRef] - Tang, Q.; Wang, K.; Song, Y.; Li, F.; Park, J.H. Waiting time minimized charging and discharging strategy based on mobile edge computing supported by software-defined network. IEEE Internet Things J.
**2020**, 7, 6088–6101. [Google Scholar] [CrossRef] - Garcia-Laencina, P.; Sancho-Gomez, J.; Figueiras-Vidal, A. Pattern classification with missing data: A review. Neural Comput. Appl.
**2010**, 19, 263–282. [Google Scholar] [CrossRef] - Akouemo, H.N.; Povinelli, R.J. Data Improving in Time Series Using ARX and ANN Models. IEEE Trans. Power Syst.
**2017**, 32, 3352–3359. [Google Scholar] [CrossRef][Green Version] - Rockel, T.; Joenssen, D.W.; Bankhofer, U. Decision Trees for the Imputation of Categorical Data. Arch. Data Sci.
**2017**, 2, 1–15. [Google Scholar] - Li, F.; Zhang, X.; Du, C.; Huang, L. A hybrid NRS-CART algorithm and its application on coal mine floor water-inrush prediction. In Proceedings of the TENCON 2015-2015 IEEE Region 10 Conference, Macao, China, 1–4 November 2015; pp. 1–4. [Google Scholar]
- Wang, G.; Deng, Z.; Choi, K.S. Tackling Missing Data in Community Health Studies Using Additive LS-SVM Classifier. IEEE J. Biomed. Health Inform.
**2018**, 22, 579–587. [Google Scholar] [CrossRef] [PubMed] - Arima, K.; Okada, N.; Tsuji, Y.; Kiguchi, K. Evaluations of a multiple SOMs method for estimating missing values. In Proceedings of the 2014 IEEE/SICE International Symposium on System Integration, Tokyo, Japan, 13–15 December 2014; pp. 796–801. [Google Scholar]
- McMahan, B.; Ramage, D. Federated Learning: Collaborative Machine Learning without Centralized Training Data. 2017. Available online: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html (accessed on 11 November 2021).
- Kolomvatsos, K.; Papadopoulou, P.; Anagnostopoulos, C.; Hadjiefthymiades, S. A Spatio-Temporal Data Imputation Model for Supporting Analytics at the Edge. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11701, pp. 138–150. [Google Scholar]
- Mary, I.P.S.; Arockiam, L. Imputing the missing data in IoT based on the spatial and temporal correlation. In Proceedings of the 2017 IEEE International Conference on Current Trends in Advanced Computing (ICCTAC), Bangalore, India, 2–3 March 2017; pp. 1–4. [Google Scholar]
- Fountas, P.; Kolomvatsos, K. Ensemble based Data Imputation at the Edge. In Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020; pp. 961–968. [Google Scholar]
- Fountas, P.; Kolomvatsos, K. A Continuous Data Imputation Mechanism based on Streams Correlation. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; pp. 1–6. [Google Scholar]
- Pan, L.; Li, J. K-Nearest Neighbor Based Missing Data Estimation Algorithm in Wireless Sensor Networks. Wirel. Sens. Netw.
**2010**, 2, 115–122. [Google Scholar] [CrossRef][Green Version] - Guastella, D.A.; Marcillaud, G.; Valenti, C. Edge-Based Missing Data Imputation in Large-Scale Environments. Information
**2021**, 12, 195. [Google Scholar] [CrossRef] - Fekade, B.; Maksymyuk, T.; Kyryk, M.; Jo, M. Probabilistic Recovery of Incomplete Sensed Data in IoT. IEEE Internet Things J.
**2018**, 5, 2282–2292. [Google Scholar] [CrossRef] - Zhang, L.; Bai, L.; Zhang, X.; Zhang, Y.; Sun, F.; Chen, C. Comparative variance and multiple imputation used for missing values in land price DataSet. Comput. Mater. Contin.
**2019**, 61, 1175–1187. [Google Scholar] [CrossRef] - González-Vidal, A.; Rathore, P.; Rao, A.S.; Mendoza-Bernal, J.; Palaniswami, M.; Skarmeta-Gómez, A.F. Missing Data Imputation With Bayesian Maximum Entropy for Internet of Things Applications. IEEE Internet Things J.
**2021**, 8, 16108–16120. [Google Scholar] [CrossRef] - Liu, Y.; Dillon, T.; Yu, W.; Rahayu, W.; Mostafa, F. Missing Value Imputation for Industrial IoT Sensor Data with Large Gaps. IEEE Internet Things J.
**2020**, 7, 6855–6867. [Google Scholar] [CrossRef] - Yan, X.; Xiong, W.; Hu, L.; Wang, F.; Zhao, K. Missing value imputation based on Gaussian mixture model for the Internet of Things. Math. Probl. Eng.
**2015**, 2015, 548605. [Google Scholar] [CrossRef] - Tkachenko, R.; Izonin, I.; Kryvinska, N.; Dronyuk, I.; Zub, K. An Approach towards Increasing Prediction Accuracy for the Recovery of Missing IoT Data based on the GRNN-SGTM Ensemble. Sensors
**2020**, 20, 2625. [Google Scholar] [CrossRef] - Kong, L.; Xia, M.; Liu, X.; Wu, M.; Liu, X. Data loss and reconstruction in sensor networks. In Proceedings of the 2013 Proceedings IEEE INFOCOM, Turin, Italy, 14–19 April 2013; pp. 1654–1662. [Google Scholar]
- Peixoto, M.L.M.; Souza, I.; Barbosa, M.; Lecomte, G.; Batista, B.G.; Kuehne, B.T.; Filho, D.M.L. Data Missing Problem in Smart Surveillance Environment. In Proceedings of the 2018 International Conference on High Performance Computing & Simulation (HPCS), Orleans, France, 16–20 July 2018; pp. 962–969. [Google Scholar]
- Xue, H.; Huang, B.; Qin, M.; Zhou, H.; Yang, H. Edge Computing for Internet of Things: A Survey. In Proceedings of the 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), Rhodes, Greece, 2–6 November 2020; pp. 755–760. [Google Scholar]
- Ali, M.I.; Gao, F.; Mileo, A. CityBench: A Configurable Benchmark to Evaluate RSP Engines Using Smart City Datasets. In International Semantic Web Conference (ISWC); Springer: Bethlehem, PA, USA, 2015; pp. 374–389. [Google Scholar]
- Raghunathan, T.; Lepkowksi, J.; Van Hoewyk, J.; Solenbeger, P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol.
**2001**, 27, 85–95. [Google Scholar] - Azur, M.J.; Stuart, E.A.; Frangakis, C.; Leaf, P.J. Multiple imputation by chained equations: What is it and how does it work?: Multiple imputation by chained equations. Int. J. Methods Psychiatr. Res.
**2011**, 20, 40–49. [Google Scholar] [CrossRef] [PubMed] - Buuren, S.V.; Groothuis-Oudshoorn, K. MICE: Multivariate Imputation by Chained Equations in R. J. Stat. Softw.
**2011**, 45, 1–67. [Google Scholar] [CrossRef][Green Version] - Stekhoven, D.J.; Bühlmann, P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics
**2012**, 28, 112–118. [Google Scholar] [CrossRef][Green Version] - Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics
**2001**, 17, 520–525. [Google Scholar] [CrossRef][Green Version] - Foundation, T.R.P. Raspberry Pi 4 Model B. 2020. Available online: https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/ (accessed on 1 May 2021).

**Figure 7.**Execution times (s) on laptop and RPI 4B 4GB for the non-bursty case and varying impairment rates.

**Figure 8.**Colormap showcasing the execution time (s) in relation to the impairment rate and burst size for kNN, missForest and MICE data impuation.

**Figure 9.**Snapshot of the CPU and RAM memory usage for the non-bursty case with 50% impairment rate on the RPI 4B (4GB of RAM) for each algorithm (1-mean imputation, 2-MICE imputation, 3-missForest imputation, 4-kNN imputation).

Main Limitation | Our Approach | |
---|---|---|

[25,26,27,28,30,31,32,33,35,36] | Data imputation techniques are applied within the IoT realm, where the experiments run on fixed platforms (sometimes through expensive tools such as Matlab or SPSS). | Imputation algorithms run directly onboard of sensors which are the main source of data. |

[24,25,26,32] | Classic missing error models are used, with the problem of bursty missing values not addressed. | The problem of bursty missing values is explicitly faced since it represents a real-world scenario in which a sensor could be unavailable for a certain period of time. |

[23,24,34,35] | The data imputation techniques are evaluated through performance indices (e.g., accuracy), but a time assessment is missing. | Performance analysis is complemented with a time assessment. |

Ozone | Particulate Matter | Carbon Monoxide | Sulphur Dioxide | Nitrogen Dioxide | |
---|---|---|---|---|---|

count | 17568 | 17,568 | 17,568 | 17,568 | 17,568 |

mean | 92.42 | 106.12 | 100.54 | 131.66 | 159.18 |

std | 46.18 | 52.01 | 49.66 | 50.51 | 43.43 |

min | 15 | 15 | 15 | 15 | 18 |

25% | 54 | 60 | 56 | 99 | 134 |

50% | 87 | 107 | 99 | 131 | 173 |

75% | 127 | 146 | 138 | 177 | 193 |

max | 215 | 215 | 215 | 215 | 215 |

Type of Introduced Error (Random) | Description |
---|---|

Non-bursty case | We randomly select the individual data points to be invalidated from the dataset, in order to reach the desired dataset impairment level. |

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● | |

20 data points, desired impairment rate of 20% (4 points to be invalidated) | |

●- original data point, ●- invalidated data point (N/A) | |

Bursty case | We randomly select a corresponding number of bursts of a given size (number of data points) to be invalidated from the dataset, in order to reach the desired dataset impairment level. |

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● | |

20 data points, desired impairment rate of 30% with burst size = 3 (6 points to be invalidated as part of 2 bursts of size = 3 data points) | |

●- original data point, ●- invalidated data point (N/A) |

Hardware Specifications | RPI 4B | Laptop |
---|---|---|

RAM | 4 GB | 16 GB |

CPU | Broadcom BCM2711, quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5 GHz | Intel(R) Core(TM) i7-8850H CPU @ 2.60 GHz |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Erhan, L.; Di Mauro, M.; Anjum, A.; Bagdasar, O.; Song, W.; Liotta, A.
Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study. *Sensors* **2021**, *21*, 7774.
https://doi.org/10.3390/s21237774

**AMA Style**

Erhan L, Di Mauro M, Anjum A, Bagdasar O, Song W, Liotta A.
Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study. *Sensors*. 2021; 21(23):7774.
https://doi.org/10.3390/s21237774

**Chicago/Turabian Style**

Erhan, Laura, Mario Di Mauro, Ashiq Anjum, Ovidiu Bagdasar, Wei Song, and Antonio Liotta.
2021. "Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study" *Sensors* 21, no. 23: 7774.
https://doi.org/10.3390/s21237774