An Ensemble Method for Missing Data of Environmental Sensor Considering Univariate and Multivariate Characteristics

With rapid urbanization, awareness of environmental pollution is growing rapidly and, accordingly, interest in environmental sensors that measure atmospheric and indoor air quality is increasing. Since these IoT-based environmental sensors are sensitive and value reliability, it is essential to deal with missing values, which are one of the causes of reliability problems. Characteristics that can be used to impute missing values in environmental sensors are the time dependency of single variables and the correlation between multivariate variables. However, in the existing method of imputing missing values, only one characteristic has been used and there has been no case where both characteristics were used. In this work, we introduced a new ensemble imputation method reflecting this. First, the cases in which missing values occur frequently were divided into four cases and were generated into the experimental data: communication error (aperiodic, periodic), sensor error (rapid change, measurement range). To compare the existing method with the proposed method, five methods of univariate imputation and five methods of multivariate imputation—both of which are widely used—were used as a single model to predict missing values for the four cases. The values predicted by a single model were applied to the ensemble method. Among the ensemble methods, the weighted average and stacking methods were used to derive the final predicted values and replace the missing values. Finally, the predicted values, substituted with the original data, were evaluated by a comparison between the mean absolute error (MAE) and the root mean square error (RMSE). The proposed ensemble method generally performed better than the single method. In addition, this method simultaneously considers the correlation between variables and time dependence, which are characteristics that must be considered in the environmental sensor. As a result, our proposed ensemble technique can contribute to the replacement of the missing values generated by environmental sensors, which can help to increase the reliability of environmental sensor data.


Introduction
The concept of a smart city has become a trend, with rapid urbanization occurring worldwide. Accordingly, various technologies that are necessary for smart cities, such as the Internet of Things, machine learning, and big data applications have been developed. Among the various smart city technologies, interest in the deployment of applications for environmental pollution monitoring is increasing [1,2]. In addition, the environment is deteriorating due to economic activity, rapid urbanization, and increased energy consumption [3]. The World Health Organization (WHO) announced that air pollution, soil quality, and water quality are the biggest environmental risk factors for health. Air pollutants that penetrate through the respiratory tract and blood vessels adversely affect the lungs, heart, and brain [4,5]. As a result, people are increasingly interested in how the environment affects their health. Accordingly, interest in and demand for environmental sensors that may be advantageous to use only one of the two methods, or it may be more appropriate to substitute one, considering both methods at the same time. Therefore, the authors attempted to create various cases in which missing values can occur, and suggest which technique is appropriate for each case.
Therefore, we divided the cases where missing values occur into several categories. We have measured these using environmental sensors since 2020 and identified the types of missing values that occur frequently, using more than 22 devices. Considering this, the case was first divided into two types: errors in communication and errors in the sensor itself. Communication errors are divided into two types-periodic and aperiodic-considering the period. In the case of sensor error, a case was added assuming a rapid change in data and when the measurement range of the sensor itself was exceeded. We aimed to discover and suggest an appropriate imputation method, according to these four situations.
In cases of missing values, the univariate imputation and multivariate imputation techniques were applied, respectively. In univariate imputation, the existing univariate imputation technique was applied as is. For univariate imputation, linear interpolation (LI), spline interpolation (SI), last observation carried forward (LOCF), Kalman, and moving average (MA) methods were used. In multivariate imputation, five machine learning techniques were used, as follows: K-nearest neighbor (KNN), random forest (RF), linear regression (Reg), support vector machine (SVM), and miss forest (MF).
In order to consider both the time dependence of the univariate and the dependence of the multivariate on variables, an ensemble method was introduced, based on the predicted values from the univariate and multivariate. Among the ensemble methods, the weighted average and the stacking method were used. In the weighted average, different weights were set for univariate and multivariate data to obtain a weighted average. In addition, in the stacking, based on the values predicted by each model, the final meta-learner model was predicted to replace the missing values, again.
The rest of the paper is organized as follows: Section 2 describes the experimental environment, existing imputation model, proposed ensemble method, and evaluation method. In Section 3, the authors aim to check the differences between models, according to the evaluation method, and compare the models that were finally used. In Section 4, the problem that occurred during the experiment and the points that were supplemented, confirmed through the results, are described. The final section addresses the paper's conclusion.

Experimental Setup and Dataset
We formed an experimental group in Soongsil University to measure environmental sensors, as shown in Figure 1. This is because the indoor environment changes depending on the measurement location. When measuring multiple places, there is a disadvantage, in that the measuring range becomes excessively wide if it is measured from too far away [22]. Therefore, the environmental devices were placed in a line that could be controlled to some extent. Then, we checked that the linearity of the sensor was maintained, even if these two environments were slightly different. In Groups 1 and 2, places where people come in and out and places where people do not enter were set to reflect the effects on ventilation and movement. Sensors 2021, 21, x FOR PEER REVIEW 4 of 22 A total of 22 environmental sensor devices were used in the experiment. Twelve devices were set in Group 1, and 10 devices were set in Group 2. Group 1 was on the 5th floor and had a well-ventilated environment, while Group 2 was on the 1st basement floor and had a humid and low-temperature environment. For the devices used in the experiment that imputed missing values, 2 devices out of 12 were selected in Group 1. Two devices with linearity were selected: one device set for reference and the other device directly put into the situation of missing values. Experimental settings were created, as shown in Figure 2. The room size was 16 m 2 and had an air conditioning system on the ceiling. This room was a meeting room, where people come and go. The obtained data were transmitted to the server using long-range (LoRa) communication, and the transmitted data were used for analysis. Our dataset contains environmental data collected from 10 gas sensors deployed in Soongsil University. As shown in Figure 3, the sensor device was equipped with a total of 10 environmental sensor modules, including those for temperature, humidity, CO, CO2, TVOC, PM2.5, PM10, NO2, NH3, and H2S. As a communication method, STM32F429ZIT MCU was used as a LoRa environmental sensor to collect information through a universal asynchronous receiver transmitter (UART), an inter-integrated circuit (I 2 C), and an analog-digital converter (ADC) for various environmental sensors, and an external LoRa modem, which also communicated using a UART. The control unit used a remote calibration protocol and performed functions such as resetting the device and changing the cycle of the sensor. A total of 22 environmental sensor devices were used in the experiment. Twelve devices were set in Group 1, and 10 devices were set in Group 2. Group 1 was on the 5th floor and had a well-ventilated environment, while Group 2 was on the 1st basement floor and had a humid and low-temperature environment. For the devices used in the experiment that imputed missing values, 2 devices out of 12 were selected in Group 1. Two devices with linearity were selected: one device set for reference and the other device directly put into the situation of missing values. Experimental settings were created, as shown in Figure 2. The room size was 16 m 2 and had an air conditioning system on the ceiling. This room was a meeting room, where people come and go. The obtained data were transmitted to the server using long-range (LoRa) communication, and the transmitted data were used for analysis. A total of 22 environmental sensor devices were used in the experiment. Twelve devices were set in Group 1, and 10 devices were set in Group 2. Group 1 was on the 5th floor and had a well-ventilated environment, while Group 2 was on the 1st basement floor and had a humid and low-temperature environment. For the devices used in the experiment that imputed missing values, 2 devices out of 12 were selected in Group 1. Two devices with linearity were selected: one device set for reference and the other device directly put into the situation of missing values. Experimental settings were created, as shown in Figure 2. The room size was 16 m 2 and had an air conditioning system on the ceiling. This room was a meeting room, where people come and go. The obtained data were transmitted to the server using long-range (LoRa) communication, and the transmitted data were used for analysis. Our dataset contains environmental data collected from 10 gas sensors deployed in Soongsil University. As shown in Figure 3, the sensor device was equipped with a total of 10 environmental sensor modules, including those for temperature, humidity, CO, CO2, TVOC, PM2.5, PM10, NO2, NH3, and H2S. As a communication method, STM32F429ZIT MCU was used as a LoRa environmental sensor to collect information through a universal asynchronous receiver transmitter (UART), an inter-integrated circuit (I 2 C), and an analog-digital converter (ADC) for various environmental sensors, and an external LoRa modem, which also communicated using a UART. The control unit used a remote calibration protocol and performed functions such as resetting the device and changing the cycle of the sensor. Our dataset contains environmental data collected from 10 gas sensors deployed in Soongsil University. As shown in Figure 3, the sensor device was equipped with a total of 10 environmental sensor modules, including those for temperature, humidity, CO, CO 2 , TVOC, PM2.5, PM10, NO 2 , NH 3 , and H 2 S. As a communication method, STM32F429ZIT MCU was used as a LoRa environmental sensor to collect information through a universal asynchronous receiver transmitter (UART), an inter-integrated circuit (I 2 C), and an analogdigital converter (ADC) for various environmental sensors, and an external LoRa modem, which also communicated using a UART. The control unit used a remote calibration protocol and performed functions such as resetting the device and changing the cycle of the sensor.  The data-collection period was measured from October 2020 to September 2021, and the data used for the experiment were from 5 March 2021 to 5 April 2021. Regarding the data interval, it was possible to secure about 16,000 pieces of data in 10 min intervals. Using real time series data makes more sense than using simulation data. There is a clear difference between real and simulated data [23]. This is because the data formation using simulation data and the technique used to fill it can lead to a different result when empirical data are received. The outline of specifications, according to the sensor type, are shown in Table 1.

Missing Data Imputation Methodology
In many papers, when dealing with missing values, a dataset is obtained first, and then the missing values are generated. Missing values are randomly generated based on the missing completely at random (MCAR) process; however, this paper differs from previous papers, in that it considers the types of missing values separately and uses univariate and multivariate imputation at the same time, as shown in Figure 4.
In this experiment, the missing data ratio was set to several levels [24,25]. In this case, the reason for setting the missing data ratio differently was that each ratio had a different degree of influence on the data. When the missing rate was less than 1%, the effect was known to be negligible [26]. In addition, when the missing rate was between 1% and 5%, the data corresponded to manageable or flexible sample data. From the moment the missing rate reached 5% or more, a suitable solution was needed to handle missing values in the data [26]. From a missing rate of 15% or more, the missing value clearly affected the predictive model [27,28]. After generating missing values, various techniques were used to process them. Univariate imputation and multivariate imputation methods are usually The data-collection period was measured from October 2020 to September 2021, and the data used for the experiment were from 5 March 2021 to 5 April 2021. Regarding the data interval, it was possible to secure about 16,000 pieces of data in 10 min intervals. Using real time series data makes more sense than using simulation data. There is a clear difference between real and simulated data [23]. This is because the data formation using simulation data and the technique used to fill it can lead to a different result when empirical data are received. The outline of specifications, according to the sensor type, are shown in Table 1.

Missing Data Imputation Methodology
In many papers, when dealing with missing values, a dataset is obtained first, and then the missing values are generated. Missing values are randomly generated based on the missing completely at random (MCAR) process; however, this paper differs from previous papers, in that it considers the types of missing values separately and uses univariate and multivariate imputation at the same time, as shown in Figure 4.
In this experiment, the missing data ratio was set to several levels [24,25]. In this case, the reason for setting the missing data ratio differently was that each ratio had a different degree of influence on the data. When the missing rate was less than 1%, the effect was known to be negligible [26]. In addition, when the missing rate was between 1% and 5%, the data corresponded to manageable or flexible sample data. From the moment the missing rate reached 5% or more, a suitable solution was needed to handle missing values in the data [26]. From a missing rate of 15% or more, the missing value clearly affected the predictive model [27,28]. After generating missing values, various techniques were used to process them. Univariate imputation and multivariate imputation methods are usually used, depending on the data type. In univariate imputation, mean, mode, LI, SI, LOCF, Kalman, and MA methods are traditionally used. In multivariate imputation, KNN [29], RF [30], regression [31], SVM, and SVD [32] are traditionally used. Finally, using the above techniques, the missing data are usually processed by evaluation (MAE, RMSE, etc.) through a comparison between the predicted missing value and the actual value [33,34]. Since the data collected by the smart environmental sensors were time series data, which were sequentially collected, and various environmental factors must be considered together, we propose an algorithm that uses both methods together. The predictions from each imputation were collected and the weighted average and stacking algorithms were used to lower the evaluation values of the missing values.  [29], RF [30], regression [31], SVM, and SVD [32] are traditionally used. Finally, using the above techniques, the missing data are usually processed by evaluation (MAE, RMSE, etc.) through a comparison between the predicted missing value and the actual value [33,34]. Since the data collected by the smart environmental sensors were time series data, which were sequentially collected, and various environmental factors must be considered together, we propose an algorithm that uses both methods together. The predictions from each imputation were collected and the weighted average and stacking algorithms were used to lower the evaluation values of the missing values.

Missing Data Type
The types of missing values are usually classified into three mechanisms, defined by Little and Rubin in 1987. These mechanisms are missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) [9]. In addition to the typical missing value types, defined by Little and Rubin, missing value cases were defined considering the characteristics of each sensor type, identified in Section 2.1. As can be seen in Table 2, we first divided the missing values into two cases: communication error and sensor error. In the communication error cases, the types of missing value were divided into two types: aperiodic and periodic. In the sensor error cases, the missing types were divided into rapid change and measurement range.

Missing Data Type
The types of missing values are usually classified into three mechanisms, defined by Little and Rubin in 1987. These mechanisms are missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) [9]. In addition to the typical missing value types, defined by Little and Rubin, missing value cases were defined considering the characteristics of each sensor type, identified in Section 2.1. As can be seen in Table 2, we first divided the missing values into two cases: communication error and sensor error. In the communication error cases, the types of missing value were divided into two types: aperiodic and periodic. In the sensor error cases, the missing types were divided into rapid change and measurement range.

. Communication Error Cases
Communication instability was the most common case of missing values. This is the unavoidable task of IoT sensors operating in a wireless environment. As the device used in our experiment also used a communication method called long-range communication (LoRa), many errors were made in the communication terminal. LoRa has the advantages of having low power and a wide range, but LoRa with a low-power, wide-area network (LPWAN) has the disadvantage of a low transmission rate. In the experiment measuring the controlling switch in Nur-A-Alam, a signal loss of 9% was produced [35]. In Basford's experiment, over 20 devices sent 135,000 messages, but only 72.4% were received by the server [36]. By checking the transmission rate using the received signal strength indicator (RSSI) in our sensor, it was confirmed that a similar problem occurred. A missing value for communication errors occurred in the LoRa-based environmental sensor device in use, as shown in Figure 5. As can be seen from Figure

. Communication Error Cases
Communication instability was the most common case of missing values. This is the unavoidable task of IoT sensors operating in a wireless environment. As the device used in our experiment also used a communication method called long-range communication (LoRa), many errors were made in the communication terminal. LoRa has the advantages of having low power and a wide range, but LoRa with a low-power, wide-area network (LPWAN) has the disadvantage of a low transmission rate. In the experiment measuring the controlling switch in Nur-A-Alam, a signal loss of 9% was produced [35]. In Basford's experiment, over 20 devices sent 135,000 messages, but only 72.4% were received by the server [36]. By checking the transmission rate using the received signal strength indicator (RSSI) in our sensor, it was confirmed that a similar problem occurred. A missing value for communication errors occurred in the LoRa-based environmental sensor device in use, as shown in Figure 5. As can be seen from Figure   Aperiodic and periodic missing values were classified within the communication errors category. It is common for missing values to occur completely randomly over time. However, since missing values may appear periodically, due to any cause, it was considered meaningful to devise a method to handle missing values in such cases. The periodic signal was generated according to the missing rate. The initial missing points were randomized and periodically generated.
The graph shown in Figure 6 is the result of introducing missing values to the communication error case. This was the situation for CO2, and the missing rate was 10%. Figure 6a,b correspond to communication errors-periodic and aperiodic errors, respectively-therefore, as shown in (b), it can be confirmed that missing values occurred at regular intervals. Aperiodic and periodic missing values were classified within the communication errors category. It is common for missing values to occur completely randomly over time. However, since missing values may appear periodically, due to any cause, it was considered meaningful to devise a method to handle missing values in such cases. The periodic signal was generated according to the missing rate. The initial missing points were randomized and periodically generated.
The graph shown in Figure 6 is the result of introducing missing values to the communication error case. This was the situation for CO 2 , and the missing rate was 10%. Figure 6a,b correspond to communication errors-periodic and aperiodic errors, respectively-therefore, as shown in (b), it can be confirmed that missing values occurred at regular intervals.

Sensor Error Cases
The error types that occurred in the sensor itself are classified. First, missing values occurred when the measured values of the sensor changed rapidly. This often occurs in special circumstances, such as when someone smokes a cigarette. In consideration of situations in which the sensor could not detect a sudden change in data, a missing value was generated when the slope between the data was greater than or equal to a certain value.
In addition, missing values can occur according to the measurement range of the sensor. The most affected factor was CO2, and a problem could be found, where the range of the CO2 sensor usually starts at 400 ppm. Since the lowest CO2 concentration in the atmosphere is 400 ppm, values cannot be measured below 400 ppm. The detection range of SVM 30-the CO2 sensor of the device we used-started at 400 ppm, and the same was true of UA50-VOC, which was a separate measurement module. Figure 7 shows the SVM 30 module measurements from 2 September 2021, and it can be seen that the CO2 value between 02:00 and 08:00 was fixed at 400 ppm.
In Figure 8a,b show when an error occurred in the sensor. This was the same situation as seen above for CO2, among other environmental substances, and the missing rate was 10%. Figure 8a shows a case where a missing value occurred when a rapid change occurred in the sensor, and Figure 8b shows a graph indicating when a certain measurement range in a sensor was exceeded.

Sensor Error Cases
The error types that occurred in the sensor itself are classified. First, missing values occurred when the measured values of the sensor changed rapidly. This often occurs in special circumstances, such as when someone smokes a cigarette. In consideration of situations in which the sensor could not detect a sudden change in data, a missing value was generated when the slope between the data was greater than or equal to a certain value.
In addition, missing values can occur according to the measurement range of the sensor. The most affected factor was CO 2 , and a problem could be found, where the range of the CO 2 sensor usually starts at 400 ppm. Since the lowest CO 2 concentration in the atmosphere is 400 ppm, values cannot be measured below 400 ppm. The detection range of SVM 30-the CO 2 sensor of the device we used-started at 400 ppm, and the same was true of UA50-VOC, which was a separate measurement module. Figure 7 shows the SVM 30 module measurements from 2 September 2021, and it can be seen that the CO 2 value between 02:00 and 08:00 was fixed at 400 ppm.
In Figure 8a,b show when an error occurred in the sensor. This was the same situation as seen above for CO 2, among other environmental substances, and the missing rate was 10%. Figure 8a shows a case where a missing value occurred when a rapid change occurred in the sensor, and Figure 8b shows a graph indicating when a certain measurement range in a sensor was exceeded. The reason for dividing the cases like this is clear. First, it can be used as a background to select an appropriate imputation algorithm. Second, this knowledge helps to build a reasonable simulator that can eliminate missing values [18]. In addition to these cases, there were many cases where missing values occurred, but the four most frequent cases The reason for dividing the cases like this is clear. First, it can be used as a background to select an appropriate imputation algorithm. Second, this knowledge helps to build a reasonable simulator that can eliminate missing values [18]. In addition to these cases, there were many cases where missing values occurred, but the four most frequent cases The reason for dividing the cases like this is clear. First, it can be used as a background to select an appropriate imputation algorithm. Second, this knowledge helps to build a reasonable simulator that can eliminate missing values [18]. In addition to these cases, there were many cases where missing values occurred, but the four most frequent cases were selected. It is also necessary to consider additional situations, such as the occurrence of missing values due to human error or power supply.

Missing Value Imputation by Single Model
The method of imputation was divided into univariate imputation using time dependence, and multivariate imputation using the correlation between variables. In addition, the methods that were mainly used in each method are the traditional methods, because the existing models are reliable, fast, and uncomplicated [37].

Imputation in Univariate Data
Univariate time series data form a sequence of single observations at successive timepoints. Although usually considered a column of observations, time is actually an implicit variable [18]. The methods used in this section were replacement methods using time dependency. Therefore, as shown in Figure 9, the value of the autocorrelation function (ACF) exceeded the upper limit, so there was an autocorrelation. were selected. It is also necessary to consider additional situations, such as the occurrence of missing values due to human error or power supply.

Missing Value Imputation by Single Model
The method of imputation was divided into univariate imputation using time dependence, and multivariate imputation using the correlation between variables. In addition, the methods that were mainly used in each method are the traditional methods, because the existing models are reliable, fast, and uncomplicated [37].

Imputation in Univariate Data
Univariate time series data form a sequence of single observations at successive timepoints. Although usually considered a column of observations, time is actually an implicit variable [18]. The methods used in this section were replacement methods using time dependency. Therefore, as shown in Figure 9, the value of the autocorrelation function (ACF) exceeded the upper limit, so there was an autocorrelation. The method of univariate imputation used here was as follows: 1. Linear interpolation: To estimate the missing value, the value of both endpoints was used to linearly estimate the missing value, according to the linear distance. LI was used to improve missing value replacement performance in the field of genotype replacement and machine translation [38,39]. 2. Spline interpolation: Estimate missing values, using low-order polynomials, by dividing them into subintervals. This is also used to replace solar data and is being developed as a method for a distributed data modeling called thin-plate spline interpolation [40,41]

Imputation in Multivariate Data
Multivariate data are data with multiple independent variables. The methods used in this part were substitution methods, using dependencies between variables. Therefore, The method of univariate imputation used here was as follows: 1.
Linear interpolation: To estimate the missing value, the value of both endpoints was used to linearly estimate the missing value, according to the linear distance. LI was used to improve missing value replacement performance in the field of genotype replacement and machine translation [38,39].

2.
Spline interpolation: Estimate missing values, using low-order polynomials, by dividing them into subintervals. This is also used to replace solar data and is being developed as a method for a distributed data modeling called thin-plate spline interpolation [40,41].

3.
Last observation carried forward imputation (LOCF): Estimate missing values using data gathered just before the occurrence of missing values. This method is often used in longitudinal studies. 4.
Moving average imputation: Estimate the missing value as the average of a window of a certain size around the missing value. This technique is mainly used for time series data. 5.
Kalman imputation: Estimate missing values using Kalman smoothing. There was also a recent study on the treatment of missing values for local climate information [42].

Imputation in Multivariate Data
Multivariate data are data with multiple independent variables. The methods used in this part were substitution methods, using dependencies between variables. Therefore, in our experimental data, we first examined the correlation between the variables using the Pearson correlation coefficient.
As shown in Figure 10, it is possible to identify environmental substances with strong correlations. For example, in environmental sensor data, there are high correlations, such as CO-temperature, NO 2 -temperature, CO 2 -TVOC, and NO 2 -CO. Before multivariate imputation was performed for each variable, feature selection was performed with variables showing a high correlation. This was because imputation with variables with clear correlations would be more effective than including all 10 variables in multivariate imputation. in our experimental data, we first examined the correlation between the variables using the Pearson correlation coefficient. As shown in Figure 10, it is possible to identify environmental substances with strong correlations. For example, in environmental sensor data, there are high correlations, such as CO-temperature, NO2-temperature, CO2-TVOC, and NO2-CO. Before multivariate imputation was performed for each variable, feature selection was performed with variables showing a high correlation. This was because imputation with variables with clear correlations would be more effective than including all 10 variables in multivariate imputation. 3. Random forest regression: Replacing missing values using the average predictions of multiple decision trees. Similar to K-NN, there are many new, modified missing value imputation methods based on random forest regression [45].
4. Support vector regression: A method using a support vector machine, which is used to replace missing values.

Miss forest:
This is a random forest-based model, which is used to replace missing values. It can be used universally, regardless of continuous, categorical, or complex interactions and non-linear relationships [46].

Ensemble Learning Method
In this paper, we propose a statistical technique and a machine learning technique, respectively, as ensemble methods to consider the univariate imputation and multivariate imputation methods at the same time. A weighted average method that is easy to use and has a fast calculation speed was used as a statistical technique. A stacking method that predicts the final result, by building a prediction model using the result predicted by each substitution method, is used as a machine learning technique. Multiple linear regression: Fitting a multiple linear regression model and replacing missing values using this. This is used in the imputation method of missing values to measure pollution concentration and air quality [44].

3.
Random forest regression: Replacing missing values using the average predictions of multiple decision trees. Similar to K-NN, there are many new, modified missing value imputation methods based on random forest regression [45]. 4. Support vector regression: A method using a support vector machine, which is used to replace missing values.

5.
Miss forest: This is a random forest-based model, which is used to replace missing values. It can be used universally, regardless of continuous, categorical, or complex interactions and non-linear relationships [46].

Ensemble Learning Method
In this paper, we propose a statistical technique and a machine learning technique, respectively, as ensemble methods to consider the univariate imputation and multivariate imputation methods at the same time. A weighted average method that is easy to use and has a fast calculation speed was used as a statistical technique. A stacking method that predicts the final result, by building a prediction model using the result predicted by each substitution method, is used as a machine learning technique.

Weighted Average Method
The weighted average is one of the simple combination methods used in the ensemble method [47]. A weighted average was set by setting weights, and a proposal was made as a final result. High weights were given to methods with good performances, and low weights were set for methods with relatively poor performances. Weights were set in inverse proportion to the evaluation methods (MAE, RMSE) obtained from each imputation method. Equation (1) shows the result (ŷ) that was obtained after introducing the weighted average algorithm.ŷ = e 2 e 1 + e 2ŷ 1 + e 1 e 1 + e 2ŷ 2 (1) In this case,ŷ is the final result prediction vector, andŷ 1 andŷ 2 are the predicted result vectors in univariate imputation and multivariate imputation. e 1 is the evaluation result value derived from univariate imputation and e 2 indicates the evaluation result value, derived from multivariate imputation.

Stacking Method
The stacking method is a machine learning technique of the ensemble techniques used, along with bagging and boosting techniques, to make another prediction based on the data predicted by individual algorithms. This model considers the predictions of the base learner as new data, and trains them as a meta-learner, which helps to obtain more accurate predictions of the dataset [48]. A variety of base learner models can be applied to form a stacking model, and we chose the univariate imputation technique and the multivariate imputation technique for the base learner model. The basic performance of our stacking algorithm can be seen in Figure 11.

Weighted Average Method
The weighted average is one of the simple combination methods used in the ensemble method [47]. A weighted average was set by setting weights, and a proposal was made as a final result. High weights were given to methods with good performances, and low weights were set for methods with relatively poor performances. Weights were set in inverse proportion to the evaluation methods (MAE, RMSE) obtained from each imputation method. Equation (1) shows the result ( ) that was obtained after introducing the weighted average algorithm.

=
(1) In this case, is the final result prediction vector, and and are the predicted result vectors in univariate imputation and multivariate imputation.
is the evaluation result value derived from univariate imputation and indicates the evaluation result value, derived from multivariate imputation.

Stacking Method
The stacking method is a machine learning technique of the ensemble techniques used, along with bagging and boosting techniques, to make another prediction based on the data predicted by individual algorithms. This model considers the predictions of the base learner as new data, and trains them as a meta-learner, which helps to obtain more accurate predictions of the dataset [48]. A variety of base learner models can be applied to form a stacking model, and we chose the univariate imputation technique and the multivariate imputation technique for the base learner model. The basic performance of our stacking algorithm can be seen in Figure 11. As can be seen in Figure 11, the original data for 10 types of environmental substances entered the base learner's input variable. The base-learner consisted of the models used in Sections 2.4.1 and 2.4.2. Thereafter, one of the five techniques of univariate imputation was selected to collect the missing value replacement values predicted by this technique. As can be seen in Figure 11, the original data for 10 types of environmental substances entered the base learner's input variable. The base-learner consisted of the models used in Sections 2.4.1 and 2.4.2. Thereafter, one of the five techniques of univariate imputation was selected to collect the missing value replacement values predicted by this technique. Multivariate imputation similarly collects the substitution value of one of the five techniques. When collecting environmental sensor data, data on the reference device was also collected, which was used as label data in the process of training the meta-learner. In conclusion, the substitution value for each technique in univariate and multivariate imputation and the sensor value in the reference device were integrated to enter the input variable (Mx3) of the meta-learner. In this case, the size of the input variable varied according to the missing value ratio. Then, the meta-learner model and the linear regression model were selected to predict the imputation value of the missing values. The predicted value was finally compared with the original data.
Algorithm 1 was followed as the stacking algorithm. In this algorithm, D 1 and D 2 are univariate and multivariate data, respectively. First, D 1 and D 2 are trained using U, which is a univariate imputer model used as a base learner, and M, which is a multivariate imputer model. Re-training is performed using the stacking imputer S, using P 1 and P 2 , which are the predicted values of the learned data. In this case, R is used, which is a label in the reference device. In conclusion, the final predicted value, P 3 , is obtained.

Evaluation Method
To prove the effect of missing data imputation when applied to environmental sensor data, the evaluation method was measured with the mean absolute error (MAE) and the root mean square error (RMSE). MAE and RMSE are the most widely used evaluation methods for the imputation of missing values [49][50][51][52]. The formulas for these methods are shown in Table 3. Table 3. Evaluation method.

Evaluation Method Equation Perfect Score Data Distribution
Mean Absolute Error (MAE) In this case, x i is the actual value of the environmental sensor data,x i is the imputed value of the environmental sensor data, and n is the number of samples. When using RMSE, missing values are not biased and are used when the distribution is normal. On the other hand, MAE is suitable for evaluating uniformly distributed missing values [24]. Unlike MAE, RMSE gives a large penalty for values with a large error. These two methods performed different evaluations according to the distribution of and errors in the data. In this experiment, four cases of missing values were set. At this time, missing values were distributed differently for each case. Therefore, the distribution of errors was also expected to be different for each case. Therefore, by checking the MAE and RMSE at the same time, we could compare the performance regardless of the distribution of various errors by case.

Results
When using the ensemble model, a total of 25 cases were confirmed by introducing one technique each from 5 univariate techniques and 5 multivariate techniques. Among them, when comparing the existing model and the proposed model, the model with the best performance in the existing model was selected and compared.

Differences between Models According to Evaluation Method
Among the 10 environmental substances measured by the environmental sensor device, CO 2 was mainly used in the results. Other environmental substances showed similar results, and CO 2 , which showed the clearest result, was selected. As mentioned in Section 2.6, the RMSE evaluation method showed a greater penalty for errors that deviated significantly from the MAE method. Through this, we tried to judge the characteristics of the model considering both MAE and RMSE. If the RMSE value was higher than the MAE value, this suggested that a large error has occurred for a specific missing value. Figure 12 is a graph showing the MAE and RMSE values of each technique, with a missing rate of 15%, and each of the four CO 2 situations. As shown in Figure 12, univariate imputation, multivariate imputation, and weighted average methods show that the RMSE value tended to rise compared with the MAE value. On the other hand, the stacking method does not show a tendency to increase the RMSE value compared with the MAE value. It can be seen that stacking does not cause a large error. Looking at Figure 12a, when un ivariate imputation was applied, the MAE value was 31.59 and the RMSE was measured to be 71.51. When multivariate imputation was applied, the MAE was 48.36 and RMSE was 81.46, and when a weighted average was used, MAE was 27.70 and RMSE was 58.78. On the other hand, when stacking was used, the MAE was 31.92 and RMSE was 31.31; therefore, it can be confirmed that RMSE derives similar values to MAE, unlike the above three methods. methods performed different evaluations according to the distribution of and errors in the data. In this experiment, four cases of missing values were set. At this time, missing values were distributed differently for each case. Therefore, the distribution of errors was also expected to be different for each case. Therefore, by checking the MAE and RMSE at the same time, we could compare the performance regardless of the distribution of various errors by case.

Results
When using the ensemble model, a total of 25 cases were confirmed by introducing one technique each from 5 univariate techniques and 5 multivariate techniques. Among them, when comparing the existing model and the proposed model, the model with the best performance in the existing model was selected and compared.

Differences between Models According to Evaluation Method
Among the 10 environmental substances measured by the environmental sensor device, CO2 was mainly used in the results. Other environmental substances showed similar results, and CO2, which showed the clearest result, was selected. As mentioned in Section 2.6, the RMSE evaluation method showed a greater penalty for errors that deviated significantly from the MAE method. Through this, we tried to judge the characteristics of the model considering both MAE and RMSE. If the RMSE value was higher than the MAE value, this suggested that a large error has occurred for a specific missing value. Figure 12 is a graph showing the MAE and RMSE values of each technique, with a missing rate of 15%, and each of the four CO2 situations. As shown in Figure 12, univariate imputation, multivariate imputation, and weighted average methods show that the RMSE value tended to rise compared with the MAE value. On the other hand, the stacking method does not show a tendency to increase the RMSE value compared with the MAE value. It can be seen that stacking does not cause a large error. Looking at Figure 12a , when un ivariate imputation was applied, the MAE value was 31.59 and the RMSE was measured to be 71.51. When multivariate imputation was applied, the MAE was 48.36 and RMSE was 81.46, and when a weighted average was used, MAE was 27.70 and RMSE was 58.78. On the other hand, when stacking was used, the MAE was 31.92 and RMSE was 31.31; therefore, it can be confirmed that RMSE derives similar values to MAE, unlike the above three methods. The error distribution for the four cases can be checked in Figure 13. There was a large error in models, except for the stacking, and the distribution of errors in stacking was more stable than in other models. This means that the stacking method showed no  The error distribution for the four cases can be checked in Figure 13. There was a large error in models, except for the stacking, and the distribution of errors in stacking was more stable than in other models. This means that the stacking method showed no significant deviations from the existing value, and it can be expected that the RMSE of the stacking method will not soon increase with respect to MAE. This can be confirmed from the stacking distribution of (a), (b), (c), and (d) of Figure 13. The error distribution for the four cases can be checked in Figure 13. There was a large error in models, except for the stacking, and the distribution of errors in stacking was more stable than in other models. This means that the stacking method showed no significant deviations from the existing value, and it can be expected that the RMSE of the stacking method will not soon increase with respect to MAE. This can be confirmed from the stacking distribution of (a), (b), (c), and (d) of Figure 13.

Performance Comparison between Models
First, the target variable was set as CO 2 from 10 types of environmental substances, and RMSE was set as the evaluation method. We aimed to compare the performance of different models according to the occurrence of missing values. In addition, the model's performance was checked by varying the missing rate in to see the numerical values that affected the model, according to the missing rate. Assuming that the missing rates were 5,10,15,20,25, and 30%, we checked whether our ensemble method was suitable for use in diverse missing situations. As can be seen in Figure 14, it was confirmed that the missing rate in the four cases did not significantly affect the performance between models. In other words, it can be seen that the ensemble model performs better than the univariate imputation model and the multivariate imputation model, which are existing models, even if the missing rate changes. Looking at (a), (b), (c), and (d) in Figure 14, stacking performed the best regardless of the missing value case. The model using the weighted average performed better than the conventional method in Figure 14a, but slightly better than the multivariate imputation model in Figure 14b,c, and slightly worse than the multivariate imputation model in Figure 14d.
ing rate in the four cases did not significantly affect the performance between models. In other words, it can be seen that the ensemble model performs better than the univariate imputation model and the multivariate imputation model, which are existing models, even if the missing rate changes. Looking at (a), (b), (c), and (d) in Figure 14, stacking performed the best regardless of the missing value case. The model using the weighted average performed better than the conventional method in Figure 14a, but slightly better than the multivariate imputation model in Figure 14b,c, and slightly worse than the multivariate imputation model in Figure 14d.  Figure 15 shows the imputation figure for the occurrence of four cases. Figure 15 shows the 10% missing rate for CO2, and shows a graph connected by a dotted line, based on the missing values imputed by each technique. In the sensor error cases shown in Figure 15c,d, it is easier to see that the weighted average and stacking imputation follow the existing graph well. As shown in Figure 15c, it can be seen that univariate and multivariate imputation replaced the outliers from the existing graph, while weighted average and stacking follow the existing graph. In particular, in Figure 15d, it can be seen that the  Figure 15 shows the imputation figure for the occurrence of four cases. Figure 15 shows the 10% missing rate for CO 2 , and shows a graph connected by a dotted line, based on the missing values imputed by each technique. In the sensor error cases shown in Figure 15c,d, it is easier to see that the weighted average and stacking imputation follow the existing graph well. As shown in Figure 15c, it can be seen that univariate and multivariate imputation replaced the outliers from the existing graph, while weighted average and stacking follow the existing graph. In particular, in Figure 15d, it can be seen that the stacking technique learns using the numerical values of the reference device, so it can be seen that the missing values are better predicted for the existing data. Table 4 shows the RMSE evaluation result for CO 2 , and the missing rate was set as 10%. The final prediction, derived from a weighted average chosen from the ensemble methods, was better than or similar to the two methods of univariate and multivariate imputation, and it was confirmed that the performance was somewhat lower in the measurement range case compared with the sensor errors. On the other hand, when the stacking method was chosen from the ensemble methods, it can be seen that, in all four cases, the RMSE performance was better than the rest of the models.    Table 5 shows the model execution time for CO 2 , and the missing rate was set to 10%. The execution time of the existing model is the average execution time for 5 techniques, and the proposed model is the average execution time for 25 combinations. There seems to be no significant difference in execution time for each case. Comparing the existing model with the proposed model, since the proposed model performs additional work after executing two existing models, a longer execution time is required compared with the existing model. However, the complexity of the model does not seem to be a big problem as the time does not differ significantly compared to the existing model. In the proposed model, the time in parentheses means the time it takes to do additional work.

Discussion
When a missing value occurred in the environmental sensor, an ensemble imputation method was conducted according to the appropriate case. As mentioned in Section 2.3, we assumed the existence of four cases. This was derived from last year's environmental sensor data measurement, and the four most frequently occurring cases were selected. In addition to this, several cases can be added for cases where missing values occur. Examples include limitations in data collection and human error in the storage process [5,10]. In this technique, errors in communication were divided only into errors in period. However, not only periodicity, but also various errors, were detected for communication error situations. For example, if one sensor causes a communication error on a device, other sensors are affected, or once a communication error occurs, successive transmission failure leads to a burst of losses. In order to develop such a more advanced technique, it is necessary to add and subdivide cases that actually occur for communication errors.
In the sensor error (measurement range) case of Table 4, the RMSE performance of the weighted average tended to be poorer than that of the other three cases. In the measurement range case, since a certain sensor range is set and values that surpass this were judged as missing, both univariate and multivariate imputation models tended to underpredict compared with the original missing value. However, since the weighted average was an ensemble technique that averaged the univariate and multivariate models by weighting them without a separate training process, it was difficult to derive a value close to the actual value. Therefore, as shown in Section 3.2, it can be seen that the weighted average model performed poorly for multivariate imputation in the measurement range case.
When retraining with the meta-learner, while performing the stacking method, we also considered which value should be set as a label. Unlike this study, if missing values occur in real devices, there is no label value. Therefore, there is a problem in training the stacking model at this time. We solved this problem through two devices, whose linearity was confirmed when setting the sensor. The time series data of the corresponding variable were obtained from a device with no missing values, and the ensemble method was introduced in the device with missing values. If a sufficient number of missing sample values are learned in the device setting process, it is expected that missing values will be properly replaced, even when missing values actually occur. This also solves the universal problem of not being able to evaluate the replacement technique when applied to a real device.
When operating an actual sensor, it is also necessary to consider whether the proposed technique will be effective even in dynamically changing situations. In the actual environment, unexpected problems occur, such as continuous missing values for a certain period of time, as shown in Figure 5. In order to introduce this technique in actual sensors, such cases should be further subdivided and added to further strengthen the natural induction of the correct replacement technique. In addition, if data is accumulated and learned for sufficient time in a situation where missing is minimized, missing values can be replaced well, even in situations that become dynamic in the future.
The ensemble method involves the application of the model based on the predicted or evaluated values of the existing univariate and multivariate imputation models. Therefore, we have no choice but to rely on the performance of univariate and multivariate imputation, which means that the performance of a single model should support the method. In other words, in order to increase the performance of the stacking algorithm, it is necessary to improve the performance of univariate and multivariate imputation first. This problem can be solved by boosting performance in our systematic confrontation process using the latest high-performance techniques, rather than universal techniques.

Conclusions
Interest in the environment is growing and the reliability of environmental sensors that can measure it has been emphasized. In the process of collecting sensor data, some data may be lost, and it is important to deal with these missing values accordingly. Various methods of handling missing values are being studied, but a new method is needed for the more accurate replacement of missing values that can be applied to environmental sensors. In the experiment, a new ensemble method that considers time dependence and correlation with other environmental substances was proposed.
In this study, we first created four cases in which missing values can occur in environmental sensors. For each situation, five traditional univariate imputation techniques and five multivariate imputation techniques were applied. Then, weighted average and stacking models were applied to the ensemble methods, based on the missing values were predicted by each model. After that, we checked the difference between the actual value and our predicted missing value, shown through MAE and RMSE. In this process, the missing rates (5,10,15,20,25, and 30%) were changed to determine whether our ensemble method was effective in various situations. The experiment was conducted based on CO 2 , chosen from 10 environmental substances. As shown in Section 3.2, when the missing rate is 10%, it could be seen that the stacking performance of the ensemble method was measured more accurately than the other three models. It showed a good performance in all four cases. In addition to this, it was confirmed that the stacking method had the best performance among the ensemble methods, and the weighted average showed a good performance, even when the missing rate was changed. As well as CO 2 , the ensemble method was used for 10 types of sensor data to determine whether a good performance could be derived for other environmental materials.
The most significant element to emphasize concerning the proposed technique is its usability. This technique can be applied to all sensors using multivariate among time series data. We tried to implement a lightweight, yet easy-to-implement, technique by using the most common techniques in replacing missing values as base learners of ensemble techniques. In addition, the base learner does not influence which technique is included; therefore, it is a simplified algorithm that does not have a problem using the latest technique for the base learner.
In addition, existing papers have not divided the situation in which missing values occur when implementing an algorithm for missing value replacement. Usually, the existing papers were conducted only by changing the missing rate. However, we have divided the situation in which missing values occur into four cases and established a countermeasure against errors that actually occur frequently. These cases can be added at any time, and by establishing a countermeasure against these cases, there is a process of recommending and introducing appropriate confrontation techniques when missing occurs.
The imputation of such missing values is required not only in environmental sensors, but also in various fields such as the smart city. When missing values occur in the environmental sensor, our new ensemble method that considers the time dependence and the correlation between variables can be significantly contributed.