Detection of Smoking in Indoor Environment Using Machine Learning

: Revealed by the e ﬀ ect of indoor pollutants on the human body, indoor air quality management is increasing. In particular, indoor smoking is one of the common sources of indoor air pollution, and its harmfulness has been well studied. Accordingly, the regulation of indoor smoking is emerging all over the world. Technical approaches are also being carried out to regulate indoor smoking, but research is focused on detection hardware. This study includes analytical and machine learning approach of cigarette detection by detecting typical gases (total volatile organic compounds, CO 2 etc.) being collected from IoT sensors. In detail, data set for machine learning was built using IoT sensors, including training data set securely collected from the rotary smoking machine and test data set gained from actual indoor environment with spontaneous smokers. The prediction accuracy was evaluated with accuracy, precision, and recall. As a result, the non-linear support vector machine (SVM) model showed the best performance with 93% in accuracy and 88% in the F1 score. The supervised learning k-nearest neighbors (KNN) and multilayer perceptron (MLP) models also showed relatively ﬁne results, but shows e ﬀ ectivity simplifying prediction with binary classiﬁcation to improve accuracy and speed. learning technique discovered this time, in the future, we intend to develop a complex system that enables the operation of an air


Introduction
In modern society, most people mainly live indoors. Therefore, the importance of indoor air pollution has been continuously mentioned [1]. According to World Health Organization (WHO) research results, each year, 3.8 million people worldwide prematurely die from illnesses attributable to indoor air quality [2]. Moreover, the wake of recent events with COVID-19 is expected to increase the interest and importance of indoor living environments as living patterns change. As indoor air pollutants, there are various factors such as outdoor air inflow, cooking, smoking, building materials, etc. Among them, heating, cooking and smoking were reported as typical indoor air pollution produced by human activities with purpose [3]. However, cooking and heating are essential parts of our lives while indoor smoking is only the preference but adversely affects our health. According to WHO, indoor smoking exhausts fine particles 100 times higher than acceptable levels [4]. Additionally, the US Environmental Health Administration indicates a study found that when nicotine and a tobacco toxic substance combine with dust in the air, 40% remained even after 21 days. This long-term effect of indoor smoke is the evidence of possibility of even third-hand smoke. Nevertheless, cigarette sales in the first quarter of 2020 increased by 4.1% compared to last year. Compared to last year, sales of e-cigarettes decreased this year, while sales of tobacco increased. Thus, the tobacco cigarette ratio is 80% of total tobacco sales [5].
In order to define the effect of tobacco emissions on indoor air quality, many studies are conducted while global air quality monitoring regulations are framed in each country to save the human and environment from harmful gases in the air [6]. Therefore, beside the air contamination issues of smoking, the other studies have been proceeded to identify and detect pollutants such as Total Volatile Organic Compounds (TVOC), Particulate Matter (PM), Carbon Dioxide (CO 2 ), Nitrogen Oxide (NO X ), etc. [7][8][9][10]. When smoking indoors occurs, the aforementioned pollutants are produced and can be detected by various sensors. As previously announced through many studies, environmental sensors have been developed or utilized for accurate measurement of pollutants. However, the measurement of air quality is very difficult because the atmospheric environment is changed in real time. In addition, it is difficult to predict and prepare for the future of the atmospheric environment using outdated data or measurement methods. Nowadays all the cities in the world are integrated with Information and Communication Technology (ICT) and Internet of Things (IoT) technology to promote urban development. The IoT technology is used to manage and improve the cities which assert and service for the overall system This paper aims to detect pollution of indoor smoking in real time and accurately predict the pollution caused by indoor smoking and various general indoor air pollutions using ICT based sensors, which is relatively reasonable priced than existing real time detection method; camera, motion detector, etc. In this study, we were able to predict the smoking situation in the indoor environment by using non-intrusive environmental sensors. We experimented to predict the presence or absence of smoking in various general situations (cooking, heating, smoking). The problem with vision-based technology is that it is relatively expensive and the installation of surveillance cameras such as infrared sensors and motion sensors is rather difficult. Smoking detection using commercially available sensors is relatively easy and inexpensive to deploy, and existing installed sensors can be utilized. Therefore, we proposed a system that monitors the changing atmospheric environment in real time by combining ICT technology and pre-installed sensors.

Recent Studies on Cigarette Detection Sensing System
Previously, various studies have been conducted to identify and detect pollutants in indoor smoking. First, by connecting the MQ-7 carbon monoxide gas sensor to the Raspberry Pi camera through Serial Peripheral Interface (SPI) communication, it is possible to monitor the smoking information in the indoor smoking area in real time by measuring the concentration of Carbon Monoxide (CO) [11]. Additionally, using the same Raspberry Pi camera, a smoke sensor and a gas sensor were connected through serial communication to determine whether smoking indoors in public places, while a separate study was conducted on a smoking detecting system in real time. This made it possible to search for images using 3D printing after detecting indoor smoking in public places [12]. In addition, a study introduces a new system for detecting human smoking behavior or small smoke using image techniques. Smoking or non-smoking were detected by using a wearable sensor system (Personal Automatic Cigarette Tracker 2.0, PACT 2.0). This proposed a method of integrating the information of the lighter and the Inertial Measurement Unit (IMU) on the wrist [13]. A similar study has developed a new system that uses images to detect human smoking or small smoke, in which motion detection and background removal functions are combined with motion area saving, skin-based image segmentation, and smoke-based image segmentation to provide potential. Selecting a smoke area made it possible to determine the occurrence of smoking [14]. Recently, smoking image detection model based on a convolutional Neural Network (NN) called Smoking Net, which automatically detects smoking behavior in video contents through images and images is suggested. This method has proven high accuracy and excellent performance for real-time monitoring as it can detect smoking images using only the smoking gesture information and cigarette image characteristics without detecting cigarette smoke [15]. Using the smoking gesture detection model is trained to detect smoking gestures as described above, it was confirmed that a gesture classification indicating whether or not a smoking gesture was indicated is received through sensor data [16].

Sensing System for Indoor Air Quality Detection
Recently, a system that monitors the environment in real time by combining ICT technology with sensor technology has been used in order to only complement the existing method. Many different technologies have been studied to identify powerful and practical ways to monitor air quality in daily life. Research was conducted mainly to minimize human intervention and to automatically monitor indoor air quality. The system included a physical circuit and an interactive platform, and the used main components in the physical circuit were the elements as Arduino Leonardo, a dust sensor, a temperature, humidity sensor, a Liquid Crystal Display (LCD). These are connected with sensor network or Ubidots (Boston, MA, USA) under specific conditions. And then air quality was analyzed [17]. Generally, pollutant concentrations of PM 2.5 and CO generated indoors were measured for pollutants using an IQ Air Visual Pro sensor (PM 2.5) and an electrochemical COA1 (CO, UEI, Gardena, CA, USA) detector adapter device [18]. Personal PM exposure of indoor and outdoor was evaluated using an inexpensive measuring device called DustDuino (Dominican Republic) consisting of a dust sensor (Plantower#3003, Plantpower, Beijing, China) that counts particles using infrared photodiodes [19] and changes in air conditions to check indoor air pollution. In addition, through a similar study, an IoT sensing system was developed that monitors and analyzes the indoor CO 2 concentration in real time. Additionally, it is intended to predict CO 2 accumulation. A hybrid prediction approach which combines ARIMA and SVR models is used, and this prediction performance was measured using Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE) and R Square (R 2 ) [20]. Efficient reception is possible regardless of Internet connection, so it can be measured using IoT, which analyzes indoor air pollution. The used air quality sensor is a MQ135 (RJ TECH, Hong Kong, China), which can be used with a microcontroller to monitor indoor pollution [21].

Artificial Intelligence Technology Used for IoT Monitoring
In previous studies, supervised model and unsupervised model techniques were studied in various fields and disciplines. Related to this paper, indoor air quality (IAQ), which is determined by the concentration of indoor air pollutants, could be predicted using physically based mechanical models or statistical models based on measured data. Statistical models have great potential to explore IAQs captured in large-scale measurement campaigns or in real-world occupancy environments. The most popular statistical models applied to IAQ are artificial NNs, multiple linear regression, partial least squares, and decision trees [22]. There are two kinds of studies implementing machine learning approaches for air quality prediction. Most of the first classes are expressed as focusing on estimating the concentration of pollutants. They mainly use Ensemble Learning or Regression algorithms as they provide an excellent compromise between interpretability and model performance. The second class consists of manuscripts dealing with forecasting problems. These kinds of questions are handled explicitly, using NN and SVM techniques [23]. In modeling of PM spatial variability, a non-linear approach can be used to accurately predict the limits of excess PM concentration [24].

Training Set
In order to increase the accuracy of predictions for various environments in real situations, it is essential to obtain a limited experimental feature model [25]. To create feature models of smoking for supervised machine learning, as shown in Figure 1, we used BORGWALDT's Smoking machine RM 200A2 (from Germany) to test cigarette smoke generation. RM200A2 is a fully automatic 20 port rotary smoking machine, which meets the requirements of ISO 3308. It is not only suitable for routine smoking according to ISO 4387 and ISO 8454 but also allows smoking of Fine Cut Smoking Articles (FCSAs) according to ISO 15592-3 [26]. experiment. As shown in Figure 1, this machine contents separated vaping chamber so that it may minimize external environment changes. Tabacos used in this experiment are described in Table 1 as below. The experiment was performed as a type of cigarette set, and repeated 10 times per a set. During the smoking, TVOC, CO2, PM2.5 and 10, temperature and humidity were measured by 10 sets of the sensors (SVM30 and SPS 30) in the smoking chamber. The changes in temperature and humidity during the experiment were at temperature of 26 to 34 °C and humidity of 33 to 51% as shown in Figure 2.    The atmosphere monitoring sensors are installed in the vaping chamber while the machine generates smoked vaper from 20 cigarettes per one experiment. During the experiment, gases generated by cigarettes are discharged through the ventilated system at the top of the chamber. After each smoking, the chamber was ventilated in 5 min to initialize the atmosphere between the experiment. As shown in Figure 1, this machine contents separated vaping chamber so that it may minimize external environment changes. Tabacos used in this experiment are described in Table 1 as below. The experiment was performed as a type of cigarette set, and repeated 10 times per a set. During the smoking, TVOC, CO 2 , PM 2.5 and 10, temperature and humidity were measured by 10 sets of the sensors (SVM30 and SPS 30) in the smoking chamber. The changes in temperature and humidity during the experiment were at temperature of 26 to 34 • C and humidity of 33 to 51% as shown in Figure 2. The atmosphere monitoring sensors are installed in the vaping chamber while the machine generates smoked vaper from 20 cigarettes per one experiment. During the experiment, gases generated by cigarettes are discharged through the ventilated system at the top of the chamber. After each smoking, the chamber was ventilated in 5 min to initialize the atmosphere between the experiment. As shown in Figure 1, this machine contents separated vaping chamber so that it may minimize external environment changes. Tabacos used in this experiment are described in Table 1 as below. The experiment was performed as a type of cigarette set, and repeated 10 times per a set. During the smoking, TVOC, CO2, PM2.5 and 10, temperature and humidity were measured by 10 sets of the sensors (SVM30 and SPS 30) in the smoking chamber. The changes in temperature and humidity during the experiment were at temperature of 26 to 34 °C and humidity of 33 to 51% as shown in Figure 2.

Indoor Smoking Environment-Testing Set
In order to imitate real life indoor air quality, a small room (16 m 2 ) was conducted and applied as a testing set of supervised learning. We target the changes of TVOC, Humidity (H), Temperature (T), CO 2 and PM 2.5, PM 10, emitted to the surrounding atmosphere before/after smoking of conventional tobacco. We ventilated through the window before/after the smoking to initialize the indoor air conditions. All the tests are conducted in the initial air condition of TVOC, H, T, CO 2 , PM 2.5 and PM 10 as Table 2. During the experiment, T, H are controlled separately by the air conditioner. Additionally, the experiment shows that smoking is not related with smoking but opening/closing the window well affects the temperature and humidity of the room. The used cigarettes are not defined, since the experiment was occurred by volunteer smokers. The room conditions are only controlled on the material we are focusing, as Table 2, while other factors are not considered. As the considered factors are limited, this testing set is considered as reflection of real situation. Heating and cooking are also performed in the same room with same condition in non-smoking situation to create comparing situation. The heating source is an electric heater which consume 400 W, without humidity changes. For cooking environment, Portable butane gas stove was applied with 600 g of meat (port, beef) on the pan. Table 3 presents the aspects of sensors. We installed 2 sets of sensors in a small room including a SVM 30 (TVOC, H, T, CO 2 ) Multi-pixel MOX gas sensor, and SPS 30 (PM 2.5, PM 10) that is a particulate matter sensor for air quality monitoring and control of Sensirion Inc. SVM 30 is a semiconductor gas sensor.  Figure 3 shows the feature of SVM30, Semiconductor sensors, defined as devices that consist of a metal oxide gas sensing material such as SnO 2 , WO 3 , and In 2 O 3 . A heater maintains the proper operating temperature of the metal oxide, and a sensing electrode to measure the resistance of the sensing material.  Appl. Sci. 2020, 10, 8912

Measurement Devices
6 of 17 Figure 4 shows that SPS 30 is a typical optical sensor. It works with photoelectric properties, which detect fine dust such as smoke, aerosol and PM 2.5, PM 10. The light scattering method is measuring the diameter and number of particles by irradiating a light source onto the suspended particles. And then detecting the scattered light. During smoking, temperature and humidity is mainly influenced by the air conditioner and outdoor environment, not the human activity so temperature and humidity is only measure not analyzed. Besides, the concentration of TVOC and CO 2 is highly related with smoking activity and occupancy while PM 2.5, PM 10 only occurred by smoking. SVM 30, SPS 30 can be used by utilizing the URL related to supplementary materials.  Figure 4 shows that SPS 30 is a typical optical sensor. It works with photoelectric properties, which detect fine dust such as smoke, aerosol and PM 2.5, PM 10. The light scattering method is measuring the diameter and number of particles by irradiating a light source onto the suspended particles. And then detecting the scattered light. During smoking, temperature and humidity is mainly influenced by the air conditioner and outdoor environment, not the human activity so temperature and humidity is only measure not analyzed. Besides, the concentration of TVOC and CO2 is highly related with smoking activity and occupancy while PM 2.5, PM 10 only occurred by smoking. SVM 30, SPS 30 can be used by utilizing the URL related to supplementary materials.    In addition, if the value suddenly jumps to an abnormal value due to low reliability and high sensitivity of the IOT sensor, the abnormality is eliminated as it is considered as outlier. Min-Max Scaler was used as a preprocessing process to extract valuable data from raw data, showing data trends, patterns and matching data units. In normalized data set, the maximum value is 1, the minimum value is 0, and the rest of values are in between 1 and 0. The data is scaled by applying the following operation to each data. Data is divided by the value obtained by subtracting the minimum value (Data.min) of the data from the data (Data[i]) and the value subtracting the minimum value (Data.min) of the data from the maximum value (Data.max) of the data. Formula (1) is as below. In addition, if the value suddenly jumps to an abnormal value due to low reliability and high sensitivity of the IOT sensor, the abnormality is eliminated as it is considered as outlier. Min-Max Scaler was used as a preprocessing process to extract valuable data from raw data, showing data trends, patterns and matching data units. In normalized data set, the maximum value is 1, the minimum value is 0, and the rest of values are in between 1 and 0. The data is scaled by applying the following operation to each data. Data is divided by the value obtained by subtracting the minimum value (Data.min) of the data from the data (Data[i]) and the value subtracting the minimum value (Data.min) of the data from the maximum value (Data.max) of the data. Formula (1) is as below.

Characteristics of Collecting Data
There are four attributes of data for the sensing and classifying cigarettes, consisting of CO 2 , TVOC, PM 2.5, and PM 10, and as a result of the experiment, each attribute has a decisive influence to classify smoking. The amount of data and the number of attributes of this study are comparatively considered as not bulky so the process of feature extraction can be minimized. In addition, there is the true value for machine learning because it is possible to know when to start smoking and when to non-smoking through an experiment. Therefore, both supervised and unsupervised learning methods can be applied. In the case of labeled data, it is classified into the regression group of unsupervised learning and the classification group of supervised learning according to whether it is predicting numeric or not. If it is not predicting numeric, it is divided according to the aspect of speed or accuracy. Models with high accuracy include Kernel SVM/Non-linear SVM, NN, MLP, and Random Forest, and models with good speed include models such as linear SVM, K-Nearest Neighbors (KNN), and Decision trees. Models belonging to the Regression group, which is Predicting Numerics, can be divided into GRU, Linear Regression, and Decision Trees models with high speed and Neural Network, RNN, LSTM, Random Forest, and Gradient Boosting Trees models with high accuracy. The Possibilistic Fuzzy Cmeans (PFCM) algorithm that combines Fuzzy C-means (FCM) and Possibilistic C-means (PCM) as a weighted sum was used as a fuzzy algorithm for performance comparison. We selected a representative model from among those shown in this figure, and the regression model did not fit the experiment and was not used.

Problem Definition and Application Method
Machine learning techniques that can be tried in consideration of features such as Sections 2.2.1 and 2.2.2 are mainly applied a classification model of a supervised learning method and a classification model of an unsupervised learning method in the experiment.
Also, FCM and PCM have been typically used in many pattern recognition problems in the fuzzy clustering domain. However, these methods are highly sensitive to noise and overlapping cluster problems. To overcome these problems, PFCM showed the result of mitigating the problems of both In Figure 6, the selection is made first according to the presence or absence of labeled data. as the labeled data set can provide training set with true values, which is so called supervised learning. If the data are non-labeled, it can belong to an unsupervised learning group represented by clustering. The next choice depends on whether there is a hierarchy. If there is a hierarchy, it branches into a hierarchical model. If there is no hierarchy, you can choose a model such as DBSCAN depending on the presence or absence of K. If K is present and categorical variables are present, it is possible to select K-modes, otherwise, depending on the preferential probability, the gaussian mixture model or K-means model can be selected.
In the case of labeled data, it is classified into the regression group of unsupervised learning and the classification group of supervised learning according to whether it is predicting numeric or not. If it is not predicting numeric, it is divided according to the aspect of speed or accuracy. Models with high accuracy include Kernel SVM/Non-linear SVM, NN, MLP, and Random Forest, and models with good speed include models such as linear SVM, K-Nearest Neighbors (KNN), and Decision trees. Models belonging to the Regression group, which is Predicting Numerics, can be divided into GRU, Linear Regression, and Decision Trees models with high speed and Neural Network, RNN, LSTM, Random Forest, and Gradient Boosting Trees models with high accuracy. The Possibilistic Fuzzy C-means (PFCM) algorithm that combines Fuzzy C-means (FCM) and Possibilistic C-means (PCM) as a weighted sum was used as a fuzzy algorithm for performance comparison. We selected a representative model from among those shown in this figure, and the regression model did not fit the experiment and was not used.
Machine learning techniques that can be tried in consideration of features such as Sections 2.2.1 and 2.2.2 are mainly applied a classification model of a supervised learning method and a classification model of an unsupervised learning method in the experiment.
Also, FCM and PCM have been typically used in many pattern recognition problems in the fuzzy clustering domain. However, these methods are highly sensitive to noise and overlapping cluster problems. To overcome these problems, PFCM showed the result of mitigating the problems of both methods by integrating FCM and PCM at the objective function level.
Among supervised learning models, the regression method is suitable for continuous value not for classification. Since the target data of this study has classifying aspects, the regression method is foreclosed and we select representative algorithm from the other categories. The conducted supervised learning model is as follows; in the algorithms performed by Linear SVM, Non-linear SVM-Support Vector Machine Radial Basis Function (SVM RBF), KNN and MLP.
The SVM is well-developed supervised learning model for pattern recognition and data analysis as one of the fields of machine learning, and is mainly used for classification and regression analysis.
Linear SVM model is to find a straight line which separates classes(groups), adjusting the degree of misclassification in straight lines. There is a way to add polynomial properties to add and classify polynomial features using linear SVM. This method is simple and works mostly fine, but higher order features add a lot of polynomial features, which make the computation very complicated. As the order of the polynomial property increases, the computational amount increases. Therefore, it will take longer to find the optimal hyperplane and kernel tricks can be used. Using the kernel trick, it can achieve the same result as adding multiple polynomial features without actually adding polynomial features. a, b are vectors describing the centers, γ is the value of the parameter, ||a − b|| is the distance of Euclidean Distance. Gaussian Radial Basis Function Formula (2) is as below.

Gaussian Radial Basis Function
γ is polynomial order. R-order polynomial kernel Formula (3) is as below.
R-order polynomial kernel : In this experiment, we selected the maximum margin boundary (Hyperplane) linearly to maximize the SVM classification margin. However, when using linear SVM, it is difficult to classify data only linearly in a high-dimensional space with many predictors. So, using a nonlinear SVM, a hyperplane was created to classify the four situations: heating, cooking, smoking, and non-smoking.
In nonlinear SVM, the Gaussian RBF kernel trick is used because the Gaussian makes the curved hyperplane softer than the polynomial kernel, which enables more accurate classification of high-dimensional spaces.
Using the KNN model, 4 attributes (CO 2 , TVOC, PM 2.5, PM 10) were used as 1 vector, and k = 3 was set the distance of each vector was obtained using the Euclidean distance, while a cluster was created in a 4D space using supervised learning. For example, when new data needs to be classified, if there are many neighbors belonging to cluster 0 among the three closest neighbors, the data are classified as 0, and if there are many neighbors belonging to cluster 1, the data are classified as 1.
MLP (Multilayer perceptron), called Feed-forward Neural Network, or Neural Network, is a generalized form of Linear Model that makes decisions through several steps. [27]. Normally, in multi-layered learning method, a single hidden layer, given enough neurons, can form any mapping needed. In practice, two hidden layers are often used to speed up convergence. Problems that require more than two hidden layers were rare prior to deep learning. Two or fewer layers will often suffice with simple data sets. However, with complex datasets involving time-series or computer vision, additional layers are required [28,29]. Given from this conventional hidden layer implementation of the MLP method, we conducted 1 input layer with 4 properties, 1 output layer, and 2 hidden layers.
Rectified Linear Unit (ReLu) was used as the activation function of the layers excluding the output layer, and the output layer was set to sigmoid for binary classification.
Besides of supervised learning, K-means clustering, the most-well known unsupervised learning, was used. The K-means algorithm is a separate clustering algorithm that groups given data into k clusters, and operates in a manner that minimizes the variance of the difference between each cluster and the distance. Since the k-means algorithm randomly determines the initial value of each cluster, the desired result may not be obtained depending on the location of the initial value. Alternatively, clusters may not be formed well even if the cluster size, cluster density, and data distribution are unusual. Due to this limitation, the majority voting method, which is a method of performing clustering several times and assigning it to the cluster that appears most frequently, is often used [27].
The k-means was calculated using the Euclidean distance between the vectors, and the initial cluster number was set to 2 to categorize smoking and non-smoking cases. The initial number of clusters was set to 2 (k = 2) (smoking group, non-smoking group). First, K-means clustering is calculated in advance using the training data, and the test data is classified into proximity clusters by Euclidean distance. In addition, this model calculates the cluster center of each data and predicts the cluster index (0,1). The case of smoking is automatically classified as 1, and the case of non-smoking is automatically classified as 0.

Evaluation Metrics
We describe conceptual metrics used to evaluate the classification performance of a model or pattern in machine learning. The factors that evaluate the model can ultimately be defined as the relationship between the predicted answer and the actual correct answer. The correct answer is divided into True and False, and the predictive model is divided into Positive and Negative. That way, we can divide the case by a 2 × 2 matrix as shown Figure 7 below.  The actual smoking is true value, and the actual non-smoking situation has a false value. If smoking is predicted in the predictive model, it has a positive value, and if it is predicted as nonsmoking, it has a negative value. In this matrix, we need to be careful about the false positives and false negatives. False positive is to misjudge what is wrong as right, False Negative is when it is judged that what is right is wrong. The viewpoints that we consider as important differ depends on the situation. The denominator of accuracy is all everyday situations such as smoking, non-smoking, heating, and cooking, and the numerator is a summed value of true positive predicting smoking in a smoking situation and a true negative predicting not smoking in a non-smoking state. The accuracy The actual smoking is true value, and the actual non-smoking situation has a false value. If smoking is predicted in the predictive model, it has a positive value, and if it is predicted as non-smoking, it has a negative value. In this matrix, we need to be careful about the false positives and false negatives. False positive is to misjudge what is wrong as right, False Negative is when it is judged that what is right is wrong. The viewpoints that we consider as important differ depends on the situation. The denominator of accuracy is all everyday situations such as smoking, non-smoking, heating, and cooking, and the numerator is a summed value of true positive predicting smoking in a smoking situation and a true negative predicting not smoking in a non-smoking state. The accuracy formula, precision formula, recall Formula (4) are as below.  Precision measures the actual smoking in all situations predicted as smoking. The denominator is the sum of the total predicted amount of smoking, the true positives predicted that who actually smoked and the false positives predicted that who smoked even though who did not actually smoke. And the numerator has the true positives predicted that who actually smoked.
Recall is the predicted proportion of smoking among actual smoking situations. It is a measure to sensitively judge the smoking situation. The denominator of the recall is (actual smoking situation) the sum of the true positive, predicted as smoking by actual smoking, and the sum of the false negative, predicted that actual smoking but did not smoke. The molecule actually smoked and is calculated as the value of the true positive predicted as smoking.
The F1 score is the harmonic average of Precision and Recall. The F1 score can accurately evaluate the model's performance when the data label is unbalanced, and the performance can be expressed as a single number. The F1 score Formula (5) is as below.
F1 is used because it is evaluated differently according to the importance in various aspects such as the status of the user and the supplier, rather than simply judging the performance of the model based on accuracy, precision, and recall. Additionally, the reason for using the F1 score is related to the nature of the IoT sensor we used. The sensor data used in the time series is sensitive and reliable, so data verification is required. Therefore, there is a limit to evaluating only the accuracy result, so the F1 score was also used as an evaluation scale. Through the meaning of each evaluation index, it is necessary to identify and select where strengths and weaknesses exist and which characteristics exist.

Feature Extraction and Correlation-Based Feature Selection
After performing EDA based on sensing data transmitted from CO 2 , PM 2.5, PM 10, and TVOC sensors extracted for indoor air quality monitoring, correlations between these values are derived. Basic data analysis is done using Python's Pandas library. Table 4 shows statistical information of CO 2 , TVOC, PM 2.5, and PM 10 among the sensor data used in the experiment. It has a total of 3516 data, and the sensor data measured every second is acquired and transmitted to the collection server.
The correlation between the measured data was derived from two situations. First, we looked at the correlation in the overall situation (smoking, cooking, heating). In Figure 8, the correlation between TVOC and CO 2 was 0.915857, showing a high correlation. There was a high correlation (0.999868) between PM 2.5 and PM 10, and the correlation between TVOC and PM 10 was 0. 197758, indicating a low correlation.   Second, we analyzed the correlation in smoking situations. As a result, looking at Figure 9, the correlation between TVOC and CO2 was 0.738492, and the correlation between PM 2.5 and PM 10 was 0.997929. The correlation between PM 10 and TVOC was 0.230974, showing the lowest correlation.

Data Characteristics
Looking at the characteristics of the training data, the values of CO2 and TVOC show a symmetrical linearity. Additionally, the values of PM 2.5 and PM 10 are symmetrical in Figure 10. When smoking a cigarette, each sensing information is expressed as a graph, and we can see that four types of information are integrated. Second, we analyzed the correlation in smoking situations. As a result, looking at Figure 9, the correlation between TVOC and CO 2 was 0.738492, and the correlation between PM 2.5 and PM 10 was 0.997929. The correlation between PM 10 and TVOC was 0.230974, showing the lowest correlation.   Second, we analyzed the correlation in smoking situations. As a result, looking at Figure 9, the correlation between TVOC and CO2 was 0.738492, and the correlation between PM 2.5 and PM 10 was 0.997929. The correlation between PM 10 and TVOC was 0.230974, showing the lowest correlation.

Data Characteristics
Looking at the characteristics of the training data, the values of CO2 and TVOC show a symmetrical linearity. Additionally, the values of PM 2.5 and PM 10 are symmetrical in Figure 10. When smoking a cigarette, each sensing information is expressed as a graph, and we can see that four types of information are integrated.

Data Characteristics
Looking at the characteristics of the training data, the values of CO 2 and TVOC show a symmetrical linearity. Additionally, the values of PM 2.5 and PM 10 are symmetrical in Figure 10. When smoking a cigarette, each sensing information is expressed as a graph, and we can see that four types of information are integrated.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 19 Figure 10. Pre-processed training data. Figure 11 shows the changes in CO2, TVOC, PM 2.5, and PM 10 of the training data set.

• Smoking
During smoking, CO 2 , TVOC, PM 2.5 and PM 10 increase. Among them, TVOC increases rapidly as soon as smoking starts, and fine dust is affected by the inhalation and exhalation, which increases slightly later. CO 2 and TVOC decrease when inhaled and increase when exhaled.

• Cooking
Cooking data is grilled meat such as beef, pork, and chicken using a gas stove. CO 2 , TVOC, PM 2.5, PM 10 have risen overall. As soon as cooking was started, the amount of fine dust rose three times more than cigarettes. CO 2 and TVOC from cooking utensils are continuously generated, showing an increasing trend.
• Heating Heat was generated using an electric heater. As the temperature increased, CO 2 and TVOC gradually increased. However, it occurred significantly less than cigarettes and cooking. There was little change in PM 2.5 and PM 10. Figure 12 shows the data for the five models as a 3D graph. Smoking data is indicated by blue dots and non-smoking by red dots. The degree of variance for each model is different according to the algorithm, and smoking predictions according to each sensor value are displayed in the coordinates. During smoking, CO2, TVOC, PM 2.5 and PM 10 increase. Among them, TVOC increases rapidly as soon as smoking starts, and fine dust is affected by the inhalation and exhalation, which increases slightly later. CO2 and TVOC decrease when inhaled and increase when exhaled.

• Cooking
Cooking data is grilled meat such as beef, pork, and chicken using a gas stove. CO2, TVOC, PM 2.5, PM 10 have risen overall. As soon as cooking was started, the amount of fine dust rose three times more than cigarettes. CO2 and TVOC from cooking utensils are continuously generated, showing an increasing trend.

• Heating
Heat was generated using an electric heater. As the temperature increased, CO2 and TVOC gradually increased. However, it occurred significantly less than cigarettes and cooking. There was little change in PM 2.5 and PM 10. Figure 12 shows the data for the five models as a 3D graph.  The model with the highest number on four performance metrics is the non-linear SVM RBF. Next, the three models of KNN, MLP and Linear SVM show similar values. In terms of accuracy, except for K-means, models are found to be a relatively high value of 83.9~93.1% in accuracy. The reason is the result of using the binary classification technique in the experimental method. The analysis results are as follows Table 5. K-means showed the worst performance. The binary classification method we used is a method specific to supervised learning, so its performance is relatively low at unsupervised K-means. This result was obtained because it is ambiguous in determining the k value, and it is difficult to centroid scattered data.
Since we value the perspective of users who are not affected by smoking in the evaluation method, it is necessary to consider the value of False Negative. For example, when people perceive that they have not smoked even though they smoked, the damage is great for non-smokers. So, we thought that the recall number was more important than other metrics. From the viewpoint of accuracy, it showed good accuracy with a value of more than 90% in the order of SVM RBF, KNN, and MLP. In the case of recall, the order of SVM RBF, MLP, KNN, PFCM and K-means also showed good performance, and in the worst case, the value of linear SVM was only 51%.
By extracting CO 2 , TVOC, PM 2.5, and PM 10 for each data set and comparing the graphs of smoking, cooking, and heating, it was confirmed that each data set has distinct characteristics. We used five methods of smoking detect algorithms. K-means as an unsupervised method and PFCM as a fuzzy method, linear SVM, Non-linear SVM, SVM RBF, KNN, and MLP as supervised models, based upon the selection cheat sheet.
As shown in Figure 13, it is necessary to decide that SVM RBF, and other machine learning methods such as KNN and MLP can be applied to other places than to decide it as a good model simply due to high accuracy and recall. For example, in a system that detects smoker, it is even more important to classify them as smoking only if it can be sure that they are smoking. In this case, precision is important and the linear SVM model with 99% precision is most effective. On the other hand, it would be effective to use SVM RBF, in a system that detects air quality due to smoking and warns people about it. This is because it is sensitive to the deterioration of air quality, so if air pollution occurs even a little, it is effectively classified. In future research, it is necessary to apply several algorithms derived from the above types. Receiver Operating Characteristic (ROC) curves are effective in evaluating the model on a binary target. Since clustering is unsupervised learning, so it was excluded as a case which no target could be assessed in terms of sensitivity and 1-specificity. ROC curves for the SVM linear, SVM RBF and MLP methods with binary targets are presented as shown in Figure 14. The SVM linear and SVM Receiver Operating Characteristic (ROC) curves are effective in evaluating the model on a binary target. Since clustering is unsupervised learning, so it was excluded as a case which no target could be assessed in terms of sensitivity and 1-specificity. ROC curves for the SVM linear, SVM RBF and MLP methods with binary targets are presented as shown in Figure 14. The SVM linear and SVM RBF methods have a wide Area under the Curve (AUC), and are shown to have good performance considering the correlation between sensitivity and specificity. Receiver Operating Characteristic (ROC) curves are effective in evaluating the model on a binary target. Since clustering is unsupervised learning, so it was excluded as a case which no target could be assessed in terms of sensitivity and 1-specificity. ROC curves for the SVM linear, SVM RBF and MLP methods with binary targets are presented as shown in Figure 14. The SVM linear and SVM RBF methods have a wide Area under the Curve (AUC), and are shown to have good performance considering the correlation between sensitivity and specificity.

Discussion
In the case of ventilation, the internal air can be discharged through the ventilation facility, but the inflow of external air is inevitable. Therefore, the prediction error range should be considered in the external environment. Therefore, in future experiments, we will collect a lot of training data and test data, considering change of seasons, combine different sensor types, or compare and select different prediction methods to improve accuracy. Tobacco detection with improved accuracy can be monitored in real time in an optimized situation by fusion of multiple sensors for pollution data (cigarettes), and can be used as an integrated detection system operating in an IoT environment. The developed system in this study is analysis based on previously acquired data. However, ultimately, there is a need for real-time prediction of smoking conditions and alarm generation. This study has implications as the first attempt to develop a machine learning solution system to prevent smoking and minimize damage from smoking. Through the optimized machine learning technique discovered this time, in the future, we intend to develop a complex system that enables the operation of an air

Discussion
In the case of ventilation, the internal air can be discharged through the ventilation facility, but the inflow of external air is inevitable. Therefore, the prediction error range should be considered in the external environment. Therefore, in future experiments, we will collect a lot of training data and test data, considering change of seasons, combine different sensor types, or compare and select different prediction methods to improve accuracy. Tobacco detection with improved accuracy can be monitored in real time in an optimized situation by fusion of multiple sensors for pollution data (cigarettes), and can be used as an integrated detection system operating in an IoT environment. The developed system in this study is analysis based on previously acquired data. However, ultimately, there is a need for real-time prediction of smoking conditions and alarm generation. This study has implications as the first attempt to develop a machine learning solution system to prevent smoking and minimize damage from smoking. Through the optimized machine learning technique discovered this time, in the future, we intend to develop a complex system that enables the operation of an air purifier to improve indoor air quality, as well as active response through linkage with the building air conditioning system.

Conclusions
Our experiment generated training data from indoor air monitoring sensors in limited circumstances to detect smoking activity, and predicted the smoking environment by including false test data (cooking, heating). Through this, we are collecting detailed information optimized for the situation in order to detect smoking status.
For data suitable for the algorithm, the Min-Max Scale method was used as a preprocessing method. As a result of the experiment, the non-linear SVM (SVM RBF) showed the best performance over the entire test data. SVM (SVM RBF) was the best with 93% in accuracy and 88% in the F1 score representing the harmonized average of precision and recall values. Supervised learning methods roughly show better results than Un supervised one. The reason for this is from data set feature, as smoking status shows binary characteristic, supervised learning deduces higher precision and recall values. Formation of hyperplane stands as the criteria for comparison of results of SVMs. Linear SVM creates linear hyperplane by making 3D to 2D linearly, marking the lowest recall and the highest precision, while SVM RBF overcoming linear SVM errors with Euclidean distance. On the other hands, MLP, multi layered style learning system, is not showing better results since it is not conducted as deep learning. However, the data is too simple to build more than 2 hidden layers, since it is not effective in manner of time and resources. Eventually, KNN and MLP showed the best performance.
In this paper, we demonstrated that our environmental sensor fusion approach successfully detected indoor smoking in the office. We introduced a number of features that can be extracted from the environmental sensors, and showed that the highest predictive power is attained with sensor fusion. We use the evaluation and prediction methods but facing information overload, collecting and implying suitable state of art for analysis is one of the kinds of science. The approach offers a number of advantages, such as being nonintrusive, which is an important aspect in smart homes, targeting the users who are only reach to open sources.