Indoor Emission Sources Detection by Pollutants Interaction Analysis

: This study employs the correlation coefﬁcients technique to support emission sources detection for indoor environments. Unlike existing methods analyzing merely primary pollution, we consider alternatively the secondary pollution (i.e., chemical reactions between pollutants in addition to pollutant level), and calculate intra pollutants correlation coefﬁcients for characterizing and distinguishing emission events. Extensive experiments show that seven major indoor emission sources are identiﬁed by the proposed method, including (1) frying canola oil on electric hob, (2) frying olive oil on an electric hob, (3) frying olive oil on a gas hob, (4) spray of household pesticide, (5) lighting a cigarette and allowing it to smoulder, (6) no activities, and (7) venting session. Furthermore, our method improves the detection accuracy by a support vector machine compared to without data ﬁltering and applying typical feature extraction methods such as PCA and LDA.


Introduction
Some studies on indoor air quality were motivated by the desire to understand the origins of the risks to the health of householders and the contribution of indoor emission sources relative to outdoor sources, as both imply quite different intervention strategies. Common indoor sources of airborne particles include combustion sources (primarily heating and cooking) and tobacco smoke. Other sources include combustion (candles, incense, etc.), household products (e.g., solvents, pesticides) and activities (e.g., dusting). Identifying the contribution of each source, and the exposure to it, is central to the effort to understand health effects and manage the risks. The magnitude, frequency and prevalence of these sources are strongly related to individual lifestyles and behaviours. Thus, there is huge potential for large variations in indoor emissions, air quality and exposures among homes, as well as among occupants. For this reason, a technique was sought to identify and quantify indoor emission sources in a form that could be deployed rapidly with ease in multiple homes at low cost.
In general, there are two major pollution sources in indoor air quality analysis [1]. Primary pollution is emitted directly into the atmosphere, such as carbon monoxide (CO) and carbon oxide (CO 2 ) gas from burning or particulate matter (PM 10 ) released from household products. The level of pollutants can be easily detected by sensors, and existing measurement studies focused on analysing the relationship between the levels of pollutants and human health with respect to people who suffer from chronic conditions [2][3][4]. Secondary pollution results from chemical reactions among pollutants in the atmosphere. However, sensors cannot detect the reactions among pollutants in the atmosphere, because the continuous reactions are invisible such as the carbonation reaction at different temperatures (the continuous change between CO 2 and CO) [5]. Capturing the reaction among pollutants is necessary for emission sources' detection.

Related Work
So far, many pattern-recognition models have been used to detect emission sources. Linear discriminant analysis (LDA) [6][7][8], principle component analysis (PCA) [9][10][11] and genetic algorithms (GA) [12] have been used as feature-extraction methods, to magnify the main orthogonal contributions that explain most of the pollutants of an emission source. LDA, a supervised method, finds a linear combination of features that characterizes or separates two or more classes of objects or events [13], which benefits data classification. Joly and Peuch [6] used LDA to analysis eight indicators (pollutants) to separate rural and urban sites. A preliminary study was conducted by Marié et al. [14], which focused on analysing magnetic carriers to detect emission sources from primary sources (vehicles) and on roads (paved area), road borders and surroundings areas. They found LDA to be helpful at magnifying magnetic carriers in pollution sources. LDA has also been used to support distinguishing two mountain valleys in the Central Pyrenees as a pollutant behaviour analysis method to evaluate the data [15]. To determine local and regional sources of PM 10 and its geochemical composition in a coastal area, LDA was employed in [16] to test the extent to which differences in the PM 10 levels and chemical profiles with varying atmospheric circulation and long-range transport were significant. PCA is an unsupervised method, which is often used to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables by an orthogonal transformation [17]. Kong et al. [9] applied PCA to analyse particulate matter emitted from coal combustion, marine aerosol, vehicular emission and soil dust to identify the influence of sea salt in TianJin, China. PCA was applied to identify the atmospheric emission sources of soluble compounds in rainwater samples from crustal particles, marine aerosols, urban traffic and a fertilizer factory [18]. Alternative research on PCA was conducted in [19] to monitor snow melting by identifying six major factors for the observation of melting snow.

Motivation
To better identify emission sources in indoor environments, we argue that the relationship among pollutants may offer useful knowledge as an important component of emission source detection. The ability to capture complex patterns that occur as interactions among pollutants can support emission source detection systems. In computational chemistry studies, Pearson correlation analysis has been used as a tool to assist immunochemistry to better understand the process of the antibody recognition of hapten molecules in a competitive immunoassay [20]. The Pearson correlation has also been used to analyse the influence of streams on nearshore water chemistry [21]. Rasmussen's study [22] proposed a global chemistry-climate model based on correlation coefficients to characterize the surface O 3 response to the year-to-year fluctuations in weather. In food chemistry, the correlation coefficient is also beneficial to analyse the relationship between ORAC and Maillard reaction-like products [23]. The main defect of existing feature-extraction methods for emission source detection is that they rely on internal data to capture the unique attributes of each entity; thus, they are not effective at discovering the interaction among pollutants. For a general emission source, the consistency ratio could be lower, but the correlation coefficient captures the relationship among various pollutants' levels, as discussed above.
Starting from the observation of the emission sources, we developed a novel correlation coefficient-based approach to support effective emission source detection. This approach captured the invisible interaction among pollutants during emission events, which is important information for emission sources' identification. This empowered the proposed approach to outperform recent feature extraction methods applied in other emission source detection studies.

Paper Organisation
The paper is organized as follows. In Section 2, we derive the proposed approach to dynamic pollutant interaction analysis. In Section 3, we demonstrate the merits of our approach by a comparison with existing methods. Finally, we conclude the paper and discuss future work in Section 4.

Description of the Sampling Data
In this study, we used the NIWA-developed PACMAN (Particles and Context Measurement Autonomous Node) device. The PACMAN instrument is able to record air quality, as well as context information at 1Hz resolution. More detailed information can be found in [24]. Table 1 shows the details of the parameters and sensors used in this study. Dust sensor baseline offset 1500 mV Data were collected in a set of semi-controlled tests over several days in October 2012 where a single PACMAN unit was placed in the lounge of an otherwise unoccupied house, as shown in Figure 1. Known particle emission activities were conducted and logged manually. The emission activities included:
Frying olive oil on an electric hob; 3.
Frying olive oil on a gas hob; 4.
Lighting a cigarette and allowing it to smoulder.
The experimental protocol involved in general four stages: pre-activity sampling (baseline measurements), emission activity, emission activity halted and pollution allowed to mix in the indoor air and venting of the house by opening external doors and windows and using a fan to aid indoor-outdoor air exchange.
For labeling purposes, an event was counted as a sample in between the times when the first flag was set (i.e., pre-activity sampling stage) until the emitting activity ended (i.e., emission activity halted). In our experiment, we considered the venting session and normal session (i.e., 10 min before each event and 10 min after each event) as two additional reference events.

Data Quality Control
PACMAN was operated continuously during the tests, logging data at 1 Hz. The data were checked for invalid instrument readings, but were not calibrated; therefore, the data presented here did not correspond to actual pollutant concentrations. The data were analysed using a moving window approach. This posed a problem to the definition of the labels associated with the events/emission sources. Using the manual experiment logbook as a basis, the different emission activities were labeled on the records, but for the sliding windows' analysis, only windows that had more than 50% of their data points with a given label were considered as part of that event.

Correlation Coefficient-Based Emission Sources' Detection
The proposed method to define representative indoor events was based on the processing of air quality time series and consisted of three steps: (i) selecting an appropriate continuously sliding window and fitting the range of the continuous sequence of air quality network data, (ii) removing short-term fluctuations associated with the influence of local emission sources from the original measurements, taking into account the correlation determined from the correlation coefficient analysis, and (iii) mapping the correlation factors into an nonlinear space for emission source recognition. For clarity of presentation, we summarize most notations used in the paper in Table 2. Table 2. A summary of the notation used in this paper.

X ∈ R T×M
The basic data structure, which can cover a majority of the indoor conditions t The index of the time point T Total elapsed time Samples at time t w The size of the sliding window M The total number of observed pollutants m The index of the pollutant The total number of subsequences i The index of the sliding window C A set of matrices extracted by slicing time series with sliding window w c i ∈ R w×M An extracted subsequence from the ith sliding window The correlation coefficient of two pollutants (i.e., jth and kth) for the ith sliding window

Time Series Analysis
Given a time series data X = {x(1), x(2), . . . , x(t), . . . , x(T)} involving emission L events, in which x(t) = {a 1 (t), a 2 (t), . . . , a M (t)} denotes a data sample consisting of M chemical contaminants and T represents elapsed time. A single point is insufficient for emission sources detection, but a time series could be very long, sometimes containing millions of observations. To analyze the relationship between pollutants, it is desirable to apply a sliding window w to X that will produce a sequence of shorter time series. Figure 2 shows the procedure of sliding windows subsequence extraction with any of the real-valued representations. As a result, we stored all extracted subsequences as C = {c 1 , c 2 . . . , c n }, where n = T − w + 1 is the number of subsequences and c i = {x[(t) : (t + w − 1)]} represents a subsequence. Note that the corresponding label y i is included in the calibration data according to the events of emission sources for each subsequence.

Data Filtering
To remove short-term fluctuations associated with the noise of the measurements from observed pollutants, the correlation coefficient was included in this work to establish the relationships among pollutants in each time window. The correlation is a measure of the strength of relationship among pollutants in a subsequence c i .
As c i approximates the original time series with a combination of M pollutants in the ith sliding window, we represent c i as a combination of M column vectors, where a i,m = {a i,m (1), a i,m (2) . . . , a i,m (w)} represents the mth pollutant vector for the ith sliding window. For each sliding window i, we define the correlation model as a covariance of every two pollutants, r i,j,k = r (a i,j ,a i,k ) , j = k, j, k = 1, 2 . . . M, in which the population correlation between two pollutants is calculated as: where Var is the variance function, Cov is the covariance function and E is the mathematical expectation. The correlation coefficient r (a i,j ,a i,k ) is a number between −1 and 1, which expresses the degree that, on an average, two pollutants change correspondingly. If one increases when the second pollutant increases, then there is a positive correlation. In this case the correlation coefficient will be closer to 1. If one decreases and the other increases simultaneously, then there is a negative correlation and the correlation coefficient will be closer to −1. As a result, we obtain a correlation efficient matrix D ∈ R n× M(M−1)  (4) in which r i ∈ R 1× M(M−1) 2 represents one data instance. The class label of the instance y i is given according to events of emission sources for each sliding window. Now, the problem of emission event detection is to seek an optimal solution to f * : r i → y i , i = 1, . . . , n.

Support Vector Machine Classification
SVM [25] performs structural risk minimization in the framework of regularization theory. Given training set D = {x i , y i } N i=1 , with the label y i ∈ (−1, +1) indicating the class to which the vector x i ∈ R d belongs, SVM's target is to find a linear separating hyperplane with a maximum margin in the higher feature space. For the linearly inseparable case, a nonlinear kernel function k(x i , x j )|i = j, i, j = 1, 2, . . . N is applied on SVM to transform the input space to a higher dimensional feature space, so that the classes may be linearly separable prior to calculating the separating hyperplane. In this work, air quality data were considered as a linearly inseparable case, and only a Gaussian RBF kernel function was attempted for emission sources' detection due to its good generalization and without the guidance from those prior experiences.
The normal form of the SVM classifier is defined as follows: where "·" means a dot product and φ(x i ) refers to the kernel function k( which enables performing a linear classification in a higher dimensional feature space. The Gaussian RBF kernel function can be represented as: where σ 2 is the parameter that determines the bandwidth of the Gaussian RBF kernel. The decision function can then be expressed as: where α and b are the optimal decision parameters that are tuned through cross-validation tests.

Results
In this section, we conducted experiments where we applied the proposed correlation coefficient method to emission source detection and compared it with the conventional PCA and LDA methods in terms of classification performance. We used here a support vector machine for all feature-extraction methods to find the class label of a test vector. Figure 3 visualizes the pollutants interaction knowledge in terms of its effectiveness for emission sources classification. The left column figures plot the original pollutants level, and the right column figures present the distribution of correlation coefficients from a pair of pollutants. As seen in Figure 3a,c, before data filtering it is very hard to distinguish the two events "Frying canola oil, electric hob" and "Smoking". In contrast, the right column Figure 3b,d present the data distributions of pollutants interaction represented by correlation coefficients, where the two emission events differences are seen magnified in the pollutants interaction space. In addtion, a 3D data distribution of pollutants levels and pollutants interaction for both observed events with session "Venting" and "Normal" are presented in Figure 3e,f, respectively. It is evident that the discriminant capability of pollutants interaction knowledge is much better than that from the original pollutant levels.

Pollutants Interaction Knowledge
(e) (f) Figure 3. (a) Levels of contaminants obtained from PACMAN for event "frying canola oil, electric hob"; (b) deterministic components obtained after data filtering for event "frying canola oil, electric hob" (c) levels of contaminants obtained from PACMAN for event "smoking"; (d) deterministic components obtained after data filtering "smoking"; (e) the native organization of contaminants for the two events with sessions "venting" and "normal" distributed in 3D space; (f) the filtered data for the two events with sessions "venting" and "normal" distributed in 3D space.

Accuracy
To conduct the experimentation, the original feature vectors are transformed by the proposed correlation method from an M-dimensional to  Table 3. We also report in the table the results when no feature extraction is conducted (denoted as "without data filtering") and with the same SVM classifier applied to get the classification performance. We present the generalization accuracy averaged over 100 trials (denoted as average accuracy) and calculate the standard deviation (denoted as stdev) and the accuracy change (denoted as growth) when two different sliding windows are applied. As seen from the table, it is evident that the correlation coefficient method is performing much better than the LDA and PCA technique, as well as the case of without data filtering on detecting all seven emission events including (1) frying canola oil on electric hob, (2) frying olive oil on an electric hob, (3) frying olive oil on a gas hob, (4) spray of household pesticide, (5) lighting a cigarette and allowing it to smoulder, (6) no activities, and (7) venting session.

Sensitivity and Robustness
To verify the sensitivity and performance of the three feature-extraction methods, the accuracy and standard deviations under different sliding window sizes were analysed. In the following, the results of each performance index are demonstrated and discussed in detail. Figure 4 reveals the average accuracy and standard deviation variation under the condition of different sliding window sizes. As seen from the figure, the lowest performance of the proposed method appeared when the sliding window was 30 s. As larger sliding window sizes were imported into the system, the performances of the proposed method rose consistently and finally reached the highest accuracy of 80.35%. In comparison, PCA did not perform as well as the proposed method for any sliding window size greater than 40 s. LDA and without data filtering gave better performances than the proposed method, when the sliding window size was smaller than 60 s. However, all accuracies were less than 60%. The proposed method started to surpass all the other three methods when the sliding window rose to 60 s. When the sliding window increased further to 90 s, the proposed methods accuracy grew to over 60%, while the remaining methods were still in a range below 60%. Table 4 reports the numerical testing results of the above experiments. As seen from the table, the accuracy grew in general for all methods, and the proposed method received the highest positive growth for most cases. Furthermore, the standard deviations of the proposed method were steady for different sliding window sizes. The largest standard deviation of the proposed method was merely 0.7064%, which was even smaller than the smallest standard deviations from LDA (i.e., 2.9167%), PCA (i.e., 4.1153%) and without data filtering (i.e., 1.0551%). This indicated that the proposed method had outstanding robustness properties, as measured by the standard deviation. Table 3. A comparison of the algorithms using classification accuracy in percentage as a prototype for the events (1) frying canola oil on an electric hob, (2) frying olive oil on an electric hob, (3) frying olive oil on a gas hob, (4) spraying of household pesticide, (5) lighting a cigarette and allowing it to smoulder, (6) no activities and (7)

Conclusions
In this study, the correlation of pollutants is mathematically calculated for the detection of emission sources in indoor environment. Extensive experiments have confirmed the effectiveness and efficiency of our correlation calculation in real-time detecting emission sources under various experimental settings, which covers seven emission events: (1) frying canola oil on electric hob, (2) frying olive oil on an electric hob, (3) frying olive oil on a gas hob, (4) spray of household pesticide, (5) lighting a cigarette and allowing it to smoulder, (6) no activities, and (7) venting session.
Its worth noting that compared to the case without capturing pollutants interaction data, the detection accuracy of the proposed correlation calculation increases over 20%. It follows that pollutants interaction is indicating a predictive relationship that can be exploited to identify an emission event. The additional information (pollutants correlation) related to the emission sources is necessary and useful for the identification of the emission source in indoor environment.
Funding: This research received no external funding.