A Local Weighted Nearest Neighbor Algorithm and a Weighted and Constrained Least-Squared Method for Mixed Odor Analysis by Electronic Nose Systems

A great deal of work has been done to develop techniques for odor analysis by electronic nose systems. These analyses mostly focus on identifying a particular odor by comparing with a known odor dataset. However, in many situations, it would be more practical if each individual odorant could be determined directly. This paper proposes two methods for such odor components analysis for electronic nose systems. First, a K-nearest neighbor (KNN)-based local weighted nearest neighbor (LWNN) algorithm is proposed to determine the components of an odor. According to the component analysis, the odor training data is firstly categorized into several groups, each of which is represented by its centroid. The examined odor is then classified as the class of the nearest centroid. The distance between the examined odor and the centroid is calculated based on a weighting scheme, which captures the local structure of each predefined group. To further determine the concentration of each component, odor models are built by regressions. Then, a weighted and constrained least-squares (WCLS) method is proposed to estimate the component concentrations. Experiments were carried out to assess the effectiveness of the proposed methods. The LWNN algorithm is able to classify mixed odors with different mixing ratios, while the WCLS method can provide good estimates on component concentrations.


Introduction
An electronic nose is a biomimetic olfactory system developed based on chemical sensor principles, electronic system design and data analysis techniques. In the biological olfactory system, there are about 350 different odorant receptors in humans and about 1,000 in mice. Different odors are recognized by different combinations of odorant receptors [1,2]. Learning from this mechanism, an array of different chemical sensors is used in the design of an electronic nose. An odor can be identified by classifying its response pattern generated by the sensor array in the electronic nose [3][4][5].
The state-of-the-art techniques for sensor array data analysis and the applicability of each technique have been discussed by Jurs [6]. One type of data analysis methods is classification, which aims to group an object into one of the predefined class. K-Nearest Neighbor classifier (KNN) is one of the widely applied classification method that classifies an item according to the majority voting of the K nearest items. Instead of setting a global value for K, Locally Adaptive Nearest Neighbor (Local KNN) computes a locally varying K value for each query point by using the information from the neighbors of the query point [7]. On the other hand, since features may not be equally effective for classification, Discriminant Adaptive Nearest Neighbor (DANN) uses a locally weighted distance measurement scheme to compute the distance between two points [8]. The accuracy of KNN and its two variants, Local KNN and DANN, were examined by Bicego [9]. These three KNN-based methods were comparable on the examined data sets regardless of the computational cost.
The methods of dimensionality reduction, such as Principal Component Analysis (PCA) and Linear Discrimination Analysis (LDA), seek to reduce the data size required for classification. PCA is an unsupervised method, which finds a set of orthogonal projection directions that capture the largest amount of variation in data without using the class information of the data. On the other hand, LDA makes use of the class labels to find a lower-dimensional vector space for best class separation. For example, a 100% classification rate was achieved by LDA for classification of different tomato maturity states and different qualities of green tea samples [10,11]. The study in [12] indicates that PCA could yield superior classification results when a small training set is used. However, traditional classification methods would require significant computational cost if the sensor number is large.
Regression analysis is a statistical data analysis approach which seeks a continuous fitting function of independent variables to model the dependent variables. The Least-Squares method can be used to find such fitting function by minimizing the sum of squared differences between each of the known data point and the fitting function. The NASA's Jet Propulsion Laboratory (JPL) used a set of self-developed polymer composite sensors to quantify single and mixed contaminants [13,14]. A second order polynomial regression based on the assumption of additive linearity was used to model the relationship between the gas concentration vector and the sensor responses. Carmel et al. [15] took the same assumption and further considered the relative influence of each component on the total mixture response. The modified model provided a promising result when more than two components were present in the examined mixture.
Although the classification methods represent a promising technology for analyzing electronic nose data, its applications are mainly focused on discrimination between different odors. Moreover, odors containing the same components but with different mixing ratios are generally perceived as different smells. For this reason, a traditional classification method will not be applicable for differentiating the smells. A more practical solution is to partition the odor space into subspaces and classify an odor into one of the subspaces. This paper adopts a supervised strategy to categorize the mixed odor dataset into several groups according to the components. The nearest neighbor method is then used to classify the response pattern into one of the predefined groups. A weighting scheme is proposed to re-scale the distance between two data points and thus the classification accuracy could be improved. Another solution for analysis of odor mixture is to directly determine the concentration of each component present in the examined mixture by analyzing the response pattern. Regression methods are applied in this paper to build odor models. The component concentrations are estimated by solving a weighted and constrained least-squares problem, in which each of the squared error term is weighted to reflect the reliability of each estimated sensor response.
The rest of this paper is organized as follows: Firstly, the proposed methods for analyzing mixed odors will be described in Section 2. Then, the data collection methods and experimental results will be provided to evaluate and support the proposed methods in Section 3. Finally, Section 4 will conclude the contribution of this work.

The Proposed Analysis Methods
Traditionally, an electronic nose is not designed to analyze mixed odors but merely to differentiate between different smells. This paper proposes to determine the components that are most significant in a mixture by analyzing the sensor response pattern of the odor mixture. This work is based on the following two assumptions [13][14][15]:  Homogeneity: The sensor response to an odor is proportional to the odor concentration.
 Linear Additive: The sensor response to a mixture is equal to the linear summation of the sensor response to each of its components.
Based on the assumption of homogeneity, the normalized mixed odor dataset could be categorized according to the contained components without considering the concentration of each component. For example, the categorization results for odors of three components would be like the one shown in Figure 1. The response pattern of the examined odor would then be classified to one of the predefined classes by using a classification method. However, the sensors may not provide enough useful information sufficient enough to classify an odor. A method of dimensionality reduction, such as PCA and LDA, can then be applied to select the significant features to achieve a better result of data partition. However, both PCA and LDA require solving a complex matrix eigenvalue problem in order to find the projection directions. In this paper, a simple local weighting scheme is proposed to properly weight each feature for a class.

Locally Weighted Nearest Neighbor (LWNN)
Assume that there are predefined classes. In the nearest neighbor classification method, which is actually a KNN method with = 1, a predefined class is represented by its centroid , where 1 ≤ ≤ and is the number of sensors. Here, we define the class centroid as the mean point of class: where , is the th component of . A testing point is defined as the class of the nearest centroid.
In the proposed weighting scheme, instead of directly computing the Euclidean distance between the testing point and the examined centroid , an independent weighting vector is associated to each class to re-scale the Euclidean distance, i.e.: where is the number of sensors. For each class the weighting vectors are determined by minimizing the num of squared weighted distance from each training data point to the centroid of its belonged class, that is: , for 1 ≤ ≤ . The optimization problem in Equation (3) can be solved by Lagrangian Multipliers. The optimal weighting vector associated to each class is computed as: where: Note that Equation (4) indicates that is the points belonging to the same class exhibit a string correlation in th feature, a large weight would be assigned to this feature for the class. Aside from these observations, the optimal weighting vectors can be computed without too much effort since the computation of each weighting term is expressed in a closed-form.
As aforementioned, the proposed Locally Weighted Nearest Neighbor algorithm (LWNN) uses the weighting scheme to re-scale the Euclidean distance between two data points when finding the nearest neighbor. Unlike the original KNN algorithm, the proposed LWNN algorithm has a training stage, which computes both the centroid and the associated independent weighting vector of each predefined class (Table 1). Then, as shown in Table 2, LWNN classifies a testing data point as the class of the centroid that has the minimum weighted distance to the examined point. Note that it is unnecessary to take any additional step to determine the best K value so as to increase the classification accuracy. In practice, as it will be shown later in Section 3, the experimental results of determining the component set demonstrates that the accuracy of the proposed LWNN classifier is comparable to that of those commonly used KNN-based methodologies.
Compute its centroid and its associated weighting vector by Equation (1) and (4).
Compute the weighted distances between and each of the class centroid by Equation (2). (2) Classify as the class whose centroid has the minimum weighted distance to .

Odor Concentration Estimation by Weighted Least-Squares Method
Although the proposed LWNN method can be used to efficiently determine the set of components present in an odor mixture, the concentration of each component is still unknown. Nevertheless, a regression method could be used to estimate the component concentration. According to the assumption of homogeneity, the sensor generated by the th sensor to the th odor component at a concentration c j can be formulized as: Based on the linear additive assumption, the response of the th sensor when exposed to an odor mixture consists of components with concentrations 1 , 2 , … , , respectively, can be formulized as: Note that the response of each component is weighted with a weighting term β i,j and an offset term β i,offest is introduced in Equation (6) to get a better fit for the sensor responses. According to [15], this weighting scheme on the response of each component can be seen as a reflection of the relative influence of each component on the total response.
The parameters in Equation (6) could be obtained by applying a method for linear least-squares problems. Then, the concentration of each mixture component can be estimated by solving the following least-squares formulation: subject to: is the number of the components, is the number of sensors, and is the th sensor response of the examined odor mixture. The nonnegative constraints are introduced in Equation (7) in order to get a feasible solution. Moreover, to reflect the effectiveness of the estimated sensor response, a weighting scheme on each sensor response is proposed to properly weight each squared error term in Equation (7), and thus the following formulation is to be solved: subject to: In order to get a close form expression for each of the weighting terms, the product of the weighting terms is set to one: According to Equation (4), the weighting term of the th sensor is defined as: and: where is the number of the training data, is the number of sensors, and ( ) is the th observed sensor response of the th training data. Equation (9) indicates that if the predicated sensor response is close to the observed sensor response, a higher weight will be assigned to that response.
The proposed methodology that uses a weighted and constrained least-squares method (WCLS) to estimate the component concentrations of a mixed odor is presented in Table 3 and Table 4. In the training stage, a set of odor models for both pure and mixed odors are built by using the least-squares method. Moreover, a set f weighting terms are computed and then used in the testing stage to estimate the concentration of each component present in an odor mixture.
Build the pure odor model according to Equation (5). (2) Build the mixed odor model according to Equation (6). Compare each of the weighting terms by Equation (9). For each testing odor data, Estimate the component concentration by solving a weighted least-squares problem as Equation (8). Although mixing of odors can yield linear additive trend, it is not necessarily common. The effect of mixing can often lead to (1) masking or dominance by a stronger component [16], (2) hypoadditivity (lower than the sum or average) [17,18], and (3) synergistic effects [19,20]. Figure 2 shows the experimental setup used to collect the volatile organic compound (VOC) for analysis. The target gas for the test was produced by a standard air generator (AID360). The solvent of the testing gas sat inside the diffusion tube of the standard air generator under room temperature. A constant heater was used to increase the temperature in the tube to cause the organic solvent to evaporate. By the time the whole system reached steady temperature and flow rate for the whole system, a testing gas with stable concentration was achieved. Diffusion rate can be theoretically controlled by the temperature setting, and air concentration can be accurately calculated by measuring the weight loss of the organic solvent. The testing gas was carried out by steady air coming from the air compressor. The gas flow rate was controlled by the mass flow controller (MFC). The testing air was then infused into the glass chamber, which connects to a commercial Cyranose 320 electronic nose, which consists of 32 carbon black composite sensors. After completing the experiment, the testing air was pumped out to a Fourier transform infrared spectrophotometer (FTIR) with built-in database for cross-validation, and dry air was again used to purge the chamber. A collection of 133 mixed odor data collected by Cyranose 320 was uploaded to a personal computer after the experiment for further analysis. Three highly volatile solvents: methanol, ethanol and acetone, were mixed with different mixing ratios by using multiple air generators and mass flow controllers. The collected data are randomly divided into two sets, called the training set and the testing set, each of which contains 67 and 66 odor data, respectively. Since there are eight different types of sensors in the Cyranose 320, eight response features are derived by averaging the responses generated by four sensors of the same type in order to get a more stable sensor response. That is to say, an odor is represented by the odor pattern formed from eight averaged sensor responses.

Odor Component Determination Results
This section presents the performance of the KNN-based methodologies, which are listed below:  A: acetone.
 ME: mixture of methanol and ethanol.
 EA: mixture of ethanol and acetone.
 AM: mixture of acetone and methanol.
 MEA: mixture of methanol, ethanol and acetone.
The results are summarized in Table 5. For each method, the K value that provided the best performance on the testing set is marked. As shown, the LDA + KNN strategy outperforms the other methods over the collected odor data set; while PCA has the worst performance. The reason is that PCA seeks to separate all the data points as widely as possible. However, the local correlation structure of each component set may be distorted. As shown in Figure 5, the method of PCA widely distributes all the data points while they are mixed together. In contrast, LDA can discriminate between different classes and keep the data points of the same class as compact as possible. Note that the projections of LDA over the testing dataset in Figure 5 match up the seven partitions in Figure 1. Although the method of KNN applied with LDA outperforms the proposed LWNN method; LWNN is the most efficient way among the examined KNN-based methods since there is no additional computation to determine the best K value. Moreover, LWNN does not require solving any costly eigenvalue problem, which is necessary for both PCA and LDA. Nevertheless, the proposed LWNN method yields an acceptable accuracy to classify and identify the component set.

Estimation Results for Mixed Odors
This section reports the performance of the proposed methodology that uses a weighted and where is the number of components. Figure 6 shows the estimated errors of the regular constrained least-squares method (CLS) and the proposed weighted and constrained least-squares method (WCLS) over the testing odor dataset. The error presented is the averaged error for each concentration combination. As shown, the proposed WCLS methodology generally produces much better estimates compared to the other method: the error curves of WCLS are almost always lower than those of CLS especially for mixed odors. As presented in Table 6, the maximum error for estimate of mixtures containing all the three components is no more than 6 ppm. However, when the number of components decreases, the estimate result becomes worse (Table 7 and Table 8). Figure

Conclusion
This study aimed to determine the mixture components and estimate the concentration of each of the contained component, assuming homogeneity and linear additive. A KNN-based method, LWNN, is proposed to determine the components present in a mixed odor by classifying its sensor responses to the closest previously partitioned component sets. Furthermore, a local weighting scheme, which associates each component set with an independent weighting vector, is proposed to re-scale the distance between a testing data point and the centroid of a component set. For each component set, a higher weight is assigned to the sensor response when the sensor yields a very consistent response to that class.
To further estimate the component concentrations, odor models have been built by regressions.
Based on these odor models, a weighted and constrained least-squares problem is solved to estimate the concentration of each of the component present in the examined mixture. A weighting scheme is adopted to reflect the reliability of each estimated sensor response. If the estimated response value of a sensor is close to the observed response, a large weight would be assigned to the squared error between the estimated and observed sensor response.
To evaluate the effectiveness of the proposed methods, a set of odor data has been collected by mixing three highly volatile solvents with different mixing ratios. LDA has been noted for its ability to discriminate between different component sets regardless of its high computational cost. Furthermore, the proposed LWNN method is shown to be comparable to the commonly applied KNN-based methodology but with lower computational cost since there is no additional computation to determine the best K value for better classification performance. However, LWNN is not suitable for estimation of component concentrations and becomes complex when the number of component increases. The proposed methodology that uses a weighted and constrained least-squares method (WCLS) also demonstrates to provide a good estimate for component concentrations especially for odor mixtures, yet WCLS may provide erroneous concentration estimates for pure odors.