1. Introduction
Various gases usually exist in the form of mixtures in the industrial and living environment; it is particularly important to identify the composition and concentration analysis of the gas mixture component, so the detection of the gas mixture component has become a hot research field. There are three methods for gas detection; the first method is sensory evaluation, which is an evaluation method based on the reaction of human sensory organs to gas. However, sensory evaluation is susceptible to the limitations of the human subjective and olfactory system, and some gases can also cause damage to the human body. The second method is chemical analysis, such as spectroscopy, gas chromatography and mass spectrometry; this method uses advanced chemical analysis equipment to measure the composition and concentration of gases. However, the difficulty of sampling, the complexity of operation, the high cost of equipment and the low real-time performance of these methods limit the application of these methods to some extent. The last method is to use gas sensors to detect gases. This method has a low cost and is easy to operate. However, in complex environments or when there are many types of gases, it is difficult to detect the gases using a single sensor. In addition, other influencing factors such as the sensitivity of sensing materials to target gases, the reproducibility of sensor arrays and limitations in the quantity of discriminated gas all impose limitations on the application of the sensor [
1,
2,
3].
Due to the limitations of the above methods, Gardner published a review article on electronic nose, formally proposing the concept of “electronic nose” in 1994 [
4]. Electronic nose, also known as artificial olfactory system, is a new bionic detection technology which can be used to simulate the working mechanism of biological olfactory. It uses sensor array technology, which has the advantages of fast response, high sensitivity, low cost and easy processing. Electronic nose technology not only solves the subjective problems of sensory analysis, but also overcomes the complicated and expensive problems of chemical analysis methods. Electronic nose technology is often used in the analysis and detection of various gas fields, such as pollution control [
5,
6], medical technology [
7,
8], oil exploration [
9], food safety [
10,
11], agricultural science [
12,
13] and environmental science [
14,
15], and the application of electronic noses is very extensive [
16,
17,
18]. Therefore, electronic nose technology has become an important research direction in the field of gas detection. The electronic nose system is mainly composed of gas sensor array, signal preprocessing module and pattern recognition, as shown in
Figure 1.
The sensor array is composed of various types of sensors and converts chemical signals into electrical signals by converting A/D. The signal preprocessing is mainly to remove noise, extract feature and process signal of sensor response signal. The pattern recognition is the classification recognition and concentration estimation of measured gases using machine learning algorithms. At present, the key research directions of electronic nose are mainly the performance improvement of gas sensors and the research of various machine learning methods in gas sensors [
19]. Due to the nonlinear response of MOS gas sensor, it is difficult to improve mixed gas classification and concentration prediction by simply relying on the selection of gas-sensitive materials. Therefore, appropriate machine learning algorithms are needed to solve the current problems [
20]. Machine learning algorithm plays an extremely important role, and its accuracy, time efficiency and anti-interference ability all affect the decision result. Among the research results of many studies, the literature [
21] argues that more intelligent pattern recognition technology is needed to realize the potential of electronic nose technology. Literature [
22] proposes that reasonable improvement of algorithms is an important support for the development of machine olfaction.
Before pattern recognition, data sets need to be preprocessed and feature extraction, which is conducive to providing reasonable data sets for subsequent recognition models. The main steps of data preprocessing are data cleaning, data specification and data transformation. Data cleansing is the “cleaning” of data by filling in missing values, smoothing noise data, smoothing or removing outliers, and resolving data inconsistencies. The average method can solve the problem of missing data, and the averaging method is to fill the average value of all data into the data position to be compensated [
23]. Based on the setting principle, the nearest location data is selected as the compensation data [
24]. The compensation method of regression model is to build a regression model and fill the compensation position with the predicted value of the regression model as the compensation value [
25]. Although the data specification technique is represented by a much smaller data set, it still maintains close data integrity and can be mined on the data set after the specification, and it is more efficient and produces nearly identical analysis results. Common strategies are dimension specification and dimension transformation. Common loss dimension transformation methods include principal component analysis (PCA), linear discriminant analysis (LDA), singular value decomposition (SVD), etc. [
26,
27]. Data transformation includes normalization, discretization and sparse processing of data to achieve the purpose of mining. In the literature [
28], data standardization and baseline processing methods were used as data processing methods to realize the identification of nitrogen dioxide and sulfur dioxide in air pollution. The feature extraction methods include PCA and LDA. When the data is small, the feature extraction effect of PCA is better than that of LDA, but when the data is large, the feature extraction effect of LDA is better than that of PCA [
29].
In the classification study of mixed gases, researchers mostly adopt machine learning methods [
30,
31,
32,
33,
34], and the selection of classification algorithms needs to be based on the characteristics of samples to find a more suitable scheme [
35,
36,
37,
38,
39,
40,
41]. Sunny used four thin film sensors to form a sensor array to identify and estimate the concentration of a gas mixture. PCA was used to extract the features of response signals, and ANN and SVM were used for category recognition, achieving good recognition results [
42]. Zhao studied the recognition of gas mixture components of organic volatiles. An array composed of four sensors was used to identify formaldehyde when the background gas was acetone, ethanol and toluene. PCA was used for dimensionality reduction extraction. MLP, SVM and ELM are, respectively, used in the classifier to identify and classify, among which SVM achieves the best effect and ELM requires the shortest training time to obtain results [
43]. Jang used the SVM and paired graph scheme combined with the sensor array composed of semiconductor sensors to classify CH4 and CO, and obtained a high recognition accuracy [
44]. Jung used sensor array to collect gas and then used SVM and fuzzy ARTMAP network for experimental comparison. The recognition time of SVM was shorter than the fuzzy ARTMAP network [
45]. Zhao adopted a weighted discriminant extreme learning machine (WDELM) as a classification method. WDELM assigns different weights to each specific sample by using a flexible weighting strategy, which enables it to perform classification tasks under unbalanced class distribution [
46].
There are multiple regression [
47], neural network [
48,
49], SVR [
50] and other methods [
51,
52,
53,
54,
55,
56] for gas concentration analysis. Zhang used the WCCNN-BiLSTM model to automatically extract time–frequency domain features of dynamic response signals from the original signals to identify unknown gases. The time domain characteristics of the steady-state response signal are automatically extracted by the many-to-many GRU model to accurately estimate the gas concentration [
57]. Piotr proposed an improved cluster-based ensemble model to predict ozone. Each improved spiking neural network was trained on a set of separate time series; the ensemble model could provide better prediction results [
58]. Liang used AdaBoost classification algorithm to classify the local features of infrared spectrum, and carried out PLS local modeling according to different features to predict the concentration of a gas mixture component. This method solves the problems of difficult identification and inaccurate quantitative analysis of alkane mixture gas components in traditional methods [
59]. Adak used the multiple linear regression (MVLR) algorithm to predict the concentration of a mixture of two gases in acetone. The relative errors of acetone and methanol are lower than 6% and 17%, respectively [
60].
Based on the above literature, most of the methods are suitable for relatively balanced data sets of categories. When the number of samples is extremely unbalanced, the traditional method with the overall classification accuracy as the learning objective will pay too much attention to most categories. As a result, the classification or regression performance of a small number of class samples is degraded, which leads to the failure of traditional machine learning to work well on extremely unbalanced data sets. Secondly, PCA method is often used to solve linear problems in the literature, while most of the problems in the real environment are nonlinear problems. In addition, ML algorithms often have multi-parameter and difficult parameter to determine the problem. In the literature, the parameters of deep learning algorithms such as neural network or ELM are often obtained by trial and error method or experience, and the selection of parameters plays a crucial role in the performance evaluation of algorithms. When the learning algorithm model without optimal parameters is used to detect the mixed gas, it cannot reasonably compare other algorithms with evaluation criteria. To solve the above problems, this paper presents a gas mixture component detection method which is suitable for electronic nose under unbalanced conditions. In view of the problem of extremely unbalanced sample numbers and too few samples, SMOTE, ADASYN, B-SMOTE, S-SMOTE and CSL-SMOTE were put forward for artificial synthesis of new samples by sample expansion methods, so as to alleviate the problem. For nonlinear problems, Kernel Principal Component Analysis (KPCA) method is used for feature extraction, and kernel technique is used to extend PCA to nonlinear problems. To solve the problem of multi-parameter and difficult parameter determination, PSO and GA optimization methods are used to optimize the parameters of classification and regression models, which are convenient for classification and regression methods to identify and classify the mixed gas and estimate the concentration.
The rest of the paper is structured as follows. In Part II, the methods of feature extraction, sample expansion, classification recognition and concentration detection are briefly introduced. In Part III, a new method for detecting mixed gas based on the electronic nose is introduced in detail. The verification experiment is carried out in part IV, and the experimental results are analyzed and discussed. Part V is the summary and outlook.
2. Methods
2.1. Kernel Principal Component Analysis
The KPCA transforms the linearly indivisible sample input space into the divisible high-dimensional feature space through kernel function
and performs PCA in this high-dimensional space. Compared with the linear problem solved by PCA, the KPCA with kernel technique extends the linear problem to the nonlinear problem [
12].
Set
as the observation sample after pretreatment, and contains
samples in
,
represents the
observed sample of the
dimension. The covariance matrix mapping sample
to a high dimensional feature space is expressed as
where
,
is a nonlinear mapping.
Eigenvalue decomposition of covariance matrix
:
where
and
, respectively, represent the eigenvalues and eigenvectors of covariance matrix
, and
is the eigenvector in the eigenspace, that is, the direction of the principal component. There is a coefficient vector
for a linear representation of the eigenvector
:
Substitute (3) into (2) and multiply both ends by
to obtain the following equation:
Define
, then
is the symmetric positive semidefinite matrix of
:
where
represents the elements in row
and column
of matrix
, and the eigenvalue solution problem combined with Equations (3)–(5) is converted to
where
is the eigenvector of
, and principal component analysis (PCA) is performed in the eigenspace to solve the eigenproblem of Formula (6), and the eigenvalue
corresponding to the eigenvector
is obtained;
where
is the number of primary components.
The
th feature of the newly observed sample
is mapped by
to
, where
is the feature vector of the
th feature in the feature space, i.e., the direction of the principal component.
where
is the projection of
onto
. Where
is not satisfied,
is
:
where
is the kernel matrix after centralization, and
is the matrix of
, where each element is
.
2.2. The Safe-Level-SMOTE Method
The SMOTE method can alleviate the over-fitting problem caused by random oversampling, but it only considers a few types of cases and does not consider the overlap between the synthesized samples and most types of samples; therefore, most researchers tend to adopt the improved Safe-Level-SMOTE method [
61,
62,
63,
64]. The Safe-Level-SMOTE method will select a few classes with a high degree of safety and assign a certain degree of safety to each class separately before combining new classes, which will be closer to the high degree of safety. This method solves the quality problem of the SMOTE class and the problem of fuzzy class boundaries. The schematic diagram of a few class samples synthesized by the Safe-Level-SMOTE algorithm is shown in
Figure 2.
The process of the Safe-Level-SMOTE method is as follows:
- (1)
Find the nearest neighbors of , denoting the number of neighbors in as , and denoting a certain neighbor as .
- (2)
Find the nearest neighbors of , and the number of neighbors in is denoted as .
- (3)
Set the ratio .
where is a sample set of a few classes, and is a sample in .
Case 1: and , that is, the neighbors of the minority class sample are all majority class samples, and no composite data is generated in this case.
Case 2: and , that is, when is very small relative to , the ratio will be 0. The sample point is located in most class samples, and then point is copied.
Case 3: , that is, , at this time to synthesize a new sample between and , synthesis method same as smote.
Case 4: , that is, , at this time, the number of subclass samples around point sample is greater than the number of subclass samples around point sample, consider point as the safe level, and use the smote in between and point to synthesize a new sample, and the synthesized sample position is biased to point.
Case 5: , that is, then the number of subclass samples around point sample is less than the number of subclass samples around point sample, consider point as the safe level, between and point with to synthesize a new sample, the synthesized sample position is biased to point.
2.3. Adaptive Synthetic Sampling Approach (ADASYN)
Many classification problems will face the problem of sample imbalance, most of the algorithms in this case, and classification effect is not ideal. Researchers usually adopt the SMOTE method to address the issue of sample imbalance. Although the SMOTE algorithm is better than random sampling, it still has some problems. Generating the same number of new samples for each minority class sample may increase the overlap between classes and create valueless samples. Therefore, the improved ADASYN method of SMOTE [
65] is adopted. The basic idea of this algorithm is to adaptively generate minority class samples based on their distribution. This method can not only reduce the learning bias caused by class imbalance but also adaptively shift the decision boundary to the difficult-to-learn samples. Then, new samples are artificially synthesized based on the minority class samples and added to the data set.
The process of the ADASYN method is as follows:
Input: Training data set with samples: , where is a sample of dimensional feature space , corresponds to class label . and are defined as minority sample size and majority sample size, respectively, so and .
The algorithm process:
- (1)
Calculate the unbalance degree:
- (2)
If ( is the default threshold of the maximum allowable unbalance rate):
- (a)
Calculate the amount of composite samples that need to be generated for a few classes of samples:
where
is a parameter that specifies the level of balance required after the resultant data is generated, and
indicates that the new data set is completely balanced after the resultant.
- (b)
For each
belonging to the minority class, find
neighbors based on Euclidean distances in
dimensional space, and compute the ratio
, which is defined as
where
is a sample of most classes input in
neighbors, then
.
- (c)
Normalizes according to , so is a density distribution ().
- (d)
Calculate the amount of sample
that needs to be synthesized in each minority sample:
where
is the total sample size of artificial minority samples synthesized according to Formula (11).
- (e)
For each minority sample , the sample is synthesized by following the following steps:
Do the Loop from 1 to ;
- (f)
(i) Randomly select a minority sample from neighbors of ; (ii) Synthetic sample ; where is a difference vector in dimensional space and is a random number.
End Loop
As can be seen from the above steps, the key idea of the ADASYN method is to use density distribution as a criterion to adaptively synthesize the number of artificial samples for each minority class sample. From a physical perspective, the distribution of weights is measured based on the learning difficulty of different minority class samples. The data set obtained by the ADASYN method not only solves the problem of imbalanced data distribution (according to the expected balance level defined by the coefficient), but also forces the learning method to focus on those difficult-to-learn samples.
2.4. The Multi-Output Least Squares Support Vector Regression Machine (MLLSVR)
Support vector regression machine (SVR) is a traditional machine learning method for solving convex quadratic programming problems. The basic idea of the method is to map the input vector to a high-dimensional feature space through a pre-determined nonlinear mapping, and then perform linear regression in this space. Thus, the effect of nonlinear regression in the original space is obtained [
19]. Least squares support vector regression machine is an improved version that replaces inequality constraints in SVR with equality constraints. MLSSVR is a generalization of LSSVR in the case of multiple outputs.
Suppose data set
, where
is the input vector and
is output value. Nonlinear mapping
is introduced to map input to the
dimensional feature space, and the regression function is constructed.
where
is the weight vector and
is the offset coefficient.
In order to find the best regression function, the minimum norm
is needed. The problem can be boiled down to the following constraint optimization problem:
where
is a block vector composed of
,
,
. By introducing the relaxation variable
, the minimization problem of Equation (15) can be transformed into
where
is a vector composed of relaxation variables and
is a regularization parameter.
In the multiple-output case, for a given training set
,
is the input vector,
is the output vector,
and
are composed of block matrices of
and
, respectively. The purpose of MLSSVR is to map from
dimensional input
to
dimensional output
. As in the case of single output, the regression function is
where
is a matrix composed of weight vectors and
is a vector composed of offset coefficients. Minimize the following constrained objective function by finding
and
:
where
is a matrix of relaxation vectors. By solving this problem,
and
are obtained, and the nonlinear mapping is obtained. According to hierarchical Bayes, the weight vector
can be decomposed into the following two parts:
where
is the mean vector,
is a difference vector,
and
reflect the connectivity difference between outputs. That is,
contains the general characteristics of output, and
contains the special information of
th component of the output. Equation (18) is equivalent to the following problem.
where
,
,
,
are two regularization parameters.
The Lagrange function corresponding to Equation (20) is
where
is a matrix consisting of a Lagrange multiplier vector.
According to the optimization theory of Karush–Kuhn–Tucker (KKT) conditions, the linear following equations are obtained:
By canceling
and
in Equation (22), the linear matrix equation can be obtained as follows:
where
,
,
,
,
,
and
. Since
is not positive definite, Equation (23) can be changed to the following form:
where
is a positive definite matrix. It is not difficult to see that Formula (24) is positive definite. The solution
and
of Equation (23) are obtained in three steps:
(1) Solve: , from and ; (2) calculate: ; (3) solve: , .
The corresponding regression function can be obtained as follows:
This article uses the most common RBF kernel functions, as follows.
where
,
is the kernel width.
The MIMO differs from MISO algorithm in input–output mapping system and parameter types. When using this method, an optimization algorithm is needed to optimize the parameters in its model.
3. The Improvement Method
This paper proposes a method of mixture gas identification and concentration detection based on sample expansion. The flow chart of gas identification and concentration detection using the proposed method is shown in
Figure 3.
The qualitative analysis of gas mixture is divided into five steps: data preprocessing, feature extraction, stratified cross-validation, sample expansion, parameter optimization and qualitative identification.
Step 1: The raw signal is preprocessed to eliminate the difference caused by the baseline to the raw data.
Step 2: The KPCA is used to extract the features of the preprocessed signal. When the cumulative contribution rate of feature values reaches the set threshold, the first features are selected to represent the original features.
Step 3: After KPCA feature extraction, use hierarchical five-fold cross-validation to divide the data into five mutually exclusive subsets on average. In each experiment, one subset is selected as the test set, the other four subsets are combined as the training set, and the average of the five results is used as the estimation of the algorithm accuracy.
Step 4: In the training set, the ADASYN method is used to artificially synthesize a few class samples in the class imbalance, and the generated new samples are put into the training set to form a new training set.
Step 5: After sample expansion on the class unbalanced data set, The ELM method is adopted as the classification method, the PSO and GA are used to optimize the parameters of the classification method, and the classification model is obtained. The test set is input into the classification model to identify the gas mixture.
The quantitative analysis of a gas mixture component is divided into four steps: data preprocessing, sample expansion, parameter optimization and quantitative estimation.
Step 1: The original signal is preprocessed to eliminate the influence of the baseline.
Step 2: Arrange the concentration in ascending order in the pre-treated sample set, and cross-select the samples as the training set and the test set. In the training set, The S-SMOTE method was used to synthesize artificial samples. The generated samples are put into the training set to form a new training set.
Step 3: After sample expansion, The MLSSVR method is used as the regression, and the PSO and GA methods are used to optimize the parameters of the regression method, and the regression model with the optimal parameters is obtained.
Step 4: Input the test set into the regression model to obtain the estimation of the mixed gas concentration, and use the mean absolute percentage error (MAPE) and root mean square error (RMSE) as the evaluation criteria.