Sensor Drift Compensation Based on the Improved LSTM and SVM Multi-Class Ensemble Learning Models

Drift is an important issue that impairs the reliability of sensors, especially in gas sensors. The conventional method usually adopts the reference gas to compensate for the drift. However, its classification accuracy is not high. We propose a supervised learning algorithm that is based on multi-classifier integration for drift compensation in this paper, which incorporates drift compensation into the classification process, motivated by the fact that the goal of drift compensation is to improve the classification performance. In our method, with the obtained characteristics of sensors and the advantage of Support Vector Machine (SVM) in few-shot classification, the improved Long Shot Term Memory (LSTM) is integrated to build the multi-class classifier model. We tested the proposed approach on the publicly available time series dataset that was collected over three years by the metal-oxide gas sensors. The results clearly indicate the superiority of multiple classifier approach, which achieves higher classification accuracy as compared with different approaches during testing period with an ensemble of classifiers in the presence of sensor drift over time.


Introduction
In recent years, with the rapid development of the machine olfactory technology, the gas identification systems have been widely applied in many fields, such as food testing, medical diagnosis, and environmental monitoring [1][2][3]. In the gas identification systems, the gas sensors are often used as the core function for sensing, identifying, and measuring different gases. The key to the sensors is to realize the function of human smell to improve the accuracy of sensors [4,5]. However, the drift phenomenon is inevitable and it cannot be ignored during the use of the sensor over time [1]. There are several forms of the drifting, such as Zero and Span drift and Concept drift. Zero drift means that the reference deviates from a fixed value due to the influence of the external environment when the input signal of the amplifying circuit is zero. Span drift refers to a change of the coefficient and conversion factor of the value amplifier with the changes of time and temperature. Sensor drift implies the interference of some factors, such as the temperature of the surrounding environment, humidity, pressure, as well as the aging and poisoning effects of the sensor material (including external pollution, irreversible combination), which results in the sensor input signal that is involved in the interference signals. The external environment makes the interference signal continuously increase, which results in a gradual decline in data quality and the acquisition accuracy. The difference from the true value drift and it has reduced the number of sensor calibrations [21]. Long Short-Term Memory (LSTM) compensates for some issues, including the gradient disappearance and gradient explosion of RNN and the lack of long-term memory ability, which enables RNN to effectively utilize the long-range timing information. Wang et al. have proposed an LSTM prediction mode parameter optimization algorithm that is based on the multi-layer grid search, which has relatively strong applicability and a relatively high accuracy in the predictive analysis [22]. However, as time goes by, the quality of data collected by the sensor decreases. In the case that cross entropy is utilized as loss function for few amounts of data, the output of softmax in the LSTM model could lead to over-fitting. Moreover, the confidence range and threshold could not be practically determined.
Multi-classifier ensemble learning combines with a variety of learning algorithms, so the corresponding hypothesis space can be expanded, which reduces the drawbacks of a single learning algorithm [23]. With the efficacy of multi-classifier ensemble learning, this paper proposes a supervised learning algorithm that is based on a multi-classifier ensemble learning. With the objective to improve the calculation accuracy, we have developed a multi-classifier integrated with a new loss function and SVM for the base classifier LSTM, which greatly combines the advantages of SVM for small samples with the advantages of LSTM in time series. This multi-classifier can be integrated by the voting strategy with the normalized weighting. We select the 'Gas Sensor Array Drift Dataset' Dataset, which is a benchmark dataset available online at UCI machine learning repository, to verify the method. The proposed method is capable of compensation drift in gas sensors and it does not system re-calibration or background information, which makes it feasible for use in real time applications.
The rest of this article is organized, as follows. Section 2 is Data Processing. Section 3 describes the entire flow of the sensor drift. Section 4 consists of the analysis and results, and finally conclusions will be drawn in Section 5.

Data Processing
The data that were collected by the sensor have a high dimension and the overall processing amount is large. If the sensor drift compensation is directly performed on the original data, it is difficult to achieve the desirable effects. While considering this case, Vergara at al. firstly selected the features of origin data when creating the dataset [7]. The method in this paper is to do a correlational analysis in the selected dataset. Finally, the dataset is processed based on the correlational analysis coefficient, and the dataset will then be done in a normalization process.

Data Acquisition
The dataset that was used in this study is collected in a controlled laboratory setup while using an array of sixteen metal oxide gas sensors that were manufactured by Figaro Inc. [24]. As for the preparation of this paper, it is a necessity that the sensor array consists of 16 pieces of Figaro commercial gas sensor with different sensitivity, among which each kind of sensor is equipped with four pieces of sensors. Table 1 shows the detailed information of the sensor arrays. Before this study, the needed datasets have been measured according to the next procedures. First, a constant flow of zero-stage dry air circulates through the sensing chamber, while the gas sensor array remains at a stable operating temperature (400 • C). This step measures the baseline steady-state sensor responses (the responses of the sensors in the absence of chemistries). The desired odorant concentration is then injected into the sensing chamber by a continuous flow system. Finally, in the third step (cleaning phase), the steam is evacuated from the sensor arrays and the test chamber is cleaned with the dry air before the newly measured concentration phase. The acquisition time for these measurements takes at least 300 s, including 100 s of a gas injection and at least 200 s of a recovery (cleaning). For purposes of processing, we consider the entire sensor responses after subtracting the baseline from each record. The sampling rate is set to 100 Hz. Finally, the measurement process that is described herein can be replicated for the subsequent measurements. The completed experimental setup of data acquisition process in given in [7]. Figure 1 shows typical information about the sensor response. steady-state sensor responses (the responses of the sensors in the absence of chemistries). The desired odorant concentration is then injected into the sensing chamber by a continuous flow system. Finally, in the third step (cleaning phase), the steam is evacuated from the sensor arrays and the test chamber is cleaned with the dry air before the newly measured concentration phase. The acquisition time for these measurements takes at least 300 s, including 100 s of a gas injection and at least 200 s of a recovery (cleaning). For purposes of processing, we consider the entire sensor responses after subtracting the baseline from each record. The sampling rate is set to 100 Hz. Finally, the measurement process that is described herein can be replicated for the subsequent measurements. The completed experimental setup of data acquisition process in given in [7]. Figure  1 shows typical information about the sensor response. Figure 1. Typical information of sensor response. Typical response of a metal-oxide based chemical sensor to 30ppmv of Acetaldehyde. The curve shows the three phases of a measurement: baseline measurement (made with pure air), test gas measurement (when the chemical analyte is injected, in gas form, to the test chamber), and recovery phase (during which the sensor again is exposed to pure air; the recovery time is usually much longer that the gas injection phase).

Feature Extraction
Feature extraction is extremely significant in every chemo-sensory application [24], which could be described as a reflection of the sensor response under the lower-dimensional space, which preserves the most meaningful portion of the information that can be contained in the original sensor signals. Vergara et al. have considered two distinct types of features that exploit the whole dynamic process that occur at the sensor surface, including the ones that reflect its adsorption, desorption, and steady-state (or final) response of the sensor element [7]. The extracted features reflect transient response (desorption, adsorption) and the steady state response of the sensors [6]. The extracted steady state and transient features are computed as: where k is the discrete time indexing the recoding interval [0,L] when the chemical vapor is present in the test chamber, and k = 1,2, …, L. [ ] y k represents the real scalar, its initial state is set to zero, Figure 1. Typical information of sensor response. Typical response of a metal-oxide based chemical sensor to 30ppmv of Acetaldehyde. The curve shows the three phases of a measurement: baseline measurement (made with pure air), test gas measurement (when the chemical analyte is injected, in gas form, to the test chamber), and recovery phase (during which the sensor again is exposed to pure air; the recovery time is usually much longer that the gas injection phase).

Feature Extraction
Feature extraction is extremely significant in every chemo-sensory application [24], which could be described as a reflection of the sensor response under the lower-dimensional space, which preserves the most meaningful portion of the information that can be contained in the original sensor signals. Vergara et al. have considered two distinct types of features that exploit the whole dynamic process that occur at the sensor surface, including the ones that reflect its adsorption, desorption, and steady-state (or final) response of the sensor element [7]. The extracted features reflect transient response (desorption, adsorption) and the steady state response of the sensors [6]. The extracted steady state and transient features are computed as: (1) where r[k] is the time curve of the sensor resistance and ∆R is the difference between the maximal resistance and the baseline. ||∆R|| is the ratio of the maximal resistance and the baseline values; k is the discrete time indexing the recording interval [0,L] when the chemical vapor is present in the test chamber. The aggregate of features reflecting rising/decaying sensor response is evaluated by exponential moving average ema a . The value of ema a is determined by calculating maximum/minimum y[k] for rising/decaying evaluation, respectively [25]. y[k] is calculated by the following formula: where k is the discrete time indexing the recoding interval [0,L] when the chemical vapor is present in the test chamber, and k = 1,2, . . . , L. y[k] represents the real scalar, its initial state is set to zero, and the scalar a(a ∈ {0, 1}) represents the operator's smoothing parameter, namely f (a(r[i])), which refers to the quality of its feature and its time sequences [9]. Setting the three different values with 0.1, 0.01, and  (1), as shown in Figure 2. For each gas sensor in the array, two steady and six transient features are computed and a feature vector of 128 (16 sensors × 8 features) features is recorded. The order in which the proposed features are placed in the feature vector is shown in Table 2. S1 in the Table 2 denotes Sensor 1 (S1); S2 represents Sensor 2 (S2), and so on until the Sensor 16 (S16); a ema ( a∈ {0.1, 0.01, 0.001}) represents the exponential moving average. 1.   Table 2. S1 in the Table 2 denotes Sensor 1 (S1); S2 represents Sensor 2 (S2), and so on until the Sensor 16 (S16); ema a (a ∈ {0.1, 0.01, 0.001}) represents the exponential moving average.  Table 2 shows the results of correlation analysis of the dataset. According to the result that the correlation coefficient between the two variables is over 0.9, removing any variable and reducing the dimension of the data, which we could fully utilize the data in low dimension. The main methods of data correlation analysis could be divided into Pearson Product Moment Correlation Coefficient (PPMCC), Kendall rank correlation coefficient (Kendall), and spearman's correlation coefficient for ranked data (spearman). These three methods all reflect the direction and extent of the trend between two variables. When compared with PPMCC, spearman could not perform well in accuracy and it is insensitive to data errors and the extreme value. Kendall is a rank correlation coefficient and it calculates the objects that are categorical variables, namely the classification categories of variables.

Correlation Analysis
In conclusion, PPMCC is the most suitable for the evaluation of sensor drift data. Assuming X i for the i-th variables; ρ x i ,y i for the Pearson correlation coefficient between the i-th variables and the j-th, the Pearson correlation coefficient between two variables could be expressed as: where X i and X j are two eigenvalues of 128 features in Table 2. cov(X i , X j ) is the covariance between the i-th variable and the j-th variable, σ X i is the population standard deviation of the i-th variable, and µ X i refers the population mean difference of the i-th variable. Calculate the covariance and standard deviation of the datasets, and obtain the Pearson correlation coefficient of the sample as: indicates the standard score for the X i sample; X i is the average of the samples; σX i is the standard deviation of the samples; g is described as the Pearson correlation coefficient; and, n refers to the sample size. The range of g is between [-1,1] [26]. If g > 0, it means that the two variables are positively correlated. There remains other condition, such as g = 0, signifying that there is no linear correlation between the two variables and g < 0, showing that the two variables are negatively correlated. Correlation analysis is performed on the entire dataset, and one of the variables with g > 0.9 is removed from the variable and only one of the variables is reserved as a processed dataset. Table 3 shows partial results of data correlation processing, where null represents the deletion of data from the original location.

Sensor Drift Model Based on Ensemble Learning
The dynamic behavior of the sensor drift cannot be calibrated. The machine learning methods make the adaptive sensor drifts of the model more attractive. The proposed method is based on the combination with the SVM and the improved LSTM to achieve precise gas classification in any concentration. The proposed Improved LSTM and SVM (ILS) mainly comprise three folds, namely the processing of data, the integration of an ILS model classifier, and the model evaluation of the model, as shown in Figure 3.

Sensor Drift Model Based on Ensemble Learning
The dynamic behavior of the sensor drift cannot be calibrated. The machine learning methods make the adaptive sensor drifts of the model more attractive. The proposed method is based on the combination with the SVM and the improved LSTM to achieve precise gas classification in any concentration. The proposed Improved LSTM and SVM (ILS) mainly comprise three folds, namely the processing of data, the integration of an ILS model classifier, and the model evaluation of the model, as shown in Figure 3. The steps are implemented as follows.
1. The dataset is purified to eliminate the noisy data and then the datasets are processed through four settings. Moreover, four different kind of datasets are obtained under the four settings. Datasets are numbered in turn, forming Dataset 1, Dataset 2, Dataset 3, and Dataset 4. In the following experiments, the four settings are explained. Moreover, setting 3 and 4 are processed while using correlation analyses. 2. Dataset 1, Dataset 2, Dataset 3, and Dataset 4 are used as inputs to the improved LSTM and SVM models, respectively, and then eight independent classifiers (there are four improved LSTM classifiers and four SVM classifiers) are trained. In the case of the same test samples, the eight classifiers output different classification results, which are used to obtain the final predicted outcomes by the normalization and weighted voting strategy. 3. Classification accuracy rate is adopted to evaluate the classifier performance of the ILS model.
The input and output of the two types classifiers are shown in the Figure 4. The input is the characteristic of each dataset, and the output layer is allocated with six different gases. Finally, the classifier output six different gases through the voting strategy. The steps are implemented as follows.

1.
The dataset is purified to eliminate the noisy data and then the datasets are processed through four settings. Moreover, four different kind of datasets are obtained under the four settings. Datasets are numbered in turn, forming Dataset 1, Dataset 2, Dataset 3, and Dataset 4. In the following experiments, the four settings are explained. Moreover, setting 3 and 4 are processed while using correlation analyses.

2.
Dataset 1, Dataset 2, Dataset 3, and Dataset 4 are used as inputs to the improved LSTM and SVM models, respectively, and then eight independent classifiers (there are four improved LSTM classifiers and four SVM classifiers) are trained. In the case of the same test samples, the eight classifiers output different classification results, which are used to obtain the final predicted outcomes by the normalization and weighted voting strategy.

3.
Classification accuracy rate is adopted to evaluate the classifier performance of the ILS model.
The input and output of the two types classifiers are shown in the Figure 4. The input is the characteristic of each dataset, and the output layer is allocated with six different gases. Finally, the classifier output six different gases through the voting strategy.
We use an ensemble of classifiers to detect and cope with the sensor drift. A set of features x as input and a class label (a gas/analyte in our problem) y as output. In every dataset t, we use a batch of S t = (x 1 , y 1 ), . . . , (x n , y n ) , and we also train the classifier. A simple and intuitive way is to assign weights to classifiers according to their prediction performance on batch S i . We trained an ensemble of multi-class classifiers using the method in Algorithm 1. The Algorithm 1 is given below.  We use an ensemble of classifiers to detect and cope with the sensor drift. A set of features x as input and a class label (a gas/analyte in our problem) y as output. In every dataset t, we use a batch of x y x y = , and we also train the classifier. A simple and intuitive way is to assign weights to classifiers according to their prediction performance on batch i S . We trained an ensemble of multi-class classifiers using the method in Algorithm 1. The Algorithm 1 is given below.

SVM Base Classifier
ILS model uses the SVM classifier as a base classifier in this paper. When dealing with linearly indivisible samples, SVM transforms themselves from low dimension feature space to high dimension by the non-linear mapping method, SVM constructs the optimal hyperplane in high dimensional space by using different kernel functions to make them linearly separable. It proves Train a classifier (SVM) on S t 4: Estimate the weight φ 1 , . . . , φ t by the techniques described in the text 5: end for 6: for t = 1, . . . , N do 7: Receive S t = (x 1 , y 1 ), . . . , (x n , y n ) 8: Train a classifier (LSTM) on S t 9: Estimate the weight η 1 , . . . , η t by the techniques described in the text 10: end for We can assign weights according to the prediction performance of classifiers on the most recent batches. In addition, we can simply estimate a single set of weights φ 1 , . . . , φ t and η 1 , . . . , η t by using the multi-class classifier prediction accuracies on every batch. The predications are based on the improved voting, as below.

SVM Base Classifier
ILS model uses the SVM classifier as a base classifier in this paper. When dealing with linearly indivisible samples, SVM transforms themselves from low dimension feature space to high dimension by the non-linear mapping method, SVM constructs the optimal hyperplane in high dimensional space by using different kernel functions to make them linearly separable. It proves that a plurality of classification sets combined with LSTM and SVM is better than a single classifier model.
The learning method of LSTM is based on the principle of experience minimization. When the number of training sample is large enough, this method can provide the good compensation effects in the sensor drift, but it usually causes overfitting if the training sample is not enough. With regard to this defect, the principle of structural risk minimization can improve it, this principle can greatly reduce the generalization errors in the training set data, weaken the complexity of machine learning, and control the predicted risks of the entire sample set while ensuring classification accuracy. SVM is on the basis of this principle, and this principle of structural risk minimization can be described as: where w is the two-norm of the vector; 1 2 w 2 can be expressed as 1 2 m 1 w 2 i , namely the L-2 regularization term. The second term in the above formula (6) represents the empirical risk. Where x represents the input characteristics of the training sample and y represents the output of the training sample set. w = (w 1 , w 2 , . . . , w m ) is the normal vector. T represents the transpose. Additionally, b is the offset, which determines the distance between the hyperplane and the origin. m is the number of input sample instances. The principle of minimization of structural risk minimizes the generalization errors of the training set data, and it reduces the complexity of the learning machine while ensuring the classification accuracy, so that the expected risk on the entire sample sets could be controlled. Finally, the SVM base classifier has been selected in this paper.

Improved Base Classifier of LSTM
LSTM is a special form of RNN. Among many RNN variants, the LSTM model compensates for the gradient disappearance and explosion of RNN and the lack of a long-term memory capacity, which enables LSTM to effectively utilize long-range timing information [22,27]. Generally, this model is divided into three levels: the input layer, the hidden layer as well as the output layer. Given sequence X = (x 1 , x 2 , . . . , x n ) can calculate the hidden layer sequence H = (h 1 , h 2 , . . . , h n ) and the output layer sequence Y = (y 1 , y 2 , . . . , y n ) could be obtained by the iterative equations from the 7th to 9th.
The forward calculation method of LSTM model cell can be expressed as: where f t , i t , and o t are, respectively, the results of the forgotten gate, the input gate, and the output gate state settlement result; W f , W i , and W o are, respectively, weight matrix of the forgetting gate, the input gate, and the output gate; b f , b i , and b o are, respectively, bias terms of the forgetting gate, the input gate, and the output gate; h t is the final output of current neurons; C t is the unit state input at time t + 1; W c is the weight matrix in a unit state; and, b c is the bias of the input unit status. σ and X are sigmoid and hyperbolic tangent activation functions, respectively. According to the calculation between the information of the previous forgotten state and the current state information input gate, Figure 5 shows the cell unit structure of the LSTM model. t t C is the unit state input at time t + 1; c W is the weight matrix in a unit state; and, c b is the bias of the input unit status. σ and X are sigmoid and hyperbolic tangent activation functions, respectively. According to the calculation between the information of the previous forgotten state and the current state information input gate, Figure 5 shows the cell unit structure of the LSTM model. The overall framework of the LSTM model that was constructed in this paper is shown in Figure 6, which includes three functional modules: the input layer, the hidden layer, and the output layer. The input layer is responsible to process data in the dataset to meet the network input requirements. The LSTM cells that are shown in Figure 5 were used to construct a single-layer circulatory neural network in the hidden layer, which adopts the Adam optimization. The output layer can classify the results of LSTM training, further analyzing its accuracy. calculate the output value and the error function value; 3.
update the weight and threshold of the output neuron; 4.
calculate the error value of the hidden layer neurons and update the weights and thresholds; and, 5.
repeat it from step 2 to step 4 until the training model converges or reaches the number of training sessions.
The overall framework of the LSTM model that was constructed in this paper is shown in Figure 6, which includes three functional modules: the input layer, the hidden layer, and the output layer. The input layer is responsible to process data in the dataset to meet the network input requirements. The LSTM cells that are shown in Figure 5 were used to construct a single-layer circulatory neural network in the hidden layer, which adopts the Adam optimization. The output layer can classify the results of LSTM training, further analyzing its accuracy. X as the input of the hidden layer, is then calculated in each single cell, respectively, with the forward propagation for training X and back propagation for optimizing loss function with Adam Optimizer. LSTM model outputs to h i , where C n and H n represent the state and output of the previous LSTM cell, respectively. We choose cross entropy as the basis for the calculation formula: where p(x) is the distribution of predictions and q(x) is the distribution of the dataset. The SoftMax output is employed in this study. If cross entropy is directly chosen as the loss function, it could simulate the max operation well for the fact that exp function in Softmax is monotonically increasing, which assigns a high value to a node and low value to the remaining nodes to a. Additionally, it polarizes the results, which would lead to the weakened ability of error correction in practical application. With the increase of the experimental batch, the interference signals in the data that were collected by the sensor gradually increase and even dominate the main signals, which requires a reduction on the max operation to weaken the influence of such signals on the results and improve the accuracy of the model. More importantly, it is difficult to determine a confidence interval and set the threshold in practice. Therefore, it is necessary to improve the loss function. The improved loss function will be shown in the next section.
There are many types of gradient-based optimization algorithms, such as stochastic gradient descent, Stochastic Gradient Descent (SGD), and Root Mean Square Prop (RMSProp). This paper adopts Adaptive Moment estimation (Adam). The Adam algorithm integrates the advantages of the AdaGrad with RMSProp algorithms when compared with other optimization algorithms. In the process of parameter updating, firstly calculate the first and second moments, and then revise their deviation. After that, adopt the modified versions to sum the updated parameters and finally update the new version by the updated parameters. The Adam algorithm performs best in the practical application when compared with other stochastic optimization methods.

Multiple Classifiers Strategy
The multi-classifier combination is to make every classifier solve the same original tasks and combine the results of each model by a specific voting strategy to obtain a better global model. This paper conducts a theoretical analysis of ensemble learning, while considering the six-classification problems y ∈ {1, 2, 3, 4, 5, 6} and function f , while assuming that the error rate of the individual classifier is δ, namely a formula of Individual classifier h i : where h i (x) indicates the predicted results and f (x) represents the real results if the integration is assembled with P individual classifiers by the simple voting. If more than half of the results of individual classifier are correct, the final determined is on the right, If the error rate of the classifier is independent of each other, the Hoeffding inequality shows that the ensemble error rate is P is showed at the above Equation (16). It shows that the ensemble error rate will exponentially decline with the increase of the number of individual classifiers increase in the integration. However, in the practical experiments, it has been found that the error rates between individual classifiers are not independent, so the key of ensemble learning is the way of combination between individual classifiers. Although there exists the simple voting method, it is still difficult to achieve the impacts that effectively improves the accuracy of results in six-classifications. Therefore, we propose a normalized voting strategy to classify more than one kind of situations, which is to say, using a weighted sum of the weights to calculate the maximum output as a final result. Next, the multiple classification vote strategy will be illustrated in detail: where y i represents the results predicted by the classifiers; x expresses the input; θ i expresses the accuracy of the i-th classifier; f i (x) expresses the output of the i-th classifiers; and, min and max, respectively, indicate the minimum and maximum accuracy. Afterwards, normalize. The accuracies of all classifiers are then normalized because a normalized model can remove the model with the lowest accuracy. After that, use the weight of other remained models to get relatively high accuracy results.

Improved Loss
The classification polarization led by the output of softmax function in the LSTM model may cause over-fitting. In practical, the confidence interval cannot be well determined and the threshold is hard to set. Therefore, we have developed a new loss function that is based on cross-entropy. The following loss function is designed: where ϕ is the minimum value that can tolerate the error setting; ε is the threshold; β j f j (x i ) is the result of the forecast; and, y i is the data of real distribution. In this paper, the Figure 6 in denominator indicates that it uses the six classifications. The purpose is to fit the even distribution and reduce the overfitting. The importance of Max operator can be shown by an inequality, −(1 − ε)y i log(β j f j (x i ) − ε/6y i log β j f j (x i )) ≥ ϕ. It can be implied that, whenever the loss is greater than ϕ, namely the difference between the predicted and the actual category is greater than the value ϕ, the loss returns to its configure. Otherwise, it will be the value zero. At the same time, regularization is introduced into the loss function to deal with the problem of sensor drift identification and correction by using the multi-classification model, so as to more accurately classify the sensor drift.

Regularization
Complex neural network models are prone to be over-fitting. The regularization techniques are widely applied in machine learning. The function is to prevent the model from over-fitting and improve the generalization ability of models. The regularization method actually eliminates the singularity by separating the curves with different tangent lines at the singular points of the irreducible plane algebraic surface [28]. Add constraints to the minimization of the empirical error function, for example, L0 norm, L1 norm, and L2 norm. The constraint has a guiding function, and the fit in the optimization function tends to choose the direction of the constraints with the gradient descent, so that the final result tends to be constrained situation. L0 norm is: Using the L0 norm to regularize the parameter matrix is suitable for the feature matrix, but it is difficult to achieve optimization. The L1 norm is: Calculate the sum of the absolute values of the elements in the vectors, namely the L2. L2 norm is: The regularization of L2 norm is to minimize the regularization term w , which makes every element in W with a minimum, close to zero. However, it is different from L1 norm. Not every element is zero, but it is just close to zero. Figure 7 sets the L1 norm as an example of the regularization. We randomly generate two color points and divide both points into the two-dimensional space by a simple logistic regression. It can be seen that Figure 7b could better distinguish the points. However, when we add some extra data, and then the curve changes significantly. Thus, it is essential to achieve a more stable and generalized curve to prevent the over-fitting situation. As a result, our research team introduces regularization to enhance the stability and robustness of the model.

Experimental Datasets and Environment
The datasets that were adopted in this study come from 'Gas Sensor Array Drift Dataset' in the UCI machine learning repository. The primary purpose of making this dataset freely accessible on-line is to provide an extensive dataset to the sensor and artificial intelligence research communities to develop and test strategies to solve a wide variety of tasks, including sensor drift, classification, regression, among others. The datasets consist of 13,910 measurements of 16 chemical sensors from January 2008 to February 2011 (36 months). These sensors have been exposed to six different gases with different concentration levels. The resulting dataset includes the recordings of six different pure gaseous substances, such as Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, and Toluene, respectively, dosed at a wide variety of concentration levels in the intervals (50,1000), (5500), (12,1000), (10,300), (10,600), as well as (10,100) PPMV [9]. The dataset is organized into 10 batches, with each batch containing diverse gas combinations shown in the tables below and the number of measurements per month, respectively, to handle the dataset conveniently. This reorganization of the data is to ensure that there is sufficient experimental data in each batch and the number of experiments is as evenly distributed as possible. The goal of this study is to distinguish among six different gases regardless of their concentration levels.
The labels in the tables indicate that the gas type 1 is ethanol; type 2 is ethylene; type 3 is ammonia; type 4 is acetaldehyde; type 5 is acetone; and, type 6 is toluene. Each of the possible gas

Experimental Datasets and Environment
The datasets that were adopted in this study come from 'Gas Sensor Array Drift Dataset' in the UCI machine learning repository. The primary purpose of making this dataset freely accessible on-line is to provide an extensive dataset to the sensor and artificial intelligence research communities to develop and test strategies to solve a wide variety of tasks, including sensor drift, classification, regression, among others. The datasets consist of 13,910 measurements of 16 chemical sensors from January 2008 to February 2011 (36 months). These sensors have been exposed to six different gases with different concentration levels. The resulting dataset includes the recordings of six different pure gaseous substances, such as Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, and Toluene, respectively, dosed at a wide variety of concentration levels in the intervals (50,1000), (5500), (12,1000), (10,300), (10,600), as well as (10,100) PPMV [9]. The dataset is organized into 10 batches, with each batch containing diverse gas combinations shown in the tables below and the number of measurements per month, respectively, to handle the dataset conveniently. This reorganization of the data is to ensure that there is sufficient experimental data in each batch and the number of experiments is as evenly distributed as possible. The goal of this study is to distinguish among six different gases regardless of their concentration levels.
The labels in the tables indicate that the gas type 1 is ethanol; type 2 is ethylene; type 3 is ammonia; type 4 is acetaldehyde; type 5 is acetone; and, type 6 is toluene. Each of the possible gas type-concentration pairs has been sampled without any particular order. The resulting dataset consists of 13,910 recordings (time series sequences) collected more than 36 months [7,29]. Moreover, as observed in Table 4, the last batch containing 3600 measurements from the same analytes is purposely collected five months after the sensors are powered off. This five-month gap plays an important role in this paper not only because it allows us to validate our proposed method on the annotated set of measurements collected after five months, but because, during this time, the sensors are prompted to severe contamination, for it is easy to make external interferent irreversibly get attached to the sensing layer. Batch 10 as a test set can effectively validate the method that we use. Principal component analysis (PCA) [30] is carried out for these 10 datasets batches in order to intuitively observe and analyze the distribution of these 10 batches in the datasets. Figure 8 shows the influence of drift on data distribution. As time goes by, there is a significant bias in the two-dimensional subspace distribution between batch 1 and other batches because of the drift. Principal component analysis (PCA) [30] is carried out for these 10 datasets batches in order to intuitively observe and analyze the distribution of these 10 batches in the datasets. Figure 8 shows the influence of drift on data distribution. As time goes by, there is a significant bias in the two-dimensional subspace distribution between batch 1 and other batches because of the drift. It is worth noting that the dynamic behavior of sensors after drift cannot be calibrated, and it is more valuable to use machine learning and data adaptive methods to compensate the sensor drift.
The computer setting in this experiment is as follows: a processor Intel Core i5-7300 HQ with 2.5 frequency GHz and the 3.5 GHz maximum frequency. The RAM is 8G. The operating system is windows 10 (6 4 bits) and programming language is python 3.5.2. The ensemble development environment is PyCharm 2017.1.2. The LSTM program model adopts the package in the It is worth noting that the dynamic behavior of sensors after drift cannot be calibrated, and it is more valuable to use machine learning and data adaptive methods to compensate the sensor drift.
The computer setting in this experiment is as follows: a processor Intel Core i5-7300 HQ with 2.5 frequency GHz and the 3.5 GHz maximum frequency. The RAM is 8G. The operating system is windows 10 (6 4 bits) and programming language is python 3.5.2. The ensemble development environment is PyCharm 2017.1.2. The LSTM program model adopts the package in the TensorFlow 1.12.0 Python package.

Drift Experiment
In the experiment, the SVM model and the kernel function that is [sigmoid, linear, poly] are selected, and the range of C, namely the penalty factor, is [2 −5 , 2 −4 , . . . , 2 9 , 2 10 ]. According to the classification accuracy, we select the optimal kernel function and penalty factor. Eigenvalues in the training and testing sets are normalized into the range [−1,1]. If the testing operation is on batch 1, 1/5 of the batch 1 is set to be the test set and the remaining data is set to be training set. Subsequently, we verify the classification accuracy on the test set of batch 1. When batch 2-10 is used as the test set, batch 1 is used as the training set to train SVM classifier. According to the Table 5, the performance of the classifier would change with the experiment batch, namely the time changes (Table 4 shows data collection in batch 1-10 with time). The performance of the classifier would decrease, which can be an indicator of the sensor drift (that is to say, the lower rate the classification, the more severe the drift phenomenon). The experiment in this paper verifies that the data collected by the sensor is drifting, and the drift reduces the performance of the classifiers.

Base Classifier
We consider the following four sets of data. According to the diverse setting of each dataset, these datasets are named as dataset1, dataset2, dataset3, and dataset4, respectively. The four settings are as follows: • In Setting 1, the latest batch is selected as the training set to reduce the differences between the test set and the training set due to the sensor drift. However, because of the different amount of data in different batches, the latest batch of the data is not various enough to involve the types of all data, so that we set all of the data before the latest batch as the training set in Setting 2. The SVM trained in Setting 2 is a strong baseline, because it sees the most recent batch of examples that is not corrupted by drifted data from the past, and data in setting 2 was not analyzed for correlation. Therefore, we use the classification accuracy of SVM in Setting 2 as an uncompensated comparison. Settings 3 and Setting 4 reduce the dimension of the two datasets formed by the Setting 1 and Setting 2 through PPMCC.

A. SVM model
The dataset is divided into 10 batches. The SVM model is utilized for four different kinds of datasets to compensate for the sensor drift. The SVM model chooses the kernel functions [ sigmoid, linear, poly]. The penalty factor C is selected from the range [2 −5 , 2 −4 , . . . , 2 9 , 2 10 ]. Subsequently, we compare the classification accuracy rate and choose the optimal kernel function and penalty factors. Table 6 shows that the SVM model has the highest classification accuracy rate, reaching 99% in the test dataset included in batch 2-10. As shown in Figure 9, the SVM model is tested on four different datasets. It could be implied that the classification accuracy rate of the dataset 2 and dataset 4 are both up to 99%, and the average classification accuracy of dataset 4 is 83.1%. Under the four different datasets, the classifier using batch 10 as the testing set has the lowest classification accuracy of 34.5%. We believe that there is a six-month difference between the dataset batch 9 and batch 10, and a large number of interference signals appear in the data that were collected by sensors in batch 10, which results in a low accuracy of the classifier. Using batch 10 in the dataset 4 as a test set, the classification accuracy of batch 10 reaches 70.6%. The SVM trained is the baseline and the average classification accuracy is 81.4% in dataset 2 of the Table 6. When compared to the baseline, the compensation is improved by 1.7% in dataset 4. Due to the dataset 4 with the highest average classification accuracy, the results of dataset 4 are adopted to compare with our proposed method ILS to test different performance of models.

B. LSTM model
We apply four kinds of dataset described above as the input of LSTM model. Subsequently, we choose relu as the activation function in the LSTM model and set different rates of learning (learning_rate = 0.001, 0.0015, 0.0025, 0.005, 0.0075). The learning_rate with the highest test accuracy obtained by dynamic changes is regarded as the current learning_rate. Because of the number of training sets that are based on the dynamically increased dynamic setting, the number of iterations is set to 50 times of the number of rows in the current training set. LSTM model is divided into four layers. The input of the input layer comes from different datasets; the output layer distributes six different gases, and there are also two hidden layers. We choose cross-entropy as the loss function and use Adam as the optimizer.
As shown in Table 7, when the LSTM model chooses batch 10 (six months out of batch 9) as the test data, the accuracy reaches 78.6%. When compared with a SVM model with the same data equipment, LSTM is more accurate, with a higher 5.8% accuracy. Thus, for the data with the predicted long-time differences, the LSTM model classifier performs better. As shown in Figure 10, the LSTM model as a test set is applied in the batch 2-10. The highest classification accuracy among

B. LSTM model
We apply four kinds of dataset described above as the input of LSTM model. Subsequently, we choose relu as the activation function in the LSTM model and set different rates of learning (learning_rate = 0.001, 0.0015, 0.0025, 0.005, 0.0075). The learning_rate with the highest test accuracy obtained by dynamic changes is regarded as the current learning_rate. Because of the number of training sets that are based on the dynamically increased dynamic setting, the number of iterations is set to 50 times of the number of rows in the current training set. LSTM model is divided into four layers. The input of the input layer comes from different datasets; the output layer distributes six different gases, and there are also two hidden layers. We choose cross-entropy as the loss function and use Adam as the optimizer.
As shown in Table 7, when the LSTM model chooses batch 10 (six months out of batch 9) as the test data, the accuracy reaches 78.6%. When compared with a SVM model with the same data equipment, LSTM is more accurate, with a higher 5.8% accuracy. Thus, for the data with the predicted long-time differences, the LSTM model classifier performs better. As shown in Figure 10, the LSTM model as a test set is applied in the batch 2-10. The highest classification accuracy among the four different datasets is 85.7%, while SVM model's accuracy is 99.0%. Hence, the SVM model is more suitable for the datasets that are small and whose time span is not too long. The average classification accuracy in the dataset 4 among four datasets is 76.4%, and the results of dataset 4 will be compared with our method that is proposed below.  C. Improved LSTM model The same four datasets mentioned above are used as above for the improved LSTM model input. In this improved LSTM model, the loss function of the LSTM model is updated, as shown in equation 17. The rest module of the improved LSTM model is configured in the same way as the above LSTM model.
The improved LSTM shows a higher classification accuracy when batch 10 is the test set when compared with the LSTM (batch 9 differs from batch 10 by six months). In dataset 4 of the Table 8, the classification accuracy reaches 83.3%, increasing by 4.7%. It can be verified that the improved loss function is more adaptive to drift data, which makes the improved LSTM to be the base classifier for the multi-classifier. As shown in Figure 11, the improved LSTM model is tested in four different datasets. The highest accuracy rate is 97.2% in all tests of dataset 3. Among four kinds of datasets, the highest classification accuracy of 78.0% is obtained on dataset4. Therefore, the results in dataset 4 are adopted in the improved LSTM when comparing with the proposed method ILS below.  C. Improved LSTM model The same four datasets mentioned above are used as above for the improved LSTM model input.
In this improved LSTM model, the loss function of the LSTM model is updated, as shown in equation 17. The rest module of the improved LSTM model is configured in the same way as the above LSTM model.
The improved LSTM shows a higher classification accuracy when batch 10 is the test set when compared with the LSTM (batch 9 differs from batch 10 by six months). In dataset 4 of the Table 8, the classification accuracy reaches 83.3%, increasing by 4.7%. It can be verified that the improved loss function is more adaptive to drift data, which makes the improved LSTM to be the base classifier for the multi-classifier. As shown in Figure 11, the improved LSTM model is tested in four different datasets. The highest accuracy rate is 97.2% in all tests of dataset 3. Among four kinds of datasets, the highest classification accuracy of 78.0% is obtained on dataset4. Therefore, the results in dataset 4 are adopted in the improved LSTM when comparing with the proposed method ILS below.

Ensemble Multi-Class Classifier
The ILS model adopts the improved LSTM model and the SVM model as the base classifier. The multi-class integration uses the above normalized weighted voting. The dataset under the four setting is the input of the base classifier. The parameter settings of the improved LSTM model and the SVM model are consistent with the parameter selection above.
As shown in Table 9, under the SVM, the LSTM and Improved LSTM column represent the 'previous results' of the independent classifier, under the ILS column are the result (improved) with the method presented of ensemble classifier. We compare the SVM, LSTM, and improved LSTM models in terms of the highest average accuracy in the four different datasets ( Figure 9

Ensemble Multi-Class Classifier
The ILS model adopts the improved LSTM model and the SVM model as the base classifier. The multi-class integration uses the above normalized weighted voting. The dataset under the four setting is the input of the base classifier. The parameter settings of the improved LSTM model and the SVM model are consistent with the parameter selection above.
As shown in Table 9, under the SVM, the LSTM and Improved LSTM column represent the 'previous results' of the independent classifier, under the ILS column are the result (improved) with the method presented of ensemble classifier. We compare the SVM, LSTM, and improved LSTM models in terms of the highest average accuracy in the four different datasets (Figures 9-12 above) to the ILS model in Figure 12, the average accuracy of LSTM, the improved LSTM, SVM, and the ILS model are 76.4, 78.0%, 83.1% and 89.3%, respectively.  The classifier ensemble performs better than the SVM trained at batch points 2, 3, 4, 5, 9, and 10. As mentioned above, this SVM is a very strong baseline and, thus, the ILS performs better than or as well as this SVM is a better result. The ILS model better than the classifier of LSTM at all batches, from batch 3 to batch 10. In the case of batch 9 and batch 10 for testing, the accuracy of the ILS model is 83.4%. In terms of the average accuracy and the accuracy with batch 10 for testing, ILS also performs better. In the highest classification accuracy rate, the ILS model reaches a maximum of 99.0%. Note that, although the Improved ILSM helps to slightly improve the performance of the LSTM trained, it performs worse than some of the batches.
We compare the ILS model with the above independent classification, which shows the superiority of the ILS. The model that is presented below is compared with different voting methods of ensemble learning. Figure 13 shows how the classifier weights used in the ILS model change with the batch.  The classifier ensemble performs better than the SVM trained at batch points 2, 3, 4, 5, 9, and 10. As mentioned above, this SVM is a very strong baseline and, thus, the ILS performs better than or as well as this SVM is a better result. The ILS model better than the classifier of LSTM at all batches, from batch 3 to batch 10. In the case of batch 9 and batch 10 for testing, the accuracy of the ILS model is 83.4%. In terms of the average accuracy and the accuracy with batch 10 for testing, ILS also performs better. In the highest classification accuracy rate, the ILS model reaches a maximum of 99.0%. Note that, although the Improved ILSM helps to slightly improve the performance of the LSTM trained, it performs worse than some of the batches.
We compare the ILS model with the above independent classification, which shows the superiority of the ILS. The model that is presented below is compared with different voting methods of ensemble learning. Figure 13 shows how the classifier weights used in the ILS model change with the batch. The classifier ensemble performs better than the SVM trained at batch points 2, 3, 4, 5, 9, and 10. As mentioned above, this SVM is a very strong baseline and, thus, the ILS performs better than or as well as this SVM is a better result. The ILS model better than the classifier of LSTM at all batches, from batch 3 to batch 10. In the case of batch 9 and batch 10 for testing, the accuracy of the ILS model is 83.4%. In terms of the average accuracy and the accuracy with batch 10 for testing, ILS also performs better. In the highest classification accuracy rate, the ILS model reaches a maximum of 99.0%. Note that, although the Improved ILSM helps to slightly improve the performance of the LSTM trained, it performs worse than some of the batches.
We compare the ILS model with the above independent classification, which shows the superiority of the ILS. The model that is presented below is compared with different voting methods of ensemble learning. Figure 13 shows how the classifier weights used in the ILS model change with the batch. Figure 13. Classifier weights used in the ensembles (SVM1 to SVM4 and, LSTM1 to LSTM4). At every point on the x-axis (batch) the corresponding points in the y-axis are the weights of the individual classifiers used in the ensemble. Note that these weights max up to 1. Note that the dotted indicates that the input is not a classifier for the correlation analysis dataset and, the solid line indicates that the classifier the inputs the data set for correlation analysis. Figure 13. Classifier weights used in the ensembles (SVM1 to SVM4 and, LSTM1 to LSTM4). At every point on the x-axis (batch) the corresponding points in the y-axis are the weights of the individual classifiers used in the ensemble. Note that these weights max up to 1. Note that the dotted indicates that the input is not a classifier for the correlation analysis dataset and, the solid line indicates that the classifier the inputs the data set for correlation analysis.

Comparison of Ensemble Voting Methods
In Section 4.2, we compare the integrated multi-classifier ILS with the multi-classifier base classifiers LSTM and SVM. The integrated multi-classifier ILS has better performance than the base classifiers. However, in ensemble learning, the combination of classifiers is equally important. Because in the multi-classifier there are base classifiers that have a negative influence on the predictive capabilities, which affects the predictive ability of ensemble classifiers. There are three ways to vote: majority voting, plurality voting, and weighted voting. For majority voting, the result of the final integration is to select more than half of the votes. That is to say, if more than half of the base learners predict the category c, then the ensemble learner predicts the result as c, otherwise the prediction is rejected. Therefore, in this part, we will compare the proposed classifier combination method with the classical majority voting method and the weighted voting method.
Plurality voting method is predicted to be the mark with the highest number of votes. If multiple marks get the highest number of votes at the same time, a mark is randomly selected from them. We regard the predicted output of h i on sample x as an N-dimensional vector (h 1 is the output on the category tag c j . The relative plurality voting method indicates: The weighted voting method is similar to weighted average, which can be expressed as: where w i is the weight of h i , generally w i ≥0, T i=1 w i = 1. In this part of the experiment, we compare the plurality method and the weighted voting method with the normalized weighted voting method that is proposed in this paper. The SVM mentioned above and the improved LSTM are selected in the base classifier of the plurality voting method. At the same time, the four datasets that were generated by the same settings above are used as input of the SVM and the improved LSTM. The accuracy of each base classifier is the weight in the weighted voting method (expressed in decimal of weight reuse accuracy).
In Table 10, majority voting is a multi-classifier that uses majority voting as a combined strategy. Plurality voting is the multiple classifier of a combination strategy based on the Plurality voting. In batch 2, 3, 4, 5 as a test sets, the classification accuracy of Majority voting is slightly lower than that of Plurality voting. The main reason is that the lowest accuracy of the base classifier occupies the same weight in the majority voting, which causes a negative effect. When batch 10 is used as the test set, the classification accuracy rate of Plurality voting is 70.7%, and the one of Majority voting is 73.4%. In the batch 2 and batch 3, the classification accuracy of ILS model and Plurality voting is basically the same, in the average classification accuracy rate, the ILS model reaches 89.2%, the Majority voting and the Plurality voting accounting for 84.2% and 83.9%, respectively. In the ensemble classifiers, if the base classifier with negative influence on the prediction ability has high accuracy, then it has a large negative influence in the voting prediction, and it is easy to make mistakes, which affects the prediction performance of the ensemble classifier. If the base classifiers with the low accuracy in Majority voting occupy the same weight, which also easily affects the predictive performance of the ensemble classifier. In this paper, the normalized weighted vote can be used to cut out some base dividers in the low accuracy. In the predictive stage, the final prediction result is obtained by the weighted allocation of the rest ensemble classifier. Normalized weighting can improve the predictive performance of the ensemble classifier by eliminating the base classifiers with the lowest prediction accuracy in the ensemble classifier.
As shown in Figure 14, the classifier ensembles are able to perform better than or as well as the classifier trained when tested on most of the most of the batches with significant improvements in accuracy on several batches. Except batch 3, batch 5z, batch 7, and batch 8, ILS has greatly improved the performances and achieved the highest accuracies. The results again turn out to be that the proposed method can effectively promote the classification and process sensor drift by merging drift compensation into the classification task. This result clearly demonstrates the effectiveness of the proposed method for automatic detection and copy with concept drift. In the batch 2 and batch 3, the classification accuracy of ILS model and Plurality voting is basically the same, in the average classification accuracy rate, the ILS model reaches 89.2%, the Majority voting and the Plurality voting accounting for 84.2% and 83.9%, respectively. In the ensemble classifiers, if the base classifier with negative influence on the prediction ability has high accuracy, then it has a large negative influence in the voting prediction, and it is easy to make mistakes, which affects the prediction performance of the ensemble classifier. If the base classifiers with the low accuracy in Majority voting occupy the same weight, which also easily affects the predictive performance of the ensemble classifier. In this paper, the normalized weighted vote can be used to cut out some base dividers in the low accuracy. In the predictive stage, the final prediction result is obtained by the weighted allocation of the rest ensemble classifier. Normalized weighting can improve the predictive performance of the ensemble classifier by eliminating the base classifiers with the lowest prediction accuracy in the ensemble classifier.
As shown in Figure 14, the classifier ensembles are able to perform better than or as well as the classifier trained when tested on most of the most of the batches with significant improvements in accuracy on several batches. Except batch 3, batch 5z, batch 7, and batch 8, ILS has greatly improved the performances and achieved the highest accuracies. The results again turn out to be that the proposed method can effectively promote the classification and process sensor drift by merging drift compensation into the classification task. This result clearly demonstrates the effectiveness of the proposed method for automatic detection and copy with concept drift. In terms of cost perspective, the proposed method can train the gas classifiers by the features that were extracted from the datasets. The abstract features extracted by our method can cope with the complex data and non-linear changes, so our method is not only robust, but also universal to the gas sensor drift. The cost of this method is relatively low, because it only requires appropriate marker data without an additional reference gas. The ILS model is composed of the LSTM and SVM classifiers. When compared with SVM, LSTM has a higher time complexity. However, the proposed model does not increase the time complexity. The LSTM model corresponds to four sets of parameters, including input gate, forget gate, output gate, and candidate state. In the LSTM, the parameters can be simplified to two matrices, U and V, which can map the input and output, In terms of cost perspective, the proposed method can train the gas classifiers by the features that were extracted from the datasets. The abstract features extracted by our method can cope with the complex data and non-linear changes, so our method is not only robust, but also universal to the gas sensor drift. The cost of this method is relatively low, because it only requires appropriate marker data without an additional reference gas. The ILS model is composed of the LSTM and SVM classifiers. When compared with SVM, LSTM has a higher time complexity. However, the proposed model does not increase the time complexity. The LSTM model corresponds to four sets of parameters, including input gate, forget gate, output gate, and candidate state. In the LSTM, the parameters can be simplified to two matrices, U and V, which can map the input and output, respectively. The dimension of U is the hidden * input, and the dimension of V is hidden * hidden. Therefore, the network is learning these two matrices, so the total of the LSTM is 4(nm + n 2 + n), where n is hidden_size, m is input_size. The amount of average time consumed by each algorithm is given in Table 11 in order to compare algorithms based on time complexity.

Conclusions
The supervised learning algorithm can effectively manage and compensate for the sensor drift. In this study, we propose a multi-classifier integration supervised learning method to compensate for drift in gas sensors. The model takes advantage of SVM, whose capacity of few-shot classification and the long-time memory characteristics of LSTM. Besides, the improved loss function eliminates the polarization caused by using SoftMax in LSTM model. Additionally, it combines SVM with the improved LSTM. Through the normalized weighted voting strategy, the base classifier with the lowest accuracy of the classifier is removed in every voting process to make the proposed model ILS adapt to the sensor drift, which effectively improves the performance of the sensor drift classifier. The model does not make any assumptions about the nature of the drift, so that the model has a better generalization ability. In addition, the used data are collected over a long period of time and it has drift characteristics, which is a relatively comprehensive dataset for exploration. On datasets with four kinds of setting, we conduct a correlation analysis and make it clear that better the approximation results could be obtained with the increased hypothesis space. When compared with SVM, LSTM, and the improved LSTM model, the proposed method achieves highest accuracy 99.0% and average accuracy 83.4%.
Our model has achieved the good experimental results on the current dataset. However, the supervised learning requires huge manpower and resources to compensate the sensor drift in dynamic labeling and training data. Besides, the classifier model has a longer training time than a Random Forest. In the future, we will further investigate the classification of unlabeled data from the sensor drift in a semi-supervised manner and attempt to optimize the model [31].

Conflicts of Interest:
The authors declare no conflict of interest.