Drift Compensation on Massive Online Electronic-Nose Responses

: Gas sensor drift is an important issue of electronic nose (E-nose) systems. This study follows this concern under the condition that requires an instant drift compensation with massive online E-nose responses. Recently, an active learning paradigm has been introduced to such condition. However, it does not consider the “noisy label” problem caused by the unreliability of its labeling process in real applications. Thus, we have proposed a class-label appraisal methodology and associated active learning framework to assess and correct the noisy labels. To evaluate the performance of the proposed methodologies, we used the datasets from two E-nose systems. The experimental results show that the proposed methodology helps the E-noses achieve higher accuracy with lower computation than the reference methods do. Finally, we can conclude that the proposed class-label appraisal mechanism is an effective means of enhancing the robustness of active learning-based E-nose drift compensation. Disagree strategy achieves the highest detection accuracy of around 83% at this point, the MPEGMM performs more stably under various LNR s. Meanwhile, in Figure 7b, we present the average detection accuracy of batches 2–3 from dataset B. The bars of the MPEGMM are on top compared with all other methods. Thus, we conclude that the proposed MPEGMM method can successfully identify more mislabeled instances. It is the key reason for making the updated classiﬁer well performed under long-term drift.


Introduction
An electronic nose (E-nose) is a kind of odor-sensing device containing a gas sensor array and proper recognition algorithms [1,2]. The gas sensor is a fundamental part of an Enose, and the issue of gas sensor drift heavily impedes the performance stability of E-noses. To address this problem, users are often demanded to perform a series of drift calibration experiments to retrain the recognition algorithms, which leads to compulsory pauses during routine works. It is apparently unsuitable for online tasks requiring continuous gas sensing, such as toxic gas alarm [3,4], air pollution monitoring [5,6], gas source tracking [7,8], and intensity measurement of gas mixtures [9][10][11].
Regarding studies on E-noses, drift compensation is still appealing to researchers focusing on two points: signal preprocessing approach [12][13][14][15][16][17][18] and machine learning model [19][20][21][22][23][24][25]. For signal preprocessing, classical signal decomposition approaches (e.g., principal component analysis (PCA), orthogonal signal correction, independent component analysis, and wavelet analysis) have been used to filter out driftlike signals. On the other hand, machine learning methods have tried to obtain proper data space projection for drift data via a multiobjective model and associated solution process. Generally, both types more or less require a number of drift calibration samples with class labels provided from extra drift calibration experiments or algorithmic inferences. However, it seems to be unrealistic in online odor monitoring for extensive and successive E-nose responses, and weak in robustness due to uncertain class labels.
To gain valuable drift calibration samples and associated class labels without any timeconsuming experiments, an active learning (AL) paradigm has been introduced in the latest academic publication [26]. As Figure 1 shows, AL allows drift compensation (classification model updating) without any time pauses during odor recognition. It selects the most informative drift calibration samples in a small number from incoming massive drift gas sensor array responses. Meanwhile, selected samples are labeled by a human expert (odor discriminator) immediately. Then, both selected samples and provided labels are added into a drift calibration set to update classification models of E-noses. However, the AL paradigm highly trusts the expert's annotation, which may deteriorate the recognition performance when the expert is affected by a series of considerable factors (e.g., inattentive errors, lack of experience, and environmental disturbance). Here, we name this matter as "noisy label" problem.
Chemosensors 2021, 9, x FOR PEER REVIEW 2 of 16 To gain valuable drift calibration samples and associated class labels without any time-consuming experiments, an active learning (AL) paradigm has been introduced in the latest academic publication [26]. As Figure 1 shows, AL allows drift compensation (classification model updating) without any time pauses during odor recognition. It selects the most informative drift calibration samples in a small number from incoming massive drift gas sensor array responses. Meanwhile, selected samples are labeled by a human expert (odor discriminator) immediately. Then, both selected samples and provided labels are added into a drift calibration set to update classification models of E-noses. However, the AL paradigm highly trusts the expert's annotation, which may deteriorate the recognition performance when the expert is affected by a series of considerable factors (e.g., inattentive errors, lack of experience, and environmental disturbance). Here, we name this matter as "noisy label" problem. In this study, we aim to detect the suspect class labels of drift calibration samples and ask the expert to relabel. The proposed methodology, named mislabel probability estimation method based on a Gaussian mixture model (MPEGMM), performs class-label appraisal by indicating the potential mislabel probability of each drift calibration sample. In the proposed methodology, under the assumption that drift responses vary slowly with time, the mislabel probability is calculated according to the label disagreement degree between a Gaussian model and the human expert. Then, the labeled samples with high mislabel probability should be relabeled and achieve correct class labels from the human expert. Finally, the renewed drift calibration set can be used for classification model updating. Two E-nose drift datasets, one a public benchmark and the other collected from an Enose we designed, were generated and collected for the compensation performance assessment. The experimental results show that the proposed method can satisfactorily identify the mislabeled drift calibration samples on presented data. In the meantime, the recognition results after the relabeling of the proposed methodology reach higher accuracy than those of the reference methods. Finally, the novelty behind MPEGMM is reflected threefold: (1) supporting online drift calibration under suspect class labels, (2) adopting a Gaussian mixture model to endure slow data distortion caused by gas sensor drift, and (3) relabeling budget to be adaptively determined to avoid unnecessary computation.
The rest of the paper is organized as follows: Section 2 describes the related works on noisy-label detection of AL. In Section 3, we illustrate the proposed method and associated steps. Then, the experimental results and discussions are presented in Section 4. Finally, Section 5 concludes this paper.

Related Works
The class label queried from an expert is conventionally assumed as an oracle in AL methods. That is to say, common AL methods do not contain an appraisal mechanism for obtained class labels. Thus, typical active learning methods are incapable of resisting the In this study, we aim to detect the suspect class labels of drift calibration samples and ask the expert to relabel. The proposed methodology, named mislabel probability estimation method based on a Gaussian mixture model (MPEGMM), performs class-label appraisal by indicating the potential mislabel probability of each drift calibration sample. In the proposed methodology, under the assumption that drift responses vary slowly with time, the mislabel probability is calculated according to the label disagreement degree between a Gaussian model and the human expert. Then, the labeled samples with high mislabel probability should be relabeled and achieve correct class labels from the human expert. Finally, the renewed drift calibration set can be used for classification model updating. Two E-nose drift datasets, one a public benchmark and the other collected from an E-nose we designed, were generated and collected for the compensation performance assessment. The experimental results show that the proposed method can satisfactorily identify the mislabeled drift calibration samples on presented data. In the meantime, the recognition results after the relabeling of the proposed methodology reach higher accuracy than those of the reference methods. Finally, the novelty behind MPEGMM is reflected threefold: (1) supporting online drift calibration under suspect class labels, (2) adopting a Gaussian mixture model to endure slow data distortion caused by gas sensor drift, and (3) relabeling budget to be adaptively determined to avoid unnecessary computation.
The rest of the paper is organized as follows: Section 2 describes the related works on noisy-label detection of AL. In Section 3, we illustrate the proposed method and associated steps. Then, the experimental results and discussions are presented in Section 4. Finally, Section 5 concludes this paper.

Related Works
The class label queried from an expert is conventionally assumed as an oracle in AL methods. That is to say, common AL methods do not contain an appraisal mechanism for obtained class labels. Thus, typical active learning methods are incapable of resisting the negative effect caused by incorrect class labels. These incorrect class labels are seen as noisy labels in the drift calibration set for E-nose drift compensation.
As far as we know, the "noisy label" problem of AL can be treated by class-label appraisal methods in two manners. The first manner is to generate a reliable label from multiple experts [27][28][29]. Although this manner can quickly provide the denoising label, the cost of using multiple experts may become a serious concern in practical usage. Thus, the second strategy depends on a single expert instead of multiple experts to save labor costs: a mislabeled instance would be decided based on the label information of tested instances [30]. Considering that a k-NN classifier is sensitive to label noise, Wilson et al. adopted 3-NN to remove the instance whose label was different from the classifier output [31]. Further, Bouguelia et al. measured the disagreement level of classifier outputs from one expert, judging the incorrect labels on the likelihood [32,33]. Additionally, a novel bidirectional AL method picked up the mislabeled sample with minimum expected entropy under different label assumptions [34]. However, the above solutions were designed for data in a unique distribution, which was unsuitable for drifted data with gradual distribution movement. Thus, it is necessary to propose a one-expert methodology for a noisy-label problem on slow-varying data. As shown in Table 1, we compare our proposed MPEGMM with the other methods mentioned in three aspects (accuracy, adaptation, cost). It can be seen that our method not only obtains higher accuracy and adaptation but also consumes less cost. Table 1. Comparison between mislabel probability estimation method based on a Gaussian mixture model (MPEGMM) and other methods.

Improved Active Learning Framework for E-Nose Drift Compensation
AL attempts to select a limited number drift calibration samples from historical instances for classifier updating. The common steps of AL-based drift calibration are summarized in Algorithm 1. Especially, we adopted "uncertainty sampling (US)" as the sample selection strategy F(x) in following sections due to its popularity. Among various measuring metrics of US, we chose "posterior probability margin" (margin u ) [35] to represent the uncertainty of an instance x u as follows: x * = arg min where f h (·) represents the posterior probability computed by a single classifier h, and y c 1 andŷ c 2 represent the categories predicted by the maximum and second maximum posterior probability, respectively. Therefore, smaller margin u means greater uncertainty. As Formula (2) describes, the selected sample x* should be the one with the minimum margin u in the unlabeled historical sample set U.
Considering that the expert might provide noisy (error) labels, it is necessary to detect the mislabeled instances and deliver them to the expert for relabeling. Hence, we modified the traditional framework by injecting a "class-label appraisal" mechanism (as shown in Figure 2). After the classifier is updated by first-round labeling, the added part detects mislabeled instances, queries the class labels of mislabeled instances from the expert again, and reupdates the drift calibration set with refreshed labels. As a result, a corrected drift calibration set can be formed for classifier updating without any interruption to online recognition.
where   h f  represents the posterior probability computed by a single classifier h, and ˆ1 c y and 2 c y represent the categories predicted by the maximum and second maximum posterior probability, respectively. Therefore, smaller marginu means greater uncertainty. As Formula (2) describes, the selected sample x* should be the one with the minimum marginu in the unlabeled historical sample set U.
Considering that the expert might provide noisy (error) labels, it is necessary to detect the mislabeled instances and deliver them to the expert for relabeling. Hence, we modified the traditional framework by injecting a "class-label appraisal" mechanism (as shown in Figure 2). After the classifier is updated by first-round labeling, the added part detects mislabeled instances, queries the class labels of mislabeled instances from the expert again, and reupdates the drift calibration set with refreshed labels. As a result, a corrected drift calibration set can be formed for classifier updating without any interruption to online recognition.

Class-Label Appraisal
The goal of class-label appraisal is to evaluate the correctness of the expert-given class labels depending on historical drift calibration samples. Considering that drift data are slow-varying samples, we suppose that most of the drift calibration samples are approximately accorded with the same data distribution. Accordingly, we adopted the Gaussian distribution as the assumed distribution for each class of drift calibration samples, because it can be suited for newly drifted data distribution, even with an existing small number of previous data. Thus, we named our proposed class-label appraisal methodology mislabel probability estimation based on a Gaussian mixture model (MPEGMM). In addition, the MPEGMM can automatically determine the optimal relabeling budget (number of drift calibration samples to be relabeled) to avoid over-relabeling.
Considering that a sample of E-noses is always a multidimensional vector, we compute the category possibility of a sample x according to multivariable Gaussian distribution as follows:

Class-Label Appraisal
The goal of class-label appraisal is to evaluate the correctness of the expert-given class labels depending on historical drift calibration samples. Considering that drift data are slow-varying samples, we suppose that most of the drift calibration samples are approximately accorded with the same data distribution. Accordingly, we adopted the Gaussian distribution as the assumed distribution for each class of drift calibration samples, because it can be suited for newly drifted data distribution, even with an existing small number of previous data. Thus, we named our proposed class-label appraisal methodology mislabel probability estimation based on a Gaussian mixture model (MPEGMM). In addition, the MPEGMM can automatically determine the optimal relabeling budget (number of drift calibration samples to be relabeled) to avoid over-relabeling.
Considering that a sample of E-noses is always a multidimensional vector, we compute the category possibility of a sample x according to multivariable Gaussian distribution as follows: where u i and Σ i denote the mean vector and covariance matrix, respectively, of the drift calibration samples belonging to category c i , and D represents the dimension of the sample x. As a result, the whole drift calibration set L can be summarized by a Gaussian mixture model (GMM) with K components. where K represents the total number of categories, and α i is the mixture coefficient of c i . To measure the reliabilities of the expert's labeling, we calculate the posterior probability of each given label y by the Bayes theorem: where p M (y = c i |x) is the posterior probability that sample x is on the i-th distribution component of a GMM. For sample x, we define the maximum posterior probability as label reliability (LR): Higher LR implies that the current label y = c i is more reliable. In other words, c i is more likely the true category of x than other categories. Thus, we estimate the mislabel probability of each instance as follows: where y mp and y g denote the labels obtained from Formula (6) and the expert, respectively. f (·) is a decreasing function measuring the mislabel probability. In this study, we select a typical nonlinear decreasing function as follows: If y mp = y g , p err (x) depends on the difference between LR and p M (y = y g |x). It is reasonable that the expert may annotate a label correctly when the possibility of an annotated label is similar to the maximum output of a GMM. Then, a small p err (x) is gained and vice versa. If y mp = y g , f (·) makes p err (x) inverse to LR. Larger LR means lower probability of sample x being labeled incorrectly. We calculate the expected entropy increment over sample set G: where I G=(U/x) and I G=U represent the expected entropy of unlabeled historical sample set U excepting and containing x, respectively. Larger ∆I x means sample x is more significant for reducing the uncertainty of the drift calibration set L. Further, we have Greater δ x denotes that sample x is the one with both higher error labeling probability and greater uncertainty. Accordingly, the sample with the greatest δ x is the one needing relabeling the most. In order to control the relabeling budget, we estimate the number of right-labeled samples as follows: where N represents the capacity of the current selected sample set S. After that, we can determine the relabeling budget θ as follows: Calculate u i and Σ i for each class instances in L', generate GMM as Formulas (3) and (4).

4:
Calculate the mislabel probability p err (x) as Formula (7). 5: Calculate the expected entropy increment ∆I x of x as Formulas (9) and (10). 6: Calculate the indicator δ x as Formula (11). 7: end for 8: Estimate the budget of relabeling θ as Formulas (12) and (13). 9: Sort all selected instances in descending order of δ x . 10: Relabel θ instances with higher δ x , denote the corrected S as S'. 11: Update the calibration set L: L← L ∪ S'. 12: Return updated drift calibration set L.

Datasets
We use two datasets to evaluate the performance of the proposed method. One (dataset A) is a public benchmark from the UC Irvine Machine Learning Repository [22], while the other (dataset B) is collected from an E-nose system designed by us.

Dataset A
Dataset A was collected by an E-nose with 16 gas sensor arrays (four commercial series: TGS2600, TGS2602, TGS2610, and TGS2620) over 36 months. Considering that eight features were abstracted from each gas sensor response, one experiment can be denoted as a vector with 128 (16 × 8) dimensions. The acquisition time of an intact experiment took at least 300 s to complete, divided into 100 s for the gas injection phase and at least 200 s for the cleaning phase. Meanwhile, the experimental environment is controlled at a stable level (10% R.H., 25 ± 1 • C). Finally, a total of 13,910 samples were collected through the detection of six kinds of pure gaseous substances in the concentration range of 10-1000 ppmv (acetone, ammonia, acetaldehyde, ethylene, ethanol, and toluene). Especially, dataset A was divided into 10 batches by the authors according to the acquisition time-series. To accommodate E-nose drift compensation scenarios based on active learning, we integrated the small-size batches (batches 4 and 5) into a bigger-size one: batch 4&5. Figure 3 provides the sample distribution of the integrated nine batches. We can observe an obvious difference between two adjacent batches caused by gas sensor drift effects.

Dataset B
Dataset B was generated from our E-nose system over 4 months. As Figure 4 shows, the designed E-nose system consists of three parts: a gas sensor array, sample injection system, and control module. In an intact experiment, both the baseline and test stages lasted 3 min., maintaining the flow rate at 100 mL/min. Additionally, the cleaning stage lasted 6 min., maintaining the flow rate at 200 mL/min. For feature extraction, we used H 0 and H to represent the steady-state voltage values of baseline and test stages, respectively. Thus, the abstracted feature of each gas sensor response can be expressed as follows: Considering the 32 gas sensors (listed in Table 2) in our E-nose system, each experiment can be represented as a 32-dimensional sample vector. We performed 441 experiments (30% R.H., 20 ± 1 • C) in 4 months on seven objects, including beer, wine, liquor, black tea, green Chemosensors 2021, 9, 78 7 of 15 tea, pu'er tea, and oolong tea. First, we mixed the original solution and distilled water at a volume ratio of 1:4. Especially, the original solution of tea samples was obtained by steeping 2 g of solid tea leaves and 200 mL of distilled water for 5 min. Then the mixed liquid was injected into a closed container and sealed for 10 min. Finally, the upper gases were used as experimental samples. Then, we collected these 441 samples as dataset B and divided them into three batches (63, 189, and 189 samples) in time order. Regarding dataset A, we plotted the PCA scatter points in Figure 5 to show the sample distribution of dataset B. We noticed that the distributions of batches 2 and 3 were similar owing to close acquisition time, while batch 1 showed significant variation on data distribution. Therefore, we can infer that drift calibration is needed for recognizing samples on varied distributions. 4.1.1. Dataset A Dataset A was collected by an E-nose with 16 gas sensor arrays (four commercial series: TGS2600, TGS2602, TGS2610, and TGS2620) over 36 months. Considering that eight features were abstracted from each gas sensor response, one experiment can be denoted as a vector with 128 (16 × 8) dimensions. The acquisition time of an intact experiment took at least 300 s to complete, divided into 100 s for the gas injection phase and at least 200 s for the cleaning phase. Meanwhile, the experimental environment is controlled at a stable level (10% R.H., 25 ± 1 °C ). Finally, a total of 13,910 samples were collected through the detection of six kinds of pure gaseous substances in the concentration range of 10-1000 ppmv (acetone, ammonia, acetaldehyde, ethylene, ethanol, and toluene). Especially, dataset A was divided into 10 batches by the authors according to the acquisition time-series.
To accommodate E-nose drift compensation scenarios based on active learning, we integrated the small-size batches (batches 4 and 5) into a bigger-size one: batch 4&5. Figure 3 provides the sample distribution of the integrated nine batches. We can observe an obvious difference between two adjacent batches caused by gas sensor drift effects.

Dataset B
Dataset B was generated from our E-nose system over 4 months. As Figure 4 shows, the designed E-nose system consists of three parts: a gas sensor array, sample injection system, and control module. In an intact experiment, both the baseline and test stages lasted 3 min., maintaining the flow rate at 100 mL/min. Additionally, the cleaning stage lasted 6 min., maintaining the flow rate at 200 mL/min. For feature extraction, we used H0 and H to represent the steady-state voltage values of baseline and test stages, respectively. Thus, the abstracted feature of each gas sensor response can be expressed as follows:

Experimental Setup
In order to simulate online drift scenarios of an E-nose, two experimental settings described in [22] were used as follows: Setting 1 (long-term drift): Batch 1 was regarded as a training set, while the other batches were assumed to be drift data for successive testing.
Setting 2 (short-term drift): Batch K was regarded as a training set, while batch (K + 1) was assumed to be drift data for successive testing.
To validate the effectiveness of resisting the noisy-label problem, we compared the MPEGMM with other methods, including k-NN (k = 3, 3-NN) [31], classifiers vote (Vote) [32], disagreement measure (Disagree) [33], and bidirectional AL (BDAL) [34]. We assumed that the class labels from the expert were not completely correct during the AL process. Thus, we defined label-noise ratio (LNR) as follows: where N err and N denote the numbers of mislabeled instances and drift calibration samples, respectively. We set three different LNRs (10%, 20%, and 30%) for both datasets A and B.  Considering the 32 gas sensors (listed in Table 2) in our E-nose system, each experiment can be represented as a 32-dimensional sample vector. We performed 441 experiments (30% R.H., 20 ± 1 °C ) in 4 months on seven objects, including beer, wine, liquor, black tea, green tea, pu'er tea, and oolong tea. First, we mixed the original solution and distilled water at a volume ratio of 1:4. Especially, the original solution of tea samples was obtained by steeping 2 g of solid tea leaves and 200 mL of distilled water for 5 min. Then the mixed liquid was injected into a closed container and sealed for 10 min. Finally, the upper gases were used as experimental samples. Then, we collected these 441 samples as dataset B and divided them into three batches (63, 189, and 189 samples) in time order. Regarding dataset A, we plotted the PCA scatter points in Figure 5 to show the sample distribution of dataset B. We noticed that the distributions of batches 2 and 3 were similar owing to close acquisition time, while batch 1 showed significant variation on data distribution. Therefore, we can infer that drift calibration is needed for recognizing samples on varied distributions.   In terms of classifier, we adopted a support vector machine (SVM), a popular and excellent classifier, for E-nose drift data classification. We chose the linear function as the kernel function of SVM due to the trade-off between higher performance and lower computational load. The penalty factor C was adjusted in the range of 10 −3 -10 3 with a two-phase grid optimization. In the first phase, we tested the C value at the points 10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 , and 10 3 and chose two candidate intervals around the best point. Then, the decimus length of the chosen interval was used as the step length to explore the optimized C. After this two-phase optimization, we set C = 0.6 and 2 for datasets A and B, respectively. In addition, the parameter w of f (·) was determined through the Monte Carlo method, and the optimized values of w are presented in Table 3.

Experimental Setup
In order to simulate online drift scenarios of an E-nose, two experimental settings described in [22] were used as follows: Setting 1 (long-term drift): Batch 1 was regarded as a training set, while the other batches were assumed to be drift data for successive testing.
Setting 2 (short-term drift): Batch K was regarded as a training set, while batch (K+1) was assumed to be drift data for successive testing.
To validate the effectiveness of resisting the noisy-label problem, we compared the MPEGMM with other methods, including k-NN (k = 3, 3-NN) [31], classifiers vote (Vote) [32], disagreement measure (Disagree) [33], and bidirectional AL (BDAL) [34]. We assumed that the class labels from the expert were not completely correct during the AL process. Thus, we defined label-noise ratio (LNR) as follows: where Nerr and N denote the numbers of mislabeled instances and drift calibration samples, respectively. We set three different LNRs (10%, 20%, and 30%) for both datasets A and B. Additionally, we selected about 5% samples from each testing batch (seen as unlabeled sample set) for labeling, while the remaining 95% samples were used for odor recognition.
In terms of classifier, we adopted a support vector machine (SVM), a popular and excellent classifier, for E-nose drift data classification. We chose the linear function as the kernel function of SVM due to the trade-off between higher performance and lower computational load. The penalty factor C was adjusted in the range of 10 -3 -10 3 with a twophase grid optimization. In the first phase, we tested the C value at the points 10 -3 , 10 -2 , 10 -1 , 1, 10, 10 2 , and 10 3 and chose two candidate intervals around the best point. Then, the decimus length of the chosen interval was used as the step length to explore the optimized C. After this two-phase optimization, we set C = 0.6 and 2 for datasets A and B, respectively. In addition, the parameter w of () f  was determined through the Monte Carlo method, and the optimized values of w are presented in Table 3. Table 3. Parameter values of the MPEGMM.

Parameter
Dataset A Dataset B

Recognition Comparison
In this subsection, we aim to (1) demonstrate the superiority of the improved AL framework and (2) illustrate the effectiveness on mislabeled-instance selection of the proposed MPEGMM. Since the error-annotated labels were randomly set, we used the average value and standard deviation (Mean ± STD) by 10 repetitions to show the recognition performance of a certain method. Figure 6 presents the accuracies of different methods under setting 1. We adopted the blue bars and red line with an asterisk to represent the mean and standard deviation, respectively. It is clear that no matter which LNR and dataset were adopted, the proposed MPEGMM would achieve the highest accuracy among all the tested methodologies. In Figure 6a-c, the accuracies of the MPEGMM are obviously higher than those of other reference methods on dataset A. The accuracy of the MPEGMM is always around 90% in all cases, while the accuracies of other paradigms are mostly less than 80%. In Figure 6d-f, we drew the accuracies on dataset B. The proposed MPEGMM was still the one with the highest accuracy among all the adopted methods. Furthermore, compared with the "NoProcess" strategy, the other methods demonstrated their effectiveness on recognition performance in most cases. Thus, we can discover that dealing with noisy labels in an AL procedure has a great impact on drift compensation.
As Tables 4 and 5 show, we reported all recognition accuracies on datasets A and B under setting 2. The best one in each case is marked in bold. Obviously, the proposed MPEGMM is more efficient and robust than the other methodologies on both datasets A and B. In Table 4, the MPEGMM achieves the highest recognition accuracy of 97.90% in batch 6→7 with LNR = 10%. In Table 5, the MPEGMM still reaches the highest accuracy of 86.84% in batch 2→3 with LNR = 20%, which is 8.55% higher than the second-best one, 3-NN. As a result, we believe that the MPEGMM is an effective strategy for E-nose drift compensation in the AL-based calibration framework.
From the above results, the accuracy difference between the MPEGMM and other reference methods under setting 1 is significantly greater than the one under setting 2. This is because the number of mislabeled instances is gradually increased in the long-term scenario, which results in a slower decreasing of the classifier's performance. In order to explain the reason why the MPEGMM achieves excellent recognition rates, we listed the noisy-label detection accuracies under setting 1. We drew the average accuracy calculated from batches 2-10 of dataset A in Figure 7a. The accuracies of the MPEGMM are apparently higher than those of other reference methods, except the point LNR = 20%. Although the Disagree strategy achieves the highest detection accuracy of around 83% at this point, the MPEGMM performs more stably under various LNRs. Meanwhile, in Figure 7b, we present the average detection accuracy of batches 2-3 from dataset B. The bars of the MPEGMM are on top compared with all other methods. Thus, we conclude that the proposed MPEGMM method can successfully identify more mislabeled instances. It is the key reason for making the updated classifier well performed under long-term drift.
respectively. It is clear that no matter which LNR and dataset were adopted, the proposed MPEGMM would achieve the highest accuracy among all the tested methodologies. In Figure 6a-c, the accuracies of the MPEGMM are obviously higher than those of other reference methods on dataset A. The accuracy of the MPEGMM is always around 90% in all cases, while the accuracies of other paradigms are mostly less than 80%. In Figure 6d-f, we drew the accuracies on dataset B. The proposed MPEGMM was still the one with the highest accuracy among all the adopted methods. Furthermore, compared with the "No-Process" strategy, the other methods demonstrated their effectiveness on recognition performance in most cases. Thus, we can discover that dealing with noisy labels in an AL procedure has a great impact on drift compensation. As Tables 4 and 5 show, we reported all recognition accuracies on datasets A and B under setting 2. The best one in each case is marked in bold. Obviously, the proposed MPEGMM is more efficient and robust than the other methodologies on both datasets A and B. In Table 4, the MPEGMM achieves the highest recognition accuracy of 97.90% in batch 6→7 with LNR = 10%. In Table 5, the MPEGMM still reaches the highest accuracy of 86.84% in batch 2→3 with LNR = 20%, which is 8.55% higher than the second-best one, 3-NN. As a result, we believe that the MPEGMM is an effective strategy for E-nose drift compensation in the AL-based calibration framework.

Parameter Sensitivity
The purpose of the MPEGMM is to detect the mislabeled instances and deliver them to the expert for relabeling. Compared with other reference methods, the MPEGMM not only measures the mislabel probability of each selected instance but also estimates the total number of noisy labels in a drift calibration set. Especially, the estimated result is

Parameter Sensitivity
The purpose of the MPEGMM is to detect the mislabeled instances and deliver them to the expert for relabeling. Compared with other reference methods, the MPEGMM not only measures the mislabel probability of each selected instance but also estimates the total number of noisy labels in a drift calibration set. Especially, the estimated result is used in all tested methodologies to control the number of relabeled instances. If the estimated result is greater than the actual one, additional labeling costs will be considered. On the contrary, if the estimated number is smaller, some error labels will stay in the drift calibration set. It is necessary to observe the variation of parameter w controlling the estimated θ. Thus, we adjust w according to the set ω = 10 λ , λ = −3, −2, −1, 0, 1, 2, 3, 4 . Figures 8 and 9 show the parameter adjustment results of datasets A and B under setting 2. We use red, magenta, and blue to indicate LNRs of 10%, 20%, and 30%, respectively. For each color, a solid line and a dashed line represent the numbers of estimated and actual noisy labels, respectively. Then, the intersection of the solid line and dashed line corresponds to the range of optimal values. In Figure 8a-h, we can observe that the intersection points are mostly located in the range of 1-10 when LNR equals 20% and 30%. While LNR = 10%, the estimated quantity is slightly larger than the actual number, so we choose the closest interval (10 2 , 10 4 ) as an optimized range. Accordingly, in Figure 9a,b, we can observe that 1-10 is the optimal parameter range when LNR equals 10% and 20%. For LNR = 30%. Basically, it is clear that the proposed MPEGMM can accurately estimate the total number of noisy labels at a proper interval of parameter w.

Computational Complexity
As Table 6 shows, we reported the average execution time of all tested methodologies by 10 repetitions on dataset A under setting 2. We can observe that Vote and Disagree have a similar execution time to identify one possible mislabeled instance, because they all need to train an SVM classifier on the same dataset. 3-NN takes a longer time since the size of the drift calibration set is larger and it needs to calculate the distance between the test instance and each drift calibration instance. For BDAL, we find that it consumes the longest time. It is reasonable that each time the label of a selected instance is changed, the training and testing process of the classifier must be redone by BDAL. On the contrary, the MPEGMM takes the least time among all the tested methodologies because the MPEGMM only calculates the increment of the expected entropy by the probability model instead of a complicated classifier training. In summary, we conclude that the computational complexity of the MPEGMM is superior to those of other reference methods.

Computational Complexity
As Table 6 shows, we reported the average execution time of all tested methodologies by 10 repetitions on dataset A under setting 2. We can observe that Vote and Disagree have a similar execution time to identify one possible mislabeled instance, because they all need to train an SVM classifier on the same dataset. 3-NN takes a longer time since the size of the drift calibration set is larger and it needs to calculate the distance between the test instance and each drift calibration instance. For BDAL, we find that it consumes the longest time. It is reasonable that each time the label of a selected instance is changed, the training and testing process of the classifier must be redone by BDAL. On the contrary, the MPEGMM takes the least time among all the tested methodologies because the MPEGMM only calculates the increment of the expected entropy by the probability model instead of a complicated classifier training. In summary, we conclude that the computational complexity of the MPEGMM is superior to those of other reference methods.

Conclusions
In this paper, we proposed a class-label appraisal methodology, MPEGMM, for improving the active learning-based drift compensation framework under massive online data. The main idea of the MPEGMM is to measure the mislabel probability of each selected instance by a Gaussian mixture model. Furthermore, the MPEGMM estimates the labeling budget of noisy labels in a dataset and delivers the most valuable instances to the expert for relabeling. In the experiments, we simulated two representative scenarios, including long-term and short-term drift with two datasets. The MPEGMM achieves the highest recognition accuracy in most cases. The percentages 97.90% and 86.84% are, respectively, two best recognition scores on datasets A and B, which are 0.42% and 8.55% ahead of the best reference method. The key reason is that the MPEGMM can detect most of mislabeled instances correctly, thereby improving the recognition performance of the classifier. Moreover, the accuracy of the relabeling budget estimation is mainly affected by the parameter w, and the results show that 1-10 is a favorable range for relabeling times in common. Considering that the MPEGMM uses a probability model instead of a complicated classifier to estimate expected entropy increment, the computational time of the MPEGMM has been dramatically reduced compared with those of the other reference methods. Accordingly, the shortest average execution time of 0.299 s (on dataset A) is obtained by the MPEGMM for one-instance identification. Generally, it is a suitable choice to handle a noisy-label problem occurring in an online drift compensation of E-noses.
Reliable class label is an important issue in gas sensor drift compensation under massive online data. Besides our concern, multigas mixture, unknown interference, temperature, and humidity effects are some other challenging points in E-nose studies. A comprehensive method framework should be established to deal with these problems in the future.