Online Sensor Drift Compensation for E-Nose Systems Using Domain Adaptation and Extreme Learning Machine

Sensor drift is a common issue in E-Nose systems and various drift compensation methods have received fruitful results in recent years. Although the accuracy for recognizing diverse gases under drift conditions has been largely enhanced, few of these methods considered online processing scenarios. In this paper, we focus on building online drift compensation model by transforming two domain adaptation based methods into their online learning versions, which allow the recognition models to adapt to the changes of sensor responses in a time-efficient manner without losing the high accuracy. Experimental results using three different settings confirm that the proposed methods save large processing time when compared with their offline versions, and outperform other drift compensation methods in recognition accuracy.


Introduction
Sensor technologies aim at providing a convenient and intelligent life for human beings and have been largely enhanced in recent years. For example, wearable sensors make health monitoring and early disease classification with minimum discomfort possible [1,2], optical sensors used in clinical diagnosis have made the detection of specific compound such as calcium more easy-to-use [3], and motion sensors have made smart phones not only tools for communications, but also means to provide personalized services [4].
By using specific sensors, identification of chemical gases becomes possible. Electronic-Nose (E-Nose) systems, also known as machine olfaction, is one of such units. Combining with pattern recognition techniques, E-nose systems can classify multiple gases mixed together in different concentrations [5], which leads to a wide application in airport and train station checkpoints [6], food security [7], environmental monitoring [8], clinical diagnosis [9] and so on. Despite the fascinating applications that make our daily lives more intelligent, there has been a major problem in E-Nose systems that makes the recognition capability of sensors degrade after some time. In E-Nose systems, the readings X rely on the chemical reactions between gas compounds and the sensor materials, together with some recognition mechanism such as machine learning to create connections between diverse gas types and their corresponding readings. Mathematically, we can use a function y = f (X) to denote such connections. Given a proper trained f (X), the outputs should match all designated gas compounds. However, in practice, when a f (X) is trained on the collected data perfectly, the outputs gradually fail to match the right gases, and the phenomenon is called sensor drift.
Currently, it is commonly accepted that the drift problem in sensors is due to two causes. One is the chemical process that happens between sensor materials and the environment, also called the first-order drift, and the other is the system noise, namely the second-order drift [10][11][12]. Researchers have been trying to solve the problem in material science, sensor selection strategies and post processing mechanisms. In material science, durable materials were invented to prolong the life of sensors [13][14][15][16]. Meanwhile, proper selection of more resilient sensors to drift is another way to achieve the goal [17,18]. In the perspective of post-processing, the drift problem can be taken as the changes of distributions of gas labels over time. To maintain the stable recognition capability of the sensors, classifier ensemble techniques have relived the problem to some extent [19][20][21]. However, the learning process of the methods is supervised and requires human effort to label the training set beforehand. Furthermore, the methods assume that the data distributions remain the same for different gases, which is not always true. Component Correction (CC)-based methods [22][23][24] and Sequential Minimal Optimization (SMO)-based [25,26] are the most effective supervised ways to adjust the model to the drift. Nevertheless, CC-based methods assume the drift acts in the same way for diverse gases, which is sometimes not the case, and SMO-based methods sometimes update the model by following the wrong reference label. Another effective and promising approach is to use the transfer learning technique, namely domain adaptation. Zhang et al. [27] achieved one of the highest accuracies using semi-supervised methods, but the learning mechanism follows an offline training scheme, which makes the data generated in real time hard to be processed in time. Together with the fast development in big data, an enormous amount of sensor data are generated per second, which has made timely processing a great challenge.
To effectively discover information collected from the data, an online learning mechanism that can train and update the model in a time-efficient manner without losing the classification accuracy is required. An online learning method in machine learning uses the current model and newly received data to update the analytical model. In this way, the updates of the model do not require training from scratch and the capability of drift compensation can keep up with the data generated in massive amounts. In this paper, we combine the theory of online sequential extreme learning machine and Domain Adapation Extreme Learning Machine (DAELM) [27] for it has appealing performance and has achieved almost 100% for specific datasets. We wish to transform the offline learning version in the work into an online learning version without losing the performance for drift compensation. The contributions are three-fold: • The selection of representative samples plays an essential part in semi-supervised methods such as DAELM-S and DAELM-T in [27]. Therefore, we analyzed the characteristics of sample selection and provided two online sampling strategies regarding whether testing error can be used as feedback. To leverage between the effects of labeled and unlabeled samples, we transformed DAELM-T into its online learning version and proposed Online Target Domain Adaptation Extreme Learning Machine (ODAELM-T). Unlike ODAELM-S, which only relies on the source domain and the labeled set, ODAELM-T leverages the effects of labeled and unlabeled set to the model. Based on the changes of the two sets, the update phase is divided into three learning process, namely unlabeled incremental learning, unlabeled decremental learning and labeled incremental learning.
The remainder of the paper is organized as follows. Section 2 introduces some preliminaries on online processing in sensors, domain adaptation and extreme learning machine to help understand the methods in the paper. Methodologies of ODAELM-S and ODAELM-T are detailed in Section 3. Section 4 describes the dataset used in the experiments and the experimental set-up, followed by a detailed analysis on the results. Conclusions are drawn in Section 5. For illustration purposes, abbreviations of the frequently used terms are listed after Section 5.

Online Processing in Sensors
In 2013, Paniagua et al. focused on delivering time efficient drift countermeasures using pattern recognition methods, which is one of the earliest research articles in online drift compensation [28]. However, the methods used in the paper were limited and not much work was done to change the algorithms into online learning versions. Ghafarinia et al. combined pattern recognition with the transient features of Capillary-attached Gas Sensor and proposed an online gas diagnosis algorithm that could determine unknown gases and their concentrations [29]. Nevertheless, the neural network model used in their paper was simple, which consisted of less than 10 nodes, and the features are no more than 40. Moreover, the method was designed for gas diagnosis, instead of drift compensation. With the developments in machine learning, new models and theories have been used for drift problem. However, most of the methods focus on an offline learning manner which leaves online detection and recognition a void. Due to the learning mechanism, some of the methods cannot be simply applied to the scenario. Meanwhile, the time cost for training the model is not trivial due to the growing number of data.
As early as in 2004, Ma et al. introduced the challenges and problems in building online recognition model, although the discussion focused more on the nature of the data rather than the learning algorithms and specific problem domain as sensor drift [30]. In recent years, online processing have been a common issue in sensor related domains. In 2012, Munir et al. proposed an online optimization method to help sensor parameter adjust to the environment changes [31]. In 2013, Zhang et al. proposed online detection for outliers in wireless sensor network to ensure the high quality of the data [32]. Between 2015 to 2017, researches related to online processing in sensors has expanded from decentralized model for resource limited environment [33] and time-efficient monitoring and detection [34,35] to more complicated tasks such as gesture recognition [36], source location [37] and fault diagnosis [38] in specific applications. Therefore, it is also imperative to combine advanced algorithms with online processing for gas sensor drift compensation.

Domain Adaptation
In domain adaptation, the distributions of samples in the data are referred to as domains [39]. In practice, a pre-trained model on a given training set (source domain) usually fails to perform well in testing sets (target domains), due to the distribution differences between training and testing sets. The phenomenon is called domain shift. To deal with the problem, some related approaches include transfer learning, semi-supervised learning, self-taught learning and multiview analysis. In the applications such as computer vision, sentimental analysis, natural language processing, video concept and wifi location detections, domain adaptation techniques have received fruitful results [40][41][42].
In the perspective of transfer learning, domain adaptation is also taken as a special case, in which labeled samples are only available in source domain and both target and source domains share the same single task [43]. Currently, there are four types of methods in domain adaptation, namely instance weighting, self labeling, feature representation and latent feature learning. Due to the expensive efforts for manually labeling the samples in the target domain, feasible and common approach is to perform semi-supervised learning which leverages the effects of limited labeled and unlabeled samples [39,43].

Extreme Learning Machine
As a special type of random neural network, Extreme Learning Machine (ELM) can be represented by a three layer feed forward neural network whose parameters between input and hidden layers are randomized [44]: Specifically, ELM can be written as Equation (1), where β 1 , β 2 , . . . , β L are the output weights between the hidden layer with L neurons and the output layer, and g i is a nonlinear piecewise continuous function, namely activation function [45]. The output weights of ELM can be acquired by minimizing the approximation error, i.e., Equation (2), where H, denoted by Equation (3), is the output of hidden layer with L neurons, n is the number of neurons in the output layer, and T is the target vector of training data: min Unlike other neural network approaches, ELM directly calculates the output weight matrix β using Equation (4), where H † is the Moore-Penrose pseudoinverse of H: The learning mechanism of classic ELM is tagged as batch learning, which requires a full set of training data. In contrast to batch training, Online Sequential Extreme Learning Machine (OSELM) was proposed to learn from data in a one-by-one or block-by-block way [46]. Instead of using all the samples in the training stage, which is also known as offline learning, OSELM sequentially learns from the data to capture the varying patterns hidden beneath them. In OSELM, the calculation of output weight matrix, denoted by β k+1 , relies only on the new sample x k+1 and the previous output weight matrix β k : The training process can be formulated as Equation (5), where P k+1 = P k − P k H T k+1 (I + H k+1 P k H T k+1 ) −1 H k+1 P k , H k+1 is the hidden layer output corresponded to the new sample x k+1 , and T k+1 is the corresponding target vector.

Online Domain Adaptation Extreme Learning Machines
To achieve timely processing without losing the recognition accuracies for sensor drift, we intend to transform current state-of-art batch learning methods into their corresponding online versions. Domain adaptation-based drift compensation has been proved to possess high accuracies in [27]. The two algorithms, namely DAELM-S and DAELM-T, are based on batch learning and require selecting a group of representative samples beforehand. The selection algorithm is based on the distribution of the entire dataset. However, in an online processing scenario, samples in the target domains arrive in sequence as a data flow. Therefore, the original sample selection method is not applicable. Supposing we can determine when to select the representative samples, there is another problem of how much time and human effort it costs. Additionally, the batch learning mechanisms of DAELM-S and DAELM-T require calculating the classifier based on the full dataset. Since the data arrives in sequence, there is no doubt that the data used in the next update will overlap with the ones in previous updates. In this case, there will be repeated calculation of the same data over time, which costs more time and resources as the size of target domain increases.
In an online learning model using either DAELM-S and DAELM-T, we wish to maintain the performances of the two methods while solving the obstacles described in the previous paragraph. Figure 1 shows the sample changes in a target domain when a new sample X i arrives. The labeled set refers to the manually labeled samples and the unlabeled set is the rest of the samples in a target domain. When X i arrives in the target domain, it triggers two possible cases regarding whether to select and label a group of samples in the unlabeled set. When no selection and labeling happens, X i is added to the unlabeled set and the labeled set remains unchanged. In this case, the changes of samples, denoted by δSet, has only one sample, namely X i . When the selection and labeling happens, the situation is more complicated. Given some sample selection algorithm, a group of representative samples, denoted by SSet, is selected from the target samples received to date. Note that, in this case, the unlabeled set includes the new sample X i . Considering the fact that the selected samples may overlap with current labeled ones, δSet is the difference set between SSet and LSet k−1 , where LSet k−1 is the latest selected sample set. The blue dashed circle in the figure represents the δSet. When labeling happens, δSet is removed from USet and added into LSet. The X k in δSet refers to any possible samples selected, including the new sample X i . In classic ELM, a variant called OSELM can be applied to reduce the calculation when new sample arrives in sequence. The similar process can be applied to both DAELM-S and DAELM-T by using incremental learning. The idea is to transform original batch learning process into a recursive process. Taking the arrival sequence in Figure 1, for example, when X i−1 arrives, the classifier trained on current received data is written as f (β k−1 T ), where β k−1 T is the parameter or parameter vector that requires training. The output of f can be the gas labels or the probabilities of the sample belonging to certain gases. In incremental learning, we wish to derive a recursive form of the training process where β k T can be written as a new formula denoted as g(β k−1 T , K), where K is some intermediate results. In this way, instead of training the new β T with all the data received to date, we can use only the previous result and the current increment for updates. By doing so, we can save the repeated calculation and achieve a time-efficient algorithm for generating new classifiers. In OSELM, the parameter P k+1 in Equation (5) is the intermediate result. In DAELM-S and DAELM-T, we can find similar terms to achieve the goal. Once the update algorithm is produced, the only challenge becomes selecting the representative samples in an online manner.
In the following part of the section, we first analyze the sampling strategies regarding online learning. Then, corresponding online learning versions of DAELM-S and DAELM-T are described in detail.

Online Sampling Strategies
Assuming we can determine the time for updates, the only problem is to determine δSet. We view the classification model as an intelligent agent and the arrival sequence of the target samples like perceptions of the environment. Similar to the description on intelligent agents by Stuart and Peter in [47], the actions of the model, including whether to select and label samples, depend on the entire sequence of samples received to date, not on anything that has not yet appeared. Therefore, we can apply Kennard-Stone (KS) algorithm [48,49] on the current received samples to determine δSet. Ideally, if we can perform KS whenever the distribution changes, the problem is solved. However, the changes in distribution may be a slow process, which brings another problem of defining the changes of distribution. Even though a periodical collection of the samples may be a trade-off plan, the circle for collection may vary, which brings up another problem of how to decide the circle beforehand.
Normally, we wish to label enough samples so that the model can be more precise. However, manually labeling in semi-supervised methods is time-consuming and labeling more samples contradicts with our goal of saving time. Eventually, the problem becomes a trade-off between high classification accuracy and low human efforts.
Selecting and labeling samples in a predefined circle is the easiest to apply, although not very applicable since proper circles for different datasets may vary. The reason for labeling more samples is to provide more information to track the changes of data. In this sense, the labeling should be more likely to happen when the performance degrades and vice versa. In this case, we consider the labeling process as a probability event, which possesses the following characteristics: • The chance of labeling is inversely proportional to number of labeled samples; • The chance of labeling is proportional to number of samples in total; • The chance of labeling is proportional to the classification error.
The performance of the model is based on the classification accuracy of current model. In this case, the system would require receiving the accuracy. When the performance of the classification model is assumed to degrade, the labeling should be more frequent, and vice versa. Meanwhile, when the number of labeled samples is small, the labeling should happen more frequently, and vice versa. However, when the accuracy is not accessible, the system would have no idea whether it performs well or not. In this case, the system can only decide for selection and labeling based on the number of samples received so far.
Ideally, if we can determine the representative samples for each target domain as [27] does, we can achieve equivalent high classification accuracies in an online manner. However, the method used to determine the representative samples in [27] is based on the distribution of the entire target domain, which is unable to be acquired in an online scenario. A more feasible way is to label more samples instead of labeling specific ones. Therefore, we use Equation (6) to depict the probability of labeling where is current classification error, and y and x are the numbers of labeled samples and entire samples, respectively. In the experiments, the equation managed to possess the aforementioned features while maintaining the number of the labeled samples in a relatively small value. Note that Equation (6) is an attempt to depict the probabilities instead of ideal calculation of the probability. Therefore, more sophisticated methods can be used to replace it: In the experiments, the equation is not good enough. Another possible issue may be that the labeled samples keep growing even when the accuracy is high, say over 90%. In an online learning scenario, if the manually labeling process labels too many samples, the method is not applicable since too much resource time would be spent in the process. Therefore, we set another criteria for the process, namely the minimum accepted accuracy. In practice, when the accuracy is stable and very high, it does not require extra labeling. In this paper, the maximum number of KS is set to 50 so as to limit the growth of labeled samples. The minimum accepted accuracy is set to 90%. When the residual error is larger than 10%, the labeled process happens.

Online Source Domain Adaptation Extreme Learning Machine
Similar to OSELM, which uses incoming data to update the model in an online manner, we transformed the training of DAELM-S into an incremental learning procedure and proposed Online Source Domain Adaptive Extreme Learning Machine (ODAELM-S). Figure 2 shows the framework of ODAELM-S. In DAELM-S, the learning only involves the source domain and the labeled samples from target domain. Since the source domain remains unchanged during the learning phase, the update of the model only happens when new samples are labeled. Initially, the target classifier is the source classifier for there is no labeled sample in the target domain. After the classifier is initialized using source domain samples, it can receive and learn the patterns from target domain in an online manner. The left rectangle in the figure represents the arrival sequence of the samples in the target domain. Let X i be the current arrival sample that belongs to either unlabeled or labeled samples. If no selection happens, X i is added into the set of unlabeled samples and there is no update in the classifier. When the online sample selection happens, the sample(s) for labeling can only be chosen from the unlabeled samples and X i . Note that X i may not be chosen when the selection happens at the arrival of the sample, but it may be selected later by another selection process. Whenever labeled samples are selected, the target classifier updates itself based on an online learning mechanism described in the following paragraphs. In DAELM-S, the model is taken as an extension of the classifier trained on the source domain. The objective function is written as Equation (7) are the corresponding hidden layer outputs of source domain and labeled samples from target domain, respectively. To obtain an proper value of β S , it requires minimizing the objective function: By calculating the gradient of L with respect to β S as Equation (8), we can calculate the optimal value of β S by setting the gradient to 0: Note that, regarding whether there are more rows or columns in H S , solving β S can be either an overdetermined or under-determined problem. When it is an overdetermined problem, we assume β S is a linear combination of the columns of H S , i.e., β S = H T S α. Therefore, by setting Equation (8) to 0, the hidden layer output β S can be formulated as Equation (9), where P = H S H T S and Q = H S H T T : Let the case where H S has more rows be case 1 and the other be case 2. When new samples are labeled, the hidden layer output becomes H k+1 where δh is the corresponding hidden layer output of the newly labeled samples. In order to save the calculation, ODAELM-S uses some intermediate result as For case 1, K k+1 can be defined as Equation (10): Hence, K −1 k+1 can be updated using Equation (11): Subsequently, β S can be updated using Equation (12): For case 2, let K k+1 be defined as Equation (13): Similarly, the update of the intermediate result and output weight can be written as Equations (14) and (15), respectively, where δk = δhH T S : The pseudo code for ODAELM-S is shown in Algorithm 1. Before the updates begin, ODAELM-S first initializes a base classifier using source domain data (lines 1-3). When a base classifier has been created, the classification of gases can be available. When a new sample in the target domain arrives, ODAELM-S calculates the possibility for samples in the target domain to be selected and labeled (lines 5-6). When the process is determined (line 7), a group of unlabeled sample will be selected as δSet (lines [8][9]. Based on whether H S has more rows or columns, ODAELM-S updates the corresponding hidden layer output and some intermediate result (lines [10][11][12][13][14][15][16]. The process continues when no more samples arrive.

Input:
L := the number of hidden layer neurons; Act := the activation function type; SD := the source domain data; 1: Initialize two empty sets, i.e., LSet and USet, as labeled and unlabeled sets, respectively; 2: Set activation function as Act and initialize an ELM with L hidden nodes with SD; 3: Let Hs be defined as in Equation (9); 4: while new sample x in the target domain arrives do 5: Calculate the probability P for labeling; 6: Generate random value between 0 and 1 as p; 7: if p < P then 8: Select a group of samples from USet as δSet for labeling; 10: if H S has more rows then 11: Update the classifier using Equations (11) and (12); 12: else 13: Update the classifier using Equation (14) and (15); 14: end if 15: Set LSet = LSet − δSet and USet = USet − δSet; 16: else 17: Add x to USet; 18: end if 19: end while

Online Target Domain Adpatation Extreme Learning Machine
To transform DAELM-T into its online learning version, we proposed Online Target Domain Adaptation Extreme Learning Machine (ODAELM-T). Different from ODAELM-S, ODAELM-T leverages both labeled and unlabeled samples in the target domain by using Equation (16), in which β T is the output weight matrix, C T , H T , t T are the same as in DAELM-S, and C Tu and H Tu are the corresponding regularization parameter and the hidden layer output of the unlabeled samples in the target domain. Obviously, the update of the model is more complicated than ODAELM-S: Figure 3 shows the procedure of ODAELM-T. Initially, a source classifier is trained on a source domain. Unlike ODAELM-S in which the classifier is built upon both source domain and labeled samples in the target domain, in DAELM-T, only the output weight matrix β S contributes to the initialization and updates of the target classifier. In ODAELM-T, solving Equation (16) results in two different cases depending on the numbers of rows and columns. In [27], when the number of rows in H T is smaller than that of columns, the Lagrange multiplier method was applied by using It is equal to assuming that the output weight is a linear combination of H T and H Tu . However, due to the fact that the two cases are based on the rows and columns of H T , it is reasonable to just assume β T = H T T α T . Therefore, we can rewrite the output weight matrix of DAELM-T as Equation (17), where P = H T H T T and Q = H T H T Tu .
, H T has more columns. Based on the appendix of [50], we know that the Gaussian kernel is of full rank in any case. In ELM, HH T and H T H are "ELM kernel" matrices [51]. Noting that Gaussian kernel is a special Radial Basis Function (RBF) kernel, we can ensure that H T H T T and H T T H T is of full rank if we use Gaussian function as the activation function. Moreover, we can further induce that Lemma A1 stands (see Appendix A), so Equation (17) can be transformed into online learning versions.
When new sample X i arrives, the algorithm determines whether to select and label a group of samples in the target domain. When no such process happens, X i is added into the unlabeled set. However, when the process takes place, δSet, which is described earlier, may consist of one or more samples and X i may or may not be in it. Due to the fact that manually labeling is time consuming, when X i arrives at first, it will be put into an unlabeled set. When δSet is determined, ODAELM-S will first take out the effects of these samples by using decremental learning in unlabeled sets. Subsequently, when the labeling is finished, ODAELM-T will perform incremental learning in the target domain. Therefore, there are three different learning mechanisms in ODAELM-T, which ensure the classifier is up-to-date during its lifetime.
The pseudo code for ODAELM-T is described as Algorithm 2. Initially, ODAELM-T generates a source classifier as ODAELM-S does (lines 2-3) and sets the labeled and unlabeled sets as LSet and USet, respectively. When new sample x in the target domain arrives, ODAELM-T calculates the probability of selecting and labeling samples in the target domain as ODAELM-S does (lines 5-6). When the selection and labeling happens, ODAELM-T firstly adds x into USet and selects the group of samples from target domain for labeling (lines [8][9]. Note that, initially, there is no target classifier. Therefore, ODAELM-T will initialize an ELM with L hidden nodes when the first group of samples are labeled (lines [11][12]. Once the target classifier is initialized, the target classifier will update itself based on the changes between LSet and USet. When x is added into USet, ODAELM-T follows unlabeled incremental learning. After a group of samples, i.e., δSet, are chosen, ODAELM-T will perform unlabeled decremental learning (lines [15][16]. Subsequently, when δSet was manually labeled, ODAELM-T will perform incremental learning (lines [17][18]. In the circumstance that no labeling happens, there is only unlabeled incremental learning (lines [21][22].

Input:
L := the number of hidden layer neurons; Act := the activation function type; SD := the source domain data; 1: Initialize labeled and unlabeled set as LSet and USet, respectively. 2: Initialize the source classifier of L hidden nodes using ActType with SD; 3: Let H S and β S be defined as Equation (17); 4: while new sample x in the target domain arrives do 5: Calculate the probability P for labeling; 6: Generate random value between 0 and 1 as p; 7: if p < P then 8: Add x into USet;

Unlabeled Decremental Learning
When a group of samples (δSet) are selected for labeling, ODAELM-T updates the model first by eliminating the effects of samples in δSet. The process is called unlabeled decremental learning.
As shown in Figure 5, X k is selected from an unlabeled set for labeling process. Note that k can be any arbitrary index from 1 to i, and there can be more than one sample for labeling. Let the corresponding hidden layer output of δSet be δh. For the case where there are more rows than columns, let K and Right be written as Equation (27): For current update procedure, the intermediate result K −1 k+1 can be formulated as Equation (28): Correspondingly, the output weight β k+1 T can be formulated as Equation (29): For the case where there are more columns than rows, let P = H T H T T and Q = H T H T Tu . Since H Tu has changed, the corresponding results regarding P and Q can be written as Equation (30): Note that K k+1 and Right k+1 can be written as Equation (31): Subsequently, we can write the intermediate result and the output weight matrix as Equations (32) and (33), respectively, where δk is defined as δk = δhH T T :

Labeled Incremental Learning
After new samples are manually labeled, the incremental learning ensures that the model does not need to be recomputed from scratch. As shown in Figure 6, the unlabeled samples remain unchanged in this case. Therefore, the changes happens in H T . The decremental part X k in this process is added into a labeled set. Note that X k in the figure is just an example and there can be more than one sample added into the labeled set. Let the increment part be δSet with its label be δt, and the corresponding hidden layer output be δh. For the case where H T has more rows, let the current intermediate result K k be Equation (34):  (35): By using Woodbury formula, the inverse of K k+1 can be formulated as Equation (36): Let Right be C T H T T T T + C Tu H T Tu H Tu β S , and then Right k+1 can be formulated as Equation (37): Substitute Equations (36) and (37) into Equation (17), and the current output weight can be formulated as Equation (38): For the case where H T has more columns, let K k be Equation (39), where P k = H T H T T and Q k = H T H T Tu : The inverse of K k+1 involves updates of P k+1 and P −1 k+1 Q k+1 Q T k+1 . Meanwhile, the inverse of current P k+1 becomes Equation (40) where C = H T δh T , d = δhδh T and F 2 = (d − CP −1 k C) −1 : Note that Q k+1 Q T k+1 and P −1 k+1 Q k+1 Q T k+1 can be formulated as Equations (41) and (42) To further compute the result, the formula becomes too complex, which increases the computation cost. Therefore, in this case, we simply update the output weight matrix beta k+1 based on the batch learning version. However, in order for the two cases to combine together, we still use Equation (43) where Right k+1 is formulated as Equation (44): Considering that H T changes with the arrival of X i , the relation between the numbers of rows and columns in H T may not be static over time. To be specific, transitions may happen when the numbers of rows and columns are the same. Initially, the labeled set has few samples and increases as the labeling happens. Given enough time and samples, eventually, the size of the labeled set will match the size of the hidden neurons, i.e., the numbers of rows and columns in H T are the same. At this time stamp, case 1 and 2 coincide with each other. In order to continue performing incremental learning, the transition between the intermediate results of the two cases happens.
For the case where there are more rows than columns, β T = K −1 1 Right 1 where K 1 and Right 1 can be formulated as Equation (45), respectively: For the case where there are more columns than rows, β T = K −1 2 Right 2 where K 2 and Right 2 can be formulated as Equation (46): When the rows are equal to the columns, the two expressions should be equal.  (47) and (48):

Experimental Data Description
In order to verify the effectiveness of the proposed methods, the chemical gas sensor dataset published on University of California Irvine (UCI) machine learning repository was used in the paper. Table 1 shows the details of the dataset. The data comprise readings from a sensor array of 16 metal gas sensors for continuous 36 months and one label field. For each gas sensor, the reading contains two steady status and six dynamic measurements. For details of the dataset, see [52].

Experimental Setup
All experiments in this paper were conducted in Matlab (2015b, MathWorks, Natick, MA, USA) on a Linux Workstation (Shanghai, China) with an E5 2.6-GHz CPU and 32 GB RAM. The settings for ELM network followed the work in [27], which used up to 1000 hidden layer nodes with RBF activation function. The parameters, tagged C s , C T and C Tu , have the same meaning and value as in [27].
The batches are organized in sequence based on their batch number. The current batch was used as a target domain, in which the samples were randomized and organized in sequence to simulate the arrival of samples in online scenarios. Meanwhile, the previous batch was used as the source domain. The performance of the target classifier was recoded at each arrival of samples and formulated as Equation (49), in which Num + was the number of correctly classified sample and Total referred to the number of the samples received so far: To show the effectiveness of the proposed methods, we compared them with original batch learning versions, i.e., DAELM-T and DAELM-S. In addition, we chose an additional four commonly used machine learning-based methods. Firstly, ELM, SVM and Random Forest (RF) are chosen for their frequent uses in constructing classifiers, among which ELM shares a similar structure with the proposed methods while SVM and RF are two commonly used methods in classification. Secondly, ensemble based methods have been widely used in gas sensor domains. Hence, we also included Ensemble-SVM and Ensemble-ELM in which the sub-classifiers were built based on SVM and ELM, respectively. To sum up, we compared both ODAELM-S and ODAELM-T with seven machine learning-based methods. In the experiments, we did not use models such as Recurrent Neural Network (RNN) to learn the temporal behavior of the sensor errors. One reason is that the dataset we have at hand has been preprocessed by the publishers. The temporal information has been transformed into a steady and dynamic state of Exponential Moving Average (EMA) model, which does not include time information any more. Additionally, RNN and similar models such as Long Short-Term Memory (LSTM) are trained using Backpropagation through time (BPTT) or its variants. If the models are unfolded over time, they each can be viewed as a deep neural network. Training such deep neural networks can be very time-consuming, and so is updated. Hence, we do not compare the proposed methods with these models for the training time does not fit for online scenario. Accelerating the learning process of RNN or LSTM is another research area that is outside the scope of this paper.
In the experiments, we used the following experimental settings to evaluate the performances of the proposed methods regarding both classification accuracy and processing time.
• Setting 1: The labeled sets were selected from each target domain beforehand using the KS algorithm. In the arrival of the target samples, the methods treated the sample as labeled if it was in the pre-selected labeled sets, and vice versa. In this way, all the methods used in the experiment shared the same sample labeling process in the same arrival sequence of samples in the target domain. Although this setting does not work in practice, it can quantitatively evaluate the performance between offline and online learning, especially in processing time. • Setting 2: The samples in the target domain was treated as a data flow and fed to the classification model in a one-by-one manner. The labeling process was treated as a random event in which the probability of labeling was related to current received and labeled samples only. To stop the the number of labeled samples from becoming too many, we set a threshold as 0.9. When the classification rate reaches the threshold, the probability of labeling is set to 0. This setting stimulates real application scenarios to provide qualitative comparisons for the proposed methods. • Setting 3: The samples in the target domain was treated in the same way as setting 2. However, in this setting, we assume the testing accuracies on target domain would act as feedback and the probability of labeling was related to it as well. This setting stimulates another application scenario to show the effectiveness of the proposed methods.

Results Using Setting 1
To model the sequence of the sample, we randomized the data and used KS on them to select 50 representative samples for every target domain. Both recognition accuracies and the processing time were recorded each time when a new sample arrived.
Note that DAELM-S and DAELM-T do not work in online learning scenarios directly. In order to show the improvements of online learning versions, we used DAELM-S and DAELM-T to update the model and recorded the classification errors and processing time whenever a new sample arrived. As shown in Figure 7, the errors of four methods decrease as the number of samples increases. Meanwhile, the offline learning versions (DAELM-S and DAELM-T) have slightly better performances than their online learning versions (ODAELM-S and ODAELM-T), respectively. It is due to the computation errors accumulated whenever the updating happens. However, the difference between DAELM-S and ODAELM-S is minor and the same observation can be obtained between DAELM-T and ODAELM-T. In the meantime, the difference between DAELM-S and DAELM-T is notable. When the number of samples are limited, the error of DAELM-T increases by more than 10% when compared with DAELM-S, and similar results can be seen between ODAELM-T and ODAELM-S. When the number of samples increases, the performances of DAELM-S and DAELM-T draw close to each other and the difference is not large for observation except in the target domain 6 where ODAELM-T has around 10% less accuracy than ODAELM-S is. The same results apply to ODAELM-S and ODAELM-T. To sum up, the online learning versions have approximately equivalent performance compared with their offline learning versions regarding classification accuracy. In addition, ODAELM-S can work with limited samples while ODAELM-T reaches its maximum performance when the number is large.  To quantitatively show their computing complexities, the processing time of DAELM-S, DAELM-T, ODAELM-S and ODAELM-T were recorded and shown in Figure 8. In Figure 8a, the processing time of DAELM-T is remarkably larger than the other three methods. The reason is due to the fact that it updates whenever a new sample arrives regardless of labeling or not. Note that it follows a batch learning mechanism, which means that the computing complexity increases as the size of data increases. DAELM-S is also a batch learning method. The difference is that the updates of DAELM-S happen when a sample in the target domain is labeled. Therefore, much fewer updates happen compared with DAELM-T in general. In the experiment, there are 50 updates in total for DAELM-S. In fact, when the samples in the target domain are limited, such as target domains 7 and 8, the performance between DAELM-S and DAELM-T is almost indifferent. However, unlike DAELM-T whose computing complexity is closely tied to the size of target domain, the processing time of DAELM-S owes largely to the source domain, which is used to train the classifier. For example, in target domain 7, when the source domain has over 3000 samples, the processing time is almost two to three times that of target domain 8, which only uses less than 1000 samples as the source domain for training. As for ODAELM-S and ODAELM-T, it can be noted that both of the methods are, in general, less time-consuming than their batch learning versions. In some batches, the differences between ODAELM-S and DAELM-S are not large because the updates only happens 50 times. Nevertheless, for those target domain whose source domain is large, for example target domain 7, the difference between ODAELM-S and DAELM-S is obvious. As for ODAELM-T, it saves an enormous amount of time due to the online learning mechanism. Specifically, both ODAELM-T and DAELM-T have updates when a new sample arrives. However, ODAELM-T maintained the time for updates at a certain level regardless of the size of target domain while the updates of DAELM-T increase as the size of the received samples grows.
To better show the performance of the proposed methods, the average accuracies of the aforementioned methods and other commonly used classification algorithms are listed in Table 2 in which the letter D in the header represents the target domain. In the table, ELM, SVM and RF all include current labeled samples in the training process, i.e., the training set includes both source domain and the labeled sample in the target domain. En-ELM and En-SVM use the previous batches to train sub-classifiers and combine them together using their training accuracies as the weight. For example, in the target domain 3 (batch 4), batches 1 to 3 are used to train three sub-classifiers separately. As shown in the table, domain adaptation based methods are more accurate for all target domains in general. ELM, SVM and RF have over 90% and only DAELM-S and ODAELM-S exceed them by around 1%. However, ELM and SVM have relatively low accuracies in other target domains-for example in D8. RF has the second average accuracy, which is only less than 2% lower that of DAELM-S. However, the training part is even more time-consuming than DAELM-T. For example, in target domains 6 and 8, it took over 6000 s in total. For ensemble-based methods, i.e., En-ELM and En-SVM, their accuracies tend to be better than ELM and SVM when the target domain number is large. This is partly due to the fact that large target domain number means more source domains to generate sub-classifiers. In target domains 4, 5, 8 and 9, the ensemble-based methods outperform ELM and SVM. Between the online learning version, i.e., ODAELM-S and ODAELM-T, the performance of ODAELM-S is almost the same as DAELM-S, which has the highest average accuracy. ODAELM-T is around 8% lower that of DAELM-S and ODAELM-S in this setting. Considering the fact that, in setting 1, 50 representative samples are scattered in the arrival sequence, the reason for low accuracy may be the lack of enough labeled samples for DAELM-T and ODAELM-T.   As shown in Table 2, all methods experienced a large decrease of accuracy in D 9, given the fact that the data in all target domains except the 9th are collected in continuous months. For example, the data in the 1st target domain are collected from the 3rd to 10th months while the ones in the 2nd target domain are gathered from 11th to 13th. However, between target domains 8 and 9, there is a five-month vacuum. The limited number of samples in source domain may be another factor affecting the performance. However, given the classification accuracy in Figure 7, target domain 5, which has limited source domain samples, does not have such degradation in accuracy as target domain 9 does. Therefore, the number of samples in the target domain is not the reason, which leaves the long delay between target domains 8 and 9 being the main factor. Similar results can also be found in Tables 3-6. The time of delay may have caused large distribution changes between target domains 8 and 9, which directly leads to the degradation of accuracies. However, we cannot be certain unless we can retrieve the data of the missing months. Note that, in practice, setting 1 does not exist since we cannot select the representative samples based on the distribution of the target domain beforehand. Therefore, to prove the effectiveness of the proposed methods, we use settings 2 and 3 to demonstrate possible labeling strategies in application scenarios.  In setting 2, we choose the time of labeling and the number of labels based on the numbers of the current arrival and selected samples so far. Figure 9 shows the classification errors of nine target domains. As shown in the figure, all domain adaptation based methods quickly reach minimum errors for all target domains except for target domains 7 and 9. In target domain 7, both DAELM-T and ODAELM-T experience high classification errors before 1/3 of the samples arrives. This is partly due to the fact the number of the samples in this domain is less than that of other batches. Similar to the results in setting 1, ODAELM-S tends to work under limited labeled samples. In some circumstances, such as Figure 9h, patterns in the labeled set may contradict that in the source domain, which causes the error to increase between 150 to 400. Meanwhile, ODAELM-T does not have the problem and its errors keep decreasing as the number of labeled samples increases. Figure 10 shows the corresponding numbers of labeled samples in nine target domains. As shown in the figure, the numbers of labeled samples for both offline and online learning methods are maintained in less than 65 except target domains 7 and 9 in which the total numbers of samples are over 3000. The reason is that the classification accuracy does not quickly reach the threshold. Hence, although the strategy ensures that the chance of labeling decreases as the number of labeled samples increases, the chance of labeling still exists and the value increases when new samples keep arriving and no labeling happens. However, even in these two domains, the labeled samples is less than 1/20 of the total samples.     Similar to setting 1, we further compared the performance of nine methods. Table 3 shows the average classification accuracies. Note that, unlike that in Table 2, RF in Table 3 is a batch learning that is trained like En-ELM and En-SVM. Even though the processing time can be omitted since RF is trained on a source domain, it cannot reach as high an accuracy as in setting 1. This is due to the drift phenomenon between domains. ODAELM-S and ODAELM-T in setting 2 have approximately the same performance on average. However, they are 3 to 4% less accurate than their batch learning versions. This is due to the accumulated computing errors over time. For specific batches, such as D2 and D3, the difference is around 0.2% to 0.8%. Note that, in setting 2, both DAELM-S and DAELM-T label more than 50 samples, which may cause even more time for updates. Considering the processing time saved by ODAELM-S and ODAELM-T, the drops in accuracies for the proposed online learning methods are still acceptable.
In Table 4, we further recorded the final classification error. Note that both ELM and SVM have large increments on average. This indicates that, although ELM and SVM reach their maximum classification accuracy, which is around 83%, the performance of the two methods over time is inferior to domain adaptation based methods. It can also be noted that the difference between offline and online learning versions of domain adaptation based methods are reduced to 2% to 3%. It indicates that the final classification accuracies between offline and online learning methods are close to each other when compared with the results in Table 3.

Results Using Setting 3
In setting 3, the online sampling strategy uses the testing accuracies of the recognition model to help tune the selection of samples. To be more specific, the strategy tends to reduce the number of selected samples when the accuracy is maintained at a certain level. Figure 11 shows the performances of DAELM-S, DAELM-T, ODAELM-S and ODAELM-T in nine target domains. Similar to Figure 9, the accuracies of all four methods increase as the number of arrival sample increases. For most of the target domains, the four methods acted in a similar way as in setting 2, and reached their minimum error after around 1/3 of the samples arrived, which indicates that the two labeling strategies are both suitable for online learning. Compared with Figure 9, the accuracies of the proposed methods are maintained, which indicates that the online sampling strategy in setting 3 ensures the performance of ODAELM-S and ODAELM-T.       To compare the number of selected samples in setting 3, we recorded the value in each update. Figure 12 shows the numbers of labeled samples. It can be noted that, compared with setting 2, all four methods use less samples in each target domain in general. The final numbers decrease by more than 10. In practice, the labeling costs not only time but also amount of human effort. With the testing error feedback, although the exact labels of the samples are still unknown, the four domain adaptation methods can save time by using less labeled sets. Between ODAELM-S and ODAELM-T, it can be seen that the former works with fewer selected samples while the other requires over 30 selected samples in general. If human resources for labeling the samples are limited, it is reasonable to choose ODAELM-S to work in practice. Nevertheless, there is still a chance that the increase of the labeled samples may decrease the accuracies of ODAELM-S, such as in Figure 11h. Therefore, ODAELM-T may be more suitable when there is a large number of samples labeled. Table 5 shows the average classification accuracies in nine batches. Similar to setting 2, we compared the proposed methods with seven other algorithms. The ensemble-based methods and RF share the same procedure as in setting 2. Therefore, the results are the same. It can be noted that, in this setting, online learning versions still performed well and the classification accuracies were around 3% lower than their batch learning versions, which is the same as in setting 2. However, the differences are also acceptable considering the fact that online learning versions is less time-consuming.     The final accuracies shown in Table 6 also confirm the results in Table 5. Even though for some batches such as target domain 8 where DAELM-T labeled less samples than ODAELM-T, the difference value is not large, e.g., in the target domain 8 ODAELM-T labeled only eight more samples than DAELM-T. Even for target domain 5 where ODAELM-T labeled around 20 more samples, ODAELM-T is still more time-saving considering that the computing complexity of DAELM-T increases drastically as the size of the arrival sample increases.

Conclusions
In this paper, we proposed ODAELM-S and ODAELM-T for online sensor drift compensation in E-Nose systems. The proposed methods can update the model as new samples arrive, which is more time-saving compared with their batch learning versions. Meanwhile, we proposed two online labeling strategies to couple with the proposed methods.
Experiments on sensor drift dataset of six diverse compounds from 36 months demonstrate the effectiveness of the proposed methods regarding both classification accuracy and processing time. The results show that, under the same sampling and arrival sequences of the target domain, the proposed methods save more time than their batch learning versions do without losing the classification accuracy. In the meantime, the results under two online sampling strategies confirm the effectiveness of the proposed methods, which outperform the other classification algorithms. Between the two proposed methods, their capacities of identifying diverse gases draw close to each other eventually. However, ODAELM-S is more suitable to apply when the target domain is small and limit samples are labeled. ODAELM-T achieves its maximum capacity when the number of labeled samples is large, and outperforms ODAELM-S in specific target domains. In general, ODAELM-S is more feasible when the labeled samples are limited, while ODAELM-T can be used to replace ODAELM-S for better accuracies when the number of sample increases.
The online sampling strategies including the formula to calculate the probabilities of selecting and labeling samples in the target domain are the only two cases used in the paper. More sophisticated and accurate sampling models may be considered to improve the selection of representative samples. Meanwhile, human labor is a key factor in semi-supervised methods and the selection of representative samples may also be constrained by the factor, which is not included in the discussion of the paper. Future works may be extended to improve the sampling strategies under more restricted scenarios and parallel computing may be included to further reduce the processing time. the paper. Nan Wang was responsible for providing the experimental environment and conducting experiments. Weina Niu was responsible for gathering the materials and preparing the dataset.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are frequently used in this manuscript: (A8) For the case of A k+1 = A k + C Tu P −1 H T h T hH T T where h is an arbitrary row of H Tu , we can combine Equations (A7) and (A8) to write d T k+1 A −1 k c k+1 as Equation (A9), where c k+1 = C Tu P −1 H T h T ∈ R n 1 * 1 and d T k+1 = hH T T ∈ R 1 * n 1 . Note that h is an arbitrary row of H Tu . Similarly, we have d T k+1 A −1 k c k+1 ≥ 0 stands: By using mathematical induction, we can prove that d T k A −1 k−1 c k ≥ 0 stands for any k. Subsequently, A k has inverse for any k. To sum up, Lemma A1 has been proved.