Instrumental Odour Monitoring System Classification Performance Optimization by Analysis of Different Pattern-Recognition and Feature Extraction Techniques

Instrumental odour monitoring systems (IOMS) are intelligent electronic sensing tools for which the primary application is the generation of odour metrics that are indicators of odour as perceived by human observers. The quality of the odour sensor signal, the mathematical treatment of the acquired data, and the validation of the correlation of the odour metric are key topics to control in order to ensure a robust and reliable measurement. The research presents and discusses the use of different pattern recognition and feature extraction techniques in the elaboration and effectiveness of the odour classification monitoring model (OCMM). The effect of the rise, intermediate, and peak period from the original response curve, in collaboration with Linear Discriminant Analysis (LDA) and Artificial Neural Networks (ANN) as a pattern recognition algorithm, were investigated. Laboratory analyses were performed with real odour samples collected in a complex industrial plant, using an advanced smart IOMS. The results demonstrate the influence of the choice of method on the quality of the OCMM produced. The peak period in combination with the Artificial Neural Network (ANN) highlighted the best combination on the basis of high classification rates. The paper provides information to develop a solution to optimize the performance of IOMS.


Introduction
Instrumental odour monitoring systems (IOMS) are devices that function as an artificial paradigm of the olfactory stimuli to reveal environmental odours. Their general architecture is composed of a sampling system, along with a detection unit, in which the array of gas sensors and signal processing system are located [1][2][3][4]. There are wide and different gas sensor technologies currently available [1,3]. In 2015, within the framework of CEN/TC 264-Air Quality, a new working group (WG41) with the aim of proposing a new European standard for IOMS environmental odour monitoring applications was started [5]. IOMS has gained a great deal of popularity and applicability over the last few years in the field of air quality and, in particular, for the monitoring of odours due to the annoyance and impact induced by the growing number of emissions in the environment by industrial activities [5,6]. Furthermore, IOMS possessed numerous advantages over sensorial (e.g., dynamic olfactometer) and analytical instrument (e.g., Gas chromatographymass spectrometry, colorimetric method, catalytic, infrared and electrochemical sensors, photoionization detector, differential optical absorption spectroscopy) because it is applicable for in-situ and in real time measurements [5][6][7][8]. Meanwhile, other techniques are  [19,26,27] Extraction from curve fitting parameters -Polynomial functions -Exponential functions -Fractional functions -approximate discrete data using analytical expressions nonlinear in nature the fitting process is complicated and long Polynomial: y = A 0 + A 1 x + A 2 x + A 3 x 3 + . . . + A n x n Exponential : The above-mentioned techniques have been used in recent years, while new methods are starting to be recognized, such as phase space (PS), dynamic moments (DM), parallel factor analysis (PARAFAC), energy vector (EV), power density spectrum (PSD), surface electromyography (sEMG), windowed time slicing (WTS), etc. [20,21]. In practical applications, the extraction from the original response curves represents one of the most used techniques, due to its intuitive nature and fast calculations [19,20]. Selecting the useful data can improve the discrimination function and exclude values that can cause noise and uncertainty in the measurement. Moreover, to maximize the potential of IOMS, the extracted data and the pattern-recognition technique must work together. Pattern-recognition techniques are mathematical models (i.e., statistical and biological) that are used to establish a relationship between input variables (independent variable) to the target output (dependent variable) in the dataset. The mathematical treatment of correlation of the odour metric with human odour perception is particularly important, and to be stressed in the application for odour monitoring, due to the large number of odourants that cause the odour [13,34]. Table 2 reports an overview of the principal pattern-recognition techniques applied to IOMS technologies.
The research presents and discusses the influence of the application of different extracted signals and pattern recognition methods in the elaboration of the environmental odour classification monitoring model (OCMM) with IOMS. The paper aims to optimize the performance and robustness of an IOMS. The piecemeal signals (i.e., rise, intermediate, and peak state) obtained from the original response curves in combination with the use of the Linear Discriminant Analysis (LDA) or the Artificial Neural Networks (ANN) as pattern recognition techniques are investigated and argued. Laboratory experimental analysis with real samples were considered, to analyze and compare the results.

Experimental Setup
Research studies were carried out by collecting real odour samples at a complex industrial petrochemical plant. Two odour classes were sampled directly at the emission of two different sources in floating roof storage tanks in accordance with the EN 13725 (2003), by using a static lung effect device.
Ten samples Class A ("petrol", CAS 86290-81-5) and 13 samples Class B ("diesel", CAS 68476-34-6) were collected, with a weekly frequency, in nalophan bags of 7-L volume. Moreover, 28 ambient air (Class C) samples were collected in the field surrounding the plant, to distinguish among the odours from the investigated sources and the ambient air (no annoyance). A total of 51 samples from three odour classes were used for the research.

IOMS Technology and Data Acquisition
The seedOA IOMS technology, developed by the Sanitary Environmental Engineering Division (SEED) of the University of Salerno, Italy, was used for the experiments. The functional architecture of the seedOA consists of a sampling system, a detection unit, a signal processing system, and a control and management system [41][42][43][44]. The sampling system contains a specific unit that allows the standardization of the temperature and humidity conditions of the analyzed gaseous sample. The air from the sample is drawn by a pump located downstream of the measuring chamber with a constant flow rate of 300 mL m −1 . The detection system is composed by the code ® measurement chamber [44], which contains total of 16 sensors distributed on two different levels. For the specific research, thirteen of the overall installed sensors are of metal-oxide semiconductors type (MOS, Figaro) and adopted for the measurement (Table 3), while the other three sensors are inserted for the control of the environment and process parameters (temperature, humidity and flowrate). All the collected gas samples were individually acquired by the seedOA IOMS technology adopting an odour-odourless air cycle [13,34]. An acquisition time and a recovery time of 2 min were set for each sample, with a data detection time step of 2 s. A total of 60 data points for each sample were recorded. The seedOA IOMS measured the resistance of the sensors by a voltage divider. Odourless air was used to recover the base resistance of the sensors each time before the next measurement.

Data Reduction
The signal responses provided by the sensors are pre-elaborated and given in fractional change in resistance and registered as kΩ (R S = (R − R O )/R O , where R is the resistance value after the reaction with a gaseous compound, and Ro is the default resistance value of  [20,21,45]. For the MOS sensors, the relationship between resistance and the gas concentration is inversely proportional and of the type: where: R is the electrical resistance supplied by the sensor; A is a constant defined by the material (e.g., TiO 2 , ZnO, SnO 2 , etc.); C is the concentration of analyzed gas; and α is the slope (e.g., experimental quantity of the gas). Figure 1 reports the general trend of the output signal response provided by the sensors, expressed in terms of electrical resistance (e.g., kΩ) with respect to exposure time (e.g., minute) and presence of odour and odourless events. As shown, when the sensor is exposed to odour, its output signal in terms of resistance decreases, while, when exposed to odourless air, the signal in terms of electrical resistance returns to the initial reference base values.

Pattern-Recognition Algorithms
Linear discriminant analysis (LDA), as a traditional statistical method, and artificial neural network (ANN), as a biological method, were used to investigate the influence and effect of the application of different categories of pattern recognition algorithms.
Linear discriminant analysis (LDA) adopts linear combinations of variables to distinguish between classes that results in linear decision boundaries. The method searches for a linear transformation that maximizes class separability in a reduced dimensional space [32,38,46]. LDA is a popular classifier technique and commonly used in IOMS technologies for environmental odour monitoring and assessment [34,35]. During LDA training, coefficients (i.e., k, a, b . . . α) of different discriminant function (γ) equations per representative group (i.e., λ, β and ω) are calculated. In predicting the categories of the new data, the input values are substituted to the variables (i.e., x 1 , x 2 . . . x n ) of the equations reported below (Equation (2)) to measure the scores: The highest score indicates the group where those values belong. Meanwhile, artificial neural network (ANN) is biological paradigm that serves as mathematical models in simulating complex systems and considered black-box [47][48][49][50][51][52]. A general ANN consists of input neurons, hidden neurons, and output neurons, connected via synapse, which contains specific weight values [49,50]. For the experimental activities, a 3-layer feed-forward neural network was designed. The 13 different electrical resistance profiles from seedOA IOMS were used as input data, while the three investigated odour classes were used as target output (Figure 3). The ideal number of neurons is identified by means of "trial-and-error" on the basis of high correlation values (R 2 ) and classification rates (%) between measured and predicted output.

Pattern-Recognition Algorithms
Linear discriminant analysis (LDA), as a traditional statistical method, and artificial neural network (ANN), as a biological method, were used to investigate the influence and effect of the application of different categories of pattern recognition algorithms. In training the neural network, the system optimizes the ideal weight values until the loss function is minimized under the influence of a learning algorithm [49,50]. The Bayesian Regularization algorithm was applied, introducing a non-linearity by using a tan-sigmoid function to reduce the possibility of an over-fit since it uses a probabilistic nature for the network weights [53].
A general ANN consists of input neurons, hidden neurons, and output neurons, connected via synapse, which contains specific weight values [49,50]. For the experimental activities, a 3-layer feed-forward neural network was designed. The 13 different electrical resistance profiles from seedOA IOMS were used as input data, while the three investigated odour classes were used as target output (Figure 3). The ideal number of neurons is identified by means of "trial-and-error" on the basis of high correlation values (R 2 ) and classification rates (%) between measured and predicted output. In training the neural network, the system optimizes the ideal weight values until the loss function is minimized under the influence of a learning algorithm [49,50]. The Bayesian Regularization algorithm was applied, introducing a non-linearity by using a tan-sigmoid function to reduce the possibility of an over-fit since it uses a probabilistic nature for the network weights [53].

Training and Validation datasets
The overall acquired dataset in terms of fractional change in resistance, for each of the 13 odour measurement sensors, at a given time of the overall sample acquisition (one data every 2 s), was divided into a "training" and a "validation" dataset. The training dataset was used to determine the coefficients of the two investigated mathematical models, considered subsequently for the validation stage. The validation dataset, consisting of six separate sets of samples and applied according to "leave-one-group-out" method [24], was adopted as test samples to verify the model accuracies.
For the LDA training, the datasets were organized and labeled according to the three investigated odour classes (G1 = Class A, G2 = Class B, and G3 = Class C) ( Table 4). For the training datasets, by applying the ANN, supervised learning was adopted. Binary classifiers, such as "1" and "0", were assigned in the output to group the data, where "1" refers to the belonging to the group, while "0" indicates no interaction (Table 5). To assess the reliability of the trained models, a validation test was conducted, using the overall acquired data, the data excluded from the training dataset, and considering known the source for the comparison test ( Table 6). The accuracy rates are, therefore, defined by analyzing the known values with the predicted ones.

Statistical Analysis
Excel 2010 software (Microsoft, Washington, DC, USA) was applied for the preprocessing data extraction. Meanwhile, Statistica 10 (StatSoft, Tulsa, OK, USA) and MAT-LAB R2017a (MathWorks, Natick, MA, USA) were used as the computational software for the LDA and ANN pattern-recognition algorithms elaboration, respectively.
For the LDA applications, a scatterplot diagram and confusion matrix was used to analyze the behavior of the detected data points per investigated odour classes and to evaluate the performance of the predicted classification algorithm.
For the ANN methodology, the coefficient of determination (R 2 ) and the mean square error (MSE) were calculated to investigate the relation of the predicted and measured data, as well as to update the weights of the number of times all of the training vectors.

Comparison Studies
Comparative analyses of the different Odour Classification Monitoring Models (OCMMs), elaborated by using the different future extraction techniques and pattern recognition methods, were performed by calculating the classification accuracy rate during the training and validation tests, per investigated odour class (α i ; i = class) and for all the detected data (ϕ): where α% is the individual accuracy rate per class, and ϕ% is the overall accuracy rates (i.e., summation of the individual accuracy rates (α%) divided by the total number of class (η)). A total of eight (8) OCMMs were elaborated and compared (Table 7).  Table 8 summarizes the results of the classification accuracy rate obtained by applying the LDA model and the training dataset. Each column shows the classification rate per investigated piecemeal signal. The values of the Wilks' lambda are also highlighted to analyze the degree of the discriminatory power of the model. For all the investigated features, despite the adjustment made to some parameters during the training (i.e., the tolerance value), the Wilks' lambda values are near to 0, thus demonstrating general good discrimination properties for all three classes.

OCMMs Using Different Extracted Signals and LDA Application
The results clearly highlight an influence in the classification accuracy rate determination, in relation to the choice of the piecemeal signal. Considering the analysis per investigated odour class, a maximum variation of 21.61% of the classification accuracy rate was detected for Class A by using, respectively, the "peak" or the "rise" signal, whereas a minimum variation of 0.23% was observed for the "Ambient Air" odours class (Class C).
While performing the analysis with all the detected data (ϕ), the discrimination variation of the investigated samples by adopting different extracted signal is equal to 4.49%. Figure 4 shows the scatter plots produced from the linear discriminant analysis (LDA) of the "training" dataset, with all the detected data, showing a distinction among Class A, Class B, and Class C by using the (a) complete response curve data and extracted data for the (b) rise data, (c) intermediate data, and (d) peak data.
The results also graphically confirm that the peak analysis ( Figure 4d) shows better cluster formation of the classes, and Class C (ambient air) is the most recognizable class among the different investigated classes. A more pronounced difficulty of discrimination is shown, especially, among some elements of the classes A and B for all of the investigated piecemeal signal features. The cause may be related to the relatively small magnitude and difference in resistance values detected by the sensors solicited and, probably, to the similar composition in terms of predominant odourous substances and/or odour concentration of the investigated samples. Table 9 summarizes the classification metrics during the LDA validation test, determined by using the discriminant factors equation developed in the training phase. While performing the analysis with all the detected data (φ), the discrimination variation of the investigated samples by adopting different extracted signal is equal to 4.49%. Figure 4 shows the scatter plots produced from the linear discriminant analysis (LDA) of the "training" dataset, with all the detected data, showing a distinction among Class A, Class B, and Class C by using the (a) complete response curve data and extracted data for the (b) rise data, (c) intermediate data, and (d) peak data. The results also graphically confirm that the peak analysis (Figure 4d) shows better cluster formation of the classes, and Class C (ambient air) is the most recognizable class among the different investigated classes. A more pronounced difficulty of discrimination is shown, especially, among some elements of the classes A and B for all of the investigated piecemeal signal features. The cause may be related to the relatively small magnitude and difference in resistance values detected by the sensors solicited and, probably, to the similar composition in terms of predominant odourous substances and/or odour concentration of the investigated samples. Table 9 summarizes the classification metrics during the LDA validation test, determined by using the discriminant factors equation developed in the training phase. Excellent classification accuracy rate results are highlighted only for the Class B samples, while a much lower recognition percentage was detected for the Class A samples. No samples of the Class C were correctly identified. Once again, the analyses show the better response by using the peak or the intermediate data.

OCMMs Using Different Extracted Signals and ANN Application
MATLAB environment has a default setting that automatically partitions the input dataset into 70-30% (i.e., train-test set) during training. The purpose of this configuration is to eliminate the possibility of over-fitting. Each piecemeal signal was tested at different number of neurons in the hidden layer. The ideal ANN topology was found at "13-7-3". Table 10 summarizes the coefficient of determination (R 2 ) obtained by applying the ANN during the training stage. Considering the overall R 2 to assess the ANN accuracy, the results show that all the correlations (R 2 ) were found to be >0.998 for all the subsets of the extracted signals. This means that the ANN was able to detect all the possible interactions in the dataset. Figure 5 highlights the graphical representation of the results summarized in Table 10 to evaluate the R 2 trend through the mean square error (MSE) vis-a-vis the number of epochs, by using the different sets of extracted data.   Table 12 presents and compares the classification accuracy rates obtained in the training and validation stage through the application of the LDA and ANN, along with the different extracted signal points, by performing the analysis with all the detected data (φ).

Comparison Studies
The peak steady part is confirmed as the piecemeal signal that provides the highest discriminatory value for all the investigated cases and contains the most useful information for both the pattern-recognition techniques. Despite the complete response signals containing the complete information, this condition appears to slow down the performance of the pattern-recognition algorithm. A good classification accuracy was highlighted by using the intermediate period of the overall acquired data. The ANN was able to map good patterns, especially when the data in the rise and peak part are utilized in the basis of small MSE at low number of epochs. The best training performance was found, respectively, to be 1.00 × 10 −9 at epoch 51 for the complete response curve data (Figure 5a), 2.60 × 10 −9 at epoch 83 for the rise part data (Figure 5b Table 11 summarizes the classification metrics during the ANN validation test, determined using the values of the weights and biases, encoded as coefficients to satisfy the topology of "13-7-3" generated during the training. The results show that the ANN misclassified Class C (ambient air) data. However, a perfect classification (100%) was achieved for Class A and Class B using the intermediate and peak data points. This scenario might be attributed to the idea that molecules of Class A and Class B are more sensitive to the gas sensors, in which an observable reaction is recognized when compared to Class C. The highest overall recorded accuracy was determined equal to 66.67% for the intermediate and peak data points. Table 12 presents and compares the classification accuracy rates obtained in the training and validation stage through the application of the LDA and ANN, along with the different extracted signal points, by performing the analysis with all the detected data (ϕ). The peak steady part is confirmed as the piecemeal signal that provides the highest discriminatory value for all the investigated cases and contains the most useful information for both the pattern-recognition techniques. Despite the complete response signals containing the complete information, this condition appears to slow down the performance of the pattern-recognition algorithm. A good classification accuracy was highlighted by using the intermediate period of the overall acquired data.

Comparison Studies
In the LDA classification, the technique was able to discriminate groups with a good satisfaction rate (>89.71%); however, when simulated with unknown data during validation, the model could not classify them higher than 50%. This phenomenon might due to the natural characteristic of the technique in relying on normal data distribution. However, some variables do not obey this behavior. Meanwhile, by applying the ANN technique, the results are relatively higher than in LDA. The model acquired a high learning condition, which is manifested by the classification rates for all the piecemeal signals and principally by using the intermediate and peak signals during the validation stage. The ANN demonstrates a better pattern-recognition potential than using the LDA for almost all the experiments carried out (e.g., +8.22% and even +12.71% during the training phase, considering, respectively, the peak or rise periods). The cause may be related to the higher ability of the ANN technique to deal with the noise in the dataset. This characteristic is an asset of the ANN due to the fact that gas movements are dynamic. Only during the validation phase by using the rise signal, LDA highlights a better classification accuracy.

Conclusions
The analysis of the adoption of different fragmented signals from the overall acquired data and their responses with different pattern-recognition algorithms, such as LDA and ANN in the OCMMs elaboration with IOMS, highlight the influences in the final classification accuracy. For the investigated analyses, during the LDA training, the intermediate and peak periods had the highest discrimination rates. On the other hand, during the ANN training, all the fragmented signals performed well in terms of a high R 2 , low MSE, and high classification metrics. ANN proves to have a higher learning capability than LDA, while, during the test set validation of the two models, the intermediate and peak parts confirm the highest accuracy, and ANN outperforms LDA in almost all the investigated cases.
The selection of the feature extraction can optimize the IOMS performance by capturing the most important signals to improve the system suffering from a large dataset and memory storage space. In this way, the redundant signals that may contribute to the uncertainty in the measurement can be eliminated and increase the robustness of the odour monitoring model. Furthermore, the selection of the most appropriate pattern-recognition technique can improve the overall algorithm of the IOMS, which is manifested by the odour classification metrics.
In LDA, no matter how the parameters were adjusted, such as by lowering the tolerance value, the Wilks' lambda remains steady, unlike in ANN, where more configurations are still available to explore. Based also on the signal response, the intermediate and peak periods carried the most useful information that can be applied in odour monitoring.
The research can be a guideline for further research on selecting the proper combination of extracted signals and pattern-recognition algorithm. The paper provides useful information for the selection of the most appropriate mathematical data treatment techniques in environmental odour monitoring with IOMS, as well as to promote the development of more flexible systems, in order to minimize redundancy, as well as increase the overall quality and reliability of the system.