Environmental Odour Quantiﬁcation by IOMS: Parametric vs. Non-Parametric Prediction Techniques

: Odour emissions are a global issue that needs to be controlled to prevent negative impacts. odour monitoring systems (IOMS) are an intelligent technology that can be applied to continuously assess annoyance and thus avoid complaints. However, gaps to be improved in terms of accuracy in deciphering information, especially in the implementation of the mathematical model, are still being researched, especially in environmental odour monitoring applications. This research presents and discusses the implementation of traditional and innovative parametric and non-parametric prediction techniques for the elaboration of an effective odour quantiﬁcation monitoring model (OQMM), with the aim of optimizing the accuracy of the measurements. Artiﬁcial neural network (ANN), multivariate adaptive regression splines (MARSpline), partial least square (PLS), multiple linear regression (MLR) and response surface regression (RSR) are implemented and compared for prediction of odour concentrations using an advanced IOMS. Experimental analyses are carried out by using real environmental odour samples collected from a municipal solid waste treatment plant. Results highlight the strengths and weaknesses of the analysed models and their accuracy in terms of environmental odour concentration prediction. The ANN application allows us to obtain the most accurate results among the investigated techniques. This paper provides useful information to select the appropriate computational tool to process the signals from sensors, in order to improve the reliability and stability of the measurements and create a robust prediction model.


Introduction
Odour emissions from industrial sources in ambient air can be a significant problem for the exposed populations because they create an unpleasant environment as well as physical and psychological disorders [1][2][3][4], which leads to public nuisance and complaints [5][6][7][8]. The presence of unpleasant odours is associated with the perception of a health risk [9,10]. Therefore, the control of odour emissions, starting from the characterisation, is a key issue in order to minimize the presence in ambient air and, thus, guarantee a suitable environmental quality [11][12][13][14]. Nowadays, environmental odour characterisation is conducted by analytical, sensorial and combined analytical-sensorial methods [4,15,16]. Analytical techniques employed mainly laboratory equipment (e.g., GC-MS, colorimetric method, catalytic, infrared and electrochemical sensors, differential optical absorption spectroscopy, fluorescence spectrometry) [1,4]. These techniques are able to detect single or multiple gaseous compounds which can be presumed as odour tracers, and quantifies them in terms of ppm or mg m −3 . On the other hand, sensorial techniques are methods in which odours are identified using the human nose [4,17]. Among the sensorial methods, Dynamic Olfactometry (DO) regulated by the EN13725: 2003, currently under review by the WG2 of the CEN/TC264 "Air quality", is mostly used in Europe. The DO measures the odour concentration in terms of European Odour Unit per Cubic Meter (OU E m −3 ) [18,19]. Meanwhile, combined methods (e.g., sensorial-analytical) are those with the highest potential for future development in the field of environmental odour measurement [15,[20][21][22][23][24], especially with the application of instrumental odour monitoring systems (IOMS) [15,[25][26][27]. According to the definition given by the CEN/TC264/WG41 [15,28,29], IOMSs are intelligent tools allowing continuous monitoring and providing results in terms of odour classes and prediction of odour concentration, with units of measurement that can be correlated with that of dynamic olfactometry [4,[29][30][31]. IOMS belong to a specific category of systems generally known as electronic noses (E.Noses) successfully applied in various fields, such as rapidsensing, real-time characterisation and quick evaluation of product quality (i.e., aroma compounds monitoring, food contamination, etc.) [13,[32][33][34]. Furthermore, other novel applications included in the field of medicine are VOC-biomarkers and disease diagnosis through exhaled breath analysis [14,[35][36][37][38]. In IOMS, the compounds detected by gas sensors [39][40][41][42] and transduced into electrical signals (e.g., kΩ, volts, etc.) are statistically analysed using mathematical programmes, in order to elaborate specific Odour Monitoring Models (OMM) for both classification and quantification applications [29,43].
In both applications, the mathematical models can be developed by partitioning the entire collected sampling dataset into disjointed subsets, such as training and preliminary internal validation. Generally, k-fold cross validation is used, by which the dataset is divided into k (k = 10) parts (e.g., 90% training and 10% validation) [44][45][46]. In the OMM classification, the observed data are grouped into a set of sub-populations which represent the investigated categories [29,47], and are validated by simulating new a set of data that will fall in the correct group/category. Meanwhile, in the OMM quantification, the odour concentration values measured with dynamic olfactometry are attributed to the values of the various acquired samples, and divided into the different investigated odour classes [18]. For each investigated odour class, a quantification OMM is developed by applying parametric and non-parametric pattern-recognition techniques [15,48,49]. Parametric methods presumed that the data are linearly behaving and/or normally distributed in such a way that when plotted, they possessed symmetrical bell-shaped graphs or "Gaussian distribution", while non-parametric methods have data distribution that are free and/or do not rely on assumptions about the data pattern distribution [29,44,48]. No univocal and unique selection of the various existing pattern recognition techniques is present in the technical scientific literature, with reference to the different fields of application. Szulczynski et al. (2017) [50] and Lokuge et al. (2018) [51] demonstrated artificial neural network (ANN) and multivariate adaptive regression splines (MARSpline's) capability in implicitly detecting a non-linear relationship between dependent and independent variables. The studies revealed that these non-parametric techniques are useful in mapping relationships to large datasets, when there is no prior knowledge of the data behaviour. Meanwhile, Ojha et al. (2018) [52] and Dahlen et al. (2000) [53] highlight that partial least square (PLS) is efficient as a regression tool and as a dimension reduction technique (e.g., compression of the number of input variables). On the other hand, some studies performed feature extraction (i.e., extraction from original response curve, curve fitting, transform domains, etc.) in which the redundant signals can be removed, thus increasing the rate of calculation [29,45,48,54]. Different papers have utilised traditional parametric statistical models (i.e., multiple linear regression (MLR), partial least square (PLS), principal component analysis (PCA), etc.) due to their convenience in implementation, but the characteristic of the dataset has not taken into consideration if these models are appropriate or if there are other techniques qualified for the application [25,46,[55][56][57][58][59]. The above considerations therefore highlight how easily a distorted conclusion can be reached if the wrong technique is used. Many efforts have been devoted to the analysis of specific prediction techniques [52,53,60]; however, after conducting a thorough research and evaluation of the literature, no papers performed actual experiments to investigate and compare different forecasting algorithms with respect to environmental odour concentration measurements.
The research presents and discusses the application of different prediction techniques in an advanced IOMS for environmental odour quantification monitoring, with the aim to investigate and compare the prediction performance and optimize the accuracy and reliability of the measurements. Traditional and alternative non-parametric (i.e., artificial neural network (ANN), multivariate adaptive regression splines (MARSpline)) and parametric (i.e., partial least square (PLS), multiple linear regression (MLR) and response surface regression (RSR)) techniques are investigated and argued. Laboratory experimental analyses are carried out by using real environmental odour samples.

Odour Samples
Environmental odorous air samples were collected from the delivery phase of the municipal organic fraction at a real municipal solid waste treatment plant (Salerno, Italy). The "lung technique" were used for the sampling, withdrawing the odorous air using a vacuum pump into a 15-L Nalophan bag. A daily frequency sampling campaign for seven days was conducted to represent the weekly trend of the emissions. Four different weeks, one in each season, were investigated, to consider the seasonal variation over an annual period. A total of 28 on-site samples were collected over an annual period. Each sample representative of the emission was then diluted to depict the environmental conditions at the receptor level. Dilution was prepared for all collected real samples at the emissions, obtaining a known target quantity from samples with varying concentration levels, obtaining an additional 62 samples. Ambient air samples to consider as an odour threshold for 0.00 OU E m −3 were also collected in an area around the plant, at a distance such as to be odourless, and considered as baseline reference (e.g., lowest possible detection limit). A total of 92 samples thus constituted the entire sampling dataset considered.

Dynamic Olfactometry
All collected samples were analysed by Dynamic Olfactometer, according to EN13725: 2003, at the Olfactometric Laboratory of the Sanitary Environmental Engineering Division (SEED) of the University of Salerno, to determine the odour concentrations in terms of OU E m −3 . A TO8 olfactometer (ECOMA GMBH-D) was employed. The odour concentrations of the diluted odour samples were calculated in an analytical manner by considering the dilution factor and the odour concentration value measured at the on-site collected samples.

SeedOA Instrumental Odour Monitoring System
All samples, collected and diluted, were also analysed by an advanced and innovative Instrumental Odour Monitoring System (IOMS), designed by the Sanitary Environmental Engineering Division (SEED) Group of the Department of Civil Engineering of the University of Salerno. The adopted IOMS functional architecture is composed of four principal units: (a) the sampling unit, (b) the detection system, (c) the data storage and processing system and (d) the management unit ( Figure 1) [61]. The sampling unit includes a pump for the raw air collection at a flowrate of 300 mL min −1 . The detection system incorporates the code (Chamber for Odour Detection) innovative measurement chamber, patented by the SEED research group, in which are located 13 metal oxide semiconductor (MOS) measurement sensors and 3 control sensors [62,63]. The code chamber is cylindrical in shape, with one internal diffuser and the sensors arranged on two levels, each hosting 8 sensors, such as to reproduce the human nasal apparatus consisting of two nostrils and a central nasal septum [63]. Moreover, other subparts include temperature and humidity control and a thermoregulation system, CPU board (Raspberry Pi), HUB i2c board and the main board, particulate and activated carbon filter, ozone cleaning system and permeation tube, pumps and manifold with internal solenoid valves for the different management of gaseous flows, and flexible pipes for the connection between the different components. The samples were analysed in an "odour-odourless-odour" cycle, with an acquisition and recovery time of 2 min and a time step of 2 s for value. According to Zarra et al. (2021), the peak points (last 1-min interval) have been considered in terms of feature extraction, resulting the most stable and representative [29].

Odour Quantification Monitoring Models (OQMM) Development
Two different non-parametric (artificial neural network, ANN; multivariate adaptive regression splines, MARSpline) and three parametric (partial least square, PLS; multiple linear regression, MLR; response surface regression, RSR) prediction techniques were investigated and subsequently compared. Each technique has its own unique set of properties. Table 1 presents the indicator considered as index for the OQMM elaboration for each investigated technique [18,51,[64][65][66][67][68][69][70][71]. In training the OQMMs and assessing the individual prediction accuracy, supervised learning technique were applied. Applying a k-fold cross-validation procedure (k = 10) to the 92 acquired samples, 82 samples were used as train data sets, while 10 samples were used as validation data sets. To ensure the reliability of the results, the validation data are selected randomly using Microsoft Excel software. Thirty (30) values from each MOS measurement sensor per sample were recorded in the acquisition time, with a frequency of one value every two seconds. The matrix composed of the values detected by the 13 measurement sensors for each sample was then associated with the respective value of the odour concentration measured by dynamic olfactometry. Table 2 summarizes the experimental dataset for the training and validation phase. For the ANN experiment, a feed-forward neural network (FFNN) consisting of three layers (input, hidden and output layer) was applied. Tan-sigmoid was used as a neural transfer function, while the network was trained using a Bayesian Regularisation algorithm. The number of neurons in the hidden layer was evaluated between 5-15, in order to avoid underfitting and/or overfitting [15]. In ANN, the input neurons (x) are connected to each other with corresponding weight (w) values. The hidden layer (ϕ) connects the input variables (x) to the target output variable (y) (Figure 2) [64]. The incoming signal to each neuron was given by the value of the neuron multiplied by the weight, obtaining the total sum from other neurons (Equation (1)), and converting it using the activated function highlighted in the Equations (2) and (3). Meanwhile, the target output variable was computed in the same manner as the neurons in the hidden layers (Equation (4)).

MARSPlines
In the MARSPline experiment, the relationship between dependent and independent variables was established by evaluating the ideal number of basic functions (piecewise linear segments). The "divide and conquer" strategy was applied [72]. The training data (space) were divided into segments, and for each of these a specific regression equation was established at different gradients (slopes). The sum of these individual segments, according to Equation (5), represents the MARSPline model [60]. In the study, each MARSPline model was evaluated starting from 10 basic functions, extracting the coefficients and investigating the most accurate one [51,66]. where: β 0 = intercept parameter; β m = weighted sum of one or more basis functions (h m (X)); -M = sum over the non-constant terms of the model.

MLR
In the MLR experiment, the relationship between the dependent and independent variables was elaborated by implementing a scatter plot, at the centre of the multi-dimensional space of data points, in order to find the collinearity of the variables. The linear equation with the highest correlation (R 2 ) was considered as an optimum MLR model [70]. The Equation (6) highlights an example of a multiple linear regression model, with n predictor variables (x 1 , x 2 . . . x n ) and a response y: where: -C = residual terms of the model and the distribution assumption; β 0 , β 1 , β 2 . . . β n = regression coefficients.

PLS
PLS was implemented in an algorithm with simultaneous principal component analysis (PCA) and Ordinary Least Square (OLS) regression. In this experiment, the ideal number of components was evaluated to be between 5-15. PLS reduces the number of input variables to formulate a new set of components that described the maximum correlation between independent and dependent variables. Its prediction ability lies on a set of orthogonal factors, namely component scores or latent variables [44,53]. After selecting the optimal number of components, the loading values and intercepts were extracted to obtain the prediction equation. Equation (7) highlights a PLS regression model with n predictor variables (X 1 , X 2 . . . X n ) and a target output variable y: where: -C = residual terms of the model and the distribution assumption; β 0 , β 1 , β 2 . . . β n = weights/coefficients.

RSR
RSR experiments were carried out by investigating the ideal combinations of the independent variables in relation to the target output. In this technique, each of the 13 MOS measurement sensors (as an independent variable) were removed one by one, and the RSR equation at the given number of independent variables was tested. The variable/s with the least influence on the accuracy of the prediction was then determined and eliminated in order to obtain the optimal RSR model. The linear or square polynomial functions were used to define the system, exploring the experimental conditions leading to the optimal situation [67]. The goal was to find the polynomial approximation of the true nonlinear model closest to the Taylor series expansion in the computation. In response surface regression, it contains the similar effects of polynomial regression designs to 2nd degree and 2-way interaction of the variables. It has the characteristics of both polynomial regression designs and fractional factorial regression designs. Equation (8) reports the regression equation for a quadratic response surface regression design. where: w, x, and z are the predictor variables; β 0 . . . β 9 are the coefficients.

Statistical Analysis
Statistica 10 (Statsoft, Tulsam, OK, USA), MATLAB R2017a (MathWorks, Natick, MA, USA) and Excel 2013 software (Microsoft, Washington, DC, USA) were applied for data analysis. The individual accuracy of the investigated OQMMs was evaluated based on a coefficient of determination (R 2 ) (Equation (9)) and root mean square error (RMSE) (Equation (10)) analysis. where: , is the sum of squared residuals (SSR); , is the sum squared total error (SST), where: x represents the predicted data, - x is the actual data; - x is the mean value of the dataset; -n is the total number of observations.
The R 2 test was used to describe how well the model fits the predicted data. The higher the R 2 , the more confidence given to the accuracy of the prediction model. Meanwhile, the RMSE value was determined in order to analyse the standard deviation of the residuals (prediction errors).

Comparison Analysis
The matrix of the different investigated OQMMs is shown in Table 3. The accuracy per OQMM has been evaluated by means of comparing the R 2 and RMSE for a training and validation dataset.  Figure 3 highlights the statistical characteristics of all of the investigated real odour samples measured by dynamic olfactometry in terms of odour concentration (OU E m −3 ), using the box and whisker diagram for the overall monitored period. As shown, the range in terms of odour concentration of the investigated samples is between 81 OU E m −3 (min) and 5793 OU E m −3 (max), covering a wide variability of values both for the representation of emissions and of the emissions scenario. The on-site collected real odorous samples are principally located at the 4th quartile, while the diluted data were at the 1st-3rd quartile. The overall mean, variance and standard deviation is respectively equal to 1695 OU E m −3 , 9.70 × 10 5 and 879 OU E m −3 . Figure 4 shows the scatter plot diagram between the measured odour concentration with DO, and the predicted values with the elaborated OQMM determined by applying the different parametric and non-parametric prediction techniques, during the training stage. Meanwhile Figure 5 depicts the scatter plot diagram of the relationship by applying the validation dataset. In order to express the certainty of the prediction, a 95% confidence level is applied and presented to all figures.   Applying ANN, a strong pattern relationship between input to output in terms of R 2 of 0.97 is highlighted by using the training dataset ( Figure 4a); this is depicted by the small number of outliers outside the confidence interval. Similarly, a high correlation (R 2 : 0.95) is obtained in the validation stage (Figure 5a). For both stages, a percentage of around 80% of the data points are located within the confidence interval. Therefore, the predicted results using ANN are in good agreement with the reference values.

Odour Quantification Monitoring Models (OQMMs)
In MARSPline application, results show more spread-out data from the regression line compared to ANN. MARSPLine is ideal when the predictor variables do not exhibit a monotone relationship to the dependent variable of interest [51,72]. According to the "divide and conquer" strategy, the curse of dimensionality from the data has been overcome when the input space is partitioned. The optimum model in the training stage was found at 25 basic functions (R 2 : 0.83) (Figure 4b). At this point, despite the increase in the number of basic functions, there is a negligible increase in the correlation value; thus, the optimum MARSPline model is taken into account at this extremity. During the validation of the model, the amount of spread in the data is also observed in this stage (R 2 : 0.87) (Figure 5b).
For the OQMM elaboration by PLS application, results show that the ideal number of components is 10. The accuracy in the training stage, in terms of R 2 , is defined as equal to 0.92 (Figure 4c). PLS highlights better efficiency when the dataset does not exhibit large variances or noisy information [69]. Furthermore, in the investigated case, a strong correlation at high odour concentrations (>4.00 × 10 −3 OU E m −3 ) proves to be difficult to establish. Meanwhile, during the validation stage, PLS accuracy in terms of R 2 assumes values comparable to those of ANN (R 2 : 0.92) (Figure 5c). However, on the basis of the confidence interval, ANN application seems to be more reliable than PLS. RSR applications were started by testing all of the input/independent variables (e.g., 13 MOS sensors) to ascertain which variable/s may have the least impact to the output. The activity was done to identify prior knowledge of the variable's interaction. Subsequently, one, or at most two, variables randomly selected were removed, and the consequent accuracy was calculated. For the analysed samples, the highest accuracy was obtained by removing one variable (e.g., one sensor), obtaining at least an R 2 equal to 0.90 in the training stage ( Figure 4d). Meanwhile, for the validation dataset, a lesser correlation (R 2 : 0.82) was determined, demonstrating the technique comparable to that of the technique MARSpline (Figure 5d).
In MLR application, the lowest relationship between input and output in terms of R 2 was highlighted by applying both the training (R 2 : 0.72) (Figure 4e) and validation (R 2 : 0.53) (Figure 5e) datasets. The scattered data points in the figure depicts an unstable system. Table 4 summarises and compares the different performances of the investigated OQMMs during training and validation stages, considering the sensors' responses as input/independent variables, and the odour concentration as output/dependent variable. In particular, the comparison between the performances was carried out by analysing the relative values, given by the difference between those calculated in terms of R 2 and/or RMSE, obtained by adopting the different prediction techniques. In fact, it should be noted that the calculated R 2 data, in absolute value, could be influenced by the potential reaction between the constituent substances of the samples [69].

Comparative Analysis
Among the parametric prediction techniques, the data reported in Table 4 show how the application of PLS leads to the highest performance. This result is probably due to the ability of PLS to reduce dimensions by transforming and compressing the independent variables into a few components as new inputs, which makes the computation faster. While using the RSR, the obtained performance, in comparison to the other investigated techniques, was lower, especially by applying the validation dataset. RSR can be a tool to analyse the sensitivity of input variables with respect to their relevance to the target output. On the other hand, MLR has been shown to be the most sensitive to outliers. MLR does not appear to achieve satisfactory levels of performance.
While between the non-parametric prediction techniques, ANN highlighted the best performance. ANN shows a stronger capability to map non-linear pattern connection between the input variables to the target output. ANN is more accurate because MAR-SPLine still ensembles linear function in segmented ways. However, architecting ANN is a challenging task, because of numerous configurations required thorough investigation and knowledge (i.e., number of neurons in the hidden layer, appropriate training algorithm, proper activation function, etc.). Overall, from a prediction accuracy point of view, ANN provides the best results in terms of the high R 2 and low RMSE.
As reported, the results refer to the analysis of odour concentrations carried out using the dynamic olfactometry as a reference method, in accordance with EN13725: 2003 for the samples collected at the emission, and the dilution method for these representatives of the conditions detectable at the receptors level.

Conclusions
The Instrumental Odour Monitoring System (IOMS) is a useful tool to continuously monitor odour emissions from a municipal solid waste treatment plant, and to control their potential negative impacts on the surrounding area in terms of the predicted OU E m −3 . Thus, the elaboration of OQMM is necessary, by using parametric or non-parametric mathematical prediction techniques. Among the investigated techniques, ANN (parametric technique) highlighted the best performance on the basis of R 2 and RMSE, while MLR (non-parametric technique) was the lowest.
This study shows how the choice of the prediction technique affects the accuracy in terms of environmental odour concentration prediction of an IOMS. This way, the production of flexible IOMSs that can easily implement different statistical models is suggested, in order to analyse and identify the best performing model for the specific case to be analysed.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.