3.1. Feature Selection
In order to select the independent variables that best reflect the input and output relationship and remove the redundant independent variables, the Recursive Feature Elimination algorithm with cross-validation (RFECV) was adopted for selecting a subset of the 32 values of fractional difference
xi, which was used as input to the models. It was decided to use features that were between rank 1 and rank 8 of the RFECV algorithm to have a balance between the number of selected features, which we set to 12. Features selected for classification based off the RFECV ranking were the following sensors: S13 (MOS), S21 (NCA), S24 (NCA), S44 (MOS), S46 (MOS), S7 (EC), S57 (MOS), S54 (NCA), S11 (MOS), S42 (MOS), S23 (NCA), and S9 (EC). Features selected for regression based off the RFECV ranking were the following sensors: S12 (MOS), S5 (AS), S55 (NCA), S11 (MOS), S49 (MOS), S47 (MOS), S7 (EC), S10 (EC), S4 (AS), S51 (NCA), S13 (MOS), and S23 (NCA). As a result of the nature of the RFECV algorithm, it is not straightforward to explain the differences in the lists between classification and regression. However, it can be pointed out that Environmental sensors S5 and S4 were relevant when the regression task was to be carried out by the IOMS, while it appears that classification was not affected by such conditions. This observation is consistent with the literature [
13], wherein classification was carried out using 13 MOS chemical sensors (MOS), without taking into account the three environmental sensors; however, more recent findings [
47] suggest that the temperature of the gaseous flux in the measurement chamber can be considered as a relevant feature in order to increase the overall classification accuracy (from 96% to 98%).
3.2. Classification
The hyperparameters of the ML models were selected for the classification task and are represented by the number of hidden layers and the node for MLP and the number of decision trees or estimators for Random Forest. It was decided to increase the number of hidden layers, varying them between 1 and 7, with an initial number of neurons equal to 10. As highlighted in
Figure 4, the maximum accuracy for MLP was achieved with five hidden layers.
After selecting the number of hidden layers, a trial-and-error approach was used to select the number of neurons per layer, starting from 10. The best results in terms of accuracy were obtained with 100 neurons.
The classification task with RF was first carried out by selecting the hyperparameter of the model, namely, the number of estimators. As can be seen from
Figure 5, the accuracy value for RF rapidly increased and stabilized when more than five estimators were used; thus, 20 estimators were chosen.
Once the models (MLP and RF) were selected, the classification accuracy rates for each class and the overall accuracy rate were calculated for the best models. The results for the training set are shown in
Table 2.
It is not unusual for such high values to occur when testing a model on the training set. To find out if there was an overfitting problem, and consequent poor predictive performances on new data, 5-fold cross-validation (CV) was adopted, which has not been applied thus far in papers dealing with IOMS classification or regression applications in complex field situations. In detail, the overall accuracy rate of MLP and RF was calculated five consecutive times by splitting the training dataset (600 data) into internal training data (480 data points) and validation data (120 data points) with different splits each time. The models were fitted on internal training data, and the scores were computed with reference to validation data.
Table 3 presents the overall classification accuracies as the scores of the models calculated for each split and the mean cross-validation (CV) score and the associated standard deviation. Low standard deviations indicate that the choice of training data did not affect the overall classification accuracy, thus indicating that overfitting was avoided.
Then, the prediction capability of the selected models was evaluated with the test dataset. The goodness of the classification can be examined through the construction of a confusion matrix through which we observe which odor classes have been correctly classified and to what quantity. Each row of the matrix represents the instances in an actual class, while each column represents the instances that the classifier assigns.
Figure 6 shows the confusion matrix obtained by MLP and Random Forest for the test dataset.
As regards the overall result, it can be seen that only three elements out of 150 were mismatched for both MLP and RF. By constructing the confusion matrix, the classification accuracy rates, the overall accuracy rate, and Cohen’s kappa can be easily calculated, as is given by the sum of the diagonal elements in the confusion matrix divided by the total, whereas the definition is more complex, but it can be recovered from the components of the confusion matrix.
In
Table 4, the aforementioned classification metrics are shown for the test dataset. The Cohen’s kappa coefficient, which is used in the present paper as a suitable score parameter for multiclass classification, is 97% for RF and MLP.
Apart from the overall accuracy rates, it is interesting to highlight the classification rate for Class 0, which represents ambient air without the influence of odor emissions from the three sources considered in the study, i.e., pretreatments (Class 1), sludge conditioning (Class 2), and biogas from anaerobic digestion (Class 3). No difference was observed in the classification accuracy rate between Class 0 and the other classes.
In a recent IOMS field application, misclassification of ambient air was found for both the algorithms employed (Linear Discriminant Analysis and ANN), which were not able to identify this class in any case. It was suggested that this was because the sensors were more sensitive to odorous classes [
13]. In the present case, ANN performed reasonably well for Class 0, giving a specific classification accuracy of 98% for this class, as compared to poor values obtained in [
13], in which one hidden layer was chosen, as opposed to five in the present work. The divergence can be partially attributed to the different models, as in the last few years, it was recognized that a deeper model provides a hierarchy of layers that builds up increasing levels of abstraction from the space of the input variables to the output variables, thus suggesting that using deep architectures expresses a useful prior over the space functions the model learns [
48]. Furthermore, RF also performed well for Class 0, so that the significant difference in the classification accuracy rate might be related to the different characteristics of the ambient air, which, in our case, was not totally odorless as a result of the presence of another wastewater treatment plant close by, emitting odorous compounds very different from the investigated classes. It is probable that the presence of other external sources in the ambient air was helpful for better discriminating the WWTP sources. This hypothesis could have been tested by sampling the odor emissions of the outer sources, but unfortunately, this was not possible. Access to each emission source, in fact, is essential to ascertain whether the Class 0 detections (not recognized) refer to emissions from another source or to ambient odorless air. This must be taken into account when IOMS are to be installed outside the fenceline of industrial emission sources, when they are not available for sampling. In other words, the current state of knowledge suggests that is not meaningful to use an IOMS for classification when all the relevant odor emission sources are not sampled and used for training.
3.3. Regression
First, the results obtained on the training set are discussed; then, the results from the 5-fold cross-validation and the performance of the models measured on the test set are shown. For odor concentration regression, we chose the same MLP structure as the classification, and the associated results turned out to be satisfying.
Table 5 shows the coefficient of determination (R
2) and root mean squared error (RMSE) for both models based on the training dataset. We also attempted to reduce the number of hidden layers and neurons to obtain a simpler network, but this only led to a loss of performance similar to that seen in classification (
Figure 4), with no significant decrease in terms of computational time; thus, there was no advantage in changing network hyperparameters.
Both algorithms were found to be very precise on the training set with equal R
2. In a similar case concerning odor emissions from WWTP [
30], a similar value of R
2 (0.996) was found for the training set, with an ANN model (13 nodes for input layer, one hidden layer, with eight neurons) developed on data provided by seedOA IOMS [
13]. RMSE was equal to 36.9 ouE/m
3 for MLP and 6.8 ouE/m
3 for RF, so that RF may be considered more accurate than MLP, based on the training dataset.
Keeping in mind the results of [
30], a straightforward comparison of the RMSE is not possible, as the dependent variable (odor concentration) ranged from 20 to 2435 ouE/m
3 in our case, while it varied from 20 to 50,000 ouE/m
3 in [
30]. By normalizing the RMSE using the difference between the maximum (y
max) and minimum (y
min) of the training dataset (NRMSE = RMSE/(y
max − y
min)), we obtained similar results for MLP and ANN in [
30], with an NRMSE between 1.05% and 1.52%, while RF demonstrated an NRMSE of 0.28%.
For the regression task, 5-fold cross-validation was also adopted to control whether the proposed models were overfitted and then could have poor predictive performance on new data. The mean CV score (R
2) and standard deviation CV score for the models were 0.9 and 0.008 for MLP and 0.95 and 0.029 for RF, respectively. Low standard deviations for the cross-validation scores for both models suggest that overfitting was avoided and the models could be used for prediction on the test set. In
Figure 7, the correlation between the odor concentrations measured using dynamic olfactometry for the test set (150 data points) and the values predicted by MLP is shown.
The coefficient of determination on the test set was 0.9, while the RMSE was 130 ouE/m
3, which is greater than the RMSE calculated on the training set (36.9 ouE/m
3). The NRMSE for MLP on the test data was 5.37%, which can be considered satisfying if we consider the uncertainty associated with measures with dynamic olfactometry [
49]. The Random Forest model demonstrated comparable or slightly better results than MLP on the training set, and this was confirmed for the test set, as indicated in
Figure 8.
The coefficient of determination on the test set was 0.92, while the RMSE was 97 ouE/m
3: ten times greater than the RMSE calculated on the training set. Unfortunately, no performance indicators on the test set are available from [
30], so that in
Table 6, a straightforward comparison between the models is only possible for those proposed in the present work.
While the two different algorithms, i.e., MLP and RF, exhibited almost the same performance for the classification, for the regression, we obtained slightly different values, which may be due to several factors. These include the difference between the RF classifier and RF regressor, the different features selected by the feature selection algorithm for classification and regression, and the different types of targets for regression (continuous variable) and classification (discrete variable). Although the NRMSE for RF on the test data was 4%, indicating a better overall performance than MLP, it must be pointed out that the incremental ratio in the RMSE between the training set and test set was 3.5, whereas a 14-fold increase was observed when comparing the RMSE for the training set (6.8 ouE/m
3) and the test set (97 ouE/m
3) for RF. Collecting more samples from different WWTP plants [
29] in order to increase the size of the dataset will be a valuable approach to confirm the discrepancy between the RMSE in the training set and test set and whether the NRMSE from the test set is below 10%. These can be considered good results for reliable continuous fenceline monitoring of odor emissions.
The results show that it is possible, even for complex situations, to develop field instrumental odor monitoring applications enhanced by ML algorithms, which are capable of simultaneously performing classification and regression, with interesting practical applications. In fact, it is thought that the dissemination of detailed knowledge concerning algorithms and their performance in the environmental monitoring of odors, with transparent protocols useful for verifying their performance, will make policy makers and environmental protection agencies increasingly inclined to set odor thresholds at plant fencelines [
33] and evaluate these values in a monitoring campaign. This has already happened in recent years [
21,
22,
28,
29] due to growing pressure from the public and citizens’ complaints. In such a rapidly evolving context, it may be useful to carry out classification and regression simultaneously from both the plant operator’s and the Environmental Protection Agency’s point of view. When fence monitoring with IOMS is mandatory or strongly encouraged, plant operators will be interested in establishing the source of odor emission within a plant that is responsible for the highest concentrations detected at the fenceline, especially when the sources cannot be directly monitored, as in the case of fugitive emissions. If, on the other hand, the Environmental Protection Agency is requested to identify the most critical emission sources from one or more plants located nearby, concentration alone or odor class alone is insufficient to carry out this task; i.e., both odor concentration and class are essential to tackle such complex tasks.
In the present case study, it may be necessary to establish the distribution of odor classes (0, 1, 2, 3) within a specific concentration range. If odor Class 0 (ambient air) is more frequent at low odor concentrations, this odor class is going to be associated with clean background air, while if odor Class 0 is more frequent at high concentrations, another source will be responsible for high odor concentrations. This source may have come be from outside the plant, having not been identified during IOMS training. The monitoring campaign to be carried out in the WWTP of Monopoli will address such issues. Although it is not possible to carry out classification when all odor emission sources are not used for training (
Section 3.2), the joint analysis of concentrations and odor classes could provide useful information on complex multisource odor emission problems.