Efﬁcient Monitoring of Microbial Communities and Chemical Characteristics in Incineration Leachate with Electronic Nose and Data Mining Techniques

: Incineration leachate is a hazardous liquid waste that requires careful management due to its high levels of organic and inorganic pollutants, and it can have serious environmental and health implications if not properly treated and monitored. This study applied a novel electronic nose to monitor the microbial communities and chemical characteristics of incineration leachate. The e-nose data were aggregated using principal component analysis (PCA) and T-distributed stochastic neighbor embedding (TSNE). Random forest (RF) and gradient-boosted decision tree (GBDT) algorithms were employed to establish relationships between the e-nose signals and the chemical characteristics (such as pH, chemical oxygen demand, and ammonia nitrogen) and microbial communities (including Proteobacteria, Firmicutes, and Bacteroidetes) of the incineration leachate. The PCA-GBDT models performed well in recognizing leachate samples, achieving 100% accuracy for the training set and 98.92% accuracy for the testing data without overﬁtting. The GBDT models based on the original data performed exceptionally well in predicting changes in chemical parameters, with R 2 values exceeding 0.99 for the training set and 0.86 for the testing set. The PCA-GBDT models also demonstrated superior performance in predicting microbial community composition, achieving R 2 values above 0.99 and MSE values below 0.0003 for the training set and R 2 values exceeding 0.86 and MSE values below 0.015 for the testing set. This research provides an efﬁcient monitoring method for the effective enforcement and implementation of monitoring programs by utilizing e-noses combined with data mining to provide more valuable insights compared with traditional instrumental measurements.


Introduction
Incineration leachate is a complex type of organic wastewater that is generated during the treatment of municipal solid waste that includes proteins, volatile fatty acids, and refractory organics [1].The treatment of incineration leachate can pose a challenge because of its intricate structure and potentially harmful contents, such as carcinogens and toxins.Proper monitoring and treatment are crucial in order to prevent these pollutants from polluting the surrounding environment [2,3].
The primary emphasis of research on incineration leachate has been on investigating the properties of the concentrated leachate [4], the molecular changes that occur in organic matter during treatment [5,6], and the alterations in microorganisms [7,8] that occur during different processes.These studies indicate that the headspace gas above leachate may contain valuable information for monitoring or processing the leachate.So far, only a small number of comprehensive studies have been carried out to extract information from large quantities of raw data on the types, concentrations, and changes in these materials.
An electronic nose (e-nose) is a promising candidate designed to mimic the sense of the human nose by detecting and analyzing volatile organic compounds (VOCs) in headspace gas [9].With a combination of sensors, such as metal oxide sensors, conducting polymers, and quartz crystal microbalance sensors, an e-nose can measure the changes in electrical resistance or impedance that result from the interaction of headspace gas in samples [10].Sensor data are then processed and analyzed using machine learning to identify specific compounds and determine their concentration [11].With the advantage of providing rapid and non-invasive analyses, e-noses are suitable for use in a wide range of industries and applications [12,13].However, studies on leachate detection based on e-nose technology are rare, according to our best knowledge.
For incineration leachate, the variety and quantity of microorganisms in each process are quite important, as they result in different treatment effects of waste incineration plants [14].The relationship between headspace gas and microorganisms in leachate is noteworthy and complicated [15].Microorganisms consume oxygen and carbon dioxide in headspace gas through respiration and metabolism, and they produce nitrogen, nitrous oxide, and other organic substances [16].Therefore, microorganisms play important regulatory roles in the composition of headspace gas.In addition, the growth of microorganisms is influenced by the oxygen content and pH value of the leachate [17], which consequently have feedback effects on the headspace gas and leachate.Overall, the relationship between headspace gas and microorganisms is a complex system of mutual influence and regulation [18].Studying this relationship is essential for understanding the biological processes of leachate and optimizing leachate treatment technology.However, molecular biology techniques have low specificity and require a significant amount of time to perform.In this study, e-nose technology was applied to mine headspace gas information to study microorganism changes.
Our main objectives were: (1) to monitor the changes in leachate headspace gas based on e-nose technology; (2) to process sensor signals based on data reduction and machine learning (random forest and gradient-boosted decision tree); and (3) to mine information on the relationship between headspace gas and microorganisms in leachate based on e-nose data.This research offers a more efficient monitoring method for the effective enforcement and implementation of monitoring programs by utilizing e-nose technology combined with machine learning to analyze the relationships among leachate gas emissions, chemical parameters, and microorganisms, thus providing more valuable insights compared with traditional instrumental measurements.

Sample Collection
Leachate incineration samples were obtained from a local waste incineration power plant (Xiaoshan Jinjiang Green Energy Co., Ltd., Hangzhou, China), as a subsidiary of Zheneng Jinjiang Environment Holding Co., Ltd., (Hangzhou, China).This company is a pioneer and leader in China's WTE (waste-to-energy) industry.The incineration power plant is located in the southeast part of Hangzhou next to the East China Sea and has a treatment scale of 1900 tons per day of WTE.
On 15 August 2022, six water outlets provided leachate samples.As shown in Figure 1, the samples were labeled as LRW (leachate raw water), LE (leachate effluent), internal circulation reactor effluent (ICRE), AeroE (aerobic effluent), ANE (anaerobic effluent), and MBRE (MBR effluent).The samples were stored in a fridge at below 4 • C and transported to a lab for analysis.

E-Nose Diagnose for Leachate Characteristic in the Headspace Gas
The headspace gas of the leachate samples was detected with a commercial PEN2 E-Nose (Airsense Analytics, GmBH, Schwerin, Germany).This device's core components are MOS sensors, as described in Table 1.The MOS sensors transform the gas types and concentrations into electrochemical signals (R/R0, where R is the sensor resistance in the sample headspace gas and R0 is the sensor resistance in clear air), presenting complementary information of the whole headspace gas instead of specific materials.To protect the e-nose sensors, Wahaha purified water with a conductivity of ≤5 us/cm (Hangzhou Wahaha Group Co., Ltd., Hangzhou, China) was adopted to dilute the leachate samples.The ratio of purified water and leachate was 4:1.First, 5 mL of liquid diluted leachate samples was placed into a 500 mL beaker sealed by plastic wrap, and the beaker was kept still for 30 min.The gas flow rate was set to 200 mL/min, and 80 s were taken for e-nose detection.After detection, the sensor chamber was cleaned with clean air.Then, 144 samples (24 samples for each water outlet, with six water outlets) were selected.

E-Nose Diagnose for Leachate Characteristic in the Headspace Gas
The headspace gas of the leachate samples was detected with a commercial PEN2 E-Nose (Airsense Analytics, GmBH, Schwerin, Germany).This device's core components are MOS sensors, as described in Table 1.The MOS sensors transform the gas types and concentrations into electrochemical signals (R/R0, where R is the sensor resistance in the sample headspace gas and R0 is the sensor resistance in clear air), presenting complementary information of the whole headspace gas instead of specific materials.To protect the e-nose sensors, Wahaha purified water with a conductivity of ≤5 µs/cm (Hangzhou Wahaha Group Co., Ltd., Hangzhou, China) was adopted to dilute the leachate samples.The ratio of purified water and leachate was 4:1.First, 5 mL of liquid diluted leachate samples was placed into a 500 mL beaker sealed by plastic wrap, and the beaker was kept still for 30 min.The gas flow rate was set to 200 mL/min, and 80 s were taken for e-nose detection.After detection, the sensor chamber was cleaned with clean air.Then, 144 samples (24 samples for each water outlet, with six water outlets) were selected.

Chemical Parameters Detection for Incinerator Leachate
Chemical parameters (pH, chemical oxygen demand (COD), and ammonia (NH 4+ -N)) were detected on site according to the national standard.The electrode method [20] was applied to detect pH values.The chlorine emendation method [21] was used to detect the contents of COD instead of the dichromate method.The concentration of ammonia nitrogen was measured using Nessler's reagent spectrophotometry [22].

Microbial Community and Functional Potential
Genomic DNA was isolated from the sediment of the leachate deposit and quantified using a NanoDrop 2000 Spectrophotometer (Thermo Fisher Technology, Waltham, MA, USA).The quality of the DNA was further confirmed with gel electrophoresis.Six samples (LRW, LE, ICRE, AeroE, ANE, and MBRE) were analyzed.To amplify the target genomic 16S rRNA (V3-V4 region), we utilized the PCR primer sets 338F (5 -ACTCCTACGG-GAGGCAGCA-3 ) and 806R (5 -TCGGACTACHVGGGTWTCTAAT-3 ) in conjunction with an Applied Biosystems 2720 thermal cycler.To amplify the target genomic 16S rRNA (V3-V4 region), we employed an Applied Biosystems 2720 thermal cycler and the PCR primer sets 338F (5 -ACTCCTACGG-GAGGCAGCA-3 ) and 806R (5 -TCGGACTACHVGGGTWTCTAAT-3 ).The amplification program consisted of an initial denaturation step at 98 • C for 2 min, followed by 30 cycles of denaturation at 98 • C for 15 s, annealing at 55 • C for 30 s, and extension at 72 • C for 30 s.A final extension step was performed at 72 • C for 5 min.After amplification, the products were purified using the Axygen gel recovery kit and quantified with a microplate reader (BioTek, FLx800).The sequencing results were clustered into OTUs at a 97% similarity level using the QIIME software.Comparisons of bacterial richness and diversity were performed using the Chao1, ACE, Shannon-Wiener, and Simpson indices.Analyses were performed using the Personalbio online analysis platform.

Data Reduction for E-Nose Sensor Signals 2.5.1. Principal Component Analysis
Principal component analysis (PCA) is a mathematical technique utilized to decrease the dimensionality of a dataset by mapping the data onto a space with fewer dimensions.PCA works by finding the directions in data that have the highest variance (i.e., the directions that contain the most information) and projecting the data onto these directions.This results in a new set of variables, called principal components (PCs), that are orthogonal to each other and capture the most important information in the data.This method can be used to visualize high-dimensional data in lower dimensions.Below is an overview of the details of PCA: (1).Normalize the continuous input data range.(2).Calculate the covariance matrix to detect associations.(3).Perform eigenvalue and eigenvector computations on the covariance matrix to discover the dominant factors.(4).Generate a feature vector to determine which principal components should be retained.(5).Transform the data onto the principal component axes.

T-Distributed Stochastic Neighbor Embedding
T-distributed stochastic neighbor embedding (TSNE) is also a dimensionality reduction technique that is often used to visualize high-dimensional data in lower dimensions by preserving the distances between data points.TSNE allows data to be visualized on a two-or three-dimensional scatter plot where similar data points are clustered together and dissimilar data points are separated from each other.It is effective at visualizing data with complex, non-linear structures, such as clusters of different shapes and sizes.The details of TSNE are as follows: (1).Find the pairwise similarity between nearby points in a high-dimensional space.
(2).Map the points in high-dimensional space to a low-dimensional map according to their pairwise similarity.(3).Use gradient descent based on Kullback-Leibler divergence to minimize the difference between two points and find a low-dimensional representation of the data.(4).Calculate the similarity between two points in low-dimensional space using a Student distribution.

Data Treatment 2.6.1. Random Forest
A random forest (RF) is an ensemble learning algorithm used for classification and regression tasks.An ensemble is a collection of individual models that are combined to make a single, more powerful model.An RF consists of individual decision trees that are trained on different subsets of data and then combined to perform a prediction.An RF is easily implemented and can handle both continuous and categorical data; additionally, its resistance to overfitting means that an RF can be generalized well to new data.

Gradient-Boosted Decision Tree
A gradient-boosted decision tree (GBDT) is also an ensemble learning algorithm, similar to an RF.However, a GBDT works by sequentially training decision trees on the residuals (errors) of previous trees.This means that each tree is trained to correct the mistakes of previous trees, and the final model is a combination of all trees.A GBDT is also flexible and can be customized using different loss functions and regularization techniques.However, a GBDT can be computationally intensive and can overfit training data if not properly regularized, so it is important to carefully tune the model's hyperparameters.

Model Evaluation
In total, 144 samples were collected; 100 samples were set as the training data, and the rest were set as the testing data.A receiver operating characteristic (ROC) curve was deployed to display the performance of a classifier (RF and GBDT).An ROC curve shows the trade-off between the true positive rate (sensitivity) and the false positive rate across different thresholds.The area under an ROC curve (AUC) is a common metric used to summarize the overall performance of a classifier, with values closer to 1 indicating better performance [23].Each model was run 20 times, and the results are given as the average value of those 20 model runs.
For prediction models, the R 2 coefficient and mean square error (MSE) were selected as the evaluation parameters.The higher the R 2 and the lower the RMSE, the more accurate the prediction model.

E-Nose Sensor Signals
The response values of the e-nose sensors are presented as R/R0, where R and R0 are the sensor responses of the sample gas and the zero gas, respectively.Figure 2 shows the means and standard deviations of the e-nose sensor signals for each leachate sample, and it can be seen that the signal characteristics were quite different.The sensor that showed the strongest responses to volatile compounds was S2.According to Table 1, S2 was very sensitive, with negative signals and reactions with nitrogen oxides, which might mean that the leachate samples had high abundances of nitrogen compounds.Sensors S4, S6, S7, S8 and S9 all exhibited strong responses to the samples, suggesting that the leachate's headspace gas contained relatively high levels of methane and sulfur compounds.The signals provided by sensors S1, S3 and S10 indicated that there were no significant differences among procedures.
The Pearson correlations between e-nose sensor signals are displayed in Figure 3.The 10 sensors showed different correlations.S1 had high correlations (positive or negative) with S2, S3, S5, S6, and S8.S1 had high correlations with S1, S2, S3, S5, S6, S7, S8, and S9.These correlations were observed frequently among the e-nose sensors, suggesting that the headspace gas information could be detected by all sensors but may have overlapped.It's important to have varied cross-sensitivity within a sensor array, and these findings indicate that e-nose technology is capable of discriminating leachate samples.To make better use of e-nose data, signals should be reduced to extract valid information.The Pearson correlations between e-nose sensor signals are displayed in Figure 3.The 10 sensors showed different correlations.S1 had high correlations (positive or negative) with S2, S3, S5, S6, and S8.S1 had high correlations with S1, S2, S3, S5, S6, S7, S8, and S9.These correlations were observed frequently among the e-nose sensors, suggesting that the headspace gas information could be detected by all sensors but may have overlapped.It's important to have varied cross-sensitivity within a sensor array, and these findings indicate that e-nose technology is capable of discriminating leachate samples.To make better use of e-nose data, signals should be reduced to extract valid information.The Pearson correlations between e-nose sensor signals are displayed in Figure 3.The 10 sensors showed different correlations.S1 had high correlations (positive or negative) with S2, S3, S5, S6, and S8.S1 had high correlations with S1, S2, S3, S5, S6, S7, S8, and S9.These correlations were observed frequently among the e-nose sensors, suggesting that the headspace gas information could be detected by all sensors but may have overlapped.It's important to have varied cross-sensitivity within a sensor array, and these findings indicate that e-nose technology is capable of discriminating leachate samples.To make better use of e-nose data, signals should be reduced to extract valid information.

Data Reduction Based on PCA and TSNE
Data reduction can be used to aggregate original data into a representative subset of data or transform them into a more compact representation [24].Here, PCA was applied to reduce the size and the complexity of the original e-nose dataset while preserving as much information as possible [25].By converting the original e-nose data into a new linear combination of variables set as principal components (PCs), we used PCA to extract a new dataset with variables orthogonal to each other.To assess the performance of the PCA, the accumulative variance of the variables was applied.Then, the variance of each PC was set as the feature importance, as displayed in Figure 4a.The accumulative variance of the first three PCs was more than 85% in total variance.Figure 4b shows the distribution of 144 samples in three dimensions.Those clusters (LRW, LE, ICRE, and ANE) were clearly separated from each other.The borders between Aero and MBRE were not well-defined, with some samples completely overlapped.This might imply that the headspace gases in the AeroE and MBRE samples were very similar.

Data Reduction Based on PCA and TSNE
Data reduction can be used to aggregate original data into a representative subset of data or transform them into a more compact representation [24].Here, PCA was applied to reduce the size and the complexity of the original e-nose dataset while preserving as much information as possible [25].By converting the original e-nose data into a new linear combination of variables set as principal components (PCs), we used PCA to extract a new dataset with variables orthogonal to each other.To assess the performance of the PCA, the accumulative variance of the variables was applied.Then, the variance of each PC was set as the feature importance, as displayed in Figure 4a.The accumulative variance of the first three PCs was more than 85% in total variance.Figure 4b shows the distribution of 144 samples in three dimensions.Those clusters (LRW, LE, ICRE, and ANE) were clearly separated from each other.The borders between Aero and MBRE were not well-defined, with some samples completely overlapped.This might imply that the headspace gases in the AeroE and MBRE samples were very similar.As a dimensionality reduction technique used to visualize high-dimensional data, TSNE has been successfully applied to e-nose data.By reducing the high-dimensional enose data (10 dimensions) into a lower-dimensional space, the samples in this study could be easily visualized in three-dimensional space.Here, TNSE ran for a fixed number of iterations determined by the loss value, with each iteration improving the alignment between the high-dimensional and low-dimensional probability distributions.When the iteration number reached 120, the loss value was not optimized; see Figure 5a.Therefore, the iteration number was set to 120 for e-nose data.As seen in Figure 5b, those clusters (LRW, LE, ICRE, and ANE) were clearly separated from each other and more gathered compared with the PCA results shown in Figure 4b.Similar phenomena can be seen in Figure 5b in that the borders between Aero and MBRE were not well-defined, with some samples totally overlapped.As a dimensionality reduction technique used to visualize high-dimensional data, TSNE has been successfully applied to e-nose data.By reducing the high-dimensional e-nose data (10 dimensions) into a lower-dimensional space, the samples in this study could be easily visualized in three-dimensional space.Here, TNSE ran for a fixed number of iterations determined by the loss value, with each iteration improving the alignment between the high-dimensional and low-dimensional probability distributions.When the iteration number reached 120, the loss value was not optimized; see Figure 5a.Therefore, the iteration number was set to 120 for e-nose data.As seen in Figure 5b, those clusters (LRW, LE, ICRE, and ANE) were clearly separated from each other and more gathered compared with the PCA results shown in Figure 4b.Similar phenomena can be seen in Figure 5b in that the borders between Aero and MBRE were not well-defined, with some samples totally overlapped.

Leachate Chemical Characterization
Leachate characterization is highly variable and heterogeneous.In this study, the chemical characteristics of incineration leachate, including pH, COD, and ammonia nitrogen, were detected.Table 2 shows the chemical parameter results of six procedures with statistically significant differences (Turkey HSD, p < 0.05).The pH value varied from 8.29 to 6.45, and the changes were not very regular.The changes in COD and ammonia nitrogen were very noticeable, with LE showing the highest values (33,860 mg/L for COD and

Leachate Chemical Characterization
Leachate characterization is highly variable and heterogeneous.In this study, the chemical characteristics of incineration leachate, including pH, COD, and ammonia nitrogen, were detected.Table 2 shows the chemical parameter results of six procedures with statistically significant differences (Turkey HSD, p < 0.05).The pH value varied from 8.29 to 6.45, and the changes were not very regular.The changes in COD and ammonia nitrogen were very noticeable, with LE showing the highest values (33,860 mg/L for COD and 2472 mg/L for ammonia nitrogen) and MBRE showing the lowest values (361.2 mg/L for COD and 7.44 mg/L for ammonia nitrogen).The conversion of LRW to MBRE resulted in a COD removal efficiency of 97.71%, which was higher than the maximum removal efficiency (63.59%) achieved with the contaminant coagulation treatment process [4].The procedure used in this study achieved a high ammonia nitrogen removal efficiency of 99.34%, which was higher than the 98.98% removal efficiency previously obtained with a spacer tube reverse osmosis membrane [26].Significantly, the chemical parameters of the LE reached their highest (COD and ammonia nitrogen) or lowest (pH) values because during this procedure, the incineration leachate was concentrated.The processed leachate was discharged into a municipal pipe network with chemical parameters that were up to standard.

Microbial Community Composition and Functional Potential Prediction
The microbial communities in the waste incineration leachate were assessed in terms of amplified 16S rDNA fragments.This type of data is commonly generated through the DNA sequencing of bacterial communities, where the relative abundance of different bacterial taxa can be inferred based on the number of sequencing reads corresponding to each taxon.The profiles of the bacterial communities were complex, and the data revealed that there was a high degree of variation between samples.Table 3 displays the respective phylum-and genus-level abundances of microbial communities.In the leachate samples, Proteobacteria, Firmicutes, and Bacteroidetes were the top three phyla, accounting for more than 90% abundance of the total bacterial community.These findings were similar to those described in previous investigations of fresh incineration leachate [27].Notably, the relative content (but not absolute content) of Proteobacteria increased with the changing processing procedures.On the contrary, the relative contents of Firmicutes and Bacteroidetes decreased with the changing processing procedures.The microbial communities in the processed leachate were established as meeting the required standards before being released into the municipal pipe network.
Using PICRUSt 2 and the KEGG database (https://www.arb-silva.de/,version: silva_132), metabolic pathways were predicted to determine the functional composition associated with leachate samples, as shown in Figure 6.The analysis of the functional gene families involved categorizing them into various groups that included metabolism, genetic information processing, cellular processes, environmental information processing, organismal systems, and human diseases.Metabolism emerged as the top-performing pathway among these categories, as it was responsible for more than 85% of the total abundances.The dominant level 2 metabolism pathways were the metabolisms of cofactors and vitamins (13.5-15.3%),carbohydrate metabolism (13.2-14.1%),amino acid metabolism (13.7-14.9),metabolisms of terpenoids and polyketides (11.3-12.5%),and metabolisms of other amino acids (7.8-10.2%).These results indicated high bacterial activity.Human disease-related pathways were uncommon.Environmental information processing pathways included signal transduction (0.8-1.9%) and membrane transport (1.1-2.3%).In this study, the area under the curve (AUC) and receiver operator characteristic  In this study, the area under the curve (AUC) and receiver operator characteristic (ROC) curves were applied to evaluate the performance of the classification models.A high AUC score indicated that the model had a good balance between the true positive rate (TPR) and the false positive rate (FPR), meaning that it could accurately distinguish samples and be useful for the monitoring task.The closer the AUC score was to 1, the closer the model could achieve perfect classification.As shown in Figure 7(a1,b1,c1), the AUC scores for the RF models based on the original e-nose data, PCA-processed e-nose data, and TNSE-processed data were 0.9926, 0.998, and 0.998, respectively.Thus, the RF models could successfully classify the six leachate samples.To further analyze the classification results, ROC curves were used to organize classifiers and visualize the results.In the ROC graphs, the closer the curve is to the (0, 1) point, the better the performance of the classifier.As seen in Figure 7(a2,b2,c2), the classification accuracies for each leachate sample were very different for the training set.Regarding the RF models, the classification model based on TNSE showed a higher accuracy than models based on original data and data processed with PCA.Models based on original data and data processed with PCA misclassified samples for each class, and models based on TNSE only misclassified ANE and MBRE samples, possibly because the ANE and MBRE classes overlapped (as seen in Figure 4a) and the headspace gases of the ANE and MBRE samples were very similar, resulting in models that were difficult to classify.4a) and the headspace gases of the ANE and MBRE samples were very similar, resulting in models that were difficult to classify.To ensure accurate classification performance, testing datasets were used, and each model was run 100 times to reduce the impact of volatility.The average results are displayed in Table 4.The classification model based on TNSE-RF had the best performance, with 99.49% accuracy for the training set and 97.36% accuracy for the testing set, suggesting that the TNSE-RF model had a more stable robustness than the original-RF and PCA-RF models.To ensure accurate classification performance, testing datasets were used, and each model was run 100 times to reduce the impact of volatility.The average results are displayed in Table 4.The classification model based on TNSE-RF had the best performance, with 99.49% accuracy for the training set and 97.36% accuracy for the testing set, suggesting that the TNSE-RF model had a more stable robustness than the original-RF and PCA-RF models.According to the AUC graphs shown in Figure 8(a1,b1,c1), the best classification result was achieved by the PCA-GBDT model, with an AUC value of 0.9995.The models based on original-GBDT and TNSE-GBDT did not exhibit performance levels that were comparable to the models based on original-RF and TNSE-RF, as shown in Figure 7(a1,c1).As shown in Figure 8(a2,b2,c2), the classification accuracy rates of the RF models were very different.The PCA-GBDT model showed the best accuracy among all models (original-RF, PCA-RF, TNSE-RF, original-GBDT, and TNSE-RF), with no samples misclassified.The models based on original-GBDT and TNSE-RF misclassified samples at different levels.The GBDT models were run 100 times to decrease their instability, and the classification results of the training and testing data are displayed in Table 5.The results suggest that the PCA-GBDT model had excellent classification performance, achieving 100% accuracy for the training set and 98.92% accuracy for the testing set.As summarized in Tables 4 and 5, the PCA-GBDT models showed satisfying performance for both the training and the testing datasets, with no overfitting in the modeling.The GBDT models were run 100 times to decrease their instability, and the classification results of the training and testing data are displayed in Table 5.The results suggest that the PCA-GBDT model had excellent classification performance, achieving 100% accuracy for the training set and 98.92% accuracy for the testing set.As summarized in Tables 4 and 5, the PCA-GBDT models showed satisfying performance for both the training and the testing datasets, with no overfitting in the modeling.An RF is considered a powerful and flexible tool for predicting continuous numerical values.While modeling, multiple CARTs are trained on different subsets of training data using random selection (bagging and boosting), helping to reduce model variance and overfitting while making the model more robust to noise in the data.In this study, the number of CARTs was set to 35 according to the R 2 and MSE values.As with the classification procedure, the prediction models were run 100 times to reduce volatility.The average R 2 and MSE values for the prediction RF models based on the original e-nose dataset, the PCA-processed dataset, and the TNSE dataset are displayed in Tables 6-8, respectively.The e-nose signals provided complete information on leachate headspace gas, which predominantly contained volatile organic compounds such as hydrogen sulfide, methyl mercaptan, acetylene, and other similar compounds.The results of the testing data were not as good as those of the training data because the model was applied to new, unseen data that may have had different characteristics or distributions.The other reason why the data were different between the training and testing datasets was the concept of overfitting, which could have led to poor generalization performance.The overall performance of the training dataset was better than that of the testing dataset, but the results of the testing dataset were not bad, with R 2 > 0.80, which was acceptable.
Regarding microbial community composition, the relative contents of Proteobacteria, Firmicutes, and Bacteroidetes were predicted by the RF models.For the training dataset,

Figure 2 .
Figure 2. The means and standard deviations of the e-nose sensor signals for each leachate sample.

Figure 2 .
Figure 2. The means and standard deviations of the e-nose sensor signals for each leachate sample.

Figure 2 .
Figure 2. The means and standard deviations of the e-nose sensor signals for each leachate sample.

Figure 3 .
Figure 3. Sensor signal correlations based on Spearman's correlations.The color scale denotes the correlations, with 1 indicating a positive correlation (red) and −1 indicating a negative correlation (blue).

Figure 4 .
Figure 4. Visualization of e-nose data dimensionality reduction based on PCA: (a) feature importance according to variance; (b) sample distribution based on the first three PCs.

Figure 4 .
Figure 4. Visualization of e-nose data dimensionality reduction based on PCA: (a) feature importance according to variance; (b) sample distribution based on the first three PCs.

Chemosensors 2023 , 17 Figure 5 .
Figure 5. Visualization of e-nose data dimensionality reduction based on TNSE: (a) the loss value according to iteration; (b) sample distribution based on the first three TPs.

Figure 5 .
Figure 5. Visualization of e-nose data dimensionality reduction based on TNSE: (a) the loss value according to iteration; (b) sample distribution based on the first three TPs.

Figure 6 .
Figure 6.Prediction of community functional potential (percentage per million functional units) for six leachate samples (LRW, LE, ICRE, ANE, AeroE, and MBRE) based on the KEGG database.

Figure 6 .
Figure 6.Prediction of community functional potential (percentage per million functional units) for six leachate samples (LRW, LE, ICRE, ANE, AeroE, and MBRE) based on the KEGG database.

Figure 7 .
Figure 7.The evaluation of RF classification based on different datasets: (a1) AUC based on the original data, (b1) AUC based on the PCA data, and (c1) AUC based on the TNSE data; (a2) ROC curve based on the original data, (b2) ROC curve based on the PCA data, and (c2) ROC curve based on the TNSE data.Class S1 refers to LRW, class S2 refers to LE, class S3 refers to ICRE, class S4 refers to AeroE, class S5 refers to ANE, and class S6 refers to MBRE.

Figure 7 .
Figure 7.The evaluation of RF classification based on different datasets: (a1) AUC based on the original data, (b1) AUC based on the PCA data, and (c1) AUC based on the TNSE data; (a2) ROC curve based on the original data, (b2) ROC curve based on the PCA data, and (c2) ROC curve based on the TNSE data.Class S1 refers to LRW, class S2 refers to LE, class S3 refers to ICRE, class S4 refers to AeroE, class S5 refers to ANE, and class S6 refers to MBRE.

Chemosensors 2023 ,
11, x FOR PEER REVIEW 12 of 17 very different.The PCA-GBDT model showed the best accuracy among all models (original-RF, PCA-RF, TNSE-RF, original-GBDT, and TNSE-RF), with no samples misclassified.The models based on original-GBDT and TNSE-RF misclassified samples at different levels.

Figure 8 .
Figure 8. Evaluation of GBDT classification based on different datasets: (a1) AUC based on the original data, (b1) AUC based on the PCA data, and (c1) AUC based on the TNSE data; (a2) ROC curve based on the original data, (b2) ROC curve based on the PCA data, and (c2) ROC curve based on the TNSE data.Class S1 refers to LRW, class S2 refers to LE, class S3 refers to ICRE, class S4 refers to AeroE, class S5 refers to ANE, and class S6 refers to MBRE.

Figure 8 .
Figure 8. Evaluation of GBDT classification based on different datasets: (a1) AUC based on the original data, (b1) AUC based on the PCA data, and (c1) AUC based on the TNSE data; (a2) ROC curve based on the original data, (b2) ROC curve based on the PCA data, and (c2) ROC curve based on the TNSE data.Class S1 refers to LRW, class S2 refers to LE, class S3 refers to ICRE, class S4 refers to AeroE, class S5 refers to ANE, and class S6 refers to MBRE.

Table 2 .
Average values of leachate chemical parameters.
a The values are the average of three leachate sample replications.A mean in the same row followed by different inline letters (a, b, c, d, e) is statistically different, as confirmed with Tukey's HSD test (p < 0.05).

Table 3 .
Bacterial taxonomic identification and relative abundances at the phylum level in each leachate sample at different water outlets.
Chemosensors 2023, 11, x FOR PEER REVIEW 11 of 17 sample were very different for the training set.Regarding the RF models, the classification model based on TNSE showed a higher accuracy than models based on original data and data processed with PCA.Models based on original data and data processed with PCA misclassified samples for each class, and models based on TNSE only misclassified ANE and MBRE samples, possibly because the ANE and MBRE classes overlapped (as seen in Figure

Table 4 .
The classification results for the training and testing sets based on RF models (100 times).

Table 4 .
The classification results for the training and testing sets based on RF models (100 times).

Table 5 .
The classification results for the training and testing sets based on GBDT models.
3.6.Prediction Results of Chemical Parameters and Microbial Community Contents Based on E-Nose Data 3.6.1.Prediction Results of Chemical Parameters and Microbial Community Contents Based on RF An RF is considered a powerful and flexible tool for predicting continuous numerical

Table 5 .
The classification results for the training and testing sets based on GBDT models.Prediction Results of Chemical Parameters and Microbial Community Contents Based on E-Nose Data 3.6.1.Prediction Results of Chemical Parameters and Microbial Community Contents Based on RF

Table 6 .
Comparison of the RF prediction models based on the original e-nose dataset.

Table 7 .
Comparison of the RF prediction models based on PCA.

Table 8 .
Comparison of the RF prediction models based on TSNE.