Groundwater LNAPL Contamination Source Identification Based on Stacking Ensemble Surrogate Model
Abstract
:1. Introduction
2. Methods
2.1. Simulation Model for Multiphase Flow
2.2. Quasi-Monte Carlo Sampling
- (1)
- Generate a Sobol sequence that fills the -dimensional space , where is the dimension of the unknown variables and is the number of samples drawn.
- (2)
- Apply a random shift to the Sobol sequence. Let be a -dimensional random vector uniformly distributed over . The new sequence , obtained by adding to the Sobol sequence, possesses statistical properties.
- (3)
- Transform the new sequence by applying the prior distribution function of the unknown variables, resulting in sets of samples representing the unknown variables.
2.3. Surrogate Model
2.3.1. Stacking Ensemble Method
2.3.2. MLP
2.3.3. RF
2.3.4. DBN
2.4. Inversion Method
2.4.1. Overview of Bayesian Theory
2.4.2. Differential Evolutionary Markov Chain
- (1)
- Generate an initial population , , where represents the dimension of the unknown variable, denotes the number of parallel Markov chains, and represents the number of iterations.
- (2)
- Employ the DE algorithm to generate a proposal chain from the set of chains:
- (3)
- Determine whether to accept the proposed chain based on the Metropolis probability rule:
- (4)
- Finally, Geweke’s convergence diagnostic is used to judge whether the Markov chain is stable [49]. The convergence diagnostic formula is expressed as follows:
3. Case Study
3.1. Case Description and Generalization
3.2. Variable to Be Identified
3.3. Establish Surrogate Model
- (1)
- Acquiring training and testing datasets. According to the prior probability distribution and the initial estimation interval in Table 2, the QMC method introduced in Section 3.1 was used to extract 800 groups of 19 variables, respectively, to obtain the input samples, which were inputted into the numerical simulation model of multiphase flow to obtain the corresponding output samples at W1~W6. The input and output samples constituted 800 groups of input–output datasets together. According to the ratio of 8:2, the data were randomly divided into 640 training datasets and 160 testing datasets.
- (2)
- Construction of individual surrogate model. Individual surrogate models for each of the output locations, W1~W6, were developed using the KRG, KELM, MLP, RF, and DBN algorithms, as detailed in Section 3.2. Among them, the correlation function of KRG and the kernel function of KELM were chosen as Gaussian functions. MLP had five hidden layers, and the number of neurons was 128, 256, 256,128, and 128, respectively. RF consisted of 150 decision trees, DBN had three RBMs, and the number of neurons was 300, 350, and 200, respectively.
- (3)
- Establishment of Stacking ensemble surrogate model. From the five individual surrogate models, the MLP, RF, and DBN models, which exhibited higher accuracy, were selected as BLMs. Meta-data were generated from the BLMs’ outputs. Subsequently, a three-layer MLP model with 128 neurons in each layer served as the MLM to complete the construction of the Stacking ensemble surrogate model.
3.4. Application of the DE-MC Method
4. Results
4.1. Performance of the Stacking Ensemble Surrogate Model
4.2. DE-MC Inversion Results
5. Conclusions and Discussion
- (1)
- The QMC method can be employed to sample from the prior space of unknown variables, resulting in a more uniformly distributed and representative sample. This approach significantly enhanced the quality of the training data, leading to a more accurate and reliable surrogate model.
- (2)
- Compared to individual surrogate models, the Stacking ensemble surrogate model demonstrated a further enhancement in approximating the simulation model with an average R2 value of 0.9950. Moreover, compared to the inversion process based on the simulation model, the computational burden was reduced by 99.56%
- (3)
- The DE-MC inversion algorithm demonstrated effectiveness in solving the 19-dimensional inversion problem. The average relative error for the eight unknown variables, excluding permeability, was 2.79%. For the 11 unknown variables characterizing the log-permeability field, the average relative error was 4.54%.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Huntley, D.; Beckett, G. Persistence of LNAPL sources: Relationship between risk reduction and LNAPL recovery. J. Contam. Hydrol. 2002, 59, 3–26. [Google Scholar] [CrossRef] [PubMed]
- Johnston, C.D.; Trefry, M.G. Characteristics of light nonaqueous phase liquid recovery in the presence of fine-scale soil layering. Water Resour. Res. 2009, 45, 5412. [Google Scholar] [CrossRef]
- Li, J.; Pang, Z.; Liu, Y.; Hu, S.; Jiang, W.; Tian, L.; Yang, G.; Jiang, Y.; Jiao, X.; Tian, J. Changes in groundwater dynamics and geochemical evolution induced by drainage reorganization: Evidence from 81Kr and 36Cl dating of geothermal water in the Weihe Basin of China. Earth Planet. Sci. Lett. 2023, 623, 118425. [Google Scholar] [CrossRef]
- Tomlinson, D.W.; Rivett, M.O.; Wealthall, G.P.; Sweeney, R.E. Understanding complex LNAPL sites: Illustrated handbook of LNAPL transport and fate in the subsurface. J. Environ. Manag. 2017, 204, 748–756. [Google Scholar] [CrossRef] [PubMed]
- Moghaddam, M.B.; Mazaheri, M.; Samani, J.M.V. Inverse modeling of contaminant transport for pollution source identification in surface and groundwaters: A review. Groundw. Sustain. Dev. 2021, 15, 100651. [Google Scholar] [CrossRef]
- Li, J.; Lu, W.; Fan, Y. Groundwater Pollution Sources Identification Based on Hybrid Homotopy-Genetic Algorithm and Simulation Optimization. Environ. Eng. Sci. 2021, 38, 777–788. [Google Scholar] [CrossRef]
- Singh, R.M.; Datta, B. Identification of Groundwater Pollution Sources Using GA-based Linked Simulation Optimization Model. J. Hydrol. Eng. 2006, 11, 1216–1227. [Google Scholar] [CrossRef]
- Chang, Z.; Lu, W.; Wang, H.; Li, J.; Luo, J. Simultaneous identification of groundwater contaminant sources and simulation of model parameters based on an improved single-component adaptive Metropolis algorithm. Hydrogeol. J. 2021, 29, 859–873. [Google Scholar] [CrossRef]
- Zanini, A.; Woodbury, A.D. Contaminant source reconstruction by empirical Bayes and Akaike’s Bayesian Information Criterion. J. Contam. Hydrol. 2016, 185–186, 74–86. [Google Scholar] [CrossRef]
- Wang, Z.; Lu, W.; Chang, Z.; Wang, H. Simultaneous identification of groundwater contaminant source and simulation model parameters based on an ensemble Kalman filter—Adaptive step length ant colony optimization algorithm. J. Hydrol. 2022, 605, 127352. [Google Scholar] [CrossRef]
- Zhang, J.; Zheng, Q.; Wu, L.; Zeng, L. Using Deep Learning to Improve Ensemble Smoother: Applications to Subsurface Characterization. Water Resour. Res. 2020, 56, e2020WR027399. [Google Scholar] [CrossRef]
- Forrester, A.I.; Keane, A.J. Recent advances in surrogate-based optimization. Prog. Aerosp. Sci. 2009, 45, 50–79. [Google Scholar] [CrossRef]
- Queipo, N.V.; Haftka, R.T.; Shyy, W.; Goel, T.; Vaidyanathan, R.; Tucker, P.K. Surrogate-based analysis and optimization. Prog. Aerosp. Sci. 2005, 41, 1–28. [Google Scholar] [CrossRef]
- Asher, M.J.; Croke, B.F.W.; Jakeman, A.J.; Peeters, L.J.M. A review of surrogate models and their application to groundwater modeling. Water Resour. Res. 2015, 51, 5957–5973. [Google Scholar] [CrossRef]
- Degen, D.; Voullième, D.C.; Buiter, S.; Franssen, H.-J.H.; Vereecken, H.; González-Nicolás, A.; Wellmann, F. Perspectives of physics-based machine learning strategies for geoscientific applications governed by partial differential equations. Geosci. Model Dev. 2023, 16, 7375–7409. [Google Scholar] [CrossRef]
- Mignot, E.; Dewals, B. Hydraulic modelling of inland urban flooding: Recent advances. J. Hydrol. 2022, 609, 127763. [Google Scholar] [CrossRef]
- Zhao, Y.; Qu, R.; Xing, Z.; Lu, W. Identifying groundwater contaminant sources based on a KELM surrogate model together with four heuristic optimization algorithms. Adv. Water Resour. 2020, 138, 103540. [Google Scholar] [CrossRef]
- Yongkai, A.; Wenxi, L.; Weiguo, C. Surrogate Model Application to the Identification of Optimal Groundwater Exploitation Scheme Based on Regression Kriging Method—A Case Study of Western Jilin Province. Int. J. Environ. Res. Public Health 2015, 12, 8897–8918. [Google Scholar] [CrossRef] [PubMed]
- Pan, F.; Zhu, P.; Zhang, Y. Metamodel-based lightweight design of B-pillar with TWB structure via support vector regression. Comput. Struct. 2010, 88, 36–44. [Google Scholar] [CrossRef]
- Wang, Z.; Lu, W.; Chang, Z.; Luo, J. A combined search method based on a deep learning combined surrogate model for groundwater DNAPL contamination source identification. J. Hydrol. 2023, 616, 128854. [Google Scholar] [CrossRef]
- Laloy, E.; Hérault, R.; Jacques, D.; Linde, N. Training-Image Based Geostatistical Inversion Using a Spatial Generative Adversarial Neural Network. Water Resour. Res. 2018, 54, 381–406. [Google Scholar] [CrossRef]
- Jeong, J.; Park, E. Comparative applications of data-driven models representing water table fluctuations. J. Hydrol. 2019, 572, 261–273. [Google Scholar] [CrossRef]
- Sun, W.; Trevor, B. A stacking ensemble learning framework for annual river ice breakup dates. J. Hydrol. 2018, 561, 636–650. [Google Scholar] [CrossRef]
- Heddam, S.; Ptak, M.; Zhu, S. Modelling of daily lake surface water temperature from air temperature: Extremely randomized trees (ERT) versus Air2Water, MARS, M5Tree, RF and MLPNN. J. Hydrol. 2020, 588, 125130. [Google Scholar] [CrossRef]
- Wu, Y.; Ke, Y.; Chen, Z.; Liang, S.; Zhao, H.; Hong, H. Application of alternating decision tree with AdaBoost and bagging ensembles for landslide susceptibility mapping. Catena 2020, 187, 104396. [Google Scholar] [CrossRef]
- Arsenault, R.; Gatien, P.; Renaud, B.; Brissette, F.; Martel, J.-L. A comparative analysis of 9 multi-model averaging approaches in hydrological continuous streamflow simulation. J. Hydrol. 2015, 529, 754–767. [Google Scholar] [CrossRef]
- Ouyang, Q.; Lu, W.; Lin, J.; Deng, W.; Cheng, W. Conservative strategy-based ensemble surrogate model for optimal groundwater remediation design at DNAPLs-contaminated sites. J. Contam. Hydrol. 2017, 203, 1–8. [Google Scholar] [CrossRef] [PubMed]
- Xing, Z.; Qu, R.; Zhao, Y.; Fu, Q.; Ji, Y.; Lu, W. Identifying the release history of a groundwater contaminant source based on an ensemble surrogate model. J. Hydrol. 2019, 572, 501–516. [Google Scholar] [CrossRef]
- Yin, J.; Tsai, F.T.-C. Bayesian set pair analysis and machine learning based ensemble surrogates for optimal multi-aquifer system remediation design. J. Hydrol. 2020, 580, 124280. [Google Scholar] [CrossRef]
- Xie, Y.; Sun, W.; Ren, M.; Chen, S.; Huang, Z.; Pan, X. Stacking ensemble learning models for daily runoff prediction using 1D and 2D CNNs. Expert Syst. Appl. 2023, 217, 119469. [Google Scholar] [CrossRef]
- Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; Hinkelmann, R. Ensemble machine learning paradigms in hydrology: A review. J. Hydrol. 2021, 598, 126266. [Google Scholar] [CrossRef]
- Jiang, X.; Ma, R.; Wang, Y.; Gu, W.; Lu, W.; Na, J. Two-stage surrogate model-assisted Bayesian framework for groundwater contaminant source identification. J. Hydrol. 2021, 594, 125955. [Google Scholar] [CrossRef]
- Mo, S.; Lu, D.; Shi, X.; Zhang, G.; Ye, M.; Wu, J.; Wu, J. A Taylor Expansion-Based Adaptive Design Strategy for Global Surrogate Modeling with Applications in Groundwater Modeling. Water Resour. Res. 2017, 53, 10802–10823. [Google Scholar] [CrossRef]
- Yu, J.; Ai, M.; Ye, Z. A review on design inspired subsampling for big data. Stat. Pap. 2024, 65, 467–510. [Google Scholar] [CrossRef]
- Flowers-Cano, R.S.; Ortiz-Gómez, R.; León-Jiménez, J.E.; Rivera, R.L.; Cruz, L.A.P. Comparison of Bootstrap Confidence Intervals Using Monte Carlo Simulations. Water 2018, 10, 166. [Google Scholar] [CrossRef]
- Davey, K.R. Latin Hypercube Sampling and Pattern Search in Magnetic Field Optimization Problems. IEEE Trans. Magn. 2008, 44, 974–977. [Google Scholar] [CrossRef]
- Delshad, M.; Pope, G.; Sepehrnoori, K. A compositional simulator for modeling surfactant enhanced aquifer remediation, 1 formulation. J. Contam. Hydrol. 1996, 23, 303–327. [Google Scholar] [CrossRef]
- He, L.; Valocchi, A.J.; Duarte, C. An adaptive global–local generalized FEM for multiscale advection–diffusion problems. Comput. Methods Appl. Mech. Eng. 2024, 418, 116548. [Google Scholar] [CrossRef]
- Bratley, P.; Fox, B.L.; Niederreiter, H. Programs to generate Niederreiter’s low-discrepancy sequences. ACM Trans. Math. Softw. 1994, 20, 494–495. [Google Scholar] [CrossRef]
- Vandewoestyne, B.; Cools, R. On the convergence of quasi-random sampling/importance resampling. Math. Comput. Simul. 2010, 81, 490–505. [Google Scholar] [CrossRef]
- Sobol, I. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Comput. Math. Math. Phys. 1969, 7, 784–802. [Google Scholar] [CrossRef]
- Wolpert, D.H. Stacked Generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
- Li, J.; Lu, W.; Wang, H.; Bai, Y.; Fan, Y. Groundwater contamination sources identification based on kernel extreme learning machine and its effect due to wavelet denoising technique. Environ. Sci. Pollut. Res. 2020, 27, 34107–34120. [Google Scholar] [CrossRef]
- Oliver, M.A.; Webster, R. A tutorial guide to geostatistics: Computing and modelling variograms and kriging. Catena 2014, 113, 56–69. [Google Scholar] [CrossRef]
- Gholami, V.; Chau, K.; Fadaee, F.; Torkaman, J.; Ghaffari, A. Modeling of groundwater level fluctuations using dendrochronology in alluvial aquifers. J. Hydrol. 2015, 529, 1060–1069. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Hinton, G.E.; Osindero, S.; Teh, Y.-W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
- Braak, C.J.F.T.; Vrugt, J.A. Differential Evolution Markov Chain with snooker updater and fewer chains. Stat. Comput. 2008, 18, 435–446. [Google Scholar] [CrossRef]
- Brooks, S.P.; Roberts, G.O. Convergence assessment techniques for Markov chain Monte Carlo. Stat. Comput. 1998, 8, 319–335. [Google Scholar] [CrossRef]
- Bai, Y.; Lu, W.; Li, J.; Chang, Z.; Wang, H. Groundwater contamination source identification using improved differential evolution Markov chain algorithm. Environ. Sci. Pollut. Res. 2022, 29, 19679–19692. [Google Scholar] [CrossRef]
- Laloy, E.; Rogiers, B.; Vrugt, J.A.; Mallants, D.; Jacques, D. Efficient posterior exploration of a high-dimensional groundwater model from two-stage Markov chain Monte Carlo simulation and polynomial chaos expansion. Water Resour. Res. 2013, 49, 2664–2682. [Google Scholar] [CrossRef]
Parameter | Value |
---|---|
Water density (kg·m−3) | 1000 |
Benzene density (kg·m−3) | 875 |
Benzene/water interfacial tension (dyne·cm−1) | 34.21 |
Solubility of benzene in water (mg·L−1) (30 °C) | 1800 |
Water viscosity (Pa·s) | 0.001 |
Benzene viscosity (Pa·s) | 0.000647 |
Residual water saturation | 0.24 |
Residual chlorobenzene saturation | 0.18 |
Variable Type | Unknown Variable | True Value | Prior Distribution | Initial Estimation Ranges |
---|---|---|---|---|
Model parameters | Porosity n | 0.27 | Uniform distribution | (0.25, 0.35) |
Longitudinal aqueous-phase dispersion αwL (m) | 43.4 | Uniform distribution | (40, 60) | |
Transverse aqueous-phase dispersion αwT (m) | 11.4 | Uniform distribution | (9, 15) | |
Permeability ξ1~ξ11 (mD) | Figure 8 | (2100, 4100) | ||
Source characteristics | Horizontal coordinates Lx (m) | 372 | Uniform distribution | (60, 390) |
Vertical coordinates Ly (m) | 241 | Uniform distribution | (200, 470) | |
Initial release moment ton (d) | 599.55 | Uniform distribution | (0, 1080) | |
Termination release moment toff (d) | 3980.44 | Uniform distribution | (3240, 4320) | |
Release intensity Q (m3/d) | 2.96 | Uniform distribution | (1, 10) |
Wells | Index | KELM | KRG | RF | MLP | DBNN |
---|---|---|---|---|---|---|
W1 | R2 | 0.9380 | 0.8504 | 0.9448 | 0.9532 | 0.9839 |
RMSE (μg/L) | 32.82 | 30.06 | 26.41 | 24.27 | 15.19 | |
W2 | R2 | 0.8874 | 0.8882 | 0.9456 | 0.9682 | 0.9854 |
RMSE (μg/L) | 29.20 | 17.43 | 16.36 | 14.73 | 11.69 | |
W3 | R2 | 0.8508 | 0.8568 | 0.9254 | 0.9450 | 0.9670 |
RMSE (μg/L) | 177.86 | 88.21 | 60.05 | 58.84 | 57.60 | |
W4 | R2 | 0.8785 | 0.8631 | 0.9208 | 0.9752 | 0.9866 |
RMSE (μg/L) | 39.20 | 33.80 | 32.30 | 31.83 | 31.14 | |
W5 | R2 | 0.8504 | 0.8464 | 0.9227 | 0.9753 | 0.9841 |
RMSE (μg/L) | 40.25 | 36.86 | 32.84 | 32.64 | 32.21 | |
W6 | R2 | 0.8681 | 0.8120 | 0.9009 | 0.9570 | 0.9841 |
RMSE (μg/L) | 19.18 | 18.92 | 18.27 | 15.31 | 12.48 |
Wells | Index | BLM | Ensemble Model | ||
---|---|---|---|---|---|
RF | MLP | DBNN | Stacking | ||
W1 | R2 | 0.9448 | 0.9532 | 0.9839 | 0.9975 |
RMSE (μg/L) | 26.41 | 24.27 | 15.19 | 10.48 | |
W2 | R2 | 0.9456 | 0.9682 | 0.9854 | 0.9950 |
RMSE (μg/L) | 16.36 | 14.73 | 11.69 | 8.72 | |
W3 | R2 | 0.9254 | 0.945 | 0.967 | 0.9896 |
RMSE (μg/L) | 60.05 | 58.84 | 57.60 | 35.66 | |
W4 | R2 | 0.9208 | 0.9752 | 0.9866 | 0.9969 |
RMSE (μg/L) | 32.30 | 31.83 | 31.14 | 28.41 | |
W5 | R2 | 0.9227 | 0.9753 | 0.9841 | 0.9943 |
RMSE (μg/L) | 32.84 | 32.64 | 32.21 | 30.58 | |
W6 | R2 | 0.9009 | 0.957 | 0.9841 | 0.9972 |
RMSE (μg/L) | 18.27 | 15.31 | 12.48 | 10.22 |
Variable | True Value | MAP Value | Relative Error | MRE | Posterior Distribution | Confidence Interval | |
---|---|---|---|---|---|---|---|
P0.025 | P0.975 | ||||||
n | 0.27 | 0.276 | 2.22% | 2.79% | N(0.286, 0.0232) | 0.25 | 0.332 |
αwL (m) | 43.4 | 43.15 | 0.58% | N(45.61, 2.442) | 40.73 | 50.49 | |
αwT (m) | 11.4 | 10.91 | 4.30% | N(11.37, 1.472) | 9 | 14.31 | |
Lx (m) | 372 | 369.85 | 0.58% | N(335.76, 42.702) | 250.36 | 390 | |
Ly (m) | 241 | 235.79 | 2.16% | N(237.50, 21.122) | 200 | 279.74 | |
ton (d) | 599.55 | 623.57 | 4.01% | N(554.68, 102.422) | 349.84 | 759.52 | |
toff (d) | 3980.44 | 3844.66 | 3.41% | N(3862.47, 109.522) | 3643.47 | 4050 | |
Q (m3/d) | 2.96 | 2.81 | 5.07% | N(3.05, 1.632) | 1 | 6.31 |
Variable | True Value | MAP | Relative Error | MRE |
---|---|---|---|---|
ζ1 | −0.13236 | −0.12384 | 6.44% | 4.54% |
ζ2 | −0.37932 | −0.39072 | 3.01% | |
ζ3 | 0.05058 | 0.047864 | 5.37% | |
ζ4 | 0.13431 | 0.127399 | 5.15% | |
ζ5 | −1.54757 | −1.65037 | 6.64% | |
ζ6 | −0.25961 | −0.2593 | 0.12% | |
ζ7 | 0.38938 | 0.36831 | 5.41% | |
ζ8 | 0.86368 | 0.846544 | 1.98% | |
ζ9 | 0.19121 | 0.192888 | 0.88% | |
ζ10 | −0.29933 | −0.32131 | 7.34% | |
ζ11 | −1.54274 | −1.65988 | 7.59% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bai, Y.; Lu, W.; Wang, Z.; Xu, Y. Groundwater LNAPL Contamination Source Identification Based on Stacking Ensemble Surrogate Model. Water 2024, 16, 2274. https://doi.org/10.3390/w16162274
Bai Y, Lu W, Wang Z, Xu Y. Groundwater LNAPL Contamination Source Identification Based on Stacking Ensemble Surrogate Model. Water. 2024; 16(16):2274. https://doi.org/10.3390/w16162274
Chicago/Turabian StyleBai, Yukun, Wenxi Lu, Zibo Wang, and Yaning Xu. 2024. "Groundwater LNAPL Contamination Source Identification Based on Stacking Ensemble Surrogate Model" Water 16, no. 16: 2274. https://doi.org/10.3390/w16162274
APA StyleBai, Y., Lu, W., Wang, Z., & Xu, Y. (2024). Groundwater LNAPL Contamination Source Identification Based on Stacking Ensemble Surrogate Model. Water, 16(16), 2274. https://doi.org/10.3390/w16162274