Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS

Evrendilek, Fatih; Evrendilek, Gulsun Akdemir

doi:10.3390/pr14030413

Open AccessArticle

Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS

by

Fatih Evrendilek

^*

and

Gulsun Akdemir Evrendilek

University of Maine Cooperative Extension, Orono, ME 04469, USA

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(3), 413; https://doi.org/10.3390/pr14030413

Submission received: 27 December 2025 / Revised: 19 January 2026 / Accepted: 22 January 2026 / Published: 24 January 2026 / Corrected: 10 February 2026

(This article belongs to the Special Issue Process and Data-Driven Models of Pollutant Transport and Fate in Ecosystems)

Download

Browse Figures

Versions Notes

Abstract

Conventional descriptive statistical approaches in per- and polyfluoroalkyl substance (PFAS) environmental forensics often fail under small-sample, ecosystem-level complexity, challenging the optimization of sampling, monitoring, and remediation strategies. This study presents an advance from passive description to adaptive decision-support for complex PFAS contamination. By integrating Bayesian optimization (BO) via Gaussian Processes (GP) with a Generalized Linear Mixed Model (GLMM), we developed a signal-extraction framework for both understanding and action from limited data (n = 18). The BO/GP model achieved strong predictive performance (GP leave-one-out R² = 0.807), while the GLMM confirmed significant overdispersion (1.62), indicating a patchy contamination distribution. The integrated analysis suggested a dominant spatiotemporal interaction: a transient, high-intensity perfluorooctane sulfonate (PFOS) plume that peaked at a precise location during early November (the autumn recharge period). Concurrently, the GLMM identified significant intra-sample variance (p = 0.0186), suggesting likely particulate-bound (colloid/sediment) transport, and detected n-ethyl perfluorooctane sulfonamidoacetic acid (NEtFOSAA) as a critical precursor (p < 0.0001), thus providing evidence consistent with the source as historic 3M aqueous film-forming foam. This coupled approach creates a dynamic, iterative decision-support system where signal-based diagnosis informs adaptive optimization, enabling mission-specific actions from targeted remediation to monitoring design.

Keywords:

source apportionment; Gaussian process; environmental forensics; adaptive management; remediation

1. Introduction

Globally, contamination by per- and polyfluoroalkyl substances (PFAS) represents a pressing challenge for public health and environmental monitoring due to their persistence, mobility, and complex behavior [1]. Predicting the spatiotemporal transport and fate of PFAS is particularly difficult due to intricate interactions between their physicochemical properties and heterogeneous subsurface environments [2,3]. The high costs of PFAS sampling and analysis often result in sparse datasets, posing challenges for extracting reliable environmental signals and limiting the optimization of sampling, monitoring, and remediation strategies [4,5]. Standard approaches like generalized regression can characterize historical data but are inherently limited in providing actionable insights into where and when to sample or intervene to maximize information gain or remediation effectiveness [6].

To overcome these small-sample limitations, we propose and demonstrate a novel integration of two powerful statistical paradigms: (1) Bayesian optimization (BO) via Gaussian Processes (GP) for spatiotemporal learning and prediction and (2) Generalized Linear Mixed Models (GLMM) for diagnosis of source mechanisms and chemical signatures. BO/GP actively navigates the unsampled parameter space, balancing exploration of high-uncertainty areas with exploitation of suspected high-concentration zones, thus dynamically optimizing sampling strategies [7]. In parallel, the GLMM partitions variance into fixed and random components, identifies key drivers, and quantifies unstructured heterogeneity (e.g., sample-specific variance), thus providing a statistical proxy for signal-based understanding of pollutant source and transport processes [8]. While BO/GP is typically employed for spatial optimization alone, its integration with GLMM, informed by principles from spatiotemporal statistics [9] and model-based geostatistics [10], creates a synergistic loop. GLMM diagnoses the mechanistic why, which informs the BO/GP configuration to accurately predict the where and when; these predictions can then be validated and refined using new GLMM diagnostics from subsequently collected data. This study fills this knowledge gap by coupling mechanistic diagnostics with active-learning optimization for environmental forensics.

This study presents the first application of a coupled BO/GP-GLMM framework for actionable PFAS forensics by transforming limited environmental data into data-informed decision-support. We test the hypothesis that the BO/GP-GLMM coupling can simultaneously achieve robust prediction, optimize adaptive sampling strategies, and deliver inferences about the contamination source. The specific objectives of this study were to: (1) develop a BO framework with a GP surrogate model for predicting PFAS contamination and optimizing its sampling and management; (2) use GLMM to extract key signals from spatiotemporal-chemical interactions and random effects governing contamination patterns; and (3) demonstrate how different BO/GP-driven optimization scenarios support specific, actionable management decisions. Therefore, this integration enables a shift from descriptive analysis to adaptive decision-support for PFAS management.

2. Materials and Methods

2.1. Site Characterization and Analytical Protocol

Sediment samples (n = 18) were collected from three locations (S1–S3) in a PFAS-impacted estuarine sediment system characterized by heterogeneous hydrogeology in Harpswell Cove (ME, USA). Sites were selected to span a hydrodynamic gradient from tidally energetic to quiescent conditions. S1 is located in a constricted channel section likely subject to stronger tidal currents; S2 occupies a broader central channel area with moderate tidal exchange; and S3 is positioned within a protected embayment, suggestive of reduced flushing. Sampling spanned a period encompassing seasonal hydrological variations between 17 October 2024 and 11 June 2025. All samples were analyzed for a suite of 40 PFAS compounds using liquid chromatography with tandem mass spectrometry (LC-MS/MS) (LC-40D XS and LCMS-8050 triple quadrupole; Shimadzu Scientific Instruments; Kyoto, Japan) following EPA Method 1633A [11].

2.2. Predictor Engineering and Data Architecture

For each sample, latitude (Lat) and longitude (Lon) were recorded in decimal degrees via GPS with sub-meter accuracy. The day of year (DOY) was transformed into cyclical coordinates using a harmonic transformation to account for seasonal periodicity without imposing artificial discontinuities (e.g., between 31 December and 1 January) as follows:

DOY_sin = sin(2π × DOY/365), DOY_cos = cos(2π × DOY/365)

(1)

This transformation maximizes seasonal signal extraction from limited temporal sampling points. DOY_sin and DOY_cos are the seasonal harmonic terms capturing seasonality (autumn/winter phase dynamics) and solstice phase offset (early vs. late autumn), respectively, thus helping to distinguish critical transport mechanisms (e.g., snowmelt vs. rainfall). A categorical variable (PFAS) was used to specify the target analyte, such as perfluorooctane sulfonate (PFOS) and perfluorooctanoic acid (PFOA), or their sum (ΣPFAS).

PFAS data were transformed from a wide format (18 rows by 40 columns, representing 18 physical samples and 40 PFAS) to a long format (720 rows: 18 Sample ID × 40 PFAS). This signal-enhancing structure preserved Sample ID as a random-effect grouping variable, stacked all PFAS concentrations into a single response column (Conc), and created a corresponding PFAS identifier column. This architecture properly accounted for the non-independence of multiple chemical measurements from a single physical sample.

2.3. Generalized Linear Mixed Models (GLMM)

A GLMM was fitted with a Gamma distribution and log link function. The model structure was thus:

log(Conc) = β₀ + β₁(Lat) + β₂(Lon) + β₃(DOY_sin) + β₄(DOY_cos) + β₅(PFAS identity) + (1|Sample ID)

(2)

where (1|Sample ID) is a random intercept.

The Gamma distribution with log link was selected as optimal for signal extraction from right-skewed concentration data, with model selection confirming its superior fit compared to alternatives like lognormal (Akaike information criterion corrected, AICc: 81.1 for Gamma vs. 85.1 for lognormal). The model included Sample ID as a random effect to quantify small-scale, within-location heterogeneity. Fixed effects included the engineered predictors (Lat, Lon, DOY_sin, and DOY_cos) and PFAS identity. The model’s purpose was diagnostic, to test for overdispersion, assess the significance of random effects, and identify key chemical discriminators (e.g., precursors).

Following model estimation, two diagnostics were performed to validate signal integrity, interpret its random structure, and enable the translation of statistically significant random effect variance into quantifiable insights about microscale contamination processes. First, best linear unbiased predictors (BLUPs) for the Sample ID random effect (Table S1) quantified the deviation of each individual sample from the population mean predicted by the fixed effects alone, providing a direct measure of within-site, sample-specific heterogeneity. Second, the fixed-effects correlation matrix was examined to assess potential multicollinearity (Table S2), ensuring the stability of parameter estimates for the spatiotemporal and chemical predictors.

For significant main effects or interactions identified by the model, post hoc pairwise comparisons were conducted using Tukey’s Honestly Significant Difference (HSD) test. The complete results of all pairwise comparisons are presented in Table S3. Model outputs are reported on the natural log scale during model fitting; all point estimates presented in the main text and tables were back-transformed to raw units (ng/g) unless explicitly labeled log-scale. All BO/GP-GLMM analyses were performed in JMP Pro 19.0.2 (JMP Statistical Discovery LLC, Cary, NC, USA).

2.4. Bayesian Optimization (BO)

We implemented a GP surrogate model within the BO framework. In this coupling, the GP provides a probabilistic posterior prediction surface across the spatiotemporal-PFAS domain. The acquisition function, also known as expected improvement (EI), then uses this posterior (both its mean and uncertainty) to evaluate and propose the next optimal sampling location or management action, thus creating a learning loop that balances exploration and exploitation [7]. The GP surrogate model was implemented by using a Matern 5/2 correlation function and an estimated nugget parameter (0.115) to prevent overfitting to local noise. The model used a Gamma likelihood with a log link function, appropriate for the positive, right-skewed concentration data. An anisotropic correlation function was employed, allowing length scales to vary independently across spatiotemporal dimensions, reflecting physical realities of differential transport. A measurement-error term (estimated at 0.023) separated analytical noise from true environmental variance.

The acquisition function guiding the BO was EI, defined for an input (x) as follows:

EI(x) = E[max(0, f(x) − f(x∗))]

(3)

where f(x∗) is the value of the current best observation. The algorithm proposes the next sampling point by maximizing EI(x), thus balancing high predicted mean and high prediction uncertainty [7,12], a principle directly extended to the management scenarios.

Validation was performed using leave-one-out cross-validation (LOO-CV). The LOO-CV R² and prediction error were used to quantify predictive accuracy and guard against overfitting. Residual analysis was conducted via an actual vs. LOO-predicted plot to verify homoscedasticity and lack of bias. The reliability of the GP model was further confirmed by examining the convergence of hyperparameter estimation and the reasonableness of predicted uncertainty surfaces. Sensitivity (predictor importance) analysis was performed using a functional analysis of variance (ANOVA) approach (dependent resampled inputs via Monte Carlo simulations; n = 5000) to partition variance into main effects (independent contributions) and total effects (interaction-inclusive contributions). Model interpretation and interaction visualization were augmented using SHapley Additive exPlanations (SHAP) values derived from the GP posterior to quantify predictor contributions.

The optimization phase transitioned the workflow from estimation to adaptive learning. We implemented a composite desirability function (D) based on the posterior predictive distribution of the GP as follows:

P (y | x) ~ N (μ (x), σ^{2} (x))

(4)

where

μ (x)

is the predicted concentration; and

σ (x)

is the prediction uncertainty at an unmeasured location x. We defined four management-oriented scenarios by setting objective functions on the two GP outputs: predicted mean concentration (Conc) and its standard deviation (SD, uncertainty) as follows:

(1) Maximize Conc; minimize SD (hotspot confirmation): Identify high-confidence areas for immediate remediation;

(2) Minimize both Conc and SD (coldspot confirmation): Identify high-confidence clean zones for resource allocation or public assurance;

(3) Maximize both Conc and SD (potential hotspot exploration): Target high-risk, data-poor areas for strategic sampling; and

(4) Ignore Conc; maximize SD (uncertainty reduction): explore to improve overall model fidelity.

2.5. Integrated Workflow

The coupling of GLMM and BO/GP was designed as a sequential, interpretative workflow to ensure that signal-based understanding informs adaptive prediction and optimization (Figure 1). The workflow proceeded through four phases: (1) GLMM diagnostics; (2) GP configuration; (3) BO; and (4) synthesis. This iterative synthesis provides a signal-grounded, spatiotemporally explicit decision-support tool. Prescriptive scenarios can define a new sampling campaign, and incorporating new data restarts the process, creating an adaptive loop in which understanding and action iteratively refine one another. A detailed, step-by-step application of this workflow to the present study is provided in the Supplementary Materials (S1).

3. Results and Discussion

This section presents and discusses the results of the two integrated modeling approaches applied to PFAS-impacted estuarine sediments, beginning with the targeted diagnostic and source-identification analysis using the GLMM and progressing to the comprehensive spatiotemporal prediction and optimization framework using the BO/GP model.

3.1. Model Performance and Diagnostic Foundation

The GLMM analysis provided the critical diagnostic foundation for signal interpretation from limited samples. The model was statistically valid, using all 720 observations with no convergence warnings. The generalized χ²/DF ratio was 1.62, suggesting significant overdispersion and a patchy contamination distribution [8]. The GLMM identified a statistically significant random effect for Sample ID (variance = 0.486; p = 0.0186), revealing substantial intra-sample variance (termed a “bottle effect”, representing microscale, within-location heterogeneity), where concentrations could vary by ~50% between samples from the same spatiotemporal zones. The BLUP values (Table S1) revealed sample-specific deviations on the log scale ranging from −0.61 (Sample ID 4) to +1.61 (Sample ID 1), which back-transformed to concentration multipliers of approximately 0.54× to 4.99× relative to the fixed-effect prediction for a given spatiotemporal and chemical condition. For PFOS, this translated to an approximate concentration range of 1090–10,000 ng/g, highlighting extreme intra-site heterogeneity consistent with particulate-bound transport [2]. This order-of-magnitude variance among samples from the same location and time was consistent with particulate-bound (colloid/sediment) transport, where the stochastic capture of PFAS-laden sediment grains during the sampling drove high intra-site heterogeneity. This finding corroborates the overdispersed (patchy) nature of the plume indicated by the generalized χ²/DF ratio. In other words, PFAS were associated with suspended particulates or sediment-bound phases rather than being fully dissolved, thus implicating particulate-mediated transport mechanisms [2]. Complementing this, analysis of the fixed-effects correlation matrix (Table S2) confirmed the absence of detrimental multicollinearity (all |r| < 0.9), supporting the robustness of the fixed-effect estimates. The moderate correlation between PFOS and PFOA (r = 0.36) underscored their independent behavior within the model, reinforcing that the extreme dominance of PFOS was a true environmental signal and not an artifact of multicollinearity.

Building on GLMM diagnostics, the BO/GP model demonstrated high predictive performance, with an LOO-CV R² value of 0.807 (Table 1). The GP model comprised a mean function with seasonal harmonic terms and an anisotropic Matern kernel, with the parameters for Lon, Lat, and PFAS identity as the kernel’s characteristic length scales (ℓ), defining the correlation range in each dimension (Table 1). The low estimated measurement error (0.023) suggests that the dominant source of variance captured by the model was environmental heterogeneity rather than analytical noise. This supports the interpretation that the GP’s predictions captured spatiotemporal structure. As evidenced by the low measurement error and interpretable hyperparameters, the robust extraction of spatiotemporally stable PFAS signals from limited sampling (n = 18 spatiotemporal points) was achieved through three strategies to mitigate overfitting. First, the estimated nugget term (τ = 0.115) explicitly accounted for microscale variability and measurement noise, preventing the GP from fitting spurious short-range fluctuations and improving generalization. Second, the long-format data structure (18 samples × 40 PFAS = 720 observations) increased informational input without inflating spatial degrees of freedom. This allowed the model to learn covariance patterns across PFAS, space, and time rather than merely interpolating a sparse spatiotemporal grid. Third, an anisotropic Matern 5/2 kernel with independent length scales for spatiotemporal dimensions respected the differing variability scales, avoiding inappropriate smoothing across the sampled domains. However, the stationarity assumption and limited spatial coverage necessitate caution when the model is extrapolated beyond the sampled domain or under divergent ecosystem regimes.

3.2. Source Identification via GLMM

The GLMM’s fixed effects provided evidence supporting source apportionment (Table 2 and Table S4). The precursor compound NEtFOSAA was highly significant (estimate = 0.808; p < 0.0001), a chemical signature exclusive to legacy 3M aqueous film-forming foam (AFFF) manufactured between 1970 and 2000 [6]. This was accompanied by overwhelmingly dominant PFOS (estimate = 7.60, p < 0.0001) and substantial PFOA (estimate = 4.22, p < 0.0001). The ratio PFOS >> PFOA > NEtFOSAA is characteristic of a historic AFFF release (10–25 yr old) where the precursor has partially degraded but remains detectable. Conversely, modern replacement compounds, including PFBA, PFBS, and fluorotelomers (e.g., 6:2 FTS and 8:2 FTS), all exhibited significantly negative estimates (estimate ≈ −2.34, p = 0.0006), confirming the absence of ongoing industrial or municipal inputs. This chemical signature is consistent with the contamination source as a legacy release of 3M AFFF, distinguishing it from chrome plating, textile manufacturing, or modern landfill leachate sources [1,13].

The GLMM also quantified the extreme concentration gradient between legacy and modern PFAS. Conversion of the GLMM’s log-scale least squares means to raw concentrations (ng/g) (Table S3) revealed that PFOS dominated the contamination profile at 2.35 ng/g, approximately 30 times higher than PFOA (0.079 ng/g) and over 20,000 times higher than background modern PFBA and 6:2 FTS (≈0.00011 ng/g). As verified by Tukey’s HSD multiple comparison test (Table S3), this extreme gradient underscores the site’s severe and isolated legacy AFFF burden. Derived from the GLMM, these conclusions provided the essential what and why dynamics that contextualized the spatiotemporal (where and when) signals identified by the BO/GP. It should be noted that the least squares means reported here (e.g., PFOS = 2.35 ng/g) represent population-averaged marginal means across all the sampled locations, seasons, and PFAS identities. These values therefore differed substantially from the much higher PFOS concentrations inferred in Section 3.1 using BLUP multipliers, which applied to hotspot-specific spatiotemporal conditions rather than global averages. In other words, Section 3.2 characterizes average chemical dominance across the dataset, whereas Section 3.1 describes localized peak intensities under high-risk conditions.

3.3. Spatiotemporal Dynamics via BO/GP

The BO/GP framework identified a dominant spatiotemporal interaction, characterizing it as a transient, high-intensity PFOS plume (termed a “pulsing bullseye”) that peaked at a precise geographic point exclusively during the early autumn recharge period. This pattern was elucidated through (1) predictor importance analysis and (2) SHAP values from the BO/GP framework (see Section 2.4). Based on Monte Carlo simulations, predictor importance analysis showed PFAS identity as the overwhelming driver of concentration variance (total effect = 0.978), followed by longitude (Lon, total effect = 0.095) (Table 3). This underscores that the extreme contamination was both compound-specific (PFOS) and highly localized. The large coefficient for Lon (407.81) and substantial GP variance (σ² = 0.204) confirmed the existence of a highly localized spatial hotspot, consistent with a point source. The significant coefficients for the seasonal harmonics (DOY_sin = 0.268; DOY_cos = 0.413; Table 1) confirmed a strong non-stationary seasonal driver, implicating hydrologic events in PFAS mobilization. The substantial interaction effects for these predictors (Table 3) demonstrate that PFAS contamination was governed by complex, non-additive relationships; specifically, the influence of location and season was strongly conditional on PFAS identity. For instance, the model showed that PFOS drove the extreme high-end concentrations, while the other compounds exhibited attenuated transport.

To deconstruct these complex interactions, SHAP interaction analysis clarified the spatiotemporal dynamics of the pulsing bullseye (Figure 2), evidenced by three key observations. First, multivariate SHAP analysis showed that high-concentration predictions (positive SHAP values > 0.5) were driven exclusively by PFOS (blue markers in Figure 2a). The remaining 39 analytes clustered near zero, indicating that they were unresponsive to the spatiotemporal conditions that mobilized the PFOS plume. Second, the SHAP dependence plots for location identified the spatial zone of the point-source hotspot, with the highest SHAP values centered near Lon −69.937 and Lat 43.86 (Figure 2b,c). Subsequent BO-derived scenario analyses (Section 3.4) refined this grid resolution-limited estimate to the precise coordinate of Lon −69.93449, Lat 43.849667. Finally, the convergence of negative SHAP values for DOY_sin and positive values for DOY_cos isolated the high-risk window to early November (late October/early November) (Figure 2d,e).

The early November timing of the PFOS pulse is consistent with the regional onset of autumn recharge dynamics. Increased autumn groundwater discharge to estuaries as summer evapotranspiration decreases before winter has been shown to remobilize sediment-associated contaminants, including PFAS, via colloid-facilitated transport [12,13]. The SHAP interaction (Figure 2b,c) shows that this seasonal trigger and the spatial hotspot were non-additive; namely, the high-intensity plume only occurred when both conditions coincided. This episodic mobilization helps to explain the pulsing nature of the bullseye and aligns with the GLMM’s diagnosis of particulate-bound transport. Consequently, the pulsing bullseye is mechanistically defined as the episodic mobilization of a particulate-bound PFOS hotspot by autumn recharge dynamics. The exclusivity of this pattern to PFOS further isolates the principal risk to this single, legacy AFFF-derived compound. Also, the prescriptive scenario analyses (Section 3.4) pinpointed similar coordinate (Lon −69.93449; Lat 43.849667) where PFOS concentrations were predicted to spike exclusively during the early November window (Figure 3).

3.4. Prescriptive Management Scenarios

The integration of BO/GP’s predictive surfaces with defined management objectives transformed the model from a descriptive tool into a signal-informed prescriptive system. This prescriptive power was quantified through two key outputs: (1) the model’s posterior mean prediction of PFAS concentration (Conc, ng/g); and (2) composite desirability function (D) (ranging from 0 to 1) derived from the GP posterior for a specific management scenario, which combines the goals for Conc and its prediction uncertainty (SD) as defined in Section 2.4. A higher D value indicated a location/time/PFAS configuration that better satisfied the combined objectives of the chosen scenario. The four scenario analyses yielded distinct, actionable guidance as follows:

(1) “Immediate remediation (max Conc; min SD)” converged on the confirmed northern coordinate (−69.93449, 43.849667) for PFOS in early November (DOY_sin = −0.96; DOY_cos = 0.53) (Figure 3). The model returned a peak posterior mean concentration of 35.3 ng/g (95% CI: 24.4–50.9 ng/g) (D = 0.999). This provides a precise, high-confidence target for source-zone intervention, aligning with recent reviews of PFAS remediation technologies [14,15,16].

(2) “Coldspot confirmation (min Conc and SD)” identified late January (DOY_sin = 0.36; DOY_cos = 0.96) at the specific location (−69.9371, 43.859592) as the spatiotemporal condition of minimum risk for NEtFOSAA, offering a scientific basis for designating areas of lower concern (Figure 4). The model predicted a mean concentration of 0.96 ng/g (95% CI: 0.58–2.03 ng/g) (D = 0.706). This suggests that cryostatic conditions in the winter effectively immobilized the particulate-bound contaminant.

(3) “Potential hotspot exploration (max Conc and SD)” flagged the same spatiotemporal conditions as Scenario 1 (November 1st: DOY_sin = −0.96; DOY_cos = 0.53), highlighting that the highest predicted risk was also the largest data gap. This corroborated the SHAP findings, pinpointing the maximum toxicity to day 304.

(4) “Uncertainty reduction (ignore Conc; max SD)” focused on other spatiotemporal regions of high model uncertainty, guiding the efficient expansion of monitoring network to improve overall model robustness.

Figure 3 and Figure 4 demonstrate how the BO/GP predictions translate into spatiotemporally explicit guidance for each management objective. As illustrated in Figure 3, the convergence of the three scenarios (1, 3, and 4) on the same critical spatiotemporal conditions demonstrates that the highest known risk was also the greatest knowledge deficiency, thus creating a priority for allocating finite monitoring or remediation resources.

Drivers of Prediction Uncertainty and Implications for Monitoring

Understanding the drivers of prediction uncertainty (SD) provides critical insight for designing efficient monitoring campaigns. Performed via Monte Carlo dependent resampling, the sensitivity analysis indicated that location and PFAS exerted only equal main effects on SD (0.228), substantially higher than the seasonal main effect (0.158) (Table 3). This contrast indicates that uncertainty was more strongly influenced by where we sampled and what we analyzed than by when we sampled. The significant, non-interacting contribution of PFAS type confirms that each compound possessed intrinsic variability in its detection and distribution. To most efficiently improve overall model fidelity, sampling efforts should prioritize a wide spatiotemporal footprint and a comprehensive suite of PFAS compounds. While seasonal timing remains a factor, a spatially and chemically comprehensive baseline survey may reduce global uncertainty more effectively than a temporally intensive survey at a single point.

3.5. Synthesis and Limitations

The coupled BO/GP-GLMM framework integrates evidence toward the following narrative: The site was impacted by a historic release of 3M AFFF, as suggested by the PFOS/NEtFOSAA signature. The PFAS were likely associated with particulates (significant Sample ID variance). The plume was not static but manifested as a pulsing bullseye, where particulate-bound PFOS was mobilized during the early autumn recharge, creating a transient but severe contamination hotspot. This synthesis of signal-based mechanism (GLMM) and spatiotemporal dynamics (BO/GP) constitutes the core contribution of this integrated approach.

This integrated approach addresses key gaps in traditional environmental modeling under data constraints. It merges the diagnostic rigor and variance-partitioning capability of advanced statistical models (GLMM) with the spatiotemporal prediction and signal-aware optimization of machine learning (BO/GP). The GLMM provides diagnostic insight under a complex variance structure, while the BO/GP provides a non-parametric framework for spatiotemporal prediction and active learning. This is particularly useful in environmental forensics, where data are scarce, but the costs of poor decision-making are high. The framework’s primary strength lies in its ability to extract meaningful environmental signals from constrained sampling (n = 18). By converting spatiotemporal samples into a long-format matrix (720 observations), we maximized information extraction across PFAS, spatial, and temporal dimensions. This signal-enhancing architecture, combined with modeling of uncertainty sources (nugget terms and random effects), allows for inference despite limited spatiotemporal coverage.

The generalizability of this approach is subject to specific limitations and data requirements. Its effectiveness hinges on how well the initial dataset captures the system’s spatiotemporal patterns. While the long-format structure leverages multi-analyte covariance to extract signal, a baseline survey across key environmental gradients is required to define the domain’s correlation structure. Future refinements could incorporate site-specific covariates (e.g., biota, sediment organic carbon, and hydraulic conductivity) to improve mechanistic insight and predictive accuracy. Performance may degrade if sampling misses non-stationary events (e.g., storms that alter transport pathways) or omits source-diagnostic PFAS compounds. Predictions are reliable for interpolation within the sampled domain; extrapolation beyond it requires caution. Future work should include systematic benchmarking against standalone models (e.g., ordinary kriging, standard GLMs, and machine learning surrogates) to quantify the added value of integration in terms of prediction accuracy, uncertainty quantification, and cost-effectiveness of sampling designs. Furthermore, the GP model, with its chosen stationary kernel, assumes stationarity of the underlying covariance structure, which may not hold over long timescales or under changing hydrological regimes. We therefore recommend that future applications begin with a reconnaissance survey designed to capture environmental extremes, ensuring proper GP kernel configuration and reasonable stationarity assumptions. An iterative implementation is advised: using initial results to guide the next sampling campaign, incorporating new data to update the models, and refining management prescriptions in an adaptive loop. This approach is not limited to PFAS and is transferable to other contaminant classes (e.g., chlorinated solvents, microplastics, and metals) in complex environmental settings where optimized site characterization and management are required.

4. Conclusions

This study demonstrates that integrating Bayesian optimization with Generalized Linear Mixed Models creates a signal-extraction synergy for environmental monitoring and management under data constraints. The site was affected by a particulate-bound, degrading PFOS plume originating from historic 3M AFFF, which was episodically mobilized as a “pulsing bullseye” during autumn hydrologic events. The GLMM identified signals consistent with the source, suggested particulate-bound transport, and excluded modern inputs. These signal-based diagnostic conclusions provided the essential what and why that contextualized the BO/GP’s where and when predictions. In an era of increasing regulatory scrutiny on PFAS and constrained budgets, such signal-extraction frameworks are essential for adaptive decision-support and sustainable environmental stewardship.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/pr14030413/s1.

Author Contributions

F.E.: Conceptualization, methodology, data acquisition, predictive modeling, interpretation, writing—original draft, supervision, project administration, funding acquisition. G.A.E.: conceptualization, resources, writing—review and editing, supervision, project administration, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by USDA-NACA (Project no: 5410758-1890). The authors thank Liam McInerney for his assistance with field campaigns and the Brunswick homeowners for providing access to water sources on their properties. The authors also thank the anonymous reviewers for their constructive comments, which significantly improved an earlier version of this manuscript. Any remaining errors or omissions are the sole responsibility of the authors. The findings and conclusions in this publication are those of the authors and should not be construed to represent any official USDA, EPA, or U.S. Government determination or policy.

Data Availability Statement

Further data inquiries should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this study:

6:2 FTS	6:2 Fluorotelomer sulfonate
8:2 FTS	8:2 Fluorotelomer sulfonate
AFFF	Aqueous film-forming foam
AICc	Akaike information criterion corrected
ANOVA	Analysis of variance
BLUP	Best linear unbiased predictor
BO	Bayesian optimization
CI	Confidence interval
Conc	PFAS concentration
D	Composite desirability function
DF	Degrees of freedom
DOY	Day of year
DOY_cos	Cosine-transformed day of year (seasonal harmonic)
DOY_sin	Sine-transformed day of year (seasonal harmonic)
EI	Expected improvement
GLMM	Generalized linear mixed model
GP	Gaussian process
GPS	Global positioning system
HSD	Tukey’s honestly significant difference
Lat	Latitude
LC-MS/MS	Liquid chromatography with tandem mass spectrometry
Lon	Longitude
LOO-CV	Leave-one-out cross-validation
MDL	Minimum detection level
n	Sample size
NEtFOSAA	N-ethyl perfluorooctane sulfonamidoacetic acid
PFAS	Per- and polyfluoroalkyl substances
PFBA	Perfluorobutanoic acid
PFDA	Perfluorodecanoic acid
PFHxA	Perfluorohexanoic acid
PFOA	Perfluorooctanoic acid
PFOS	Perfluorooctane sulfonate
PFUnA	Perfluoroundecanoic acid
Pred Std Dev	Predicted standard deviation
p-value	Statistical significance
r	Pearson’s coefficient of correlation
R²	Coefficient of determination
SD	Standard deviation
SE	Standard error
SHAP	Shapley additive explanations
ΣPFAS	Sum of PFAS concentrations
χ²	Chi-square

References

Buck, R.C.; Franklin, J.; Berger, U.; Conder, J.M.; Cousins, I.T.; de Voogt, P.; Jensen, A.A.; Kannan, K.; Mabury, S.A.; van Leeuwen, S.P.J. Perfluoroalkyl and polyfluoroalkyl substances in the environment: Terminology, classification, and origins. Integr. Environ. Assess. Manag. 2011, 7, 513–541. [Google Scholar] [CrossRef] [PubMed]
Higgins, C.P.; Luthy, R.G. Sorption of perfluorinated surfactants on sediments. Environ. Sci. Technol. 2006, 40, 7251–7256. [Google Scholar] [CrossRef] [PubMed]
Nguyen, T.M.H.; Bräunig, J.; Thompson, K.; Kabiri, S.; Navarro, D.A.; Grimison, C.; Barnes, C.M.; Higgins, C.P.; McLaughlin, M.J.; Kookana, R.S. Influences of chemical properties, soil properties, and solution pH on soil–water partitioning coefficients of per- and polyfluoroalkyl substances (PFAS). Environ. Sci. Technol. 2020, 54, 15883–15892. [Google Scholar] [CrossRef] [PubMed]
Matott, L.S.; Babendreier, J.E.; Purucker, S.T. Evaluating uncertainty in integrated environmental models: A review of concepts and tools. Water Resour. Res. 2009, 45, W06421. [Google Scholar] [CrossRef]
Shastri, Y.; Diwekar, U.; Cabezas, H. Optimal control theory for sustainable environmental management. Environ. Sci. Technol. 2008, 42, 5322–5328. [Google Scholar] [CrossRef]
Interstate Technology & Regulatory Council (ITRC). PFAS Technical and Regulatory Guidance Document (PFAS-1). 2023. Available online: https://pfas-1.itrcweb.org/ (accessed on 1 May 2025).
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef]
McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
Cressie, N.; Wikle, C.K. Statistics for Spatio-Temporal Data; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Diggle, P.J.; Ribeiro, P.J. Model-Based Geostatistics; Springer: New York, NY, USA, 2007. [Google Scholar]
United States Environmental Protection Agency (USEPA). Method 1633, RevisionA: Analysis of Per- and Polyfluoroalkyl Substances (PFAS) in Aqueous, Solid, Biosolids, and Tissue Samples by LC-MS/MS; EPA-820-R-24-007; Office of Water: Washington, DC, USA, 2024. Available online: https://www.epa.gov/system/files/documents/2024-12/method-1633a-december-5-2024-508-compliant.pdf (accessed on 1 May 2025).
Klein, A.; Falkner, S.; Bartels, S.; Hennig, P.; Hutter, F. Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 528–536. [Google Scholar]
McGuire, M.E.; Schaefer, C.; Richards, T.; Backe, W.J.; Field, J.A.; Houtz, E.; Sedlak, D.L.; Guelfo, J.L.; Wunsch, A.; Higgins, C.P. Evidence of remediation-induced alteration of subsurface poly- and perfluoroalkyl substance distribution at a former firefighter training area. Environ. Sci. Technol. 2014, 48, 6644–6652. [Google Scholar] [CrossRef] [PubMed]
Weber, A.K.; Barber, L.B.; LeBlanc, D.R.; Sunderland, E.M.; Vecitis, C.D. Geochemical and hydrologic factors controlling subsurface transport of poly- and perfluoroalkyl substances, Cape Cod, Massachusetts. Environ. Sci. Technol. 2017, 51, 4269–4279. [Google Scholar] [CrossRef] [PubMed]
Bräunig, J.; Baduel, C.; Heffernan, A.; Rotander, A.; Donaldson, E.; Mueller, J.F. Fate and redistribution of perfluoroalkyl acids through AFFF-impacted groundwater. Sci. Total Environ. 2017, 596, 360–368. [Google Scholar] [CrossRef] [PubMed]
Wanninayake, D.M. Comparison of currently available PFAS remediation technologies in water: A review. J. Environ. Manag. 2021, 283, 111977. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Integrated workflow of GLMM and BO/GP adopted in this study.

Figure 2. Multivariate SHAP interaction analyses based on the BO/GP model of PFAS concentrations (on the log-scale) for: (a) 40 PFAS species: PFOS (blue) was the exclusive driver of high-concentration predictions; (b) Lon (X): values peaked at the hotspot coordinate (−69.937); (c) Lat (Y): values peaked at the hotspot coordinate (43.86); (d) DOY_sin: negative values corresponded to late October/early November; and (e) DOY_cos: positive values corresponded to late October/early November.

Figure 3. Converged results of the “immediate remediation (max-min)”, “potential hotspot exploration (max-max)”, and “uncertainty reduction (none-max)” scenario analyses on the GP posterior mean surface (Conc, log-scale). Values in red and blue indicate scenario-related optimal conditions and 95% confidence intervals, respectively.

Figure 4. Results of the “coldspot confirmation (min-min)” scenario analysis, identifying a low-risk spatiotemporal zone (Conc, log-scale). Values in red and blue indicate scenario-related optimal conditions and 95% confidence intervals, respectively.

Table 1. Gaussian Process (GP) model parameters and performance metrics.

Component	Parameter	Estimate	Interpretation
Mean function	Intercept (β)	0.048	Baseline log-scale
	DOY_sin (β₁)	0.268	Linear effect on the log-scale
	DOY_cos (β₂)	0.413	Linear effect on the log-scale
Kernel hyperparameter (length scale, ℓ)	Lon (X)	407.81	Distance in longitude over which spatial correlation decays
	Lat (Y)	30.120	Distance in latitude over which spatial correlation decays
	PFAS (θ)	0.807	“Distance” across PFAS compounds over which correlation decays
Variance	Nugget (τ)	0.115	Microscale/noise (non-spatial) component
	GP variance (σ²)	0.204	The underlying spatial process (signal)
Validation	Leave-one-out R²	0.807	Predictive power
	Measurement error	0.023	Estimated analytical noise

Table 2. Key fixed effects from the GLMM for PFAS concentration (on the log scale), with PFUnA as the baseline (full GLMM output in Table S4).

PFAS	Estimate	SE	p-Value	Interpretation
6:2 FTS (modern)	−2.336	0.678	0.0006	Fluorotelomer, absent
NEtFOSAA	0.808	0.115	<0.0001	Precursor specific to legacy 3M AFFF
PFBA (modern)	−2.336	0.678	0.0006	Modern replacement compound, absent
PFDA	−0.707	0.704	0.3156	Not significant
PFHxA	−0.347	0.478	0.4684	Not significant
PFOA	4.219	0.926	<0.0001	Secondary terminal product
PFOS	7.604	0.692	<0.0001	Dominant terminal product

Table 3. Main, interaction, and total effects (relative importance) of the five predictors on concentration (Conc), uncertainty (SD), and overall (Conc + SD).

Response	Input	Main Effect	Interaction Effect	Total Effect
Overall	PFAS	0.127	0.476	0.603
	Lon (X)	0.121	0.041	0.162
	Lat (Y)	0.121	0.021	0.142
	DOY_sin	0.087	0.028	0.115
	DOY_cos	0.087	0.017	0.104
Conc	PFAS	0.025	0.953	0.978
	Lon (X)	0.014	0.081	0.095
	DOY_sin	0.016	0.057	0.073
	Lat (Y)	0.014	0.043	0.057
	DOY_cos	0.016	0.034	0.050
SD *	Lon (X)	0.228	0	0.228
	Lat (Y)	0.228	0	0.228
	PFAS	0.228	0	0.228
	DOY_sin	0.158	0	0.158
	DOY_cos	0.158	0	0.158

* An interaction effect of zero for the SD response indicates that the variance-based sensitivity analysis did not detect non-additive, interactive contributions between the predictors for that specific model output.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Evrendilek, F.; Evrendilek, G.A. Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS. Processes 2026, 14, 413. https://doi.org/10.3390/pr14030413

AMA Style

Evrendilek F, Evrendilek GA. Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS. Processes. 2026; 14(3):413. https://doi.org/10.3390/pr14030413

Chicago/Turabian Style

Evrendilek, Fatih, and Gulsun Akdemir Evrendilek. 2026. "Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS" Processes 14, no. 3: 413. https://doi.org/10.3390/pr14030413

APA Style

Evrendilek, F., & Evrendilek, G. A. (2026). Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS. Processes, 14(3), 413. https://doi.org/10.3390/pr14030413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Coupling Bayesian Optimization with Generalized Linear Mixed Models for Managing Spatiotemporal Dynamics of Sediment PFAS

Abstract

1. Introduction

2. Materials and Methods

2.1. Site Characterization and Analytical Protocol

2.2. Predictor Engineering and Data Architecture

2.3. Generalized Linear Mixed Models (GLMM)

2.4. Bayesian Optimization (BO)

2.5. Integrated Workflow

3. Results and Discussion

3.1. Model Performance and Diagnostic Foundation

3.2. Source Identification via GLMM

3.3. Spatiotemporal Dynamics via BO/GP

3.4. Prescriptive Management Scenarios

Drivers of Prediction Uncertainty and Implications for Monitoring

3.5. Synthesis and Limitations

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI