1. Introduction
A micro-environment is a physical or functional space where contaminant concentrations are relatively homogeneous and well mixed while a person is present, e.g., an automobile cabin, a garage, a forecourt, a classroom, or a workplace zone [
1]. Urban residents spend substantial amounts of time in micro-environments shaped by traffic flow, building geometry, and fuel handling, where pollutant mixtures and exposure intensities can differ markedly from values at regulatory background monitors. Near-road corridors, street canyons, and queue-prone intersections often show sharp spatial concentration gradients over tens of meters and strong diurnal variability tied to congestion and boundary-layer dynamics. These features are recognized in guidance and siting criteria for dedicated near-road monitoring [
2]. Beyond the curbside, activities such as vehicle parking and refueling create semi-enclosed or source-proximate spaces that mix primary emissions with ventilation constraints. In many mid-sized cities, permanent monitoring networks are sparse, so short, intensive campaigns are often used to understand exposure patterns in these settings. A key challenge is to summarize a few days of multi-pollutant measurements into clear conditions that allow fair comparison across micro-environments and support exposure interpretation and follow-up decisions like targeted monitoring, ventilation checks, or operational guidance during high-risk regimes.
Vehicle exhaust emissions are a major contributor to air pollution in semi-enclosed traffic environments such as parking garages, where pollutants accumulate due to limited air exchange. These emissions contain a complex mixture of carbon monoxide (CO), nitrogen oxides (NO
x), hydrocarbons (HCs), aldehydes, and particulate matter (PM) [
3,
4]. In practice, CO is often used as an operational indicator in garages, yet fine particles remain a major health concern, and short exposure peaks can matter [
5].
Traffic-related pollutants are linked to respiratory and cardiopulmonary risks, and guideline values emphasize the importance of both short and long exposures [
6]. Some gases, such as hydrogen sulfide (H
2S), may remain low in ambient urban air but can rise episodically in poorly ventilated settings and cause acute irritation and nuisance odors [
7]. When a campaign covers several micro-environments in only a few days, there is a practical need to summarize multi-pollutant conditions in a consistent way. This helps compare sites and time periods, link observed patterns to plausible drivers such as traffic activity and ventilation, and apply simple analyses without mixing very different conditions.
In this work, an
operational regime refers to a recurring multi-pollutant condition that reflects the joint behavior of emissions and ventilation, expressed through a characteristic combination of pollutant levels, diurnal timing, and simple meteorological indicators [
8,
9,
10,
11]. We define regimes by grouping observations with similar multi-pollutant and meteorological patterns, and we then use the regime label as a compact description of conditions that matter for exposure interpretation and simple analyses. Because regimes are defined from the measured variables rather than fixed labels, a given regime can occur at different sites and at different times when conditions are similar. Practically, the regime approach turns a short multi-pollutant campaign into (i) a time series of regime labels at each site and (ii) per-regime summaries of pollutant levels and co-variation. This supports like-for-like comparison across micro-environments (same regime), identification of regime-specific peak exposure periods, and use of the regime label as an additional categorical predictor for simple analyses and gap-filling.
Compared with common segmentation approaches such as site-based grouping, fixed time-slot grouping, or single-pollutant thresholding, the regime approach aims to identify recurring multi-pollutant situations that can reappear across-sites and times [
8,
10,
11]. This is useful in short campaigns because it provides a small set of interpretable conditions that can be summarized and tested by keeping some days and some sites completely separate for evaluation. It also supports simple analyses and gap-filling by using the regime label as a practical indicator of conditions, together with the other pollutants and meteorological variables measured at the same time.
Short multi-site urban campaigns reveal strong diurnal and micro-environmental structure, yet few studies formalize compact and interpretable multi-pollutant regimes, test them using separate days and separate sites, and examine whether patterns identified at one location also appear at another [
8,
9,
10,
12,
13]. Recent reviews confirm the growing use of machine learning in air quality studies, and they also highlight the importance of interpretability and careful evaluation when data are limited or heterogeneous [
14,
15]. We address this gap with three aims: (1) to define operational regimes from simple and physically grounded features across five contrasting micro-environments, (2) to examine whether the regime label helps explain key tracers beyond wind, time of day, and activity by testing on separate days and separate sites, and (3) to examine how far results can be used across locations by identifying when cross-site use is reasonable and when local effects dominate. Because the campaign duration is short and sources differ by location, we treat cross-site use cautiously and present it as limited to comparable settings rather than a universal rule.
Our contribution is a practical data-driven framework that links the regime approach to concrete uses in short campaigns. These uses include summarizing multi-pollutant conditions into a small number of interpretable situations, checking whether the regime label improves simple exposure-relevant models when tested on separate days and sites, and evaluating when cross-site use is reasonable and when site dependence dominates.
Hyperlocal monitoring has shown sharp concentration contrasts over short distances, which motivates multi-site designs that compare micro-environments within limited sampling periods [
12,
13]. Meteorological normalization further emphasizes that diurnal timing and mixing conditions are central for interpreting observed variability in traffic-related pollution [
10]. Together, these findings motivate the regime approach as a structured way to summarize short campaign observations and to support cautious interpretation and limited use across comparable settings [
9].
2. Materials and Methods
To identify and interpret recurring multi-pollutant conditions that govern exposure in traffic-affected urban micro-environments, we implemented a pipeline that links high-cadence, multi-site measurements to operational regime identification, robustness checks, and task-based evaluation. The pipeline is designed for short campaigns, where the practical goal is to summarize complex pollutant–meteorology mixtures into a small number of interpretable states (regimes) and to test whether the regime label is useful for downstream analyses. We evaluate performance using designs that keep whole days and whole sites separate, so the reported results reflect out-of-sample behavior rather than reuse of information from the held-out blocks.
Figure 1 shows the full analytical pipeline, from measurement to regime discovery, stability testing, and evaluation under day and site-blocked designs.
As shown in
Figure 1, we start from quality-controlled multi-site observations and derive physically grounded features that reflect traffic activity, ventilation, and diurnal timing. We then identify operational regimes by grouping similar multivariate feature patterns, and we summarize each regime by its typical pollutant profile and its occurrence across micro-environments and hours. Finally, we evaluate three practical uses under day- and site-blocked designs: compact condition summarization, explanatory modeling for key tracers, and reconstruction when one pollutant stream is fully missing. In each evaluation, every data-driven step (scaling, regime identification, and model fitting) is learned on training blocks only and then applied unchanged to the held-out blocks.
2.1. Study Design, Instrumentation, and Data Quality Control
We conducted a short fixed-schedule campaign that covered five urban micro-environments typical of traffic-affected settings. The five sites were an open garage, a large roundabout with street canyon form, a closed municipal garage, a gasoline forecourt, and a campus entry that served as an urban background. Sampling followed a constant fifteen-minute cadence from 08:00 to 18:45 local time on four consecutive weekdays during August 2024. This window aligns with peak activity and human presence at these locations and is therefore the period of primary interest for exposure and operations.
We recorded PM
1, PM
2.5, PM
10, NO
2, CO, H
2S, and total VOCs. We also recorded relative humidity, wind speed, geographic coordinates, and a vehicle activity proxy that captures near field traffic intensity. The proxy was derived from an on site infrared traffic counter that produced fifteen minute counts. Site-wise summary statistics appear in
Table 1.
For this work, we group the variables into three input sets: (i) measured inputs, pollutant concentrations and contextual variables recorded during the campaign; (ii) regime identification features, engineered variables that reflect emissions, ventilation, and diurnal structure, used to define the regimes; (iii) task predictors, the variables used in each analysis task, such as regression, cross-site testing, and imputation, which may include the regime label when we test its added value. This organization makes it clear what each step uses, and it helps keep test days and test sites separate from model fitting.
Using this organization, we applied a common preprocessing pipeline before any analysis. All channels were synchronized to a fixed 15 min time grid and checked for invalid readings and logging artifacts. Relative humidity was retained and inspected because optical PM can be biased upward during humid periods. Day and site identifiers were kept so that any step that learns from the data, such as scaling, regime identification, and model fitting, was fitted using only the calibration days and sites and then applied unchanged to the test days and sites.
Additionally, instruments were selected to balance robustness, portability, and traceability under field conditions. Particulate matter (PM
1, PM
2.5, PM
10) was monitored with a Casella Dust Detective optical instrument operating on the light-scattering principle, where a laser beam passes through the sampled air and scattered light intensity is measured to determine particle concentration. The device includes an internal pump that maintains a stable flow rate, provides real-time output for fine and coarse fractions, and is well suited for urban air quality studies. Trace gases (NO
2, CO, H
2S, total VOCs) were measured with Aeroqual Series 500 modular heads, and meteorological variables were recorded using a Kestrel 5500 vane and cup anemometer. All sensors were co-located on a tripod at breathing height
m, and powered by field batteries.
Table 2 summarizes the instruments, principles, and ranges, where PID refers to a Photoionization Detector Head used for total VOCs.
Moreover, quality assurance and quality control (QA/QC) procedures were applied to ensure precision, accuracy, and full traceability. Prior to deployment, the Casella Dust Detective and Aeroqual analyzers were calibrated using certified manufacturer standards, with calibration certificates archived. Daily zero and span checks were performed in the field using High-Efficiency Particulate Air (HEPA)-filtered air for the particulate instrument and certified calibration gases for the gas analyzers. Instrument response and flow-rate stability were verified within 2% of nominal values before each sampling session. Device clocks were synchronized to local time within 1 min and logged at fixed 15 min intervals aligned to wall time. Each day’s measurement sequence included start- and end-of-day bump checks, and any deviation beyond acceptance limits triggered corrective action and notation in the field log. Sampling conditions (duration, location, temperature, RH, and airflow) were recorded concurrently and cross-checked with calibration records. Data validation involved cross-comparison between the Casella Dust Detective and a secondary colocated handheld monitor (HAZ-Scanner) used as a consistency check, inspection for outliers, and exclusion of points attributable to instrument fault or procedure anomalies. Optical PM data were screened with a relative humidity sentinel rule to limit hygroscopic bias. Intervals with RH ≥ 85% were flagged and removed prior to summary analyses [
16]. All serial numbers, maintenance logs, and calibration records were archived for traceability.
A simple site-level calibration protocol defined threshold criteria to maintain consistency across all locations. Zero and span verifications for each pollutant analyzer and flow-rate stability within were confirmed before and after daily measurements. This standardized procedure ensured that observed spatial differences reflected genuine environmental variability rather than calibration drift or bias.
The campaign emphasizes hours that dominate activity and human presence around the sites, and it spans contrasting micro-environments within a compact window. This supports the discovery of recurring daytime conditions and provides a shared context across locations [
9]. Hyperlocal studies that mapped street-scale variability concentrated their measurements in active daytime periods and showed consistent spatial patterns across repeats. This aligns with our focus on practical regime identification under real constraints [
12,
13,
17,
18]. Because temporal coverage is short, we explicitly evaluate the robustness of the learned regimes using block-resampling stability and sensitivity checks as seen in
Section 2.4. Overall, the combined design delivers the spatial contrasts needed for regime learning while documenting stability of the resulting structure.
2.2. Physically Grounded Feature Engineering
Regime learning is driven by a feature vector that combines pollutant levels with simple indicators of timing and mixing. The goal is to represent recurring emission and ventilation conditions in a way that remains interpretable and comparable across sites.
Short roadside campaigns show strong diurnal structure driven by activity timing, mixed-layer growth, and ventilation in semi-enclosed spaces. To represent these rhythms smoothly without arbitrary cutoffs, we encode local time of day
with two circular terms
We include two composition ratios that reflect shifts in the particle-size mixture linked to emission and processing pathways
The ratios increase when the fine mode dominates (e.g., fresh exhaust or confined micro-environments) and decrease when coarse mechanisms (e.g., resuspension or road dust) dominate. These ratios provide a compact unitless indicator of shifts in size mix that is comparable across sites in short campaigns.
We also include a simple daytime mixing proxy based on the sine of the solar-elevation angle
, used here as a practical indicator of convective mixing strength. The angle is computed from site latitude
, solar declination
from day of year, and hour angle
H using the NREL Solar Position Algorithm and then truncated at zero
The truncation is appropriate because the window is daytime only; negative solar elevation corresponds to night, when this proxy would not represent convective mixing. More details about this algorithm are found in [
19].
2.3. Learning Recurring Conditions as Regimes
In this work, clustering is used as a regime-construction step. The objective is to obtain a small set of stable and interpretable operational prototypes, “centroids,” that can be summarized physically and used consistently in subsequent regime-conditioned modeling and transport checks across sites.
Operational regimes summarize recurring multivariate conditions that matter for exposure and for simple forecasting. Let denote the feature vector with pollutant levels, basic meteorology, the diurnal terms, and the composition ratios.
To place heterogeneous variables on a comparable geometry and to avoid information leakage, each column
j is standardized using training only moments
k-means is used in this study because the regimes are intended to act as operational prototypes: each record must receive a single, reproducible label, and the same regime definition must be assignable to unseen days and unseen sites using training information only. In this setting, centroid-based regimes provide (i) direct regime-wise profiles via cluster centroids/medians and (ii) a deterministic out-of-sample rule (nearest-centroid assignment) that fits naturally with the day- and site-blocked evaluation design. Other clustering families such as model-based mixtures or density-based methods are useful when the scientific goal is to represent non-spherical structure or variable-density groups. However, for short campaigns and for the specific downstream tasks studied here (regime recognition from reduced sensors, regression with a regime factor, and fold-safe reconstruction), the key requirement is a stable partition with complete assignment and a transparent out-of-sample mapping under leakage control. We therefore use
k-means as a controlled, stability-documented baseline and quantify robustness via block-resampling stability, perturbation checks, and feature-group ablations (
Section 2.4), alongside alternative partitions in
Appendix A.3 [
20,
21,
22,
23,
24,
25,
26,
27].
Furthermore, we learn
k regimes with
k-means by minimizing within cluster distortion using the Hartigan and Wong algorithm with multiple random starts [
22]. The number of regimes
k is chosen from a small grid by maximizing mean silhouette and then confirmed by plateaus in adjusted Rand index under day block bootstrap resampling [
20,
26]. This stability-aware selection reduces sensitivity to unequal densities and supports reproducibility. Let
denote the centroids of the
k clusters in the standardized feature space, where each
is a
p-dimensional mean vector representing the typical conditions of regime
j. For new observations
we assign regimes by nearest centroid in the training standardized space
with
and
learned on training data only, and
is the Euclidean norm. Alternative views that included Ward linkage on Euclidean distances, k-means in principal component analysis (PCA) space, and adjusted Rand index (ARI) agreement checks were explored. As reported in the
Appendix A, candidate
k was scanned over a small grid using average silhouette, and stability was evaluated by day block bootstrap and ARI [
27,
28,
29].
2.4. Stability, Sensitivity, and Interpretability
To evaluate whether the discovered regimes are robust under a short duration and limited sample size, we assessed stability and sensitivity using blocked resampling and controlled input perturbations. Stability was assessed by resampling entire site–day blocks with replacement, refitting
k-means with the selected
k, and summarizing agreement with the reference partition using the adjusted Rand index (ARI) [
20,
27]. Sensitivity was assessed in two complementary ways. First, we applied small multiplicative perturbations (jitter) to VOC and RH and refit the clustering, following stability-based clustering validation practice [
30]. Second, we performed feature-group ablations by refitting the clustering after removing one feature group at a time (diurnal terms, PM ratios, VOC, or RH) and quantifying partition agreement using ARI. These checks are used to confirm that the main regime structure is not driven by a single channel and that it persists under small changes to the input representation; they are not used to claim causal importance.
For each regime, pollutant and meteorological distributions were summarized using the median and interquartile range, and hourly regime occurrence was tabulated over the daytime window to characterize diurnal representation. Robust summaries, such as median and interquartile range (IQR), are used because short roadside campaigns often produce right-skewed distributions with episodic peaks.
2.5. Imputing One Missing Sensor Stream
One site–day record of H
2S was unavailable in the collected dataset. Device logs indicated a transient sensor fault, while all other channels remained stable. We treated this gap as missing at random conditional on observed covariates and reconstructed the missing values using a regime-aware random forest [
31,
32]. Random forest was selected because it captures nonlinear relationships among co-pollutants and meteorology, and it is also robust to multicollinearity. The operational regime label was included as an auxiliary predictor to encourage coherence with the multivariate pollutant–meteorology state, linking the regime framework to a practical sensor-recovery task. This imputation example is presented as a proof of concept for feasibility in short campaigns; generalization to longer gaps or substantially different source mechanisms is discussed in
Section 4.
The input features used as predictors for the imputation task included site/location covariates (site name, latitude, longitude), time-of-day features (hour and sin/cos encoding), meteorology/proxies (wind speed, relative humidity, and a mixing proxy), a traffic proxy (car-parking), co-pollutants (PM
1, PM
2.5, PM
10, NO
2, CO, total VOCs), and the regime label. This use of co-pollutants as predictors follows the general idea that co-occurring air-pollution variables can carry strong predictive information, which has been shown in multi-output VOC prediction settings [
33]. Regimes were learned on the training data only; the resulting cluster ID (
) was then assigned to held-out and new records (including the missing day) using the trained centroids and included in the random forest model as a categorical factor. Random forest models were fit without hyperparameter tuning, keeping the default number of candidate predictors considered at each split, because the goal was a robust reconstruction baseline for a short campaign rather than optimizing performance. We used ensembles of 600 trees in cross-validation and 1000 trees for the final imputation fit to stabilize the ensemble predictions [
31,
32]. Because H
2S is unobserved on the missing day, performance cannot be computed for that day. We therefore evaluated the reconstruction model on non-missing days using day-blocked splits, then refit the final model on all available non-missing days and applied it to the missing site–day. This avoids using information from the target day during fitting.
Predictive dispersion across trees was used as an exploratory proxy for variability in the imputed estimates. We report the spread of predictions across the random forest ensemble only as a heuristic indicator of prediction stability, not a statistically calibrated predictive uncertainty interval.
2.6. Do Regimes Add Explanatory Value Beyond Basic Covariates?
The analysis tests whether the learned regimes capture exposure-relevant multivariate conditions that are not already explained by common covariates used in air-quality interpretation (wind speed, solar mixing proxy, diurnal timing, and traffic activity). Regime membership represents recurring states in which pollutants and meteorology co-vary (e.g., traffic build-up under weak ventilation versus well-mixed periods). We quantify the added explanatory value of regimes by comparing a baseline model to an augmented model that includes the regime label.
The baseline linear model for pollutant
at time
i is
where wind speed is denoted WS, the truncated solar-elevation proxy is SE, the cyclic hour terms describe diurnal timing, and the vehicle activity indicator is Cars, which is recorded by the IR traffic counter (vehicles per 15 min) at all sites.
Let
denote the regime label at time
i, with regime 1 as the reference level. The augmented model adds the regime label as a categorical fixed effect to estimate the mean pollutant shift associated with each regime:
Treating regime as a fixed effect allows a direct estimation of the average pollutant difference across the identified multivariate conditions and enables formal testing of whether regime membership adds independent explanatory value. Models were estimated by ordinary least squares [
34].
Residuals were inspected (residuals versus fitted and Q–Q plots). For skewed targets, we repeated the analysis on the log scale as a sensitivity check, as presented in
Appendix A.4; results are reported on the natural scale.
2.7. Generalization Across Sites and Leakage Control
To evaluate whether models trained in some locations can predict pollutant levels in a new but comparable micro-environment(site), we assessed cross-site generalization using a Leave-One-Site-Out (LOSO) design and trained identical models with and without the regime factor. This reflects a common operational setting where monitoring exists at only a subset of locations.
In each LOSO round, one site served as the test set and models were trained on the remaining sites. Any step that learns from the data distribution (standardization and regime identification) was fit on the training sites only and applied unchanged to the held-out site to prevent information leakage. Predictive performance was summarized by and RMSE, and we compared models with and without the regime factor.
Within-site analyses used day-blocked cross-validation [
35], where entire days were held out together to respect temporal dependence and avoid optimistic splits within the same day. Together, LOSO and day-blocked validation address (i) generalization to a new day at the same site and (ii) generalization to a new site.
These validation schemes provide a rigorous test of robustness under strict leakage control [
36] and support the use of the framework for short multivariate campaigns in operational and regulatory contexts.
4. Discussion
This section interprets the results in light of the study’s three main objectives: (i) learning operational regimes that summarize daytime conditions across sites, (ii) testing whether these regimes add explanatory value beyond basic covariates, and (iii) assessing cautious model transport across comparable micro-environments. In practical terms, we focus on a short-campaign problem: how to summarize concurrent multi-pollutant measurements across different micro-environments in a way that remains usable for modeling and data-quality tasks.
The operational regimes derived in this study summarize the joint pollutant–meteorology structure observed across the five monitored micro-environments during daytime hours. Compared with conventional segmentation (by site, by time slot, or by single-pollutant thresholds), the regime labels are assigned from multiple variables jointly, so they can represent recurring multi-pollutant states that appear across different sites and hours within the same campaign. At the same time, the results also show site-dominant regimes (e.g., regimes largely associated with one site type), which helps separate site-specific conditions from those shared across similar micro-environments. Because the regime factor remained interpretable and was highly classifiable from reduced sensor subsets, it provides a compact way to report multivariate conditions using a small number of labels.
In a short campaign, a practitioner can use the regime labels to (i) summarize exposure conditions by reporting the fraction of time each micro-environment spends in higher-burden regimes (by hour and by site), (ii) compare locations on a common basis beyond simple averages (e.g., whether a site is dominated by accumulation-type regimes or by well-mixed regimes), and (iii) support operational actions such as prioritizing ventilation or traffic-management measures during the hours when higher-burden regimes occur most frequently. The same labels can also support data-quality workflows by flagging records that are inconsistent with the typical regime profile and by providing context for reconstructing short sensor outages.
We used
k-means because the regime labels are intended to function as operational prototypes: they must be (i) easy to summarize via regime-wise profiles and (ii) assignable to new observations using only training information (nearest-centroid assignment) under the day- and site-blocked evaluation design. This supports a clear separation between learning and testing and keeps the regime definition reproducible across resamples. Model-based mixtures and density-based clustering can be valuable when regimes are expected to be strongly non-spherical, overlapping, or multi-density; however, in short campaigns they can require additional tuning choices that affect reproducibility and the stability of downstream comparisons. Further, methods that are more sensitive to model specification or density/hierarchy hyperparameters may change the effective number and definition of clusters across folds or environments, which complicates the same-regime interpretation required by our framework. Additionally, because our regime label is used as a factor/predictor and as a target in regime-recognition, we require a complete partition where every record is assigned a regime, and we require a deterministic out-of-sample assignment rule under site/day hold-out. We therefore focus on prototype-based regimes via k-means and outline a leakage-safe comparison protocol for alternative clustering families in
Appendix A.3. Accordingly, this study emphasizes a controlled, stability-documented baseline that can be extended using the same leakage-safe validation principles when richer datasets or additional context variables are available [
20,
24,
25,
38].
In the NO
2 regression, adding the five-level regime factor increased
from 0.194 to 0.251 (
). This change is reported as an incremental increase in explained variability for a model that remains parsimonious (one additional categorical factor). The result is consistent with near-road work highlighting the role of local dispersion and mixing conditions in shaping concentration variability [
2,
8]. We report the effect as a model-level change in explained variance and do not translate it into a ppb-scale improvement without additional concentration-error summaries.
Moreover, the regime-aware random forest imputation reconstructed a completely missing H2S site–day with strong cross-validated performance and low predictive dispersion, using a fold-safe design in which regimes are learned in training folds and assigned in held-out folds. In this manuscript, the imputation case study is limited to one pollutant, one site, and one missing day, so it is presented as a proof-of-concept for methodological feasibility under short-campaign constraints rather than a general statement about performance for longer missing periods or for sites with different source mechanisms.
Concerning cross-site transferability and practical boundaries, leave-one-site-out testing revealed substantial heterogeneity in pollutant behavior. NO2 showed the strongest portability among the tested targets in this dataset, while several other channels showed weak or negative . Here, a negative indicates performance worse than a site-mean baseline for the held-out site, which provides a clear marker of limited transportability. Overall, these results highlight that some targets support pooled modeling across sites in this campaign, whereas others remain strongly site-dependent and would require additional contextual descriptors (e.g., geometry, land use, or source mix) and/or site-specific calibration for deployment beyond the training micro-environments.
The present campaign focused on daytime hours over a small number of representative sites, so the learned regimes are specific to the observed daytime conditions and may not represent nocturnal chemistry, other seasons, or more complex urban morphologies. The solar-elevation proxy is a simplified representation of boundary-layer dynamics and does not replace direct PBL or mixing-height measurements. Although portable analyzers were factory-calibrated, instrument bias and channel noise remain possible, particularly for VOC and H2S sensors. Future work should extend regime learning across longer timeframes and incorporate physical and contextual covariates such as mixing height, canopy geometry, and land-use class to strengthen interpretation and transport across dissimilar micro-environments. The same stability and leakage-safe validation design used here is compatible with other clustering families when the data support additional complexity. Combining machine learning with mechanistic dispersion modeling is also a promising direction for improving transport across dissimilar micro-environments.
5. Conclusions
This study develops and tests an operational-regime workflow for short, synchronized multi-site monitoring in traffic-related urban micro-environments (open and semi-enclosed garages, a fuel forecourt, a street setting, and a campus/background location). The regimes are learned from the joint pollutant–meteorology space and therefore provide an alternative to conventional segmentation by site, hour, or single-pollutant thresholds. A five-regime solution was selected by the silhouette criterion and supported by day-block stability diagnostics and sensitivity checks (
Appendix A.1 and
Appendix A.3), and the regime profiles summarize consistent contrasts in PM loading, gaseous signatures, and ventilation conditions across the monitored network.
Regime labels were recoverable from reduced sensor configurations under day-blocked validation, with accuracy 0.989 using gases and meteorology and 0.993 using all variables, supporting regime recognition when only a subset of channels is available. In the NO2 regression, adding the five-level regime factor increased from 0.194 to 0.251 (), providing a quantified gain in explained variability within the same parsimonious model form. For missing data reconstruction, the fold-safe, regime-aware random forest approach reproduced a fully missing H2S site–day with strong cross-validated performance in this proof-of-concept case, while explicitly avoiding leakage by learning regimes within training folds.
In leave-one-site-out tests, NO2 showed the strongest cross-site portability in this dataset, whereas several other channels were strongly site-dependent and included negative values, indicating performance worse than a site-mean baseline. These results summarize both what transfers across the monitored micro-environments in this campaign and what remains local, and they define the practical operating range of the proposed workflow for short field deployments.