1. Introduction
VOCs are a significant group of organic chemicals that are typically found in a gaseous form and vaporize easily, readily entering the environment under normal conditions [
1,
2,
3,
4]. They are synthetic chemicals broadly used in the production of numerous day-to-day products for residential and commercial applications [
5]. VOCs can occur naturally in the environment as biogenic compounds emitted by plants [
6], or they can be anthropogenic, resulting from human activities [
7]. Although biogenic emissions persist, there has been a substantial rise in anthropogenic VOC emissions over recent decades due to increasing industrialization and urbanization [
5]. Urban areas are particularly vulnerable to anthropogenic VOC emissions, with major sources including vehicular traffic and industrial operations [
8]. Vehicle emissions remain a predominant contributor, with VOC levels influenced by vehicle type, age, fuel composition, engine efficiency, driving patterns, and maintenance [
9]. Emissions often include a wide spectrum of compounds such as alkanes, aromatics, and halocarbons: for instance, hexene, pentene, butene, butadiene, dodecane, undecane, decane, octane, methyl-cyclohexane, diethylbenzene, propylbenzene, trimethylbenzene, ethylbenzene, styrene, benzene, toluene, and various chlorinated hydrocarbons [
10].
In addition to outdoor sources, VOCs are widespread in indoor environments [
11,
12]. Modern residential, commercial, and institutional buildings often utilize chemical-based products during construction and furnishing such as paints, adhesives, sealants, varnishes, and cleaning agents that emit VOCs [
13,
14,
15]. Common indoor VOCs may originate from carpets (e.g., benzene, styrene) [
16], household cleaners (e.g., formaldehyde, xylene) [
17], personal care products (e.g., toluene) [
18], electronics (e.g., formaldehyde) [
19], and plastic materials (e.g., ethylbenzene) [
20]. Tobacco smoke is another significant indoor source, frequently containing hazardous compounds like benzene, toluene, ethylbenzene, xylene (BTEX), and formaldehyde [
21]. Once released into the atmosphere, VOCs undergo chemical reactions with ambient pollutants and solar radiation, significantly influencing tropospheric chemistry [
22]. These reactions alter the concentrations of hydroxyl radicals (OH), contribute to the formation of secondary organic aerosols and organic acids, and facilitate the production of ozone through photochemical processes [
23]. In areas with intense sunlight such as the western Mediterranean region, the interplay between biogenic and anthropogenic precursors can intensify ozone formation [
24].
VOCs in urban environments originate from a combination of anthropogenic and biogenic sources, each contributing differently to the overall emission profile. Transportation is often the dominant source in high-traffic corridors, accounting for over 50% of total urban VOC emissions in several European cities, primarily due to fuel combustion and evaporative losses from petrol and diesel vehicles [
25,
26]. Common species emitted include benzene, toluene, ethylbenzene, xylenes (BTEX), ethene, and propene. Industrial sources contribute significantly to localised emissions, especially near refineries, manufacturing hubs, or chemical plants, where chlorinated hydrocarbons, alkanes, and aromatics are released during solvent use, degreasing, and chemical manufacturing [
27]. Biogenic sources, particularly vegetation, emit isoprene and monoterpenes, which can be substantial during summer months and are highly reactive in ozone formation processes [
28]. Although their contribution to total VOC levels is typically lower in urban cores, they can play a significant role in ozone production under VOC-limited regimes. Understanding the relative contributions from these sources is essential for accurately apportioning VOCs and designing effective emission control strategies. Studies frequently report strong correlations between VOCs such as benzene, ethylbenzene, toluene, ethene, and propene, suggesting common emission sources, especially vehicular exhaust and fossil fuel combustion [
29,
30].
Numerous studies have applied a range of statistical and analytical methods to understand the behaviour, sources, and impacts of VOCs and their role in ozone formation. Correlation and regression analyses are commonly used to identify interrelationships and common sources among VOC species [
31]. Principal Component Analysis (PCA) and Positive Matrix Factorization (PMF) are widely used for source apportionment and dimensionality reduction in VOC datasets [
32,
33,
34]. Diurnal and seasonal pattern analysis is frequently employed to link VOC variability with traffic patterns and boundary layer dynamics [
35]. Additionally, clustering techniques such as K-means and hierarchical clustering have gained popularity for identifying pollution regimes and understanding atmospheric processing [
36,
37]. However, only a limited number of studies have combined long-term high-resolution datasets with unsupervised machine learning and lagged cross-correlation analyses to explore dynamic VOC–ozone interactions under diverse meteorological conditions, particularly in dense urban settings in the UK. This study bridges that gap through a novel integrative framework.
Ozone is a secondary pollutant formed through complex photochemical reactions involving VOCs and NO
X in the presence of sunlight [
38]. The primary mechanism begins with the photolysis of NO
2:
VOCs influence ozone formation by producing peroxy radicals (RO
2•) during their oxidation, which convert NO to NO
2 without consuming ozone:
This NO
2 can then photolyze again, producing more ozone. The efficiency of this cycle depends heavily on the VOC-to-NO
X ratio. In VOC-limited regimes (common in urban areas with high NO
X emissions), adding more VOCs increases ozone production, whereas reducing NO
X can initially lead to more ozone due to decreased titration [
39]. In contrast, NO
X-limited regimes (often in rural or downwind regions with low NO
X), ozone formation is limited by NO
X availability, and VOC reductions have less impact [
40]. VOC species also differ in their reactivity; alkenes and aromatics like isoprene and toluene form ozone more efficiently due to their high reactivity and radical generation potential under sunlight.
Environmental factors such as temperature, solar radiation, and boundary layer dynamics further influence these reactions. Elevated temperatures and intense sunlight accelerate VOC oxidation, enhance photolysis rates, and thus increase ozone formation. These mechanisms explain the observed inverse seasonal patterns in VOC and ozone concentrations and underscore the importance of considering chemical regimes when designing mitigation policies.
From a health perspective, long-term exposure to VOCs is increasingly associated with various adverse effects [
41,
42,
43]. Due to their chemical reactivity, VOCs can cause toxic, allergic, mutagenic, and carcinogenic outcomes depending on exposure levels and durations [
44,
45]. For instance, Wang et al. (2025) analysed data from the U.S. National Health and Nutrition Examination Survey (NHANES 2011–2020) and found that elevated VOC biomarkers were significantly associated with increased cardiovascular risk indicators, including blood pressure and systemic inflammation markers [
42]. Health risks also vary based on compound type, exposure environment, and individual susceptibility. Prolonged exposure, especially in indoor settings, has been linked to serious outcomes such as cancer [
46]. Tsai (2019) reviewed VOCs regulated as indoor air pollutants and concluded that several, including benzene and trichloroethylene, are linked to leukaemia, liver toxicity, and neurobehavioral effects [
46]. Notably, compounds like trichloroethylene, vinyl chloride, benzene, and formaldehyde are recognized for their high toxicity and carcinogenic potential [
47,
48]. McCarthy et al. (2006) assessed background concentrations of 18 air toxics, including benzene and formaldehyde, and reported elevated cancer risk levels associated with ambient exposure in urban North America [
48]. Certain studies have connected domestic exposure, e.g., cooking fuels or poor ventilation, to elevated cancer risk, particularly among women and children [
49,
50]. Other research has highlighted links between VOCs and asthma exacerbation or cardiovascular issues [
51]. In a meta-analysis, Alford and Kumar (2021) found consistent links between indoor VOC exposure and respiratory symptoms, including coughing, wheezing, and asthma onset in children and adults [
12]. These findings reinforce the urgency of controlling VOC emissions in urban environments, both to meet air quality standards and to protect long-term public health. Despite their widespread presence, public awareness of VOC exposure remains limited due to their often-hidden presence in consumer products and indoor environments.
Beyond the air, VOCs are also detected in soil and water [
52]. Groundwater contamination can occur through industrial spills or improper waste disposal. This contamination may pose health risks when groundwater is used as a drinking source [
53]. Detecting VOCs in water bodies poses analytical challenges due to their volatile nature and the sensitivity required in sampling. Analytical methods include gas chromatography (GC), mass spectrometry (MS), and more advanced setups like purge-and-trap GC/MS (e.g., EPA Method 524.2), headspace solid-phase microextraction (HS-SPME), surface acoustic wave sensors (SAW), ion mobility spectrometry (IMS), and photoionization detection (PID) [
54]. GC and GC/MS techniques are especially favoured due to their high accuracy and sensitivity. Given their pervasive nature, diverse sources, and complex behaviour, VOCs represent a critical concern for environmental monitoring and public health. Their impact spans across atmospheric chemistry, human health, and ecosystem integrity. Accurate estimation and prediction of VOC dispersion are crucial particularly in urban areas with dense human activity and overlapping sources. Dispersion modelling techniques, including atmospheric models, are vital for simulating how VOCs move through and react within the atmosphere. When combined with extensive datasets and modern tools such as machine learning, these models can yield deeper insights into VOC patterns, enhance forecasting capabilities, and inform policy decisions for effective air quality management.
Despite extensive work on VOC emissions and ozone formation, several critical gaps persist. Most studies focus on short-term monitoring campaigns or isolated pollutants, limiting our understanding of long-term dynamics and inter-species behaviour under varying meteorological conditions. Moreover, few investigations in UK urban environments have employed an integrated framework combining multivariate statistics, unsupervised machine learning (PCA and clustering), and time-lagged cross-correlation to disentangle the complex interactions between VOCs and ozone. This study addresses these gaps by analysing an 8-year high-resolution dataset from a key urban traffic corridor, applying novel analytical tools to uncover emission patterns, temporal behaviours, and photochemical regimes. Such an approach enhances our ability to inform policy and design targeted interventions for effective urban air quality management.
Hence, this research aims to systematically investigate the ambient behaviour, temporal dynamics, and source characteristics of key VOCs in an urban environment by integrating advanced statistical analyses, including correlation matrices, linear regression, and cross-correlation techniques, with meteorological and ozone data. The objectives include (i) identifying major VOC species contributing to ozone formation under VOC-limited regimes, (ii) characterizing temporal (diurnal, weekly, seasonal) variability of VOCs and ozone to infer patterns of emission and transformation, (iii) evaluating the spatial influence of emission sources using wind-sector analyses and polar plots, and (iv) applying principal component analysis and clustering methods to classify pollution regimes and understand underlying atmospheric processes. Through these analyses, the study seeks to provide insights into VOC–ozone interactions, highlight the significance of anthropogenic and meteorological drivers, and inform targeted mitigation strategies for air quality management in densely populated urban areas.
2. Methodology
2.1. Instrumentation and Data Collection
Data for this study were collected from the Marylebone Road supersite in central London, a well-established urban air quality monitoring location characterized by heavy traffic and diverse emission sources. Measurements focused on key VOCs including benzene, toluene, ethylbenzene, ethene, propene, isoprene, propane, and ethyne, alongside ozone and meteorological parameters. The measurement of VOCs, O3, and meteorological parameters involved temporal resolution of 15 min over the period between 1 January 2015 and 1 January 2023. The primary instrument used for VOC detection was a Hewlett-Packard Gas Chromatograph with Flame Ionisation Detector (GC-FID), operated in compliance with the protocols of the UK Hydrocarbon Monitoring Network. This system provides high temporal resolution, sensitivity, and compound specificity for hydrocarbons. In addition, benzene (C6H6) was also measured independently using a Differential Optical Absorption Spectroscopy (DOAS) system for cross-validation and source attribution analysis. O3 concentrations were recorded via a UV photometric O3 monitor, which meets the EU reference method standards for ambient O3.
The meteorological data were concurrently measured to account for the local atmospheric dynamics influencing VOC variability. These parameters included ambient temperature (°C), relative humidity (%), atmospheric pressure (hPa), wind speed (m/s), wind direction (degrees from North), global solar radiation (W/m2), and precipitation (mm), captured via an on-site meteorological station equipped with standard meteorological sensors.
2.2. Supplementary Regional Meteorological Data
To overcome the micro-scale limitations of the street canyon and better reflect synoptic meteorological trends, data were obtained from a regional urban background site (Station: 51.505° N, 0.055° W). Hourly air temperature, wind speed, wind direction, and atmospheric pressure data were retrieved from NOAA’s Integrated Surface Database using the worldmet package in R. These additional datasets were used to support the analysis of large-scale air mass transport and to verify observed patterns at the roadside site.
2.3. Data Preprocessing and Quality Control
All pollutant and meteorological data underwent a rigorous quality control (QC) protocol: instrument error flags, calibration periods, and invalid readings were removed. Data were checked for continuity and synchronized across all-time series. Minor missing values (<1%) were interpolated. VOC concentrations were log-transformed (where necessary) to normalize skewed distributions. All measurements were aggregated to hourly means for consistency and computational efficiency. Variables were standardized (z-score normalization) prior to statistical and machine learning analyses to allow fair comparison between differing units and scales.
2.4. Temporal and Statistical Analyses
Temporal behaviour of VOCs and O3 was analysed using diurnal, weekly, and seasonal cycle plots, Spearman correlation matrices, and CCF analysis to explore time-lagged interactions between VOCs and O3. Linear regression models were used to assess co-variation among selected VOCs and infer shared source categories. Furthermore, wind sector and polar plot analyses were used to explore spatial influences and directional trends in pollutant concentrations, particularly from key sectors such as SW, SSW, and WSW, which are indicative of traffic and industrial source areas.
2.5. Principal Component Analysis (PCA)
To reduce dimensionality and uncover latent pollutant patterns, Principal Component Analysis (PCA) was conducted on the standardized dataset. Variables included all eight VOC species, ozone, and meteorological parameters (temperature, wind speed, wind direction). The first two principal components accounted for the majority of the total variance (PC1 ≈ 40%, PC2 ≈ 15%), where PC1 captured vehicular and combustion-related VOCs (e.g., benzene, toluene, ethene, propene) while PC2 reflected biogenic and temperature-driven influences (e.g., isoprene and temperature). Higher-order PCs highlighted meteorological dispersion effects (e.g., wind speed/direction). The PCA output was interpreted through scree plots, loading scores, and biplots to distinguish between source profiles and environmental drivers.
2.6. K-Means Clustering Analysis
A K-means clustering algorithm was applied to the PCA-transformed data (first six PCs retained) to categorize distinct atmospheric regimes. Due to computational constraints, a random subset of 500 hourly observations was clustered initially, with cluster labels mapped back to the full dataset using proximity-based classification. The optimal number of clusters (k = 3) was chosen based on the elbow method applied to within-cluster sum of squares (WSS). The resulting clusters were interpreted as follows: Cluster 1: high VOCs, low O3—fresh primary emissions under stagnant, cool conditions. Cluster 2: moderate VOCs and O3—transitional regime indicating partial photochemical processing. Cluster 3: low VOCs, high O3—aged air masses where VOCs have reacted, leading to secondary ozone accumulation under warm, windy conditions. Cluster profiles were visualized using spider (radar) plots, allowing intuitive comparison of pollutant and meteorological fingerprints across regimes.
3. Results and Discussion
3.1. Correlation Analysis of VOCs, O3, and Meteorological Variables
The Spearman correlation analysis elucidated the relationships among VOC species and O
3 (as shown in
Figure 1). Strong positive correlations were observed among benzene, ethylbenzene (EBenzene), toluene, ethene, and propene, with correlation coefficients exceeding 0.6 in many cases, indicating a shared source, likely from vehicular emissions and fossil fuel combustion.
Notably, ethene and propene showed particularly strong associations (r ≈ 0.68), reinforcing their co-emission from anthropogenic activities. In contrast, ozone exhibited moderate to weak negative correlations with most VOCs (e.g., −0.56 with benzene, −0.54 with ethylbenzene, −0.56 with toluene), suggesting that higher VOC concentrations are often associated with lower ozone levels at the measurement timescale, likely due to VOC-limited ozone formation regimes typical in urban environments. These findings imply that while VOCs contribute to ozone production through photochemical reactions, the presence of high VOC levels might simultaneously reflect periods of less-efficient ozone formation, possibly influenced by titration effects with NO. Overall, the correlation patterns highlight the intertwined dynamics of primary emissions and secondary pollutant formation in the studied environment.
3.2. Linear Regression Analysis of Concerned VOCs
The linear regression analysis between benzene (C
6H
6) and ethylbenzene (C
6H
5C
2H
5) at Marylebone Road reveals a strong positive relationship (as shown in
Figure 2a), with a slope of 1.26, indicating that for every 1 μg/m
3 increase in ethylbenzene concentration, benzene concentration increases by approximately 1.26 μg/m
3, and an R
2 value of 0.75, indicating that 75% of the variability in benzene can be explained by ethylbenzene levels. The positive correlation suggests that both compounds share a common emission source, likely vehicular exhaust and fossil fuel combustion. The observed intercept of 0.24 implies a baseline concentration of benzene even when ethylbenzene is low, possibly from background sources or photochemical reactions. In urban environments, benzene and ethylbenzene are known to undergo photochemical oxidation, leading to the formation of O
3 via reactions with OH and NO
X. The reaction for VOCs like benzene and ethylbenzene is as follows:
These reactions contribute to ozone formation in the presence of abundant VOCs and sunlight, though high NO
X concentrations can suppress ozone production via the titration effect, where NO reacts with ozone:
The negative correlation between VOCs and ozone observed in some studies reflects this dynamic, where high VOC levels coincide with reduced ozone formation under high NO
X conditions. Similar correlations between benzene and ethylbenzene have been observed in other urban studies, such as those by researchers [
55,
56] in Beijing, highlighting the significant role of vehicular emissions in contributing to urban air pollution. These findings underline the environmental and health risks posed by these toxic VOCs, which are associated with increased risks of leukaemia (for benzene) and neurotoxicity (for ethylbenzene). Reducing their levels would require addressing emissions from traffic and promoting cleaner, low-emission technologies.
The linear regression analysis between toluene and ethylbenzene (as shown in
Figure 2b) reveals a moderate positive association, described by the equation Y = 1.06 + 2.8X with an R
2 value of 0.418. This indicates that while both pollutants share common sources, primarily vehicular exhaust, fuel evaporation, and industrial solvent use, their emissions and atmospheric behaviour are not entirely synchronized. The relatively steep slope suggests that toluene concentrations rise more rapidly than ethylbenzene, and the non-zero intercept (1.06 μg/m
3) implies a persistent background level of toluene, possibly due to additional inputs from commercial and industrial solvent applications or more localized emissions. Though the key photochemical reactions have been previously discussed, it is important to note that toluene, like ethylbenzene, undergoes OH-initiated oxidation, forming peroxy radicals, contributing to ozone formation under suitable sunlight and NO
X conditions. The moderate correlation may also reflect differences in atmospheric lifetimes, reactivities, or proximity to emission sources. Monod et al. (2001) found strong toluene ethylbenzene correlations (R
2 ≈ 0.94) in traffic-related samples, but weaker correlations in urban background air, due to additional sources of toluene (e.g., solvents, paint, industrial use) [
57].
The current study’s moderate R
2 value (0.418) suggests a similar pattern, indicating partially shared sources but also the influence of diverse urban emission sources, which is consistent with their findings in mixed-source environments. Kheirbek et al. (2012) observed that traffic density and industrial activities influenced both toluene and ethylbenzene concentrations, again implying shared but not identical emission origins, supportive of this study’s regression results, where the association is moderate but not strong [
58]. Similar source-divergent patterns in aromatic VOCs have been observed in other urban settings. For example, Na et al. (2005) [
59] found variable contributions of mobile and evaporative sources to aromatic VOC levels in Seoul, while Mandal et al. (2023) [
60] reported distinct diurnal and seasonal VOC trends in Delhi tied to traffic intensity and industrial activities. These studies reinforce the interpretation that aromatic VOCs in urban corridors like Marylebone Road arise from a complex mix of emissions, atmospheric processes, and chemical transformations, necessitating compound-specific mitigation strategies.
The regression analysis between benzene and propene concentrations (as shown in
Figure 2c) gave the equation Y = −0.07 + 0.96X with an R
2 value of 0.433, indicating a moderate positive correlation. This suggests a partial overlap in their emission sources, predominantly vehicular exhaust and combustion of fossil fuels, both of which are known to emit aromatic hydrocarbons (e.g., benzene) and light alkenes (e.g., propene). The near-unity slope (0.96) indicates a proportional relationship between their concentrations, while the negative intercept may reflect instrument detection limits or background variability at low propene levels. From an atmospheric chemistry perspective, benzene is relatively chemically stable, with an atmospheric lifetime of several days, whereas propene is much more reactive due to its carbon–carbon double bond, undergoing rapid oxidation via hydroxyl radicals (OH) and contributing to tropospheric O
3 and peroxyacetyl nitrate (PAN) formation. Despite their co-emission, propene’s faster photochemical degradation compared to benzene may account for the moderate R
2, rather than a stronger association. Similar patterns have been observed in other urban environments. For instance, Ait-Helal et al. (2014) conducted a study in suburban Paris and reported that while benzene and propene are both emitted from traffic-related sources, their ambient concentrations and correlations are influenced by seasonal variations and atmospheric processing [
61]. The study highlighted that propene levels exhibited significant diurnal and seasonal variability due to its higher reactivity, whereas benzene showed more stable concentrations. This differential behaviour underscores the complexity of VOC dynamics in urban atmospheres and the importance of considering both emission sources and atmospheric chemistry when interpreting pollutant relationships.
The regression analysis between ethene and propene concentrations (as shown in
Figure 2d) yielded the equation Y = 0.13 + 2.02X with an R
2 value of 0.53, indicating a moderate positive correlation. This suggests that while ethene and propene share common emission sources, such as vehicular exhaust and fossil fuel combustion, their atmospheric behaviours and reactivities differ. Both compounds are reactive alkenes that play significant roles in urban photochemistry, particularly in the formation of tropospheric ozone and secondary organic aerosols. Their atmospheric lifetimes are relatively short due to rapid reactions with hydroxyl radicals, leading to the production of formaldehyde (CH
2O) and other photochemical oxidants. The observed moderate correlation may reflect the influence of varying emission strengths, atmospheric processing, and differing reactivities under urban conditions [
62].
3.3. Temporal Dynamics of VOC–Ozone Interactions via Cross-Correlation Analysis
The Cross-Correlation Function (CCF) analysis helps identify the time-lagged relationships between VOCs and ozone concentrations, revealing whether changes in VOC levels precede or follow changes in ozone. This is crucial for understanding the temporal dynamics of ozone formation driven by VOC emissions under photochemical conditions. The correlations are computed over lags from −24 to +24 h, where a positive lag indicates that the VOC leads O3.
The CCF analysis between benzene and ozone (as shown in
Figure 3a) reveals a strong and consistent negative relationship across all time lags. The peak negative correlation occurs at lag 0 (−0.3749), suggesting a contemporaneous inverse relationship where higher benzene concentrations are associated with lower ozone levels. This trend extends over a wide temporal window, with notable negative correlations at lag −1 (−0.3540), lag −2 (−0.3261), and lag −3 (−0.2987), indicating that benzene levels preceding ozone are also inversely related. The strength of the negative correlation gradually diminishes in positive lags (VOC leading ozone), but the relationship remains negative throughout, with values like lag 1 (−0.3602), lag 2 (−0.3346), lag 5 (−0.2668), and up to lag 24 (−0.2108). This sustained pattern indicates a strong and persistent inverse association, suggesting that benzene does not play a direct ozone-forming role in this setting and might act more as a sink or reactant that consumes oxidants rather than promoting ozone buildup. For instance, Sharma et al. (2021) reported a moderate negative correlation between benzene and ozone concentrations, with correlation coefficients of r
2 = 0.475 at DMS and r
2 = 0.356 at NSIT, indicating that higher benzene levels are associated with lower ozone concentrations [
63]. The study also highlighted that benzene concentrations are influenced by meteorological parameters, which in turn affect ozone formation.
For isoprene and O
3 (as shown in
Figure 3b), the CCF analysis revealed a peak positive correlation between isoprene and ozone concentrations at lags +19 to +21 h, with a maximum correlation coefficient of approximately 0.11. This gradual increase from lag −20 to 0, peaking around lag −5 to +5 and stabilizing until lag +21, suggests that isoprene emissions may precede ozone formation, albeit with a weak relationship. Isoprene’s role in ozone formation is likely secondary or dependent on other atmospheric conditions, such as the presence of nitrogen oxides (NO
X) and sunlight. Studies have shown that isoprene oxidation contributes to ozone production, particularly under moderate NO
X conditions, with the rate of ozone formation being influenced by NO
X levels and solar radiation intensity [
64].
For propene and O
3 (as shown in
Figure 3c), the peak negative correlation was at lag 0 (−0.4958), and the values remained strongly negative at lags −1 to −5, with correlations ranging from −0.4655 to −0.3286. The strength of the negative correlation decreased gradually as the lag moved positively but still remained substantial. For instance, at lags 1 to 5, the correlations were −0.4775, −0.4422, −0.4046, −0.3694, and −0.3435, respectively. Even at longer positive lags, such as lag 24, the value was still negative at −0.2586, indicating a sustained inverse association between propene and ozone levels over time. The immediate negative correlation suggests quick reactivity and possibly a precursor role in photochemical ozone production. Propene’s rapid reaction with ozone and its role in forming secondary organic aerosols have been documented, highlighting its significance in atmospheric processes [
65].
A strong negative correlation was observed between ethene and ozone concentrations (as shown in
Figure 3d) at lag 0, with a correlation coefficient of −0.5067. This indicates that high ethene levels coincide with lower ozone concentrations, and as ethene levels drop, ozone tends to rise. This inverse relationship suggests rapid reactivity, where ethene is consumed in ozone-producing reactions. Ethene reacts readily with ozone, leading to the formation of various products, and this reaction plays a significant role in atmospheric chemistry [
66].
The toluene vs. O
3 relationship (as shown in
Figure 3e) also demonstrated consistent negative correlations across the entire lag period. The most negative value appeared at lag 0 with a correlation of −0.4942. High negative correlations were observed at lag −1 (−0.4614), lag −2 (−0.4194), and lag −3 (−0.3791). Positive lags exhibited slightly reduced but still negative correlations, such as lag 1 (−0.4755), lag 2 (−0.4379), lag 3 (−0.3952), and lag 4 (−0.3574). The correlation gradually weakened over time, with lag 24 showing a value of −0.2342. Although the strength of correlation declined across positive lags, the overall trend remained negatively inclined throughout the range. This inverse relationship indicates potential ozone-forming potential through photochemical oxidation, with toluene being depleted as ozone builds up. Toluene’s photochemical reactions with oxygen atoms lead to the formation of various products, contributing to ozone formation in the atmosphere [
67].
3.4. Temporal (Diurnal, Monthly, Weekly) Variability of VOCs and O3
Hourly trends for five VOCs (benzene, Ebenzene, ethane, ethene, ethyne) reflect a pronounced diurnal cycle that aligns closely with anthropogenic activity patterns and atmospheric boundary layer (ABL) dynamics (as shown in
Figure 4). Benzene and Ebenzene concentrations exhibit a clear bimodal distribution. Concentrations rise sharply in the early morning hours, peaking between 07:00 and 10:00, which coincides with morning rush hour traffic and a shallow boundary layer that inhibits vertical dispersion. For instance, benzene levels increase from around 0.74 µg/m
3 at 06:00 to over 1.09 µg/m
3 by 09:00. After midday, concentrations decline due to enhanced vertical mixing and photochemical degradation under increased solar radiation. A second, smaller peak occurs in the late afternoon to early evening, typically from 17:00 to 21:00, likely reflecting evening vehicular activity and a lowering ABL. Ethane, while still showing a bimodal profile, demonstrates a relatively stable concentration throughout the day, owing to its low reactivity and longer atmospheric lifetime. This suggests a combination of local and regional sources, including fossil fuel combustion and long-range transport. Conversely, ethene and ethyne exhibit sharper morning peaks and steeper declines in the afternoon, attributable to their higher reactivity with hydroxyl radicals and short atmospheric lifetimes. These compounds are strongly associated with fresh vehicular and industrial emissions, and their reduction throughout the day supports their rapid oxidative loss [
68,
69].
Monthly variations in VOC concentrations reveal a distinct seasonal cycle. Benzene and Ebenzene concentrations peak in winter (January–February), with benzene reaching as high as 1.34 µg/m
3 in January. Levels gradually decline towards the summer months, reaching a minimum between May and July (~0.74–0.83 µg/m
3). This trend reflects a combination of factors: in winter, reduced photochemical activity limits the atmospheric degradation of VOCs, and lower mixing heights lead to pollutant accumulation near the ground. Moreover, cold-start vehicle emissions and increased heating-related combustion during winter months further exacerbate ambient concentrations [
70]. In contrast, summer months are characterised by enhanced photochemical activity, which facilitates the oxidation and removal of reactive VOCs. Additionally, greater ABL height and stronger atmospheric mixing reduce surface concentrations. Ethane displays a comparatively flatter seasonal profile, consistent with its chemical stability and partial contribution from background sources. Ethene and ethyne follow a similar pattern to benzene, exhibiting winter maxima and summer minima, again attributable to seasonal differences in atmospheric oxidation capacity and boundary layer conditions [
71].
These findings are consistent with previous studies across urban European environments, which have documented wintertime accumulation of VOCs due to low dispersion and limited photochemical degradation, alongside morning and evening peaks driven by local traffic emissions [
68,
69,
70,
71]. The VOC behaviour observed in current study is characteristic of heavily trafficked urban areas, reinforcing the significance of vehicular emissions and atmospheric processes in shaping VOC exposure patterns.
The further analysis for VOCs (propene, propane, isoprene, toluene) and O
3 is shown in
Figure 5. The propane levels exhibit a clear morning peak beginning at around 6 AM, reaching maximum concentrations between 8–10 AM (~1.18 μg/m
3), which coincides with traffic rush hours. These levels decline through the day, hitting their lowest concentrations between 3–5 AM (~0.93 μg/m
3). Similarly, isoprene shows a sharp mid-morning peak (8–10 AM) of around 1.12 μg/m
3, after starting the day at significantly lower levels (~0.53–0.63 μg/m
3). Toluene behaves in much the same way, peaking between 8–10 AM (1.17–1.20 μg/m
3) and reaching its lowest levels from 3–5 AM (~0.51 μg/m
3). The behaviour of O
3 contrasts with the VOCs. O
3 reaches its maximum values during nighttime, around 2–4 AM (~1.24 μg/m
3) but dips sharply between 8–9 AM (~0.695–0.735 μg/m
3). This pattern aligns with titration of O
3 by NO during morning traffic peaks, where freshly emitted NO from vehicle exhaust reacts with ambient ozone, a well-documented mechanism in urban air chemistry [
72].
Seasonally, propane concentrations peak during winter, particularly in January (~1.36 μg/m3), due to reduced atmospheric dispersion and increased heating-related emissions. A marked dip occurs from April to July (0.83–0.85 μg/m3), likely reflecting enhanced photochemical degradation and atmospheric mixing. A similar pattern is seen with propene, which also shows winter highs (January and October, ~1.07–1.19 μg/m3) and spring/summer lows (April–July, ~0.84–0.86 μg/m3). While isoprene and toluene follow comparable seasonal cycles, O3 concentrations tend to vary more complexly, influenced by both precursor availability and solar radiation that drives photochemical ozone formation.
These findings echo previously reported trends in urban atmospheric studies. For example, Monks et al. (2009) highlight how VOCs such as toluene and propane show morning peaks aligned with traffic activity, while ozone exhibits early morning minima due to titration [
73]. Furthermore, von Schneidemesser et al. (2010) observed elevated wintertime VOC concentrations across European cities, attributed to both anthropogenic activity and meteorological stagnation [
74]. Hence, the observations highlight clear diurnal and seasonal recurrence in urban pollutant behaviour. Morning VOC peaks coincide with traffic emissions, while O
3 dips during high-NO
x periods underscore the importance of titration processes. The seasonal rise of VOCs in winter and photochemical O
3 variations in response to solar input and precursor levels are consistent with known atmospheric chemistry patterns. These recurrences are not only expected but have been systematically documented across global urban environments, reaffirming the importance of historical pattern recognition in environmental modelling.
3.5. Wind Sector Analysis of VOCs and Ozone Concentrations
The VOCs, including benzene, ethylbenzene, ethene, ethyne, isoprene, propane, propene, and toluene, exhibit similar wind direction-based concentration patterns, particularly influenced by wind sectors from the southwest (SW), west-southwest (WSW), and south-southwest (SSW), as shown in
Figure 6. These directions consistently correspond to the highest average concentrations of pollutants, suggesting that the major emission sources are localized toward the southwest of the monitoring site, possibly from traffic corridors, industrial operations, and fuel-handling activities typical of urban and semi-industrialized areas.
Benzene and ethylbenzene both exhibit the highest concentrations (the data of concentrations has been shared in the
Appendix A,
Table A1,
Table A2,
Table A3,
Table A4,
Table A5,
Table A6,
Table A7 and
Table A8) in the SSW, SW, and WSW sectors (benzene: SSW: 1.01 μg/m
3, SW: 0.850 μg/m
3, WSW: 0.841 μg/m
3; ethylbenzene: SW: 0.462 μg/m
3, WSW: 0.437 μg/m
3, SSW: 0.428 μg/m
3), suggesting similar source regions related to vehicular emissions and industrial activities, as these compounds are often associated with combustion processes and solvent use [
75]. The proximity of the monitoring site to major transportation routes and industrial zones likely influences these results, reinforcing the hypothesis of localized emission hotspots along these wind paths.
Similarly, ethene and propane show peak concentrations in the SSW, SW, and WSW sectors (ethene: SSW: 2.29 μg/m3, WSW: 2.14 μg/m3, SW: 2.06 μg/m3; propane: SW: 6.57 μg/m3, WSW: 6.52 μg/m3, SSW: 5.59 μg/m3), further corroborating the notion of localized pollution sources in the southwestern direction. Ethene, a byproduct of fossil fuel combustion, and propane, often associated with industrial operations and heating, exhibit a strikingly similar distribution. This similarity can be attributed to the shared emission sources, such as traffic and industrial zones, which are dominant in the southwest direction.
On the other hand, isoprene and toluene show some variations in their directional concentration patterns. Isoprene is predominantly emitted from vegetation and combustion sources, with its highest concentrations recorded in the SSW, S, and SSE wind sectors (isoprene: SSW: 0.0530 μg/m
3, S: 0.0528 μg/m
3, SSE: 0.0511 μg/m
3). This could be indicative of biogenic emissions from nearby green spaces or vegetation in addition to traffic emissions. The observed patterns for isoprene may reflect mixed sources, with a combination of biogenic and anthropogenic contributions, as is common in urban areas with nearby natural green cover [
76]. In contrast, toluene, typically associated with industrial solvents and vehicle emissions, follows a pattern similar to that of benzene and ethylbenzene, with the highest concentrations found in the SW, WSW, and SSW sectors, confirming the dominance of vehicular and industrial sources.
Ethyne (acetylene), however, exhibits a distinct trend, with the highest concentrations observed in the E and ENE wind sectors (ethyne: E: 2.53 μg/m3, ENE: 2.29 μg/m3), pointing to an emission source located to the east or northeast of the monitoring site. This directional anomaly could indicate emissions from nearby industrial zones or regional pollution sources eastward of the site. Ethyne, being a byproduct of incomplete combustion, is also commonly associated with industrial activities and regional transport emissions, and this observation may highlight the influence of larger, more distant sources or regional transport patterns.
Finally, the analysis of O3 concentrations reveals a strong correlation with wind directions from the SSE, SE, and S sectors, where the highest average concentrations (55.3 μg/m3, 50.4 μg/m3, and 44.7 μg/m3, respectively) are observed. Ozone, a secondary pollutant formed by photochemical reactions between VOCs and NOX under sunny conditions, tends to accumulate in regions with higher levels of precursor pollutants, which aligns with the high levels of VOCs in these sectors. The SSE and SE wind patterns may carry precursor pollutants from nearby traffic and industrial zones, promoting ozone formation in these regions. This directional trend supports the hypothesis that O3 formation is influenced by local pollution sources, such as vehicular emissions, and can be enhanced by meteorological factors such as sunlight and temperature.
3.6. Principal Component and Clustering Analysis of VOCs and Meteorological Variables in Urban Air Quality
3.6.1. Principal Component Analysis (PCA)
PCA was performed on the standardized VOC dataset to reduce dimensionality and identify key patterns as shown in the scree plot in
Figure 7. The first principal component (PC1) accounted for 39.79% of the total variance, while PC2 explained an additional 14.83%, leading to a cumulative variance of 54.63% by the second dimension. PC3 to PC6 contributed 10.25%, 8.59%, 6.97%, and 6.11%, respectively, cumulatively capturing 86.54% of the total variance. Beyond PC6, each additional component explained less than 5% of the variance, indicating diminishing returns. Based on the cumulative variance and the observed elbow point in the scree plot, retaining the first six principal components (PC1 to PC6) was considered sufficient for subsequent clustering and analysis.
The PC1 accounted predominantly for the variance associated with the VOC species such as ethene (15.85%), propene (15.72%), benzene (15.68%), Ebenzene (14.54%), and toluene (13.97%). The second principal component (PC2) was driven mainly by isoprene (36.57%) and air temperature (19.73%), indicating a strong influence of biogenic emissions and temperature-driven variability. The third principal component (PC3) was characterized by high contributions from wind direction (49.04%) and wind speed (28.71%), suggesting the importance of atmospheric dispersion processes. Similarly, PC4 was largely influenced by ethyne (43.80%) and air temperature (27.34%). These results indicate that both anthropogenic emissions (VOC-related) and meteorological factors (temperature, wind) play crucial roles in the observed variations in pollutant concentrations.
The combined interpretation of the scree plot and the variables’ contribution to PC1 and PC2 highlights the major environmental processes shaping VOC variability at the study site. The variables contributing to the PC1 and PC2 are shown in
Figure 8. The strong loading of traffic-related VOCs such as ethene, propene, benzene, Ebenzene, and toluene on PC1 reflects the dominant influence of primary anthropogenic emissions, primarily from vehicular and combustion sources. Meanwhile, the high contribution of isoprene and air temperature to PC2 signifies the role of biogenic activities and meteorologically driven processes, where warmer temperatures enhance natural VOC emissions. The third and fourth principal components, shaped by wind-related parameters (wind direction, wind speed) and specific VOCs like ethyne, further emphasize the significance of atmospheric dispersion and transport in modulating local pollutant concentrations. To summarise, together, the environmental interpretation suggests that the urban air composition is shaped by two main processes: (1) primary anthropogenic emissions (captured in PC1) heavily driven by traffic-related VOCs, and (2) secondary biogenic and photochemical processes (captured in PC2) influenced by natural emissions and temperature. The choice to focus on PC1 and PC2 in the variable contribution plot is justified, as these two dimensions together explain more than half of the total variance, offering the clearest insights into the dominant environmental processes affecting air quality at the study site. By identifying these dominant patterns, PCA not only simplifies the complex dataset but also provides a scientific basis for targeted air pollution control strategies, distinguishing between traffic management interventions and temperature- or wind-related considerations.
3.6.2. K-Means Clustering Analysis
To classify distinct air quality regimes based on VOCs, O3, and meteorological variables, K-means clustering was applied to the normalized (scaled) dataset. The variables considered included benzene, Ebenzene, ethene, ethyne, isoprene, propane, propene, toluene, O3, air temperature, wind speed, and wind direction. Given the computational limitations encountered during clustering of the full dataset, a random sample of 500 points was utilized for cluster assignment. Following this, cluster memberships were mapped back onto the full cleaned dataset for interpretation. The optimal number of clusters (k = 3) was determined based on visual inspection of within-cluster sum of squares (WSS) plots and empirical observations of the data structure.
The cluster characteristics were visualized using a spider plot (radar plot) as shown in
Figure 9, where each axis represents one of the standardized variables scaled between 0 and 1. In the spider plot, each coloured polygon corresponds to the average profile of one cluster across all variables, allowing intuitive visual comparison. Cluster 1 exhibited the highest normalized values for benzene, Ebenzene, ethene, ethyne, propane, propene, and toluene (all at or near 1.0 on the standardized scale) but had very low O
3 concentrations (scaled 0.0) and lower air temperatures and wind speeds. This indicates a pollution regime characterized by fresh, primary VOC emissions under relatively stagnant and cooler atmospheric conditions. Cluster 2 showed intermediate VOC concentrations (scaled around 0.3–0.4) and moderately elevated O
3 levels (scaled around 0.17) with slightly higher air temperatures and wind speeds compared to Cluster 1, suggesting transitional or mixed air masses where some photochemical processing of VOCs had occurred. Cluster 3, by contrast, showed the lowest VOC concentrations (scaled near 0.0 for most VOCs) but the highest O
3 levels (scaled 1.0), highest air temperature (1.0), and highest wind speed (1.0). This profile reflects aged air masses where primary VOCs have largely reacted, resulting in elevated secondary pollutants like ozone under warmer, sunnier, and windier conditions.
The K-means clustering identified three distinct pollution regimes, each representing characteristic atmospheric and emission scenarios relevant to urban air quality management: Cluster 1 was characterised by high concentrations of VOCs and very low ozone levels. This scenario is indicative of fresh primary emissions dominated by traffic and combustion sources, under stagnant meteorological conditions (low temperature and low wind speed). These conditions limit dispersion and inhibit photochemical activity, leading to pollutant accumulation. From a policy perspective, this scenario highlights the need for stricter traffic emission controls during early morning and winter periods when dispersion is weakest. Cluster 2 represented transitional or mixed scenarios with moderate VOC and ozone levels. This cluster occurred under slightly warmer and windier conditions, suggesting a blend of primary emissions and early-stage photochemical activity. Such conditions are common during late morning and shoulder seasons (spring/autumn). Interventions during these periods should focus on both emission reduction and photochemical monitoring, as this scenario can rapidly evolve toward secondary pollution episodes. Cluster 3 featured low VOCs but elevated ozone, occurring during the warmest and windiest periods. These conditions favour photochemical processing and reflect aged urban air masses. This scenario exemplifies VOC-limited ozone formation, where even modest VOC levels lead to high ozone production due to abundant NOX and strong solar radiation. Effective mitigation under these conditions requires prioritising VOC reduction, particularly of highly reactive species (e.g., aromatics and alkenes), while also considering regional transport contributions. These clusters provide a practical framework for dynamic air quality management. Instead of uniform policies, pollution mitigation can be tailored by time of day, season, and prevailing meteorology. For example, targeted restrictions on traffic emissions during Cluster 1 scenarios and VOC-specific industrial controls during Cluster 3 events could significantly reduce health risks and exceedances of regulatory thresholds.
The use of K-means clustering coupled with spider plot visualization allows identification of different atmospheric regimes based on pollutant and meteorological profiles. Scientifically, this aligns with the known behaviour of photochemical pollution: VOCs serve as precursors that, under sufficient solar radiation and in the presence of NOx, lead to secondary ozone formation. The observed inverse relationship between VOC concentrations and ozone levels across the clusters is consistent with classical photochemical smog theories [
77]. Similar multi-cluster patterns have been reported in previous air quality studies where fresh emissions dominated low-ozone clusters, and aged, oxidized air masses showed elevated ozone [
78,
79]. Hence, this analysis reveals a clear separation of pollution regimes, highlighting the transition from fresh emission events (Cluster 1), through mixed conditions (Cluster 2), to photochemically aged air masses rich in secondary pollutants (Cluster 3). The clustering approach thus provides valuable insights into VOC dynamics, atmospheric aging, and secondary pollutant formation, offering a powerful method for source attribution and air quality management strategies.
While the patterns identified in this study are robust for the Marylebone Road corridor, it is important to note that emission profiles, meteorological influences, and chemical regimes may vary across urban settings. Therefore, applying this methodology to other cities would provide valuable comparative insights and test the generalisability of the observed pollution regimes.
4. Conclusions
This study comprehensively examined the behaviour, sources, and atmospheric interactions of key volatile organic compounds (VOCs) and ozone over an 8-year period along Marylebone Road, London, an urban corridor dominated by traffic emissions. The integration of multivariate statistics, machine learning, and meteorological analysis allowed us to identify mechanistic insights into pollutant formation and transformation dynamics, rather than presenting isolated statistical associations.
The key findings aligned with the study’s objectives. Firstly, strong correlations among VOCs such as benzene, toluene, ethylbenzene, and ethene confirmed their primary origin in vehicular and fossil fuel combustion emissions. Secondly, time-lagged inverse correlations with ozone, revealed through cross-correlation analysis demonstrated the operation of a VOC-limited regime. This regime is characterised by ozone titration due to high NO levels, delaying photochemical ozone build-up until pollutants have aged and dispersed. Thirdly, temporal trends showed that VOC concentrations peaked during winter and morning/evening rush hours, reflecting reduced dispersion and increased emissions, while ozone peaked in summer under conditions of strong solar radiation and greater atmospheric mixing, supporting its secondary formation pathway. Wind-sector analysis further revealed spatial heterogeneity in pollutant sources, with VOCs transported from southwest traffic corridors and ozone peaking under southeast winds carrying photochemically aged air masses.
Principal Component Analysis attributed nearly 40% of VOC variability to traffic emissions (PC1) and ~15% to biogenic and temperature-sensitive emissions (PC2). K-means clustering further identified three pollution regimes: fresh emission events with high VOCs and low ozone, mixed regimes with partial transformation, and aged air masses with low VOCs and elevated ozone, indicating progressive chemical evolution under meteorologically favourable conditions. These insights advance our understanding of how primary emissions interact with meteorology to shape secondary pollution outcomes in dense urban environments. By revealing when, where, and how VOCs contribute to or inhibit ozone formation, the study supports the development of targeted air quality interventions, particularly under VOC-limited regimes, where reducing NOX alone may not be effective. The clustering results not only revealed the chemical and meteorological characteristics of each pollution scenario but also provided actionable insights for policy design. By recognising when specific emission sources and atmospheric conditions dominate, urban planners and regulators can implement more responsive, scenario-specific strategies, such as temporal traffic restrictions, VOC monitoring during high-ozone periods, and regional coordination for pollution transport. The study limitations include the use of a single roadside monitoring location, which may not capture the full spatial variability of emissions and ozone dynamics across the urban area. Additionally, while the study applied advanced statistical tools, chemical transport modelling was not used to simulate reaction pathways or regional transport explicitly. Future work could build on these findings by incorporating real-time chemical modelling, expanding spatial coverage, and evaluating health exposure impacts to inform policy decisions more comprehensively.