2. Materials and Methods
This study was conducted in Zagreb, the capital of Croatia, with a population exceeding 800,000 residents and a complex urban morphology comprising dense residential zones and major transportation corridors [
36]. As one of the most densely populated cities in Croatia, Zagreb experiences elevated concentrations of PM and NO
2, especially during winter, largely due to traffic emissions, residential heating, adverse meteorological conditions, and cross-border issues.
To assess the potential of sensors for urban air quality monitoring, a dense network of 35 sensor units was deployed across different city districts, co-financed by the EU regional development, research, and innovation fund (
Figure 1).
The sensor system used in this study was the AirQ Outdoor platform, developed by Smart Sense (Zagreb, Croatia), representing the pilot version of the company’s current Sensees Environment Monitoring System solution (available at:
https://mysensees.com/, accessed on 8 November 2025). For measuring PM
10 concentrations, an optical particle counter based on laser light scattering was used, while electrochemical sensors were employed for NO
2 detection. Sensor lifetime features for optical particle counters maintain stable performance for 24–36 months, whereas electrochemical gas sensors (such as those used for NO
2) exhibit a shorter operational lifetime of 12–24 months, depending on environmental exposure and pollutant load. Therefore, routine recalibration and periodic performance checks were implemented to mitigate long-term drift. Sensors showing signal degradation were replaced according to the manufacturer’s and operator’s quality-assurance protocols. Sensor measuring of concentrations at high temporal resolution (hourly) was installed in diverse microenvironments, including streets with heavy traffic load, residential neighborhoods, and suburban areas. Before network deployment, NO
2 sensors were individually calibrated under controlled conditions. For PM
10 sensors, a subset of representative units was co-located with a type-approved reference particulate analyzer over an extended period to establish a method of equivalence. Based on this co-location campaign, intra-sensor variability was quantified, and corresponding correction factors were derived and applied to all PM
10 sensors to harmonize the measurements. Following deployment, routine operational checks and monthly validation activities were conducted jointly by the sensor network operator and the public health institute expert. These included periodic inspection, data quality control, and recalibration if deviations were detected.
The NO2 sensor is a high-sensitivity electrochemical (EC) sensor with a measurement range of 0–16,000 ppb, precision of ±5 ppb, and a lower detection limit of 1 ppb. It is factory pre-calibrated and individually characterized for zero offset and sensitivity, with an operational lifetime exceeding 24 months. For PM, the system employs optical particle counters based on laser light scattering to quantify PM1, PM2.5, and PM10. The measurement ranges are up to 500 µg/m3 (PM1), 2000 µg/m3 (PM2.5), and 5000 µg/m3 (PM10), with a detection limit of 1 µg/m3. These sensors operate reliably within a temperature range of –10 to +50 °C and 0–90% RH. The indicative air quality system is classified as a mid-range solution designed for professional and support applications rather than as a consumer-grade low-cost device.
Sensor placement was coordinated by public health experts in collaboration with local stakeholders and guided by prior pollution maps, population demographics, health indicators, and urban infrastructure characteristics. When selecting micro-locations, the requirements of the Ordinance on Air Quality Monitoring (Official Gazette 72/20) were taken into consideration. Micro-locations for monitoring traffic-related air pollution were selected in collaboration with the Faculty of Transport and Traffic Sciences at the University of Zagreb.
Microscale criteria for sampling points included: unobstructed airflow around the sampling probe inlet (generally free within at least 270°, or 180° for edge-of-settlement locations), without physical obstructions that may influence air movement (typically several meters away from buildings, balconies, vegetation, and other barriers, and at least 0.5 m from the nearest structure), sampling inlet height between 1.5 m (representing the breathing zone) and 4 m above ground level, the sampling probe inlet located in proximity to direct emission sources to avoid sampling undiluted emissions, the sampler exhaust outlet positioned to prevent re-entrainment of exhaust air into the sampling inlet, for traffic-oriented monitoring, sampling probes placed at least 25 m away from major intersections and no more than 10 m from the road curb.
Additional siting considerations included: presence of interfering emission sources, safety and security of equipment and personnel, accessibility for maintenance and calibration, availability of electrical power and telecommunications infrastructure, visibility of the monitoring site within its surroundings, public safety and operator safety, feasibility of co-locating instruments for multiple pollutants, and urban planning and land use constraints. Street lighting poles were identified as optimal installation points for low-cost sensors due to convenient access to electrical power, and formal approval was requested from the competent City Office.
The sensor data were compared against measurements from three high-quality reference monitoring stations operated by the Croatian Meteorological and Hydrological Service (DHMZ), referred to as HQ Zagreb 1, HQ Zagreb 2, and HQ Zagreb 3, as well as the local reference-grade station, Štampar (
Table 1). These stations represent different pollution contexts: a moderately polluted residential zone, a suburban background site, and an urban center roadside location, respectively. Each sensor unit contained optical PM sensors (based on laser light scattering) and electrochemical NO
2 sensors. Also, PM
2.5, PM
1, SO
2, O
3, CO are monitored. The devices recorded measurements every hour and transmitted data to a central database.
Meteorological parameters, including temperature, relative humidity, wind speed, and surface pressure, were obtained from the ERA5 reanalysis dataset (European Centre for Medium-Range Weather Forecasts, ECMWF). ERA5 provides hourly data at a spatial resolution of 0.1° × 0.1° (~9 km). For each sensor location, the nearest ERA5 grid cell was identified using bilinear interpolation based on latitude and longitude. This ensured spatial alignment between the reanalysis data and ground measurements. The interpolated hourly values were then temporally synchronized with the corresponding hourly sensor records. This procedure was applied to all meteorological variables used in both the sensitivity analysis and calibration models. In addition to these variables, the following meteorological and surface parameters were also included in the analysis: 2 m dewpoint temperature, skin temperature, leaf area index (high vegetation), leaf area index (low vegetation), surface latent heat flux, surface net long-wave (thermal) radiation, evaporation, and surface runoff.
2.1. Data Preparation
The dataset, consisting of hourly measurements from 35 sensors and three national and one local reference stations, underwent several preprocessing steps to enhance model performance and ensure robustness. The data processing followed the steps published previously [
31]. First, temporal features (based on timestamps) were encoded using sinusoidal transformations [
37]. Specifically, the day of the year and month of the year were transformed using sine and cosine functions to capture seasonal differences in air pollution. Additionally, the Julian day and year were included to account for long-term trends, with Julian day representing a continuously increasing number of days since 1 January 1970. A holiday feature was also added as a binary variable, where 0 indicates a regular day and 1 indicates a holiday.
The period from 27 December to 3 January was classified as a holiday to account for increased pollution associated with New Year’s Eve celebrations and fireworks [
38,
39]. To approximate traffic-related pollution, a traffic proxy variable was introduced alongside the temporal features, following a polynomial approach [
40]. For each hour of the day, the mean pollutant value was calculated from the training set, normalized to a range of −1 to 1, and fitted with a polynomial function. This function was then applied across the entire dataset to simulate daily traffic patterns, since direct traffic data were unavailable. The same method was also used to capture weekly traffic dynamics by the day of the week [
40]. Before modelling or sensor calibration, outliers were removed using a 10-h sliding window approach: values falling outside the 0.01 and 0.99 quantiles were filtered out within each window. Rare missing values were handled using an iterative imputer, which models each feature as a function of the others and imputes missing values iteratively using a round-robin strategy. The dataset was then split into two subsets: training and testing. The training set spanned from 1 March 2022 to 25 September 2023. The test set started on 26 September 2023 and continued until the end of the dataset on 28 May 2024. This was also done to evaluate the need for recalibration given longer periods.
2.2. Machine Learning and Statistical Analysis
To analyze pollutant dynamics during rush hours, measuring sites were categorized into three groups: heavy traffic load sites, light traffic load sites, and referent stations. Three sites were picked per category to showcase it. Sites were classified according to the density of surrounding roads and transport lines, along with the overall intensity of traffic activity. Areas experiencing heavy traffic were designated as congestion-prone urban zones, whereas light-traffic sites were found in calmer residential neighborhoods. The heavy traffic load sites included Vrbik, NSB, and Trešnjevka/Vukovarska, while the light traffic load sites were Gajnice, Dubrava/Maksimir, and Čulinec. Additionally, three referent sites, HQ Zagreb 1, 2, and 3, were used.
Hourly measurements of pollutants were extracted for a randomly selected Monday, which was 12 June 2023. To show temporal variations during the morning rush period, only data from 03:00 until 11:00 was selected. This weekday was chosen as it represents a typical working day within the observation period and is not affected by weekend or public holiday traffic anomalies. However, we acknowledge that analysing a single day does not capture differences between weekdays and weekends, nor does it reflect potential afternoon traffic peaks or seasonal variability. Morning rush hours were selected as they represent the most consistent and pronounced traffic-related pollution peak across weekdays, and atmospheric conditions during this period favour pollutant accumulation. To investigate how environmental and temporal factors influence air pollutant concentrations, a two-stage modelling framework comprising a feature sensitivity analysis and ML-based calibration was performed. First, a feature sensitivity analysis was conducted using simple linear regression. Separate models were trained for each sensor location to quantify how different inputs, meteorological parameters (e.g., temperature, humidity, wind speed from ERA5), temporal features (e.g., hour of day, day of week, holidays), and traffic proxy indicators, influence local pollutant concentrations. Input variables were standardized before modelling to allow for direct comparison of regression coefficients across features and stations. Standardization was performed by centering each variable to a mean of zero and scaling it to unit variance.
The idea of this analysis was not predictive performance, but rather interpretability: understanding spatial variability in pollution drivers and the relative importance of each explanatory factor across the sensor network. In addition to sensitivity analysis, Principal Component Analysis (PCA) [
41] was applied to explore underlying structures and potential redundancies in the measurement data for both PM
10 and NO
2. This approach enabled the identification of potential multicollinearity between stations, the assessment of spatial coherence in pollutant concentrations, and the detection of station-specific behavior. The primary purpose of PCA was to assess the consistency of the sensor network, identify stations with similar or divergent temporal patterns, and detect potential outliers before calibration. In the second stage, ML models to calibrate sensor measurements were developed. The XGBoost (Extreme Gradient Boosting) [
33] algorithm was selected due to its ability to model complex, non-linear relationships and interactions among spatiotemporal features. Each model was trained using 5-fold cross-validation, with hyperparameters optimized via Bayesian optimization [
42]. The calibration dataset consisted of hourly sensor measurements paired with simultaneously recorded pollutant concentrations from the reference station, which served as the target values in the model. This spatiotemporal calibration strategy allowed the model to correct sensor drift and environmental interference by keeping patterns learned from co-located high-accuracy data. To assess calibration performance, we compared pollutant predictions before and after model correction using the Pearson correlation coefficient and the RMSE. Additionally, we monitored the rolling RMSE over time to detect performance degradation and established thresholds to determine when recalibration was needed.
2.3. Machine Learning-Based Calibration
The calibration process followed a structured, multi-step approach. Hourly pollutant data (PM
10 and NO
2), both from sensors and reference stations (1, 2, 3, and 39 operated by DHMZ), formed the core of the training dataset. These data were enriched with engineered temporal features (e.g., sine/cosine encodings of time, holiday markers), meteorological variables from the ERA5 dataset [
43,
44,
45], and proxy indicators of traffic patterns. Sensor-reference pairings were chosen not only based on proximity but also on environmental comparability, considering factors such as traffic exposure, surrounding land use, and microclimatic conditions, including temperature, humidity, and wind direction.
Two sensors were paired with each reference site, resulting in six distinct calibration scenarios. It should be noted that full co-location of sensors directly beside reference-grade monitoring stations was not feasible due to the technical and infrastructural constraints during deployment. Therefore, sensor–reference pairings were established based on spatial proximity and similarity of the local micro-environment rather than identical placement. This approach assumes that nearby sites within comparable micro-environments share similar temporal pollution dynamics, but it cannot eliminate hyperlocal gradients that may cause divergence between sensor and reference values. Consequently, the accuracy metrics presented here should be interpreted as indicators of performance under realistic operational conditions rather than strict equivalence tests. To compensate for the absence of direct co-location, the calibration relied on spatiotemporal patterns by integrating meteorological variables (ERA5), land-use and traffic proxy indicators, and temporal features. This calibration design assumes that the sensors located in comparable urban environments will respond similarly to environmental factors. Therefore, even if exact colocation is not possible, valid calibration relationships can still be established by capturing shared spatiotemporal trends. In the calibration models, the pollutant concentration measured at the reference station was used as the target variable.
In contrast, sensor measurements, meteorological data, temporal features, and land-use variables served as predictors. XGBoost was selected as the calibration algorithm due to its proven ability to capture non-linear and multi-scale relationships, as well as its robustness to noisy input data. Calibration performance was assessed using Pearson correlation coefficients and Mean Squared Error (MSE), both before and after applying the model. In addition, we introduced a rolling RMSE evaluation to monitor performance degradation over time. A reference baseline RMSE was computed on the training set, and recalibration was flagged when the rolling RMSE during the test period exceeded 50% of this baseline. This strategy provided a practical mechanism for determining the temporal stability of the model and guiding the schedule for recalibration.
3. Results
To explore spatial differences in air pollution across Zagreb, average concentrations of NO
2 and PM
10 were analyzed for the full observation period (March 2022–May 2024).
Figure 2 shows the spatial distribution of mean values per station. The NO
2 map exhibits pronounced spatial variability, with higher concentrations observed in the central and eastern parts of the city, particularly near major traffic corridors and urbanized areas. Sensor locations numbered 18, 19, and 14 (see
Table 1) show the highest yearly mean NO
2 concentrations, exceeding 70 µg/m
3 and upcoming limit values for the protection of human health (daily up to 50 µg/m
3 not more than 18 times per calendar year and annual up to 20 µg/m
3), suggesting a significant influence from traffic emissions and localized traffic congestion. On the other hand, in the southern and western parts of the city, at locations 3, 13, and 28, NO
2 levels are notably lower, generally below 30 µg/m
3, indicating a reduced anthropogenic influence and the presence of more dispersed residential or green areas.
The distribution of PM10 concentrations is more uniform, with most yearly concentrations falling within the 34–63 µg/m3 range. Nevertheless, a few hotspots, particularly in the eastern urban zones (sites 38 and 32) and one in the center (site 11), show elevated PM10 concentrations. These values, which are mostly above the upcoming limit values for the protection of human health (daily up to 45 µg/m3, not more than 18 times per calendar year, and annual up to 20 µg/m3), may indicate a current combined exposure effect due to vehicular activity, construction, and localized heating sources. Notably, some regions that reported higher NO2 values do not coincide with the PM10 peaks, indicating that different emission sources and atmospheric processes may be influencing the distribution of each pollutant.
Temporal variations in air pollutant concentrations during the morning rush hour (03:00–11:00) on a randomly selected Monday (12 June 2023) are presented in
Figure 3.
Morning hours were chosen because they offer the clearest and most consistent traffic-related emission signal in Zagreb, making them suitable for illustrating how sensors capture rapid, locally driven changes in pollutant concentrations. At the same time, this limited temporal window does not capture the full diurnal or seasonal variability, which is addressed in the long-term analyses presented in other sections. This period was selected as a representative example to highlight traffic-related variations in pollutant concentrations, as it captures the most pronounced contrasts between sites with heavy and low traffic loads. It serves as an illustration of daily variability rather than a complete temporal trend analysis, which is addressed through long-term and aggregated analyses in other sections. As described previously, monitoring locations were grouped into three categories, each comprising three sites. For NO
2, sites with heavy traffic loads exhibited pronounced fluctuations, with peak concentrations observed between 09:00 and 10:00. Maximum levels reached approximately 60 µg/m
3 at Trešnjevka/Vukovarska and NSB, while Vrbik showed comparatively lower peaks of around 30 µg/m
3. Reference sites displayed a moderate morning increase, with HQ Zagreb-2 reaching ~45 µg/m
3, whereas HQ Zagreb-3 remained below 15 µg/m
3. In contrast, light traffic load sites maintained consistently lower concentrations, gradually decreasing from early morning peaks (e.g., Čulinec ~32 µg/m
3) to 8–12 µg/m
3 by 11:00. For PM
10, heavy traffic load sites showed a gradual increase in concentrations during the rush hour (26 µg/m
3). Reference sites exhibited lower levels, particularly HQ Zagreb-2, which approached 19 µg/m
3. Light traffic load sites also have recorded low concentrations, remaining below 20 µg/m
3 with only minor fluctuations. Only the Dubrava/Maksimir site showed unusually high PM
10 concentrations, which cannot be attributed to traffic. Overall, the results demonstrate the substantial influence of traffic intensity on pollutant concentrations. Heavy traffic load areas were characterized by sharp NO
2 peaks during the morning rush hour, while PM
10 concentrations showed less distinct increases but remained elevated at both heavy traffic load and reference sites compared to light traffic load locations. This analysis revealed that sensor data can capture local variations and thereby be used for local health impact analysis of pollution events and air quality assessment in compliance with limit values of PM
10 and NO
2 among others for the protection of human health by Directive (EU) 2024/2881 of the European Parliament and of the Council of 23 October 2024 on ambient air quality and cleaner air for Europe [
20].
Principle component analysis (PCA) was applied to standardized weekly average concentrations of NO
2 and PM
10 across all monitoring stations to understand similarities and differences in pollution behaviour among locations. The left subplot in
Figure 4 shows the PCA projection of NO
2 data, while the right subplot presents the corresponding results for PM
10. Each point represents a station, labelled by its station number and color-coded for clarity. For NO
2 results, the stations are more widely dispersed in the PCA plot, indicating higher variability in NO
2 pollution profiles across different locations in the city. This spatial spread reflects how local factors such as traffic density and urban structures affect pollution levels at specific stations. In contrast, the PCA of PM
10 concentrations revealed a tighter clustering of stations, indicating more uniformity across locations, although some sites (e.g., 1, 5, 11, 31, and 37) still exhibited distinct behavior. Reference-grade DHMZ monitors (stations 1, 2, and 3) appear among the outliers due to their higher precision and calibration standards, while station 39 (ASTIPH reference-grade) was excluded from the PM
10 analysis due to insufficient data. The first two principal components (PCs) capture most of the variance for both pollutants, enabling clear visual separation between stations with similar versus distinct pollution profiles.
3.1. Comparison to Reference Measurements
In this section, measurements from the sensors are compared to reference measurements.
Figure 5 presents the absolute error of NO
2 and PM
10 measurements across three temporal resolutions: hourly, daily, and weekly. To compute the absolute error, measurements from all sensor sites were aggregated by calculating the median, and the same procedure was applied to the reference measurements. The absolute error was then derived from the difference between these aggregated values. These three temporal resolutions were chosen to capture different characteristics: hourly sampling reveals high-frequency fluctuations, daily sampling smooths short-term variations, and weekly sampling highlights long-term error trends. This aggregated error analysis was designed to provide a network-level overview of deviations between the ensemble of sensors and reference stations, explaining temporal and seasonal error patterns rather than quantifying the bias of individual sensors. The upper plot (
Figure 5) shows the absolute error of NO
2. Hourly resampling reveals high-frequency fluctuations with peaks exceeding 60 µg/m
3, with larger error clusters appearing notably in mid-2022 and 2023. In the daily resampling, while peaks are smoother, the absolute error typically ranges from 10 to 30 µg/m
3. Weekly resampling shows relatively stable error values, fluctuating between 10 and 20 µg/m
3, with occasional increases. The lower plot in
Figure 5 shows the absolute error of PM
10, where hourly resampling exhibits extreme spikes, with errors reaching or exceeding 1200 µg/m
3. The distribution is more sporadic than for NO
2, with pronounced outliers, especially in late 2022 and early 2023. Daily resampling reveals clearer temporal patterns, with bursts of error visible around late 2022 and early 2024. Despite smoothing, occasional spikes up to 400 µg/m
3 for PM
10 persist. Weekly resampling drastically reduces variability, with general error levels below 40 µg/m
3, indicating that most extreme fluctuations are short-lived. The analysis shows that as the temporal resolution changes from hourly to weekly, extreme error values are significantly smoothed. This suggests that while high errors do occur, they are typically brief in duration, and daily or weekly averaging provides a more accurate estimation of long-term sensor bias. Given that air quality data are reported as daily values, this suggests sufficient quality for reporting even before software calibration. Moreover, PM
10 shows larger and more extreme errors compared to NO
2, especially in the hourly and daily views. This suggests that PM
10 predictions from sensors are more sensitive to episodic or extreme pollution events than those of NO
2. Both pollutants exhibit seasonal or periodic errors. For instance, NO
2 errors tend to decrease during late summer and rise again in winter, potentially due to changes in atmospheric dispersion or emission patterns (e.g., increased heating-related emissions in colder months). There is also a visible downward trend in error variability over time, particularly for PM
10 during late 2023 and early 2024. This may indicate improvements in the modeling approach or enhanced calibration of the sensor data.
3.2. Sensitivity Assessment
Normalized feature sensitivities for NO
2 (left panel) and PM
10 (right panel) concentrations across all monitoring stations are presented in
Figure 6. Each row represents a station, while each column corresponds to a feature used in the linear regression model. Feature coefficients were normalized by dividing each value by the maximum absolute coefficient for that station, resulting in values scaled between −1 and 1. This approach emphasizes the relative importance of each feature within a station rather than across stations. For NO
2, most features show values close to zero, indicating weak or diffuse linear relationships with the predictors. Only a few stations display noticeable sensitivity to surface energy balance variables (e.g., surface latent heat flux and evapotranspiration), while temporal features such as Hour in the day or Day in the week contribute very little. This suggests that for NO
2, which is largely traffic-related, the linear model struggles to capture variability using the selected meteorological and temporal predictors. For PM
10, sensitivities also remain mostly low across stations. Occasional increases in sensitivity appear for certain stations and meteorological variables, but no consistent feature dominates. Although temporal features have previously been shown to play an important role in PM
10 variability [
45,
46], their contribution here is minimal. Equally important, the overall muted and station-specific sensitivity patterns for both pollutants suggest that neither NO
2 nor PM
10 is strongly explained by the selected features within a linear framework. This highlights the need for pollutant-specific calibration strategies and potentially more complex, non-linear model-ling approaches when applying sensor networks for urban air quality monitoring.
3.3. Machine Learning Based Improvement of Data Quality Using Local Stations
ML algorithm XGBoost was used to improve the accuracy of sensor measurements. During the training phase, a Bayesian optimizer was employed to fine-tune the model. The final calibration model utilized the following hyperparameters: n_estimators = 200, indicating that the model was built using 200 decision trees sequentially; and max_depth = 6, which limited the depth of each decision tree, allowing the model to capture moderately complex patterns while reducing the risk of overfitting. A learning rate of 0.1 provided the best balance between learning speed and generalization ability. Subsample of 0.8 implies that 80% of the training data was randomly sampled for each tree, which helped prevent overfitting by introducing variability. For each of the reference stations (1, 2, and 3), two nearby sensor stations were selected. The selection was based on both geographic proximity and similar surrounding environments:
- (1)
HQ Zagreb 1 is located on a busy city street with more than six traffic lanes. Therefore, two sensor sites Trešnjevka/Vukovarska and NSB were selected, as both are also located near streets with comparable traffic volume
- (2)
HQ Zagreb 2 is situated in the eastern part of the city, which is still relatively busy but less central than HQ Zagreb 1. The chosen sensor sites Dubrava Centar and Ravnice share similar characteristics in terms of the number of surrounding roads and public transport stops
- (3)
HQ Zagreb 3 is in the southeastern suburban part of the city, an area with fewer office buildings and lower traffic density. Therefore, Borovje and Folnegovićevo were selected as they share comparable suburban characteristics
Figure 7 shows the results of the calibration model applied to hourly data. The pre-calibration data are represented as orange dots, while the calibrated values obtained after applying the model are shown in blue. The left column presents results for NO
2, and the right column shows PM
10. Each subplot includes the RMSE for the original and calibrated data, along with the calculated percentage improvement. Across all the sensor sites, at higher pollutant concentrations, the original (pre-calibration) data diverges substantially from the reference measurements. The apparent divergence between the sensor and reference data measurements before calibration primarily reflects differences in hyperlocal context. The indicative air quality stations were installed on lighting poles in close proximity to emission sources, up to 4 m above ground and directly facing the street. In contrast, reference-grade stations are generally positioned more than 10 m away from major roads and in semi-open environments to ensure regulatory representativeness. As a result, the indicative sensors often record higher short-term peaks and stronger spatial gradients. These hyperlocal variations naturally lead to discrepancies when comparing sensor data with reference stations, especially without spatial normalization. After calibration, the data points align much more closely with the reference values, indicating a significant improvement in accuracy. The calibration process led to a substantial reduction in RMSE across all sites and for both pollutants. The improvements ranged from 18% to 82%, revealing the effectiveness of the applied correction methodology. The most significant improvement occurred at Ravnice for NO
2, where the RMSE dropped from 51.91 to 9.14 µg/m
32, representing an 82% improvement. For PM
10, although the initial error levels were generally higher than for NO
2, the calibration still resulted in consistent performance gains, with improvements exceeding 40% in most cases. To better understand the performance of the calibration model over time, rolling RMSE was calculated for both NO
2 and PM
10 at each location. This site-specific analysis complements the aggregated error assessment by providing a temporally resolved evaluation of calibration performance, capturing sensor drift and model degradation over time.
Figure 8 shows these results, with NO
2 on the left, and PM
10 on the right across all six sensor locations. Each subplot tracks the model’s performance over the entire time span of the dataset. The blue-shaded area represents the training period, while the red-shaded area corresponds to the test period. These two periods are separated by a black vertical dotted line indicating the train/test split. The green dotted line shows the baseline RMSE, calculated during the training phase. The red dashed line represents a threshold that indicates when model recalibration may be necessary. This threshold is defined as a 50% increase over the baseline RMSE. At all locations, model drift is evident during the test period, particularly in 2024, where RMSE increases for both NO
2 and PM
10. The increase in RMSE is earlier and more pronounced for PM
10, in line with its greater variability and sensitivity. The earliest recalibration is observed at Folnegovićevo station for PM
10, just 36 days into the test period, while the latest occurs at Trešnjevka/Vukovarska for NO
2, after 148 days. In some locations, the threshold is not exceeded during the test period, indicating more stable model performance. Notably, Dubrava Centar, Ravnice, and Folnegovićevo stations show rapid error growth, suggesting that these sites may require more frequent retraining or adaptive calibration strategies. The defined thresholds offer a quantitative and interpretable method for identifying when a deployed model’s performance declines to an unacceptable level, supporting effective model maintenance and monitoring practices.
3.4. Strengths and Limitations of the Study
This study presents the first city-wide assessment of next-generation air quality sensors in Zagreb, providing comprehensive spatial and temporal data across 35 sensor locations. A strength of this work lies in the application of ML-based recalibration models (XGBoost), which significantly enhance sensor accuracy, particularly during peak pollution events. Additionally, the introduction of rolling RMSE analysis offers an innovative and practical approach for determining recalibration intervals, contributing to sustainable long-term monitoring practices. The study also integrates extensive meteorological and proxy traffic data, further improving the robustness of sensor calibration.
However, several limitations must be acknowledged. First, full co-location of sensors with reference-grade monitoring stations was not feasible due to infrastructural constraints, leading to reliance on spatiotemporal pairing instead of direct one-to-one calibration. While spatiotemporal pairing provides meaningful correction relationships, it cannot replicate true co-location conditions where micro-environmental differences are eliminated. As a result, some portion of the observed discrepancies particularly short-term peaks reflects hyperlocal emission variability rather than sensor inaccuracy. Moreover, calibration models may partially learn environmental differences rather than purely correcting sensor bias, which should be considered when interpreting the error reductions. Second, the absence of direct traffic data required the use of proxy variables, which may not fully capture real-time traffic dynamics. Although previous studies have shown these proxies to be effective, they cannot fully substitute for real-time traffic data and may underestimate abrupt changes in mobility patterns or special events. Third, while the rolling RMSE method effectively indicates sensor drift, it does not account for the influence of extreme meteorological events or sensor hardware degradation. Lastly, the study focused on PM10 and NO2 only, while additional pollutants such as PM2.5, O3, and CO, though measured, were not included in the calibration analysis.
To address these limitations, future research should aim to increase the number of co-located sensor-reference pairs, integrate real-time traffic and land-use data, and expand calibration efforts to other pollutants and sensor types. Adaptive recalibration frameworks incorporating transfer learning and sensor hardware diagnostics could further improve the scalability and resilience of monitoring networks.
4. Conclusions
This study presents the first city-wide evaluation of air quality sensors in Zagreb, assessing their performance in measuring PM10 and NO2 concentrations across diverse urban environments prior to software calibration cycles. Unlike previous Croatian studies, this work demonstrates the feasibility of an ML-calibrated, city-wide low-cost sensor network and introduces a replicable recalibration framework based on rolling RMSE. By linking recalibrated data to pollution event detection and health impact relevance, the study provides novel methodological and practical insights for urban air quality management in Europe. The results highlight substantial spatial and temporal variability in these pollutants, which is shaped by traffic, land use, and meteorological conditions. While NO2 concentrations closely reflected traffic emission patterns, PM10 exhibited a more heterogeneous distribution, influenced by factors such as residential heating, construction activities, and traffic-related factors, including fleet age and road conditions. Sensitivity analysis using linear regression revealed that pollutant concentrations were strongly influenced by temporal and meteorological variables, with NO2 being more dependent on daily traffic patterns and PM10 more responsive to environmental and surface characteristics. Calibration with XGBoost significantly improved sensor accuracy, reducing RMSE by up to 82%, and should be applied in regular cycles. The models corrected for sensor drift and bias, particularly during peak pollution episodes. Rolling RMSE analysis further indicated that model performance declines over time, especially for PM10, requiring recalibration typically within 1 to 6 months after deployment. While the study demonstrates substantial improvements in sensor performance after ML calibration, these outcomes should be interpreted in the context of several methodological constraints, particularly the reliance on spatiotemporal rather than fully co-located calibration data. Therefore, the results indicate the potential for robust network-wide accuracy rather than establishing strict equivalence to reference-grade measurements. Future work involving expanded co-location campaigns and integration of direct traffic and micro-environmental data will be essential to validate and refine these conclusions. Overall, the findings confirm that an expert-validated sensor system network, when properly calibrated and maintained, serves as a valuable tool for assessing air quality at the city level. This scalable solution for expanding monitoring networks aligns with forthcoming European air quality legislation, enabling more granular data coverage. By combining dense sensor deployments with ML calibration, cities can strengthen evidence-based policy, support targeted public health interventions, and foster citizen engagement in air quality monitoring and management.