Design and Implementation of a Crowdsensing-Based Air Quality Monitoring Open and FAIR Data Infrastructure

: This work reports on the development of a real-time vehicle sensor network (VSN) system and infrastructure devised to monitor particulate matter (PM) in urban areas within a participatory paradigm. The approach is based on the use of multiple vehicles where sensors, acquisition and transmission devices are installed. PM values are measured and transmitted using standard mobile phone networks. Given the large number of acquisition platforms needed in crowdsensing, sensors need to be low-cost (LCS). This sets limitations in the precision and accuracy of measurements that can be mitigated using statistical methods on redundant data. Once data are received, they are automatically quality controlled, processed and mapped geographically to produce easy-to-understand visualizations that are made available in almost real time through a dedicated web portal. There, end users can access current and historic data and data products. The system has been operational since 2021 and has collected over 50 billion measurements, highlighting several hotspots and trends of air pollution in the city of Trieste (north-east Italy). The study concludes that (i) this perspective allows for drastically reduced costs and considerably improves the coverage of measurements; (ii) for an urban area of approximately 100,000 square meters and 200,000 inhabitants, a large quantity of measurements can be obtained with a relatively low number (5) of public buses; (iii) a small number of private cars, although less easy to organize, can be very important to provide inﬁlls in areas where buses are not available; (iv) appropriate corrections for LCS limitations in accuracy can be calculated and applied using reference measurements taken with high-quality standardized devices and methods; and that (v) analyzing the dispersion of measurements in the designated area, it is possible to highlight trends of air pollution and possibly associate them with trafﬁc directions. Crowdsensing and open access to air quality data can provide very useful data to the scientiﬁc community but also have great potential in fostering environmental awareness and the adoption of correct practices by the general public.


Introduction
The World Health Organization defines air pollution as the "contamination of the indoor or outdoor environment by any chemical, physical or biological agent that modifies the natural characteristics of the atmosphere" [1]. Epidemiological evidence suggests that polluted air is one of the leading factors associated with the development of respiratory illness, cardiovascular disease and lung cancer [2]. At the same time, air pollution directly and indirectly affects the climate and damages buildings and cultural heritage [3,4]. Many countries have introduced specific legislation setting strict objectives for air quality. In the United States, this was implemented in the 1970s with the Air Quality Act [5] and later in Europe with the 2008/50/EC directive [6]. Notwithstanding the fact that, in general, air quality has improved a lot since then, there are still several hotspots of air pollution in most of the Western countries [7,8]. Recently, several analysts highlighted the risks that, due to the current geopolitical situation and the shortage of natural gas, the resulting increase in the use of solid fuels poses to worsen the situation [9]. Indeed, while the combustion of

Materials and Methods
Within this work, we will describe a system and infrastructure we developed that, leveraging the crowdsensing paradigm, allows the monitoring of air quality and represents its geographic distribution on a web portal in real time. The initiative is named "COCAL" after the dialectal term used for seagulls in the city of Trieste (Italy), where it has been developed and first deployed. The reason for using a seabird name comes from the fact that Trieste is a coastal city where applications of the crowdsensing paradigm can be envisaged in multiple environments. As a matter of fact, the first trials of COCAL were focused on monitoring marine parameters such as temperature, pH, salinity and dissolved oxygen. Further information on such trials can be found in [35,36].
Trieste is located at the north-east tip of the Adriatic Sea and occupies an NW-SE trending elongated area of approximately 90 square kilometers between the sea and the Karst plateau, which acts as a barrier to air masses (Figure 1). The city center lies in a restricted area that is characterized by economic activities linked to the tertiary sector and tourism and separated from an industrial area located in an SE sector. The wind regime is characterized in winter by a strong NE wind called Bora, which can effectively move polluted masses from the city to the sea and other nearby regions [13,40]. In summer, the sea breezes are the prevalent factor conditioning the behavior and position of the polluted air masses [41].
have considered some of these aspects and the scientific results that we were able to obtai by exploiting them. Here, we describe the technological aspects of our work in the hop that our experience could prove beneficial to others intending to replicate or eventuall improve what we have been able to build so far.

Materials and Methods
Within this work, we will describe a system and infrastructure we developed tha leveraging the crowdsensing paradigm, allows the monitoring of air quality an represents its geographic distribution on a web portal in real time. The initiative is name "COCAL" after the dialectal term used for seagulls in the city of Trieste (Italy), where has been developed and first deployed. The reason for using a seabird name comes from the fact that Trieste is a coastal city where applications of the crowdsensing paradigm ca be envisaged in multiple environments. As a matter of fact, the first trials of COCAL wer focused on monitoring marine parameters such as temperature, pH, salinity an dissolved oxygen. Further information on such trials can be found in [35,36].
Trieste is located at the north-east tip of the Adriatic Sea and occupies an NW-S trending elongated area of approximately 90 square kilometers between the sea and th Karst plateau, which acts as a barrier to air masses (Figure 1). The city center lies in restricted area that is characterized by economic activities linked to the tertiary sector an tourism and separated from an industrial area located in an SE sector. The wind regime i characterized in winter by a strong NE wind called Bora, which can effectively mov polluted masses from the city to the sea and other nearby regions [13,40]. In summer, th sea breezes are the prevalent factor conditioning the behavior and position of the pollute air masses [41]. Within this work, we focus on monitoring PM only, but we are currently working o extending the method to other pollutants. It is worth highlighting that all the dat acquired within this initiative and in other crowdsensing initiatives are gathered in a Within this work, we focus on monitoring PM only, but we are currently working on extending the method to other pollutants. It is worth highlighting that all the data acquired within this initiative and in other crowdsensing initiatives are gathered in an integrated database and managed within a fully FAIR-compliant perspective following international standards as mandated by ISO and OGC.

PM Sensors
Given the importance in crowdsensing of using a large number of acquisition devices, it is evident that increasing their number will inevitably imply increasing the overall cost of the initiative. Under this approach, in fact, it is not possible to use conventional PM monitoring techniques such as filters and gravimetric mass detection, which are very expensive and based on standardized procedures that can only be performed by trained personnel. Therefore, low-cost sensors (LCS) are needed. New technologies have emerged that use laser scattering, which relates the waveform of the scattered light to the diameter and number of particles, enabling real-time and continuous measurements of particulate matter.
A detailed description of the technologies behind PM sensors is beyond the scope of this work and can be found in other works such as, for example, [42,43]. Suffice it to say that these PM sensors consist of a fan, generally connected to a small tube, that pushes air into the sensing box. Light from a laser diode is scattered by the particles. This scattered light is received by a photodiode, which can estimate the concentration of each type of particle by classifying and counting the number of pulses detected.
Within COCAL, we use the SDS011 PM sensor from Nova Fitness Co., which enables simultaneously measuring both PM 2.5 and PM 10 levels at a very low cost.

LCS Performances
The major advantages of LCS in terms of price and portability come at the expense of limitations in precision and accuracy [43][44][45].
Many LCS manufacturers and models are available on the market, and detailed comparisons between them can be found, for example, in [12,24,42]. These works highlight that, in addition to the limitations in precision and accuracy, it is very important to consider the environmental conditions in which LCS sensors operate. For example, because these sensors do not have sample conditioning equipment, they are susceptible to drifts due to relative humidity (RH), which can affect the hygroscopic growth of particles and distort measurements [27,46]. The results of these studies demonstrate that among LCS, the issue of quality and precision of the specific brand and model of sensors can be as relevant as the intrinsic limitations of the technology employed, the issues related to the deployment in the designated environment and the environmental conditions.
In this perspective, to monitor the environmental conditions in which the acquisition takes place, together with the PM LCS, we also use a Dallas Semiconductor DS18B20 one-wire communication sensor with waterproof protection outside the acquisition box, while internally, we use a Bosch BME280 temperature, pressure and RH sensor connected to the board via the I2C bus.
In [42], useful references can be found to understand the performance of the SDS011 sensor and other similar sensors under controlled laboratory conditions. The results of that work confirm that the SDS011 sensor is suitable for use within the COCAL project since it performs reasonably well in comparison with similar or even more expensive sensors. At the same time, downsides have been identified such as a general trend to underestimate PM values and the presence of a delay in the timing of measurements.

Statistical Analysis
Following [44], it is particularly important to understand the behavior of the LCS used in this work under real-world conditions. In order to devise possible mitigation strategies for the issues introduced by the use of LCS, following the protocols suggested by US EPA and [43], we designed two experiments, namely: (i) a study of the behavior of LCS in a highly variable PM level environment to evaluate precision and (ii) a co-locationbased evaluation of LCS with a reference measurement station managed by the regional environmental agency ARPA-FVG, in order to evaluate accuracy. It should be noted that during these tests, the data completeness of the COCAL system, meaning the ability to avoid gaps in measurements and data transmission, has always been very high, with almost negligible glitches and well above the 75% threshold recommended by US EPA.

LCS Precision
To assess the precision of LCS, we studied the recordings of three co-located LCS in a highly variable PM concentration environment. Following US EPA standards and procedures mentioned by [43], precision can be estimated using the standard deviation (SD) and the coefficient of variation (CV). The SD shows a value of 2.15 for PM 10 and a value of 1.17 for PM 2.5 , while the CV shows a value of 24.70% for PM 10 and 22.77% for PM 2.5 .
US EPA recommends a target SD less or equal to 5 µg/m 3 and a CV less or equal to 30%. The tests we conducted therefore assess that LCSs used in COCAL match the recommendations for precision.

LCS Accuracy
To test the accuracy of the selected sensors, we placed a COCAL box at a short distance from a certified reference air quality station (ARPA-FVG station 'Rosmini'). The reference PM values of this station are made available through an API on the official ARPA-FVG website as daily average values only. The measurements and comparison took place from mid-March 2022 to the end of April 2023.
In Figure 2, one can compare measurements from the reference station and those taken during the same period with the COCAL box located close to the reference station. Figure 2 is divided into two boxes. The upper box shows data and statistics for PM 10 , while the lower one focuses on PM 2.5 . In each box, the first graph (n.1) shows in red the time series of the reference daily average values as made available from the ARPA-FVG website, while the daily average values of the COCAL box located close to the reference station are plotted in blue.
High PM values were measured at the end of July 2022. Unfortunately, these were not outliers but the effects of a large forest fire that occurred for several days at a distance of about 20 km from the test site.
It is possible to note that COCAL measurements are generally lower than the reference measurements; however, during the first months of 2023, in three specific events (identified by boxes A, B and C), this behavior reverses. To better understand the performance of LCSs, Figure 2 also shows the difference between the reference and the COCAL time series (graph n.3), the RH time series (graph n.3) and the time series of the standard deviation of all COCAL measurements acquired near the reference station, calculated on a daily basis.
It is interesting to note that during the A, B and C events, the standard deviation of COCAL measurements increases. In two of these cases (A and C), this can be understood as owing to the sensitivity of the LCS to RH. In fact, the time series of the RH in those periods exceeded 60%, while elsewhere, when LCS underestimated the reference measurements, the RH remains below this threshold. In the case of event B, however, where RH is low, a different explanation is needed. This could be found in the different technologies and rate of sampling of the LCS and the reference acquisition system. COCAL boxes acquire data every ten seconds, which means that rapid variations in the actual PM concentrations can effectively be captured, increasing at the same time the standard deviation of the set of daily measurements. Reference systems sample much more slowly so that, even if very reliable, they can overlook rapid phenomena so that, in the comparison, LCSs data appear to be dispersed.
Following US EPA standards and procedures mentioned by [32], accuracy can be estimated using the coefficient of determination (R 2 ), slope (m), intercept (b), root mean square error (RMSE) and the normalized root mean square error (NRMSE). Results for the mentioned tests for PM 10 and PM 2.5 are shown in Table 1, while Figure 3 provides a snapshot of the comparison of the LCS and reference measurements.  In both cases, graph no. (1) shows the daily average COCAL measurements in blue, while the reference daily value is plotted in red. Graph no. (2) shows the difference between the two time series. Graph no. (3) shows the RH time series. Graph no. (4) shows the standard deviation of COCAL measurements calculated day by day. As can be seen, COCAL measurements are generally lower than the reference measurements. During the first months of 2023 in three specific events (identified in the graph by the boxes A, B and C), this behavior reverses. It is interesting to note that during these events, the standard deviation of measurements also becomes high. In two of these cases (A and C), this can be understood as owing to the sensitivity of the LCS to RH, but in the case of event B, the RH is low. Processes 2023, 11, x FOR PEER REVIEW lower than the reference measurements. During the first months of 2023 in three specific (identified in the graph by the boxes A, B and C), this behavior reverses. It is interesting to no during these events, the standard deviation of measurements also becomes high. In two o cases (A and C), this can be understood as owing to the sensitivity of the LCS to RH, but in t of event B, the RH is low.
Following US EPA standards and procedures mentioned by [32], accuracy estimated using the coefficient of determination (R 2 ), slope (m), intercept (b), root square error (RMSE) and the normalized root mean square error (NRMSE). Results mentioned tests for PM10 and PM2.5 are shown in Table 1, while Figure 3 prov snapshot of the comparison of the LCS and reference measurements.  Considering US EPA recommendations, these results can be problematic. I considering R 2 , this parameter is recommended to exceed 0.70, while the analysis r lower values. The target slope should be approximately 1 ± 0.35, a condition that is i respected by the LCSs. Similarly, the intercept parameter performs relatively w PM2.5 sensors, while PM10 sensors do not fall within the recommended range sinc recommends a value between −5 and +5. Following EPA standards, RMSE should be than 7 µg/m 3 , and again, also here, PM2.5 scores rather well, whereas we measured a twice the threshold for PM10. In addition, unfortunately, the NRMSE results are al high, being around 60%, while EPA recommends a value less than 30%.
It can therefore be said that accuracy-wise, the LCSs perform rather poorly, bo PM10 and PM2.5, and that a correction mechanism is needed to obtain results that co comparable with the official ones. At the same time, it is possible to say that, sin precision is reasonably good, a geographic distribution of LCS measurements built integration of multiple COCAL boxes should reasonably be capable of highli Considering US EPA recommendations, these results can be problematic. In fact, considering R 2 , this parameter is recommended to exceed 0.70, while the analysis reveals lower values. The target slope should be approximately 1 ± 0.35, a condition that is instead respected by the LCSs. Similarly, the intercept parameter performs relatively well for PM 2.5 sensors, while PM 10 sensors do not fall within the recommended range since EPA recommends a value between −5 and +5. Following EPA standards, RMSE should be lower than 7 µg/m 3 , and again, also here, PM 2.5 scores rather well, whereas we measured a value twice the threshold for PM 10 . In addition, unfortunately, the NRMSE results are also too high, being around 60%, while EPA recommends a value less than 30%.
It can therefore be said that accuracy-wise, the LCSs perform rather poorly, both for PM 10 and PM 2.5 , and that a correction mechanism is needed to obtain results that could be comparable with the official ones. At the same time, it is possible to say that, since the precision is reasonably good, a geographic distribution of LCS measurements built on the integration of multiple COCAL boxes should reasonably be capable of highlighting general trends and pinpointing local anomalies.

LCS Performances Improvement
The results of the tests are consistent with a wide literature on LCS performance. Several authors [47][48][49] highlighted the impact of the environmental conditions and, in particular, of the RH in the deviation between reference measurements and LCS. To address such problems, reference stations are equipped with a device that, on heating air samples, induces the water vapor condensed onto the particle to evaporate. This, of course, is not possible with LCSs. To compensate for this effect, several RH correction approaches exist such as, for example, κ-Köhler theory-derived factors and various types of regressions or Processes 2023, 11, 1881 8 of 21 machine learning methods. After an extensive survey of the existing literature, Ref. [44] maintains that such corrections are applied very seldomly, with a simple linear regression, in that case, being the most used method. The same authors underline difficulties in accurately defining local parameters and accumulating knowledge from different cases and areas. It is also worth noting that RH itself can be a problematic parameter to measure and that, since COCAL is a VSN system, RH measurements taken with it can have further limitations.
Taking these considerations into account, and since the tests performed in Sections "LCS Precision" and "LCS Accuracy" with the LCSs we use in this work revealed that they perform reasonably well in terms of precision but unfortunately not well enough in terms of accuracy, we devised a specific and pragmatic two-step mitigation strategy to improve their performance.
The first step consists in filtering all measurements made in problematic conditions, for example, when RH is more than 60%. These data are automatically flagged and are not sent to the following processing flow.
The second step consists in calculating an accuracy correction to a reference station using a COCAL box located in its proximity. Since sensors proved to behave consistently among them, following [39], we apply the same accuracy correction to all the other sensors. As abovementioned, given that, in the designated area, only one value per day is currently available from the reference station, we calculate the difference between that reference value and the daily average value of all the measurements taken by an LCS co-located in the proximity of the reference station. Corrections are inter/extrapolated in all designated areas by means of the technique described in [39].
An example of the method's results can be seen in Figure 4, where on the left, the geographic distribution of LCS measurements before corrections is shown, while on the right, corrected values that are more consistent with reference measurements are shown. It is to be noted that this method can be problematic since applying the correction to areas far from where the reference station is located can unpredictably bias the final values. In the case proposed in Figure 4, measurements taken in the village of Opicina (upper part of the map) generally depict rather different conditions from the city center (lower part of the map). Opicina is, in fact, located uphill, is characterized by a different climatic setting and is less subject to vehicular traffic. No reference station is available in that area such that the only reference measurements available are those taken in the city center. In the example of Figure 4 (left), while the un-corrected data report a polluted city center and a much better situation uphill, after the corrections (Figure 4 right), the revised air quality also degrades notably in the hills. This could be an artifact that needs careful consideration when interpreting the data.
of regressions or machine learning methods. After an extensive survey of the existing literature, Ref. [44] maintains that such corrections are applied very seldomly, with a simple linear regression, in that case, being the most used method. The same authors underline difficulties in accurately defining local parameters and accumulating knowledge from different cases and areas. It is also worth noting that RH itself can be a problematic parameter to measure and that, since COCAL is a VSN system, RH measurements taken with it can have further limitations.
Taking these considerations into account, and since the tests performed in Sections "LCS Precision" and "LCS Accuracy" with the LCSs we use in this work revealed that they perform reasonably well in terms of precision but unfortunately not well enough in terms of accuracy, we devised a specific and pragmatic two-step mitigation strategy to improve their performance.
The first step consists in filtering all measurements made in problematic conditions, for example, when RH is more than 60%. These data are automatically flagged and are not sent to the following processing flow.
The second step consists in calculating an accuracy correction to a reference station using a COCAL box located in its proximity. Since sensors proved to behave consistently among them, following [39], we apply the same accuracy correction to all the other sensors. As abovementioned, given that, in the designated area, only one value per day is currently available from the reference station, we calculate the difference between that reference value and the daily average value of all the measurements taken by an LCS colocated in the proximity of the reference station. Corrections are inter/extrapolated in all designated areas by means of the technique described in [39].
An example of the method s results can be seen in Figure 4, where on the left, the geographic distribution of LCS measurements before corrections is shown, while on the right, corrected values that are more consistent with reference measurements are shown. It is to be noted that this method can be problematic since applying the correction to areas far from where the reference station is located can unpredictably bias the final values. In the case proposed in Figure 4, measurements taken in the village of Opicina (upper part of the map) generally depict rather different conditions from the city center (lower part of the map). Opicina is, in fact, located uphill, is characterized by a different climatic setting and is less subject to vehicular traffic. No reference station is available in that area such that the only reference measurements available are those taken in the city center. In the example of Figure 4 (left), while the un-corrected data report a polluted city center and a much better situation uphill, after the corrections (Figure 4 right), the revised air quality also degrades notably in the hills. This could be an artifact that needs careful consideration when interpreting the data.

Deployment on Mobile Platforms
As mentioned above, sensors were installed on two different platform types, namely (i) buses and (ii) cars. In both cases, we developed a tailor-made waterproof box that can easily be installed on the platform and where all the acquisition and transmission electronics can be safely protected while air inlets and outlets could effectively bring air samples to the LCS.
Bus deployment has been developed with key help from the local transportation authority, TPL Trieste Trasporti, which kindly offered to host several COCAL systems. The boxes were installed on the roof of the buses ( Figure 5) in a closed compartment with a specific air inlet passing through a syphon in order to prevent rain from entering the box. The power supply is obtained from the bus using a temporized relay to minimize the impact of the COCAL box on the normal functioning of the vehicles.

Deployment on Mobile Platforms
As mentioned above, sensors were installed on two different platform types, namely (i) buses and (ii) cars. In both cases, we developed a tailor-made waterproof box that can easily be installed on the platform and where all the acquisition and transmission electronics can be safely protected while air inlets and outlets could effectively bring air samples to the LCS.
Bus deployment has been developed with key help from the local transportation authority, TPL Trieste Trasporti, which kindly offered to host several COCAL systems. The boxes were installed on the roof of the buses ( Figure 5) in a closed compartment with a specific air inlet passing through a syphon in order to prevent rain from entering the box. The power supply is obtained from the bus using a temporized relay to minimize the impact of the COCAL box on the normal functioning of the vehicles. Buses are a very convenient acquisition platform because each unit can be redirected to several routes throughout the day, thereby covering a large portion of the urban area. On the other hand, bus routes tend to follow the main directions of the traffic in a city, which may somehow bias the coverage of the designated area.
Cars have several advantages over buses, one being that they generally do not follow predefined routes. This makes cars a good means to provide infills in areas where buses are not available. At the same time, cars introduce other constraints that depend mainly on volunteer drivers. Issues may arise, in fact, in order to motivate them to cover areas that are not within their daily routines. In our experience, this often meant also using vehicles belonging to our institute.
COCAL boxes for cars have been designed entirely by us and 3D-printed autonomously with the help of the ICTP FabLab laboratory ( Figure 6). The boxes were conceived to be fully autonomous and least invasive as possible. This forced us to make some design choices; for example, since connecting the boxes to the car s power supply can be problematic, they are powered on batteries only. Autonomy is approximately one full day, although it can be longer depending on the rate of data transmission. Battery recharge can be completed in a few hours. Another choice was to avoid taking up space inside the vehicle or in the trunk, so we decided to position the box on the roof. To affix the box to the roof surface, we added magnetic plates on the bottom and, for further Buses are a very convenient acquisition platform because each unit can be redirected to several routes throughout the day, thereby covering a large portion of the urban area. On the other hand, bus routes tend to follow the main directions of the traffic in a city, which may somehow bias the coverage of the designated area.
Cars have several advantages over buses, one being that they generally do not follow predefined routes. This makes cars a good means to provide infills in areas where buses are not available. At the same time, cars introduce other constraints that depend mainly on volunteer drivers. Issues may arise, in fact, in order to motivate them to cover areas that are not within their daily routines. In our experience, this often meant also using vehicles belonging to our institute.
COCAL boxes for cars have been designed entirely by us and 3D-printed autonomously with the help of the ICTP FabLab laboratory ( Figure 6). The boxes were conceived to be fully autonomous and least invasive as possible. This forced us to make some design choices; for example, since connecting the boxes to the car's power supply can be problematic, they are powered on batteries only. Autonomy is approximately one full day, although it can be longer depending on the rate of data transmission. Battery recharge can be completed in a few hours. Another choice was to avoid taking up space inside the vehicle or in the trunk, so we decided to position the box on the roof. To affix the box to the roof surface, we added magnetic plates on the bottom and, for further security, we decided to install it on cars with roof bars only, to which COCAL boxes are secured using Velcro strips. The air inlet passes through the curved white roof so that it remains dry in case of rain (Figure 3 lower left).
The air outlet is located on the back of the box. The roof can easily be removed to access the electronics inside.

Limitations of Mobile Platforms
Besides the already mentioned limitations in accuracy, we were also concerned about the possible effects of the deployment of LCSs on moving platforms. While it is known that platform speed influences the measurement, to the best of our knowledge, there is no specific study on this topic since most of the existing literature is based on fixed-position deployments. We therefore set up a test, where, passing multiple times in the same area at different speeds during a restricted period with stable meteorologic conditions, we collected a large dataset of measurements. The results of the experiment can be seen in Figure 7. These show an inverse relationship between PM and platform speed. Considering how the COCAL box is built, this is probably due to a depression induced by the platform movement on the inlet of the box. This increases with speed, reducing the quantity of air that reaches the detection device and therefore reducing the estimates of PM values. The drift is relatively small and below the precision of sensors for velocities lower than 50 km/h, while higher speed values tend to be more problematic. Since the system has been installed mostly in an area where the speed limit is below 50 km/h, we can safely say that data collection is not particularly affected by this issue. As a measure of further security, during data processing, measurements associated with a speed higher than 50 km/h are automatically filtered out of the calculations.

Limitations of Mobile Platforms
Besides the already mentioned limitations in accuracy, we were also concerned about the possible effects of the deployment of LCSs on moving platforms. While it is known that platform speed influences the measurement, to the best of our knowledge, there is no specific study on this topic since most of the existing literature is based on fixed-position deployments. We therefore set up a test, where, passing multiple times in the same area at different speeds during a restricted period with stable meteorologic conditions, we collected a large dataset of measurements. The results of the experiment can be seen in Figure 7. These show an inverse relationship between PM and platform speed. Considering how the COCAL box is built, this is probably due to a depression induced by the platform movement on the inlet of the box. This increases with speed, reducing the quantity of air that reaches the detection device and therefore reducing the estimates of PM values. The drift is relatively small and below the precision of sensors for velocities lower than 50 km/h, while higher speed values tend to be more problematic. Since the system has been installed mostly in an area where the speed limit is below 50 km/h, we can safely say that data collection is not particularly affected by this issue. As a measure of further security, during data processing, measurements associated with a speed higher than 50 km/h are automatically filtered out of the calculations.

Data Acquisition and Transmission
The acquisition system ( Figure 8) is based on a low-cost ESP32 microcontroller with WiFi and Bluetooth connectivity. We selected a Heltec LoRa 32 v2 board, which has an embedded OLED display and battery charger together with a LoRa chip and Wi-Fi and Bluetooth connectivity. These are used for testing and short-distance connectivity, while LoRa is used for long-distance connectivity [28]. To this, we added GSM and GPS functionalities using an A9G development board, designed by Ai-Thinker, which, with an active SIM card, allows data transmission using the GSM telephonic network where coverage is available. Data transmission using Wi-Fi and GSM stores data directly in an InfluxDB database, while, using Lora, we rely on The Things Network (TTN) LoRaWAN infrastructure in order to retrieve data transmitted using LoRa and store data into the database. Telegraf, a server-based agent, oversees retrieving data from TTN using the MQTT protocol and storing data into the database.

Data Acquisition and Transmission
The acquisition system (Figure 8) is based on a low-cost ESP32 microcontroller with WiFi and Bluetooth connectivity. We selected a Heltec LoRa 32 v2 board, which has an embedded OLED display and battery charger together with a LoRa chip and Wi-Fi and Bluetooth connectivity. These are used for testing and short-distance connectivity, while LoRa is used for long-distance connectivity [28]. To this, we added GSM and GPS functionalities using an A9G development board, designed by Ai-Thinker, which, with an active SIM card, allows data transmission using the GSM telephonic network where coverage is available. Data transmission using Wi-Fi and GSM stores data directly in an InfluxDB database, while, using Lora, we rely on The Things Network (TTN) LoRaWAN infrastructure in order to retrieve data transmitted using LoRa and store data into the database. Telegraf, a server-based agent, oversees retrieving data from TTN using the MQTT protocol and storing data into the database.

Data Acquisition and Transmission
The acquisition system ( Figure 8) is based on a low-cost ESP32 microcontroller WiFi and Bluetooth connectivity. We selected a Heltec LoRa 32 v2 board, which h embedded OLED display and battery charger together with a LoRa chip and Wi-F Bluetooth connectivity. These are used for testing and short-distance connectivity, w LoRa is used for long-distance connectivity [28]. To this, we added GSM and functionalities using an A9G development board, designed by Ai-Thinker, which, wi active SIM card, allows data transmission using the GSM telephonic network w coverage is available. Data transmission using Wi-Fi and GSM stores data directly InfluxDB database, while, using Lora, we rely on The Things Network (TTN) LoRaW infrastructure in order to retrieve data transmitted using LoRa and store data int database. Telegraf, a server-based agent, oversees retrieving data from TTN usin MQTT protocol and storing data into the database.  Figure 9 describes the general architecture of the COCAL system. The flu incoming data transmitted through LoRaWAN flows into an InfluxDB table filled b TTN service. A server script manages to reroute the data into the main InfluxDB  Figure 9 describes the general architecture of the COCAL system. The flux of incoming data transmitted through LoRaWAN flows into an InfluxDB table filled by the TTN service. A server script manages to reroute the data into the main InfluxDB time-series tables after proper conversion. The final storage and processing server is based on a Postgres database, with a PostGIS extension for dealing with georeferenced objects, geographic projections and geometric objects such as polylines. This storage/processing server (SPS) is built on an open-source architecture: Linux Ubuntu, Apache, PHP, Python and Postgres. It currently manages the database, as well as several scripts responsible for the processing and the web front-end. A PHP script periodically synchronizes the InfluxDB with the Postgres database, inserting in the latter one every valid measurement from a sensor with a time marker (UTC), WGS84 coordinates, all other GPS info (such as altitude or speed), the type of transmission (e.g., GSM), a device ID, a sensor ID (e.g., atmospheric pressure) and the measured value (e.g., 1007 mBar).

Data Processing
The SPS performs different activities by means of PHP scripts, which are scheduled with crontab. The most demanding analyses are encoded in Python with its standard libraries such as NumPy, SciPy, Matplotlib or PIL.
All processed products are made available in near real-time and stored permanently for better performance.

Window Averaging
Window averaging is necessary to assimilate the large amount of data acquired by many devices spread across a wide area. As in [35], we define a geographical grid of 200 m per 200 m wide cells, based on a local projection. In addition, we subdivide the timeline into 1 h intervals. Every set of data spanning a spatial cell and a time interval is a datacube, Figure 9. General architecture of the COCAL system.

Data Processing
The SPS performs different activities by means of PHP scripts, which are scheduled with crontab. The most demanding analyses are encoded in Python with its standard libraries such as NumPy, SciPy, Matplotlib or PIL.
All processed products are made available in near real-time and stored permanently for better performance.

Window Averaging
Window averaging is necessary to assimilate the large amount of data acquired by many devices spread across a wide area. As in [35], we define a geographical grid of In addition, we subdivide the timeline into 1 h intervals. Every set of data spanning a spatial cell and a time interval is a datacube, including measurements from different devices but mounting the same kind of sensor (e.g., PM 10 ). The choice of a local projection provides good accuracy when the area of interest is limited, and, in the case of this work, we used WGS84/UTM zone 33N (EPSG:32633). In order to obtain values that are smoother and more representative of the physical phenomenon, reducing the outliers and providing a uniform subdivision of space and time, considering the good results obtained in [26], we adopted a similar approach by averaging data (calculating the median) over space and time datacubes. Larger time intervals (for example 2, 3, 4 or 8 h) can be analyzed by selecting the specific datacube. These are processed once a day and made available the next day. A discussion on the advantages of window averaging and the shape of the cells can also be found in [39].
Every averaged datacube is stored into the database, marked with a start time, end time and a polyline describing the square cell.

Correction of LCS Data
As mentioned in Section "LCS Accuracy", the accuracy of LCS can be problematic. The technologies used within these sensors and the rate of sampling together with the effects of environmental parameters such as the RH often induce drifts in the LCS measurements. In Section 2.1.3, we introduced a pragmatic method that can mitigate such effects. This is based on applying a correction value to the LCS measurements that are calculated daily as the difference between the value provided by a reference station and a fixed LCS located in the vicinity of the reference station. The correction is applied server-side one day after the LCS data are actually collected since the reference value is available only with such a delay. The resulting grid of data is then made available to the web portal.

Interpolation and Contouring
In order to provide a more intuitive insight into the measured phenomenon, interpolation is a useful tool. Following [39], there are many aspects to take into account when spatial interpolation is applied: (i) The accuracy of the method and how far the interpolated values from the samples are still meaningful, i.e., a consideration on "extrapolation". This issue can be partially solved by the definition of an area of interpolation, such as the bounding box or (better) the convex hull of the samples as the first approximation. (ii) The computational complexity and the relative speed of the interpolation method, which must comply with the near real-time requirement. In our implementation, we chose a very quick and sufficiently accurate method, inverse distance weighting (IDW). IDW interpolation is defined as follows. Assuming that {x 1 . . . x N } are the interpolating points (samples) and x a generic point, the interpolant is: where the weights are: (iii) In addition, there is an epistemological aspect to be further considered: all processing is automatic and cannot be assisted by human intervention. This fact excludes algorithms such as Kriging, which involve many discretional models and parameters.
An excellent alternative solution is natural neighbor interpolation (Sibson's method), which is based only on the geometrical properties of the dataset and is approximately ten times slower than IDW [50]. Lastly, it is necessary to define the levels for the contouring in some adaptive way to improve the readability and also the color map for the interpolation, which must be coherent with all other data visualizations. The interpolation/contouring is implemented on the SPS with a Python script that reads the averaged values, applies the IDW method (with exponent p = 3), generates the contour and produces a transparent PNG image with a small text file for the georeferentiation. The image is clipped around the convex hull of the dataset, excluding the external area.

Near-Real-Time Web-Based Visualization
The visualization of environmental data is a topic that raises several questions: Are our data time series or spatial distributions? How do we represent time-varying phenomena? Which colors and graphic patterns are more effective and representative? To what extent does the computation have an impact on near-real-time web interaction? There are many different answers, of course, and much research involving mathematical, computational or psychological aspects (see, for example, [39]).
We implemented a set of visualizations in the web front-end, which allows the end user to browse through spatial and temporal coordinates and to select and analyze both single acquisitions and averaged maps ( Figure 10).

R PEER REVIEW
15 of 22

Near-Real-Time Web-Based Visualization
The visualization of environmental data is a topic that raises several questions: Are our data time series or spatial distributions? How do we represent time-varying phenomena? Which colors and graphic patterns are more effective and representative? To what extent does the computation have an impact on near-real-time web interaction? There are many different answers, of course, and much research involving mathematical, computational or psychological aspects (see, for example, [39]).
We implemented a set of visualizations in the web front-end, which allows the end user to browse through spatial and temporal coordinates and to select and analyze both single acquisitions and averaged maps ( Figure 10). All services are available at the web portal https://cocal.ogs.it (accessed on 1 June 2023).
The web interface allows easy navigation through the datasets by means of a simple window (Figure 10 left) where the user can select a single device, the acquisition sensor, the time interval and many different options. The time selection can be made in local time or in UTC, and a simplified view of the day shows the density of available data as shades of red, providing one-click access to time selection.
A calendar (month view button) shows the data density day by day by using the same principle. The acquisitions of a single device are represented as an interactive chart of the time series (a) or as a collection of connected points on a map (b). In the latter case, an arrow shape can show the GPS direction, and the point colors are mapped to the measure scale and corresponding legend.
The averaged data (on a rectangular grid) are represented as colored and labeled polygons (c). The interpolated data use the same color coding but are represented as continuous within the data convex hull, with superimposed contour lines (d). All graphic elements are responsive, showing all data details.
Additionally, we implemented a functionality that allows the user to follow cumulative data as an animation, cycling from a starting to an ending hour, in order to All services are available at the web portal https://cocal.ogs.it (accessed on 1 June 2023). The web interface allows easy navigation through the datasets by means of a simple window (Figure 10 left) where the user can select a single device, the acquisition sensor, the time interval and many different options. The time selection can be made in local time or in UTC, and a simplified view of the day shows the density of available data as shades of red, providing one-click access to time selection.
A calendar (month view button) shows the data density day by day by using the same principle. The acquisitions of a single device are represented as an interactive chart of the time series (a) or as a collection of connected points on a map (b). In the latter case, an arrow shape can show the GPS direction, and the point colors are mapped to the measure scale and corresponding legend.
The averaged data (on a rectangular grid) are represented as colored and labeled polygons (c). The interpolated data use the same color coding but are represented as continuous within the data convex hull, with superimposed contour lines (d). All graphic elements are responsive, showing all data details.
Additionally, we implemented a functionality that allows the user to follow cumulative data as an animation, cycling from a starting to an ending hour, in order to dynamically represent the temporal evolution of each parameter.

Advanced Analysis
The adoption of UTM33N as the map projection is a disadvantage when the acquisitions are beyond the limits of the range from 12 • E to 18 • E. Moreover, the implemented processing mechanism, which computes the averages periodically (in the background), is very efficient for a quick response and a fluid user experience but, on the downside, is rather fixed and rigid. A more flexible interface for data analysis has been tested, based on the global map projection "Web Mercator" (EPSG:3857, Pseudo-Mercator/Spherical Mercator). This spherical projection is used by most GIS systems such as Google Maps, Bing, ESRI, etc., has an increasing distortion at high latitudes and is not conformal but is the de facto standard for web applications and allows global coverage for processing. The web page shown in Figure 11 provides a wider range of query parameters and builds an analysis "on the fly" (windowed average or IDW interpolation). The computation is restricted to the visible bounding box and requires some computational time, favoring the extended query flexibility. Mercator/Spherical Mercator). This spherical projection is used by most GIS systems such as Google Maps, Bing, ESRI, etc., has an increasing distortion at high latitudes and is not conformal but is the de facto standard for web applications and allows global coverage for processing. The web page shown in Figure 11 provides a wider range of query parameters and builds an analysis "on the fly" (windowed average or IDW interpolation). The computation is restricted to the visible bounding box and requires some computational time, favoring the extended query flexibility. Figure 11. COCAL web "real time analysis" with advanced queries.

Open and FAIR Data Access
According to the FAIR principles, data must be findable, accessible, interoperable and reusable. COCAL deploys different protocols and implementations aiming to provide Open and FAIR data in accordance with well-established and official standards. In order to achieve discoverability, the initiative handles the standard ISO 19115-3 metadata profile through the Geonetwork catalog application. To ensure interoperability, such as machineto-machine data flows, data harvesting or archive federations, the COCAL geospatial database is compliant with OGC (Open Geospatial Consortium) standards, deployed as Web HTTP services: (i) WMS (Web Map Service), which provides georeferenced map images of the requested features or (ii) WFS (Web Feature Service), which provides detailed and fine-grained information about features or general capabilities of the dataset in a structured text format (XML, JSON, etc.). All OGC services are implemented on a Geoserver platform, an open-source Apache Tomcat extension linked to the main database. In order to achieve accessibility, data products are fully open and accessible, while a download of raw data in CSV format is available on the COCAL web portal, after authentication with a trusted account.

Figure 11
. COCAL web "real time analysis" with advanced queries.

Open and FAIR Data Access
According to the FAIR principles, data must be findable, accessible, interoperable and reusable. COCAL deploys different protocols and implementations aiming to provide Open and FAIR data in accordance with well-established and official standards. In order to achieve discoverability, the initiative handles the standard ISO 19115-3 metadata profile through the Geonetwork catalog application. To ensure interoperability, such as machineto-machine data flows, data harvesting or archive federations, the COCAL geospatial database is compliant with OGC (Open Geospatial Consortium) standards, deployed as Web HTTP services: (i) WMS (Web Map Service), which provides georeferenced map images of the requested features or (ii) WFS (Web Feature Service), which provides detailed and fine-grained information about features or general capabilities of the dataset in a structured text format (XML, JSON, etc.). All OGC services are implemented on a Geoserver platform, an open-source Apache Tomcat extension linked to the main database. In order to achieve accessibility, data products are fully open and accessible, while a download of raw data in CSV format is available on the COCAL web portal, after authentication with a trusted account.

Results
The technology behind COCAL has been under constant development since 2020. During the initial phase, trials took place using multiple simultaneous acquisition and transmission platforms mounted on vehicles operated by our institution. This allowed us to extensively test the system, its scalability, precision and accuracy during specific targeted surveys. Once the system was finalized, we were able to deploy a fully operative system on vehicles of the local transportation authority (Trieste Trasporti) and on some voluntary cars. COCAL entered into service in February 2021 and has been fully operative ever since.
Up to April 2023, the system acquired and processed a remarkable amount of data, both "points" (single measurements) and "cells" (averaged results). In Table 2, it is possible to see the approximated number of records per year during the period from January 2020 to April 2023. Data are fully public and can be accessed using standard procedures from the COCAL web portal (https://cocal.ogs.it, accessed on 1 June 2023). The main results of the work are, on one hand, the COCAL system itself, which has proved to be efficient, robust and easy to install and maintain, allowing a very high throughput of environmental data that strongly support the paradigm of low-cost participative systems in monitoring the environment. On the other hand, a very important result is the availability of a very large quantity of environmental measurements, allowing us to significantly increase the spatial and time coverage of the distribution of air pollutants in the designated area. Initial analysis of the dataset acquired enabled identifying several interesting features of the air quality in the area of Trieste. Some of these observations have already been published in the papers mentioned above. Figure 12 shows the geographic distribution of the standard deviation of all measurements made by the COCAL systems in the designated area. For every possible grid cell, we consider all PM measurements during the whole considered period ( . . . 2021 . . . 2022). Acquisitions within each cell are divided into time windows of 1 h, and some statistical parameters are calculated for every window, such as the number of samples, average, median and standard deviation. Eventually, the maximum standard deviation over the period is calculated for every cell.
This map should be interpreted with some caution. On one hand, the standard deviation could provide an idea of the amount of variation in measurements during a specific period of time. Where the standard deviation is high, this could mean that higher levels of PM have been recorded in that area compared to areas where the standard deviation is lower. On the other hand, there is a risk that if a biased coverage of data has been used, then the standard deviation could also be biased. Indeed, in Figure 12, it is easy to identify an NW-SE alignment (highlighted by the light blue line) where high values of the standard deviation seem to gather. This correlated almost perfectly with the main traffic direction of the city. It could be tempting to associate this trend with the traffic, concluding that those areas are the most polluted in the city. We think that some caution should be taken because this direction also corresponds to the more frequently followed routes of the buses used in the COCAL initiative. Most of the data, in fact, have been acquired in those areas so that, in comparison to other areas, measurements made in the former could have had the opportunity to detect all pollution events, while measurements made in other areas may just have overlooked them.
routes of the buses used in the COCAL initiative. Most of the data, in fact, have been acquired in those areas so that, in comparison to other areas, measurements made in the former could have had the opportunity to detect all pollution events, while measurements made in other areas may just have overlooked them.
An interesting and surprising result we obtained is related to the possible deterioration of the sensors after long usage. After approximately 2 years of activity, we decided to substitute the hardware of the deployed systems and discovered that they practically had not accumulated any dirt inside and that their precision remained (considering the intrinsic limitations of the LCS) almost unaltered ( Figure 13).   An interesting and surprising result we obtained is related to the possible deterioration of the sensors after long usage. After approximately 2 years of activity, we decided to substitute the hardware of the deployed systems and discovered that they practically had not accumulated any dirt inside and that their precision remained (considering the intrinsic limitations of the LCS) almost unaltered ( Figure 13). routes of the buses used in the COCAL initiative. Most of the data, in fact, have been acquired in those areas so that, in comparison to other areas, measurements made in the former could have had the opportunity to detect all pollution events, while measurements made in other areas may just have overlooked them. An interesting and surprising result we obtained is related to the possible deterioration of the sensors after long usage. After approximately 2 years of activity, we decided to substitute the hardware of the deployed systems and discovered that they practically had not accumulated any dirt inside and that their precision remained (considering the intrinsic limitations of the LCS) almost unaltered ( Figure 13).

Conclusions and Future Work
This work describes a crowdsensing-based air monitoring system following all of the technological segments of a path that starts from the actual measurement using LCS, to data transmission, to processing and a FAIR-compliant web-based representation and access to the reconstructed data products. The main conclusion of this work is, therefore, that the implementation of all segments of the system can be achieved using low-cost and opensource technology only. At the same time, the acquisition of data does not need trained personnel but can be performed with the help of volunteers and especially of the local transportation authorities. The results of the experience we propose here suggest that such systems can be trustworthy from the point of view of the precision of measurements, while it is necessary to rely on a reference value to correct the deviation of measurements due to the intrinsic limitations of the LCSs. Currently, within the proposed system, this correction is calculated daily, since the reference values are made available by the local environmental agency only as daily averages and one day after the actual measurements took place. We demonstrated that, in some cases, this can be problematic and that, when available, corrections should be calculated with a higher temporal resolution. Other limitations of the system were taken into consideration such as the speed of the acquisition vehicle. We showed that this speed should not exceed 60 km/h; otherwise, the variation in air pressure could bias the measurements. We also demonstrated that after almost two years of continuous field operations, the amount of dirt accumulated within the acquisition box in the case of the designated area was minimal. This, of course, can depend on the levels of pollution in the specific cases where the system is applied.
The amount of data acquired raised important questions from the viewpoint of the ITC system to be used. We understood that, for example, a modular approach in separating each activity on different virtual machines helps considerably in monitoring the performance of the system and in understanding where it may be necessary to increase dedicated resources.
After two years of testing the system with five COCAL boxes installed on local transportation authority buses, we understood that since this means of transportation can often be under maintenance or rerouted, five units of vectors is the minimum set of installation for a city of approximately 200,000 inhabitants and an area of approximately 100,000 square meters. In this perspective, much depends on the actual urban configuration. In fact, while installations on cars can cover the urban area almost randomly, buses follow the bus line distribution, which is generally concentrated in specific areas while neglecting others. The resulting data can be biased and limit the reconstruction of the distribution of air quality.
In the coming months, the number of COCAL installations on the public bus network of Trieste will be doubled in order to achieve a broader and more homogeneous coverage of the city.
The designated area being a coastal area, the system developed so far allows reconstructing the on-land area only. This does not allow studying in depth the phenomena related to the movements of polluted air masses due to sea breezes. In this perspective, an extension of the system is planned at sea using volunteers' recreational sailing ships, while the system will also be installed on a certain number of sea buoys managed by OGS.