Daily Concentrations of PM 2.5 in the Valencian Community Using Random Forest for the Period 2008–2018

: Fine particulate matter (PM 2.5 ) is a global problem that affects the population health and contributes to climate change. Remote sensing provides useful information for the development of air quality models. This work aims to obtain a daily model of PM 2.5 levels in the Valencian Community with a resolution of 1 km for the period 2008–2018. MODIS-MAIAC images, meteorological parameters of the MERRA-2 project, land cover information and ground level measurements of PM 2.5 levels were analysed with Random Forest. The verification of the model was carried out using cross-validation repeated ten times, and an evaluation of a test set with 20% of the collected information. The final model was used to generate maps of the daily concentrations of PM 2.5 for the area of the Valencian Community throughout the study period.


Introduction
Atmospheric aerosols are a critical compound in the atmosphere due to their influence on climate change and population health [1]. Remote sensing has contributed significantly to the air quality study in order to capture the spatial-temporal variation of pollutants. Previous papers have shown that the amount of light absorbed or scattered by suspended particles, aerosol optical depth (AOD), is a relevant parameter for estimating PM2.5 at ground level [2,3].
Moderate Resolution Imaging Spectroradiometer (MODIS) products are widely used in atmospheric models, as they have a daily review and a convenient spatial resolution for regional and local studies [2]. The recent Multi-angle Implementation of Atmospheric Correction (MAIAC) algorithm presents new opportunities for the development of atmospheric aerosol models [3]. These images employ the MODIS Aqua and Terra data and improve spatial resolution from 25 to 1 km [3].
Land use regression (LUR) models have had a broad application in different parts of the world. The LURs incorporate satellite images and meteorological and land use information as predictors to model PM10 (particulate matter with diameter < 10 µm) and PM2.5 (particulate matter with diameter < 2.5 µm). Recent work shows positive results in the use of Random Forest (RF) for LUR models [2].
The air of the Valencian Community is affected by the growth of the vehicle fleet, industrial production, Sahara dust events and biomass burning smoke. An air quality model would allow the risk of exposure to be estimated. Previous works presented the correlation between daily PM2.5 ground-measures and AOD MODIS for this region [4,5]. The aim of this work is to apply an RF model, using 15 atmospheric variables and characteristic for land use to estimate the daily PM2.5 ground-concentration at the 1 km grid for 2008-2018 in Valencia Community.

Study Area
The Valencian Community is to the east of the Iberian Peninsula, on the Mediterranean coast ( Figure  1). The population is mainly concentrated in urban centres, in particular, in the metropolitan areas of Valencia and Alicante. Segura et al. [6] found a significant relationship between black smoke and the number of emergency admissions for heart disease in Valencia city.

Data Sets
The hourly concentrations at the ground level of PM2.5 were download from the Valencian Network for Monitoring and Control of Atmospheric Pollution of the Generalitat Valenciana for the period between 1 January 2008 to 30 September 2018 (http://www.agroambient.gva.es/va/web/calidad-ambiental). During this time, 24 stations measured PM2.5 continuously, of which only 12 stations had a percentage of missing values less than 30% (Figure 1). For this work, we calculate the average PM2.5 concentrations between the hours of the Aqua and Terra satellite overpass.
MODIS-MAIAC products were downloaded from the Level-1 and Atmosphere Archive and Distribution System website (https://ladsweb.modaps.eosdis.nasa.gov/) [2]. AOD measurements were calibrated with Aerosol Robotic Network (AERONET) data Level 2.0 (http://aeronet.gsfc.nasa.gov/) [4]. The fraction of the artificial surface was estimated for each pixel using the information provided by the Corine Land Cover project for the year 2012 [7]. The terrain elevation was obtained from the Consultative Group for International Agricultural Research Consortium for Spatial Information GEOPortal (http://srtm.csi.cgiar.org) with a resolution of 90 m at the equator [8].
Finally, atmospheric conditions data was download from the NASA's Goddard Earth Sciences Data and Information Services Center website (https://disc.gsfc.nasa.gov/). The Modern Era-Retrospective Analysis for Research and Applications, Version 2 (MERRA-2) is a global atmospheric reanalysis using the Goddard Earth Observing System Model, Version 5 with its Atmospheric Data Assimilation System, at a spatial resolution of 0.5° × 0.625° [9].

Statistical Analysis
The RF model was trained with 80% of the collected information and evaluated with the remaining 20%. The model was built using PM2.5 observations as dependent variables. The predictor variables of the model were: (1) atmospheric variables: aerosol optical depth (AOD), surface pressure (PS), relative humidity (RH), surface temperature (T), surface wind component u (U), surface wind component v (V), black carbon surface mass concentration (BCSMASS), dimethyl-sulfide surface mass concentration (DMSSMASS), dust surface mass concentration (DUSMASS), SO4 surface mass concentration (SO4SMASS), sea salt surface mass concentration (SSSMASS25), total precipitation (PRECTOT), high cloud cover (CLDHGH), low cloud cover (CLDLOW); (2) land use: fraction of artificial surface (CLC_1); (3) terrain elevation (DEM). The data were centered and scaled prior to being incorporated into the model.
The model verification contains a 10-fold cross validation (cv). The feature importance of each variable is then calculated after the model fitting process. Analyses were performed using the R language [10] and the "caret" package for RF model [11]. The maps were made with the software QGIS [12].

Results and Discussion
The ground PM2.5 average for the entire period was 8.  (Figure 1).
The top RF accuracy with a ntree = 10 was with mtry = 7. The variables PRECTOT, CLDHGH and AOD were the most significant predictors that contribute to the model construction process. PRECTOT and CLDHGH have a pronounced influence on the deposition and dispersion of pollutants. Based on these results, we ran the daily model for the rest of the Community of Valencia. Figure 2a

Conclusions
This study proposed a daily concentration model of PM2.5 based on the RF for the Valencia Community (Spain). The method used AOD MAIAC measures, MERRA-2 products and land cover information to simulate ground PM2.5 values. Based on the evaluation of the 10-fold cross-validation method and the test set verification method, the model performs very well. With this data, we were able to predict ~90% of the temporal and spatial variability of PM2.5. More RF trees and an exhaustive analysis of predictor variables will bring benefits to PM2.5 simulations in the future. This work may provide support for air quality management and may also give evidence for epidemiological studies.