Lag Variables in Air Pollution Modeling Based on Traffic Flow and Meteorological Factors

: In order to refine the research on the impact of environmental factors on the concentration of pollutants in the air, in this paper, we present a mathematical model that allows the possibility of taking into account the past values of factors (explanatory variables) when modeling the current concentration of pollution. We conducted numerical analyzes based on hourly data from meteorological, traffic and air quality monitoring stations in Wroc ł aw (Poland, Central Europe) from 2015–2017. In order to determine the optimal delay of each explanatory variable, we used a multi-objective optimization model (MO). It turned out that for the concentration of nitrogen oxides, delayed traffic flow, wind speed and sunshine duration time are more important than current ones. Then we built two random forest models: an actual model of current values of explanatory variables and a lag model with delayed variables determined by the MO method. Taking into account variables with an optimal delay (lag model) results in an increase in model accuracy for NO 2 with R 2 = 0.51 to 0.56 and for NO x from 0.46 to 0.52. We deduced that in pollutant concentrations modeling, the possibility of greater influence of variables with delay should always be considered because it can significantly increase the accuracy of the model and indicate additional relationships or dependencies.


Introduction
The relationship between air pollutant concentrations and environmental factors is widely studied. The quantitative and qualitative recognition of the impact of factors makes it possible to undertake actions aimed at preventing, reducing or limiting the spread of pollution. Pollution models can support urban managers in taking actions to improve air quality in the city [1][2][3]. The growing population of cities and increasing motorization are the reasons for the increasing number of moving vehicles and consequently, increasing exhaust gas emission. The expansion and density of city buildings reduces the phenomenon of city ventilation, which results in a decrease in the impact of low wind speeds on the evacuation of pollution. Wrocław currently has 641,600 of residents [4]. It is estimated that about 15,000 vehicles in the rush hours to below 1000 vehicles at night are moving around the city. That means that approximately 40,000 vehicles make journeys in the city during one hour [5]. One of the main air pollutants emitted by car combustion engines is nitrogen oxides: NO2 and NOx = NO + NO2. In the literature there exist many different air pollution concentration models, e.g., multidimensional regression models [6][7][8], polynomial functions [9,10], artificial neural networks [11], single random trees [12], random forest (RF) [13][14][15] and boosted regression trees [16,17]. These models take into account, in addition to the current values, the past values of the explanatory variables, which have been used mainly to study the impact of pollution concentration on human health and life. Lag variables are then used to take account of the exposure duration to harmful conditions [18,19].
The intensity of chemical reactions in the atmosphere depends on the duration of certain favorable conditions. Therefore, it can be assumed that the current concentration values are significantly affected not only by the current values of the explanatory variables (t), but also by previous moments (t − 1, t − 2, t − 3, …). Classically, this issue is described by adding to the predictor set new variables with a delay (lag variables) 1, 2, 3, … This method has two main disadvantages: first, it is not known how far back the delay variables should be created, and second, creating a set of variables for each delay significantly multiplies the number of explanatory variables, extending the time of calculations and deteriorating the quality and even the possibility of interpretation. In [20,21], it was proposed to use the multi-purpose optimization (MO) algorithm to determine the delay of each predictor that ensures maximum model fit. To be precise, a three-object optimization was developed: power (maximum 3), delay and regression coefficients for each of the variables were ultimately optimized by matching the model. To assess the influence of the variables delay, we used a random forest (RF) algorithm with lagged variables (Lag model) designated in the MO process and compared it with RF developed with original variables without delay (Actual model).

Data Source
We performed numerical analyzes using data from Wrocław (51.086 N, 17.012 E). Data covered the full 3 years of 2015-2017 in hourly intervals. Traffic data are provided by the Traffic and Public Transport Management Department of the Roads and City Maintenance Board in Wrocław. The data contain the number of all vehicles passing through the measurement intersection (51.08637 N, 17.01202 E) during a period of one hour. Traffic flow shows a clear, bimodal daily variability [15] with two peaks: in the morning and in the afternoon. Meteorological hourly data are provided by the Institute of Meteorology and Water Management (IMGW) at only one station in Wrocław, located on the outskirts of the city (51.10319 N, 16.89985 E; 9 km from the intersection in a straight line). One can observe clear seasonal variation in temperature, characteristic of transitional climate type subject to both oceanic and continental influences. Air pollution data are collected by the Provincial Environment Protection Inspectorate and measured at hourly intervals. The measuring station is located in the direct vicinity of the intersection with traffic measurement (30 m from the middle of intersection).

Results and Discussion
Using the MO, we determined the function describing the dependence of NO2 and NOx concentration on meteorological factors and traffic flow. We determined the delay, regression coefficient and power (maximum 3) of each variable to maximize the fit of the model to real data. Based on the 10-fold cross-validation process and on the selection of the most appropriate in terms of occurring in the atmosphere phenomena interpretation, we obtained linear functions with the delays given in Table 1. The fact of obtaining a linear function proves that the relationship is indeed linear and not of a higher degree. For both NO2 and NOx, one hour of traffic flow have the major influence on actual concentration. This results from the emission and accumulation phenomenon of air pollutants.
Wind speed has an impact on the evacuation of pollution. The stronger wind speed is, the more intense evacuation and lower pollution concentration is. Due to the distance of the meteorological station (9 km in a straight line), the effect of wind speed is delayed by 2-3 h. This is a consequence of the time needed for the air masses to reach the air quality measurement station. In Wrocław, West and North-West winds prevail, therefore blowing from the meteorological station to the city center; at an average wind speed of 3.1 m s −1 and covering a distance of 9 km, the time this takes, taking into account the porosity of urban buildings, ranges from 2 to 3 h.
In the next step, we built two random forest models: using actual predictor values and using lagged values (predictor values with delay) for NO2 and NOx.
Due to the greater variation in NOx values (coefficient of variation is equal to 46% for NO2 and 73% for NOx), it is more difficult to predict its values effectively. This is generally indicated by lower goodness of fit measure values than in NO2 (Table 2). However, for both pollutants, including lag variables has improved the models fit. In full generality, it can be concluded that determining the optimal delay of environment variables and including such lag variables as a predictor increases the accuracy of the model. The method of determining optimal delay for each independent variable and inputting this lag variable into modeling is an absolutely general method and may be utilized in every air pollution modeling notwithstanding considered factors and type of pollution. Detailed conclusions depend on local meteorological, topographical and traffic conditions.