A Proportional Odds Model of Particle Pollution

A linear regression model of particle pollution and an ordered logistic regression model of the relevant index were selected for observations in the US city of Los Angeles, California. Models were used to forecast Air Quality Index (AQI) from a sample, and were compared and contrasted. Methods are comparable overall but markedly different in their powers to predict certain categories. Linear regression models of AQI through particle pollution are more favored to predict moderate air quality; ordered logistic regression models of AQI directly are more favored to predict good air quality.


Introduction
The availability of air pollution statistics has led to the development of different models and techniques to forecast air quality.For example, the literature on models of ozone levels is relatively well developed [1][2][3][4].More recently, interest in air quality indices has increased [5,6], and more diverse measures of air pollution have been subject to time series analysis [7,8].Different kinds of regression analyses such as multiple linear regression, principal component regression, independent component regression, quantile regression, and partial least squares regression have been used for forecasting daily air quality levels [9].Vlachogianni, et al. [10] compared forecasts of multiple linear regression to that of artificial neural networks to investigate air quality.Stadlober, et al. [11] used linear regression models to combine information of the present day with meteorological forecasts of the next to predict daily PM 10 concentrations, and showed that PM 10 forecasting models based on OPEN ACCESS linear regression give suitable results in three European cities. Silva, et al. [12] applied nonparametric procedures to describe and forecast particulate material concentrations.
The research reported here was motivated by interest in regression models of Air Quality Index (AQI) for particle pollution.The focus was small particles or droplets in the air that are 2.5 micrometers in diameter or smaller, emitted directly from forest fires and dust or indirectly from automobiles, industries and power plants.Excessive exposure to small particles could cause major health effects in humans including heart stroke, cancer, problems in pregnancy and many other short and long term health effects.With nearly 4 million residents, its high population density and traffic, Los Angeles has some of the most affected air in the United States (US).Los Angeles remains the worst city in America for ozone concentration, and one of the worst in particulate matter concentrations.It is reported that PM 2.5 is responsible for more than 125,000 cancer cases in the US and 16,250 in Los Angeles alone, and causes over 5000 premature deaths per year in the Los Angeles area (Air Quality Management District).
Negative health effects caused by particulate matter have been analyzed in many studies.A large-scale general review can be found in Pope and Dockery [21].The National Association of Clean Air Agencies reported that PM 2.5 is the worst air pollutant because of its small size making it relatively easy to inhale.These particulates also consist of heavy metals, solid and liquid chemical elements and toxic organic compounds.It is crucial to develop good prediction and modeling techniques for the concentration of these pollutants in the air.
The Clean Air Act requires the US Environmental Protection Agency (EPA) to set, "National Ambient Air Quality Standards for pollutants considered harmful to public health and the environment".These standards along with the particle pollution data gathered for this study are provided at EPA.gov.More specifically, they are small particles recorded in the US city of Los Angeles (2001-2011), monitored air quality data from the EPA Air Quality System Data Mart (www.epa.gov/ttn/airs/aqsdatamart/).Early years (2001)(2002)(2003)(2004)(2005) were used to fit models of particles recorded; later years (2006-2011) were reserved for out of sample comparison and contrast.
The AQI is a simple index for reporting daily air quality.AQI values map to an ordinal scale that is one where categories may be ordered, but assignment of numerical values would be arbitrary and so theoretically inappropriate.For ordinal data, we limit ourselves to statistical models that do not rely on numerical assignments.Linear regression models of particle pollution were used to generate predictions on a continuous scale that are mapped to predictions on the ordinal scale.Ordered logistic regression models were used to generate predictions directly onto the ordinal AQI scale: Good, Moderate, Unhealthy for Sensitive Groups (USG), Unhealthy, Very Unhealthy, and Hazardous.
As for independent variables that may be selected, we limited them to reasonable lagged observations of particle pollution observed today (PT): particle pollution observed yesterday (PD), particle pollution observed exactly one week ago (PW), and particle pollution observed exactly one year ago (PY).In other words we examined time series models as opposed to econometric ones that enjoy the benefit of external independent variables.
The rest of the paper is structured as follows: Section 2 includes in-sample results of Linear Regression.Section 3 includes in-sample results of Ordered Logistic Regression.Comparison and Contrast are included in Section 4. Discussion of conclusions and future work is featured in Section 5. Tests of statistical significance are based on α = 0.10.

Linear Regression
In this study, we assume particle pollution observed today PT has the normal distribution and constant variance.The mean, however, is assumed to be a linear function of lagged observations of the response: particle pollution observed yesterday (PD), particle pollution observed exactly one week ago (PW), and particle pollution observed exactly one year ago (PY).
PT ~ normal(µ, σ) µ = linear f(PD, PW, PY) The important question at first is whether or not coefficients (on the independent variables) are generally different from zero.Expectations based on the main effects model are given by the least squares fit (fit to particles recorded in years 2001-2005): PT = 0.6891171(PD) + 0.0511629(PW) + 0.0335756(PY) + 4.601193 However, we fail to reject the hypothesis that β (PY) = 0, so we fit the full second order model to investigate interaction.No significant interaction in the full second order model includes particle pollution observed exactly one year ago (PY), so we drop it and reestimate the function of main effects: PT = 0.6914947(PD) + 0.0377937(PW) + 5.483561 It explains R 2 = 48.37% of the variation in particles recorded in years 2001-2005.

Ordered Logistic Regression
In ordered logistic regression-a direct generalization of logistic regression-we estimate with maximum likelihood an underlying score as the linear function of independent variables and cut-points.The probability of observing an outcome is analogous to the probability that estimated linear function is within the outcome's cut-point range.We estimate the coefficients β together with the cut-points k where u is logistically distributed, Outcomes are Good, Moderate, Unhealthy for Sensitive Groups (USG), Unhealthy, Very Unhealthy, and Hazardous.Coefficients of the main effects ordered logistic regression model correspond to particle pollution observed yesterday (PD), particle pollution observed exactly one week ago (PW), and particle pollution observed exactly one year ago (PY).Only the coefficient on PD is generally different from zero considering years 2001-2005, so again we fit the full second order model to investigate interaction.No significant interaction includes PW, and none includes PY, so they are dropped from the main effects model which is re-estimated with PD as the lone independent variable.For a "pseudo" R 2 = 23.75% we use the formula 1 − L 1 /L 0 where L 0 is the constant-only log-likelihood, and L 1 = −1063.8966is that of the model under consideration.

Comparison and Contrast
In order to gain some relative insight into the power of linear and ordered logistic regression for particle pollution, we evaluated models out of sample (2006)(2007)(2008)(2009)(2010)(2011).Expected values based on linear regression were mapped to the ordinal scale; those based on ordered logistic regression were most likely according to expected probabilities.Results of linear regression are in Table 1; those of ordered logistic regression are in Table 2. Total observations for ordered logistic regression are greater because fewer independent variables meant fewer missing data.
To summarize, we provide power to predict the outcomes for linear (REGRESS) and ordered logistic (OLOGIT) regression in Table 3.

Table 1 .
Observed versus expected outcomes based on linear regression (out of sample).

Table 2 .
Observed versus expected outcomes based on ordered logistic regression (out of sample).

Table 3 .
The power to predict.