Next Article in Journal
A Rule Extraction Study from SVM on Sentiment Analysis
Next Article in Special Issue
A Multi-Modality Deep Network for Cold-Start Recommendation
Previous Article in Journal
Reimaging Research Methodology as Data Science
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

A Machine Learning Approach for Air Quality Prediction: Model Regularization and Optimization

Department of Computer Science, University of Iowa, Iowa City, IA 52242, USA
Department of Occupational and Environmental Health, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA
Department of Management Sciences, University of Iowa, Iowa City, IA 52242, USA
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2018, 2(1), 5;
Received: 28 December 2017 / Revised: 16 February 2018 / Accepted: 19 February 2018 / Published: 24 February 2018
(This article belongs to the Special Issue Learning with Big Data: Scalable Algorithms and Novel Applications)


In this paper, we tackle air quality forecasting by using machine learning approaches to predict the hourly concentration of air pollutants (e.g., ozone, particle matter ( PM 2.5 ) and sulfur dioxide). Machine learning, as one of the most popular techniques, is able to efficiently train a model on big data by using large-scale optimization algorithms. Although there exist some works applying machine learning to air quality prediction, most of the prior studies are restricted to several-year data and simply train standard regression models (linear or nonlinear) to predict the hourly air pollution concentration. In this work, we propose refined models to predict the hourly air pollution concentration on the basis of meteorological data of previous days by formulating the prediction over 24 h as a multi-task learning (MTL) problem. This enables us to select a good model with different regularization techniques. We propose a useful regularization by enforcing the prediction models of consecutive hours to be close to each other and compare it with several typical regularizations for MTL, including standard Frobenius norm regularization, nuclear norm regularization, and 2 , 1 -norm regularization. Our experiments have showed that the proposed parameter-reducing formulations and consecutive-hour-related regularizations achieve better performance than existing standard regression models and existing regularizations.

1. Introduction

Adverse health impacts from exposure to outdoor air pollutants are complicated functions of pollutant compositions and concentrations [1]. Major outdoor air pollutants in cities include ozone ( O 3 ), particle matter (PM), sulfur dioxide ( SO 2 ), carbon monoxide (CO), nitrogen oxides ( NO x ), volatile organic compounds (VOCs), pesticides, and metals, among others [2,3]. Increased mortality and morbidity rates have been found in association with increased air pollutants (such as O 3 , PM and SO 2 ) concentrations [3,4,5]. According to the report from the American Lung Association [6], a 10 parts per billion (ppb) increase in the O 3 mixing ratio might cause over 3700 premature deaths annually in the United States (U.S.). Chicago, as for many other megacities in U.S., has struggled with air pollution as a result of industrialization and urbanization. Although O 3 precursor (such as VOCs, NO x , and CO) emissions have significantly decreased since the late 1970s, O 3 levels in Chicago have not been in compliance with standards set by the Environmental Protection Agency (EPA) to protect public health [7]. Particle size is critical in determining the particle deposition location in the human respiratory system [8]. PM 2.5 , referring to particles with a diameter less than or equal to 2.5 μ m, has been an increasing concern, as these particles can be deposited into the lung gas-exchange region, the alveoli [9]. The U.S. EPA revised the annual standard of PM 2.5 by lowering the concentration to 12 μ g/m 3 to provide improved protection against health effects associated with long- and short-term exposure [10]. SO 2 , as an important precursor of new particle formation and particle growth, has also been found to be associated with respiratory diseases in many countries [11,12,13,14,15]. Therefore, we selected O 3 , PM 2.5 and SO 2 for testing in this study.
Meteorological conditions, including regional and synoptic meteorology, are critical in determining the air pollutant concentrations [16,17,18,19,20,21]. According to the study by Holloway et al. [22], the O 3 concentration over Chicago was found to be most sensitive to air temperature, wind speed and direction, relative humidity, incoming solar radiation, and cloud cover. For example, a lower ambient temperature and incoming solar radiation slow down photochemical reactions and lead to less secondary air pollutants, such as O 3 [23]. Increasing wind speed could either increase or decrease the air pollutant concentrations. For instance, when the wind speed was low (weak dispersion/ventilation), the pollutants associated with traffic were found at the highest concentrations [24,25]. However, strong wind speeds might form dust storms by blowing up the particles on the ground [26]. High humidity is usually associated with high concentrations of certain air pollutants (such as PM, CO and SO 2 ) but with low concentrations of other air pollutants (such as NO 2 and O 3 ) because of various formation and removal mechanisms [25]. In addition, high humidity can be an indicator of precipitation events, which result in strong wet deposition leading to low concentrations of air pollutants [27]. Because various particle compositions and their interactions with light were found to be the most important factors in attenuating visibility [28,29], low visibility could be an indicator of high PM concentrations. Cloud can scatter and absorb solar radiation, which is significant for the formation of some air pollutants (e.g., O 3 ) [23,30]. Therefore, these important meteorological variables were selected to predict air pollutant concentrations in this study.
Statistical models have been applied for air pollution prediction on the basis of meteorological data [31,32,33,34,35]. However, existing studies on statistical modeling have mostly been restricted to simply utilizing standard classification or regression models, which have neglected the nature of the problem itself or ignored the correlation between sub-models in different time slots. On the other hand, machine learning approaches have been developing for over 60 years and have achieved tremendous success in a variety of areas [36,37,38,39,40,41]. There exist various new tools and techniques invented by the machine learning community, which allow for more refined modeling of a specific problem. In particular, model regularization is a fundamental technique for improving the generalization performance of a predictive model. Accordingly, many efficient optimization algorithms have been developed for solving various machine learning formulations with different regularizations.
In this study, we focus on refined modeling for predicting hourly air pollutant concentrations on the basis of historical metrological data and air pollution data. A striking difference between this work and the previous works is that we emphasize how to regularize the model in order to improve its generalization performance and how to learn a complex regularized model from big data with advanced optimization algorithms. We collected 10 years worth of meteorological and air pollution data from the Chicago area. The air pollutant data was from the EPA [42,43], and the meteorological data was from MesoWest [44]. From their databases, we fetched consecutive hourly measurements of various meteorological variables and pollutants reported by two air quality monitoring stations and two air pollutant monitoring sites in the Chicago area. Each record of hourly measurements included meteorological variables such as solar radiation, wind direction and speed, temperature, and atmospheric pressure; as well as air pollutants, including PM 2.5 , O 3 , and SO 2 . We used two methods for model regularization: (i) explicitly controlling the number of parameters in the model; (ii) explicitly enforcing a certain structure in the model parameters. For controlling the number of parameters in the model, we compared three different model formulations, which can be considered in a unified multi-task learning (MTL) framework with a diagonal- or full-matrix model. For enforcing the model matrix into a certain structure, we have considered the relationship between prediction models of different hours and compared three different regularizations with standard Frobenius norm regularization. The experimental results show that the model with the intermediate size and the proposed regularization, which enforces the prediction models of two consecutive hours to be close, achieved the best results and was far better than standard regression models. We have also developed efficient optimization algorithms for solving different formulations and demonstrated their effectiveness through experiments.
The rest of the paper is organized as follows. In Section 2, we discuss related work. In Section 3, we describe the data collection and preprocessing. In Section 4, we describe the proposed solutions, including formulations, regularizations and optimizations. In Section 5, we present the experimental studies and the results. In Section 6, we give conclusions and indicate future work.

2. Related Work

Many previous works have been proposed to apply machine learning algorithms to air quality predictions. Some researchers have aimed to predict targets into discretized levels. Kalapanidas et al. [32] elaborated effects on air pollution only from meteorological features such as temperature, wind, precipitation, solar radiation, and humidity and classified air pollution into different levels (low, med, high, and alarm) by using a lazy learning approach, the case-based reasoning (CBR) system. Athanasiadis et al. [45] employed the σ -fuzzy lattice neurocomputing classifier to predict and categorize O 3 concentrations into three levels (low, mid, and high) on the basis of meteorological features and other pollutants such as SO 2 , NO, NO 2 , and so on. Kurt and Oktay [33] modeled geographic connections into a neural network model and predicted daily concentration levels of SO 2 , CO, and PM 10 3 days in advance. However, the process of converting regression tasks to classification tasks is problematic, as it ignores the magnitude of the numeric data and consequently is inaccurate.
Other researchers have worked on predicting concentrations of pollutants. Corani [46] worked on training neural network models to predict hourly O 3 and PM 10 concentrations on the basis of data from the previous day. Mainly compared were the performances of feed-forward neural networks (FFNNs) and pruned neural networks (PNNs). Further efforts have been made on FFNNs: Fu et al. [47] applied a rolling mechanism and gray model to improve traditional FFNN models. Jiang et al. [48] explored multiple models (physical and chemical model, regression model, and multiple layer perceptron) on the air pollutant prediction task, and their results show that statistical models are competitive with the classical physical and chemical models. Ni, X. Y. et al. [49] compared multiple statistical models on the basis of PM 2.5 data around Beijing, and their results implied that linear regression models can in some cases be better than the other models.
MTL focuses on learning multiple tasks that have commonalities [50] that can improve the efficiency and accuracy of the models. It has achieved tremendous successes in many fields, such as natural language processing [37], image recognition [38], bioinformatics [39,40], marketing prediction [41], and so on. A variety of regularizations can be utilized to enhance the commonalities of the related tasks, including the 2 , 1 -norm [51], nuclear norm [52], spectral norm [53], Frobenius norm [54], and so on. However, most of the former machine learning works on air pollutant prediction did not consider the similarities between the models and only focused on improving the model performance for a single task, that is, improving prediction performance for each hour either separately or identically.
Therefore, we decided to use meteorological and pollutant data to perform predictions of hourly concentrations on the basis of linear models. In this work, we focused on three different prediction model formulations and used the MTL framework with different regularizations. To the best of our knowledge, this is the first work that has utilized MTL for the air pollutant prediction task. We exploited analytical approaches and optimization techniques to obtain the optimal solutions. The model’s evaluation metric was the root-mean-squared error (RMSE).

3. Data Collection and Preprocessing

3.1. Data Collection

We collected air pollutant data from two air quality monitoring sites and meteorological data from two weather stations from 2006 to 2015 (summarized in Table 1). The air pollutant data in this study included the concentrations of O 3 , PM 2.5 and SO 2 . We downloaded the air pollutant data from the U.S. EPA’s Air Quality System (AQS) database (, which has been widely used for model evaluation [42,43]. We selected the meteorological variables that would affect the air pollutant concentrations, including air temperature, relative humidity, wind speed and direction, wind gust, precipitation accumulation, visibility, dew point, wind cardinal direction, pressure, and weather conditions. We downloaded the meteorological data from MesoWest (, a project within the Department of Meteorology at the University of Utah, which has been aggregating meteorological data since 2002 [44].
The locations of the two air quality monitoring sites and two weather stations are shown in Figure 1. The Alsip Village (AV) air quality monitoring site is also located in a suburban residential area, which is in southern Cook County, Illinois (AQS ID: 17-031-0001; latitude/longitude: 41.670992/−87.732457. The Lemont Village (LV) air quality monitoring site is located in a suburban residential area, which is in southwestern Cook County, Illinois (AQS ID: 17-031-1601; latitude/longitude: 41.66812/−87.99057. The weather station situated in Lansing Municipal Airport (LMA) is the closest meteorological site (MesoWest ID: KIGQ; latitude/longitude: 41.54125/−87.52822) to the AV air quality monitoring site. The weather station positioned at Lewis University (LU) is the closest meteorological site (MesoWest ID: KLOT; latitude/longitude: 41.60307/−88.10164) to the LV air quality monitoring site.

3.2. Preprocessing

We paired the collected meteorological data and air pollutant data on the basis of time to obtain the required data format for applying the machine learning methods. In particular, for each variable, we formed one value for each hour. However, the original data may have contained multiple records or missing values at some hours. To preprocess the data, we calculated the hourly mean value of each numeric variable if there were multiple observed records within an hour and chose the category with the highest frequency per hour for each categorical variable if there were multiple values. Missing values existed for some variables, which was not tolerable for applying the machine learning methods used in this study. Therefore, we imputed the missing values by using the closest-neighbor values for four continuous variables and one categorical variable: wind gust, pressure, altimeter reading, precipitation, and weather conditions. We deleted the days that still had missing values after imputing. We applied dummy coding for two categorical variables, the cardinal wind direction (16 values, e.g., N, S, E, W, etc.) and weather conditions (31 values, e.g., sunny, rainy, windy, etc.). Then, we added the weekday and weekend as two boolean features. Finally, we obtained 60 features in total (9 numerical meteorological features, 16 dummy codings for wind direction, 31 dummy codings for weather conditions, 2 boolean features for weekday/weekend, 1 numerical feature for pollutants, and 1 bias term). We applied normalization for all the features and pollutant targets to make their values fall in the range [ 0 , 1 ] .

4. Machine Learning Approaches for Air Pollution Prediction

In this section, we describe the proposed approaches for predicting the ambient concentration of air pollutants.

4.1. A General Formulation

Our goal is to predict the concentration of air pollutants of the next day on the basis of the historical meteorological and air pollutant data. In this work, we have focused on using the former day’s data to predict the next day’s hourly pollutants. In particular, we let ( x i ; y i ) denote the ith training data, where y i R 24 × 1 denotes the concentration of a certain air pollutant on a day, and x i = ( u i ; v i ) denotes the observed data on the previous day that include two components, where a semicolon “;” represents the column layout. The first component u i = ( u i , 1 ; ; u i , D ) R 24 · D × 1 includes all meteorological data over 24 h for the previous day, where u i , j R 24 × 1 denotes the jth meteorological feature of the 24 h and D is the number of meteorological features; the second component v i R 24 × 1 includes the hourly concentration of the same air pollutant on the previous day. The general formulation can be expressed as
min W 1 n i = 1 n f ( W , x i ) y i 2 2 + φ ( W )
where W denotes the parameters of the model, f ( W , x i ) denotes the prediction of the air pollutant concentration, and φ ( · ) denotes a regularization function of the model parameters W.
Next, we introduce two levels of model regularization. The first level is to explicitly control the number of model parameters. The second level is to explicitly impose a certain regularization on the model parameter. For the first level, we consider three models that are described below:
  • Baseline Model. The first model is a baseline model that has been considered in existing studies and has the fewest number of parameters. In particular, the prediction of the air pollutant concentration is given by
    f k ( W , x i ) = j = 1 D e k u i , j · w j + e k v i · w D + 1 + w 0 , k = 1 , , 24
    where e k R 24 × 1 is a basis vector with 1 at only the kth position and 0 at other positions; w 0 , w 1 , , w D , w D + 1 R are the model parameters, where w 0 is the bias term. We denote this model by W = ( w 0 , w 1 , , w D + 1 ) . It is notable that this model predicts the hourly concentration on the basis of the same hourly historical data of the previous day and that it has D + 2 parameters. This simple model assumes that all 24 h share the same model parameter.
  • Heavy Model. The second model takes all the data of the previous day into account when predicting the concentration of every hour of the second day. In particular, for the kth hour, the prediction is given by
    f k ( W , x i ) = j = 1 D u i , j w k , j + v i w k , D + 1 + w k , 0 , k = 1 , , 24
    where w k , j R 24 × 1 , j = 1 , , D + 1 and w k , 0 R . This model is defined by
    W = w 1 , 0 w 2 , 0 w 24 , 0 w 1 , 1 w 2 , 1 w 24 , 1 w 1 , D + 1 w 2 , D + 1 w 24 , D + 1
    We note that each column of W corresponds to the prediction model for each hour. There are a total of 24 × (24 × ( D + 1 ) + 1 ) parameters. It is notable that the baseline model is a special case by enforcing all columns of W to be the same and because each w k , j has only one non-zero element at the kth position.
  • Light Model. The third model is between the baseline model and the heavy model. It considers the 24 h pattern of the air pollutants in the previous day and the same hourly meteorological data of the previous day to predict the concentration at a particular hour. The prediction is given by
    f k ( W , x i ) = j = 1 D e k u i , j · w k , j + v i w k , D + 1 + w k , 0 , k = 1 , , 24
    where w k , j R , j = 1 , , D and w k , D + 1 R 24 × 1 . This model is defined by
    W = w 1 , 0 w 2 , 0 w 24 , 0 w 1 , 1 w 2 , 1 w 24 , 1 w 1 , D + 1 w 2 , D + 1 w 24 , D + 1
    It is also notable that each column corresponds to the predictive model for one hour and that W has a total of 24 × ( D + 1 ) + 24 × 24 × 1 parameters.

4.2. Regularization of Model Parameters

In this section, we describe different regularizations for the model parameter matrices W in the heavy and light models. We consider the problem using MTL, in which predicting the concentration of air pollutants over one hour is one task. In the literature, a number of regularizations have been proposed by considering the relationship between different tasks. We first describe three baseline regularizations in the literature and then present the proposed regularization that takes the dimension of time into consideration for modeling the relationship between models at different times.
  • Frobenius norm regularization. Frobenius norm regularization is a generalization of standard Euclidean norm regularization to the matrix case, for which
    φ ( W ) = λ | | W | | F 2
    where λ > 0 is a regularization parameter.
  • 2 , 1 -norm regularization. 2 , 1 -norm regularization has been used for feature selection in MTL. The norm is formed by first computing the 2 -norm of each row of the W matrix (across different tasks) and then computing the 1 -norm of the resulting vector. In particular, for W R d × K ,
    W 2 , 1 = j = 1 d W j , 2
    where W j , denotes the jth row of W. We consider a 2 , 1 -norm regularizer φ ( W ) = λ W 2 , 1 .
  • Nuclear norm regularization. The nuclear norm is defined as the sum of singular values of a matrix, which is a standard regularization for enforcing a matrix to have a low rank. The motivation for using a low-rank matrix is that models for consecutive hours are highly correlated, which could render the matrix W to be low rank. We denote by W the nuclear norm of a matrix W; the regularization is φ ( W ) = λ W .
  • Consecutive close (CC) regularization. Finally, we propose a useful regularization for the considered problem that explicitly enforces the predictive models for two consecutive hours to be close to each other. The intuition is that usually the concentrations of air pollutants for two consecutive hours are close to each other. We denote the model by W = ( w 1 , , w K ) and by C o n s ( W ) = [ ( w 1 w 2 ) , ( w 2 w 3 ) , , ( w K 1 w K ) ] . The CC regularization is given by
    φ ( W ) = λ j = 1 K 1 w j w j + 1 p p
    where p = 1 or p = 2 .

4.3. Stochastic Optimization Algorithms for Different Formulations

With the exception that the Frobenius norm regularized model (with 2 -norm CC regularization or not) has a closed-form solution, we solved the other models via advanced stochastic optimization techniques. We denote the following: F ( W , x i ) = [ f 1 ( W , x i ) , , f 24 ( W , x i ) ] and Y i = [ y i , 1 , , y i , 24 ] ; the total number of features is D. Although the standard stochastic (sub)gradient method [55] could be utilized to solve all the formulations considered in this work, it does not necessary yield the fastest convergence. To address this issue, we considered advanced stochastic optimization techniques tailored for solving each formulation.

4.3.1. Optimizing 2 , 1 -Norm Regularized Model

We utilized the accelerated stochastic subgradient (ASSG) method [56] with proximal mapping to optimize this model. The algorithm runs in mutliple stages, and each stage calls the standard stochastic gradient method with a constant step size. To handle the non-smooth 2 , 1 -norm, we used proximal mapping [57]. The stochastic gradient descent part is
W t = W t 1 2 η s F ( W t 1 , x i ) W t 1 e ( F ( W t 1 , x i ) Y i )
where η s is the stage-wise step size, i is a sampled index, and e is a vector with 1 for all its elements. Then a proximal mapping is as follows (denoted by λ ˜ = 2 η s λ ):
W t = arg min W W W t F 2 + λ ˜ W 2 , 1
The above problem has analytical solutions. We denote w i as a column vector for W and w i as a column vector for W t . Then the solution to Equation (4) can be computed by the following [51]:
w i = ( 1 λ ˜ w i 2 ) w i , λ ˜ > 0 , w i 2 > λ ˜ 0 , λ ˜ > 0 , w i 2 λ ˜ w i , λ ˜ = 0
The pseudocode of the algorithm is as follows:
Algorithm 1: ASSG method with proximal mapping solving 2 , 1 -norm regularized model.
  Input: X, Y, W 0 , η 0 , S, and T
   Bdcc 02 00005 i001

4.3.2. Optimizing Nuclear Norm Regularized Model

The challenge in solving the nuclear norm reguralized problem of most optimization algorithms lies with computing the full singular value decomposition (SVD) of the involved matrix W, which is an expensive operation. To avoid full SVD, the SVD-free convex–concave algorithm extension to a stochastic setting (SECONE-S) [58] was employed to solve the problem. The algorithm solves the following minimum–maximum problem:
min W R D × K max U R D × K 1 n i = 1 n F ( W , x i ) Y i 2 2 + λ t r ( U W ) ρ [ U 2 1 ] +
Then stochastic gradient descent and ascent are used to update W and U at each iteration:
W t = W t 1 η t 1 ( 2 F ( W t 1 , x i ) W t 1 e ( F ( W t 1 , x i ) Y i ) + λ U t 1 ) U t = U t 1 + τ t 1 ( λ W t 1 ρ [ U t 1 2 1 ] + )
where ρ Y F 2 and [ U t 2 1 ] + can be computed by u 1 v 1 1 [ σ 1 > 1 ] , with ( u 1 , v 1 ) being the top-left and -right singular vectors of U t and σ 1 being the top singular value. The pseudocode for the algorithm is as follows:
  Algorithm 2: SECONE-S solving nuclear norm regularized model.
Input: X, Y, T, η 0 , and τ 0
Bdcc 02 00005 i002

4.3.3. Optimizing Consecutive Close Regularized Model

The challenge of tackling the proposed CC regularization lies in that the standard proximal mapping cannot be computed efficiently. We addressed this challenge by using the alternating-direction method of multipliers. We utilized a recently proposed locally adaptive stochastic alternating-direction method of multipliers (LA-SADMM) [59] to solve the CC regularized model. Below, we discuss the updates for the choice of p = 1 (i.e., using the 1 -norm) in Equation (2). The updates for the choice of p = 2 can be derived similarly.
The objective function can be written as
min W R D × K 1 n i = 1 n F ( W , x i ) Y i 2 2 + λ W E 1 , 1
Here, E = ( e ^ 1 , , e ^ k 1 ) , where e ^ i = ( 0 , , 1 , 1 , , 0 ) T , i = 1 , , k 1 , the ith element is 1 and the ( i + 1 )th element is 1 . Therefore, C o n s ( W ) = W E . A dummy variable U = W E was introduced to decouple the last term from the first term, and a Lagrangian function was formed as follows:
L ( W , U , Λ ) = 1 n i = 1 n F ( W , x i ) Y i 2 2 + λ U 1 , 1 tr ( Λ ( W E U ) ) + β 2 W E U F 2
where Λ is the Lagrangian multiplier and β is the penalty parameter.
This could then be solved by optimizing each variable alternatively. The update rules for SADMM are as follows:
W τ = arg min W R D × K L ( W , U τ 1 , Λ τ 1 ) = arg min W R D × K F ˜ ( W τ 1 , x i ) + tr r { F ˜ ( W τ 1 , x i ) W ( W W τ 1 ) } + β 2 W E U τ 1 1 β Λ τ 1 T F 2 + W W τ 1 F 2 η τ 1 U τ = arg min U R D × K L ( W τ , U , Λ τ 1 ) = arg min U R D × K γ U 1 , 1 + β 2 W τ E U 1 β Λ τ 1 T F 2 Λ τ = Λ τ 1 β ( W τ E U τ ) T
where F ˜ ( W τ 1 , x i ) = F ( W τ 1 , x i ) Y i 2 2 .
LA-SADMM solves the problem more efficiently by doing stage-wise penalty increasing. The pseudocode for the algorithm is as follows:
Algorithm 3: LA-SADMM solving consecutive close (CC) regularized problem with 1 -norm.
  Input: X, Y, W 0 , U 0 , Λ 0 , β 1 , η 1 , S, and T
   Bdcc 02 00005 i003

4.4. Extensive Discussion

It is noteworthy that the main contribution of this work is the incorporation of model parameter reduction and MTL with regularization into air pollutant prediction. As the previous content has illustrated, for the parameter reduction part, our light formulation reduces model parameters by removing heavy meteorological parameters of the other hours for one hour’s submodel. For the MTL part, we considered that there could be some similarities for consecutive hours’ models; therefore, we could add appropriate regularizers for this purpose.
The high-level idea of MTL lies in transfer learning, which generally aims to transfer knowledge from a related source task to a target task and consequently improve the performance for the target task. There are multiple variants for transfer learning, such as inductive transfer learning, transductive transfer learning and unsupervised transfer learning, and the approaches for transfer learning mainly include instance transfer, feature-representation transfer, parameter transfer and relational-knowledge transfer [60]. One of the most common examples is feature-representation transfer for deep neural networks. After either supervised or unsupervised learning from other related datasets, the pretrained model can be appropriately reused for learning the target task with a better performance. The MTL technique in this work is an example of parameter transfer in an inductive-transfer-learning setting.
A similar idea can be applied to other kinds of work. First, if the submodels are not built for each hour but for each day (or even for each location from a spatial perspective), we can still apply the parameter reduction idea that only keeps more important information and removes the information with low priority. Second, for the MTL part, we can still add regularizations for the similarities of the submodels. Furthermore, in this work, the submodel w i was a linear regression model; it is also practical to replace it with support vector regression (SVR), nonlinear regression, neural networks, and so on. Finally, the techniques used in this work can be further combined with many other transfer learning techniques, such as feature-representation transfer for deep neural networks.

5. Experiments

We used the names of the paired air quality monitoring sites and two weather stations to denote the two datasets, that is, LU–LV and LMA–AV. LU–LV contained the data to predict the concentration of the two air pollutants O 3 and SO 2 . LMA–AV contained the data to predict the concentration of the two air pollutants O 3 and PM 2.5 .
We compared 11 different models that were learned with different combinations of model formulations and regularizations. The 11 models were the following:
  • Baseline: the baseline model with standard Frobenius norm regularization.
  • Heavy–F: the heavy model with standard Frobenius norm regularization.
  • Light–F: the heavy model with standard Frobenius norm regularization.
  • Heavy– 2 , 1 : the heavy model with 2 , 1 -norm regularization.
  • Heavy–nuclear: the heavy model with nuclear-norm regularization.
  • Heavy–CCL2: the heavy model with CC regularization using the 2 -norm.
  • Heavy–CCL1: the heavy model with CC regularization using the 1 -norm.
  • Light– 2 , 1 : the light model with 2 , 1 -norm regularization.
  • Light–nuclear: the light model with nuclear-norm regularization.
  • Light–CCL2: the light model with CC regularization using the 2 -norm.
  • Light–CCL1: the light model with CC regularization using the 1 -norm.
It is noteworthy that we also added the standard Frobenius norm regularizer for the heavy/light–nuclear, –CCL2, and –CCL1 models, because their regularizers were mainly considered for controlling the similarities of submodels and may not have been enough for preventing overfitting. We divided each dataset into two parts: training data and testing data. Each model was trained on the training data with proper regularization parameters and the learning rate selected on the basis of 5-fold cross-validation. Each trained model was evaluated on the testing data. The splitting of the data was done by dividing all days into a number of chunks of 11 consecutive days, for which the first 8 days were used for training and the next 3 days were used for testing. We have used the RMSE as the evaluation metric.
We first report the improvement of each method over the baseline method. The improvement was measured by a positive or negative percentage over the performance of the baseline method, that is, (RMSE of compared method - RMSE of the baseline method)×100/RMSE of the baseline method. The results are shown in Figure 2 and Figure 3. To facilitate the comparison between different methods, for each air pollutant of each dataset, we report two figures, with one grouping the results by regularizations and the other grouping the results by the model formulations. From the results, we can see that (i) the light model formulation had a clear advantage over the heavy model formulation and the baseline model formulation, which implied that controlling the number of parameters is important for improving generalization performance; and (ii) the proposed CC regularization yielded a better performance than other regularizations, which verified that considering the similarities between models of consecutive hours is helpful. We also report the exact RMSE of each method in Table 2.
Finally, we compared the convergence speed of the employed optimization algorithms with their standard counterparts. In particular, we compared the ASSG and SSG methods for optimizing the 2 , 1 -norm regularized problem, and SSG for solving the nuclear norm regularized problem, and and SADMM for solving the CC regularized problem. The results are plotted in Figure 4 and demonstrate that the employed advanced optimization techniques converged much faster than the classical techniques.

6. Conclusions

In this paper, we have developed efficient machine learning methods for air pollutant prediction. We have formulated the problem as regularized MTL and employed advanced optimization algorithms for solving different formulations. We have focused on alleviating model complexity by reducing the number of model parameters and on improving the performance by using a structured regularizer. Our results show that the proposed light formulation achieves much better performance than the other two model formulations and that the regularization by enforcing prediction models for two consecutive hours to be close can also boost the performance of predictions. We have also shown that advanced optimization techniques are important for improving the convergence of optimization and that they speed up the training process for big data. For future work, we will further consider the commonalities between nearby meteorology stations and combine them in a MTL framework, which may provide a further boosting for the prediction.


Authors would like to thank the support from Environmental Health Sciences Research Center at University of Iowa, and National Science Foundation Grant No. IIS-1566386 for funding and facilitating this research.

Author Contributions

Dixian Zhu, Tianbao Yang, and Xun Zhou conceived and designed the experiments; Changjie Cai collected the data; Dixian Zhu and Changjie Cai analyzed the data; Dixian Zhu performed the experiments; Xun Zhou and Tianbao Yang contributed to the progress of research idea; Tianbao Yang, Changjie Cai and Dixian Zhu wrote the paper. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.


  1. Curtis, L.; Rea, W.; Smith-Willis, P.; Fenyves, E.; Pan, Y. Adverse health effects of outdoor air pollutants. Environ. Int. 2006, 32, 815–830. [Google Scholar] [CrossRef] [PubMed]
  2. Mayer, H. Air pollution in cities. Atmos. Environ. 1999, 33, 4029–4037. [Google Scholar] [CrossRef]
  3. Samet, J.M.; Zeger, S.L.; Dominici, F.; Curriero, F.; Coursac, I.; Dockery, D.W.; Schwartz, J.; Zanobetti, A. The national morbidity, mortality, and air pollution study. Part II: Morbidity and mortality from air pollution in the United States. Res. Rep. Health Eff. Inst. 2000, 94, 5–79. [Google Scholar] [PubMed]
  4. Dockery, D.W.; Schwartz, J.; Spengler, J.D. Air pollution and daily mortality: Associations with particulates and acid aerosols. Environ. Res. 1992, 59, 362–373. [Google Scholar] [CrossRef]
  5. Schwartz, J.; Dockery, D.W. Increased mortality in Philadelphia associated with daily air pollution concentrations. Am. Rev. Respir. Dis. 1992, 145, 600–604. [Google Scholar] [CrossRef] [PubMed]
  6. American Lung Association. State of the Air Report; ALA: New York, NY, USA, 2007; pp. 19–27. [Google Scholar]
  7. Environmental Protection Agency (EPA). Region 5: State Designations, as of September 18, 2009. Available online: (accessed on 17 December 2017).
  8. Hinds, W.C. Aerosol Technology: Properties, Behavior, and Measurement of Airborne Particles; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  9. Soukup, J.M.; Becker, S. Human alveolar macrophage responses to air pollution particulates are associated with insoluble components of coarse material, including particulate endotoxin. Toxicol. Appl. Pharmacol. 2001, 171, 20–26. [Google Scholar] [CrossRef] [PubMed]
  10. Environmental Protection Agency (EPA). CFR Parts 50, 51, 52, 53, and 58-National Ambient Air Quality Standards for Particulate Matter: Final Rule. Fed. Regist. 2013, 78, 3086–3286. [Google Scholar]
  11. Schwartz, J. Short term fluctuations in air pollution and hospital admissions of the elderly for respiratory disease. Thorax 1995, 50, 531–538. [Google Scholar] [CrossRef] [PubMed]
  12. De Leon, A.P.; Anderson, H.R.; Bland, J.M.; Strachan, D.P.; Bower, J. Effects of air pollution on daily hospital admissions for respiratory disease in London between 1987-88 and 1991-92. J. Epidemiol. Community Health 1996, 50 (Suppl. 1), s63–s70. [Google Scholar] [CrossRef]
  13. Birmili, W.; Wiedensohler, A. New particle formation in the continental boundary layer: Meteorological and gas phase parameter influence. Geophys. Res. Lett. 2000, 27, 3325–3328. [Google Scholar] [CrossRef]
  14. Lee, J.-T.; Kim, H.; Song, H.; Hong, Y.C.; Cho, Y.S.; Shin, S.Y.; Hyun, Y.J.; Kim, Y.S. Air pollution and asthma among children in Seoul, Korea. Epidemiology 2002, 13, 481–484. [Google Scholar] [CrossRef] [PubMed]
  15. Cai, C.; Zhang, X.; Wang, K.; Zhang, Y.; Wang, L.; Zhang, Q.; Duan, F.; He, K.; Yu, S.-C. Incorporation of new particle formation and early growth treatments into WRF/Chem: Model improvement, evaluation, and impacts of anthropogenic aerosols over East Asia. Atmos. Environ. 2016, 124, 262–284. [Google Scholar] [CrossRef]
  16. Kalkstein, L.S.; Corrigan, P. A synoptic climatological approach for geographical analysis: Assessment of sulfur dioxide concentrations. Ann. Assoc. Am. Geogr. 1986, 76, 381–395. [Google Scholar] [CrossRef]
  17. Comrie, A.C. A synoptic climatology of rural ozone pollution at three forest sites in Pennsylvania. Atmos. Environ. 1994, 28, 1601–1614. [Google Scholar] [CrossRef]
  18. Eder, B.K.; Davis, J.M.; Bloomfield, P. An automated classification scheme designed to better elucidate the dependence of ozone on meteorology. J. Appl. Meteorol. 1994, 33, 1182–1199. [Google Scholar] [CrossRef]
  19. Zelenka, M.P. An analysis of the meteorological parameters affecting ambient concentrations of acid aerosols in Uniontown, Pennsylvania. Atmos. Environ. 1997, 31, 869–878. [Google Scholar] [CrossRef]
  20. Laakso, L.; Hussein, T.; Aarnio, P.; Komppula, M.; Hiltunen, V.; Viisanen, Y.; Kulmala, M. Diurnal and annual characteristics of particle mass and number concentrations in urban, rural and Arctic environments in Finland. Atmos. Environ. 2003, 37, 2629–2641. [Google Scholar] [CrossRef]
  21. Jacob, D.J.; Winner, D.A. Effect of climate change on air quality. Atmos. Environ. 2009, 43, 51–63. [Google Scholar] [CrossRef]
  22. Holloway, T.; Spak, S.N.; Barker, D.; Bretl, M.; Moberg, C.; Hayhoe, K.; Van Dorn, J.; Wuebbles, D. Change in ozone air pollution over Chicago associated with global climate change. J. Geophys. Res. Atmos. 2008, 113. [Google Scholar] [CrossRef]
  23. Akbari, H. Shade trees reduce building energy use and CO2 emissions from power plants. Environ. Pollut. 2002, 116, S119–S126. [Google Scholar] [CrossRef]
  24. DeGaetano, A.T.; Doherty, O.M. Temporal, spatial and meteorological variations in hourly PM 2.5 concentration extremes in New York City. Atmos. Environ. 2004, 38, 1547–1558. [Google Scholar] [CrossRef]
  25. Elminir, H.K. Dependence of urban air pollutants on meteorology. Sci. Total Environ. 2005, 350, 225–237. [Google Scholar] [CrossRef] [PubMed]
  26. Natsagdorj, L.; Jugder, D.; Chung, Y.S. Analysis of dust storms observed in Mongolia during 1937–1999. Atmos. Environ. 2003, 37, 1401–1411. [Google Scholar] [CrossRef]
  27. Seinfeld, J.H.; Pandis, S.N. Atmospheric Chemistry and Physics: From Air Pollution to Climate Change; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
  28. Appel, B.R.; Tokiwa, Y.; Hsu, J.; Kothny, E.L.; Hahn, E. Visibility as related to atmospheric aerosol constituents. Atmos. Environ. (1967) 1985, 19, 1525–1534. [Google Scholar] [CrossRef]
  29. Deng, X.; Tie, X.; Wu, D.; Zhou, X.; Bi, X.; Tan, H.; Li, F.; Jiang, C. Long-term trend of visibility and its characterizations in the Pearl River Delta (PRD) region, China. Atmos. Environ. 2008, 42, 1424–1435. [Google Scholar] [CrossRef]
  30. Twomey, S. The influence of pollution on the shortwave albedo of clouds. J. Atmos. Sci. 1977, 34, 1149–1152. [Google Scholar] [CrossRef]
  31. Zheng, Y.; Liu, F.; Hsieh, H.-P. U-Air: When urban air quality inference meets big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013. [Google Scholar]
  32. Kalapanidas, E.; Avouris, N. Short-term air quality prediction using a case-based classifier. Environ. Model. Softw. 2001, 16, 263–272. [Google Scholar] [CrossRef]
  33. Kurt, A.; Oktay, A.B. Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks. Expert Syst. Appl. 2010, 37, 7986–7992. [Google Scholar] [CrossRef]
  34. Kleine Deters, J.; Zalakeviciute, R.; Gonzalez, M.; Rybarczyk, Y. Modeling PM2.5 urban pollution using machine learning and selected meteorological parameters. J. Electr. Comput. Eng. 2017, 2017, 5106045. [Google Scholar] [CrossRef]
  35. Bougoudis, I.; Demertzis, K.; Iliadis, L.; Anezakis, V.-D.; Papaleonidas, A. FuSSFFra, a fuzzy semi-supervised forecasting framework: The case of the air pollution in Athens. In Neural Computing and Applications; Springer: Berlin, Germany, 2017; pp. 1–14. [Google Scholar] [CrossRef]
  36. Yuan, Z.; Zhou, X.; Yang, T.; Tamerius, J.; Mantilla, R. Predicting Traffic Accidents Through Heterogeneous Urban Data: A Case Study. In Proceedings of the 6th International Workshop on Urban Computing (UrbComp 2017), Halifax, NS, Canada, 14 August 2017. [Google Scholar]
  37. Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
  38. Fan, J.; Gao, Y.; Luo, H. Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation. IEEE Trans. Image Process. 2008, 17, 407–426. [Google Scholar] [CrossRef] [PubMed]
  39. Widmer, C.; Leiva, J.; Altun, Y.; Rätsch, G. Leveraging sequence classification by taxonomy-based multitask learning. In Annual International Conference on Research in Computational Molecular Biology; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  40. Kshirsagar, M.; Carbonell, J.; Klein-Seetharaman, J. Multitask learning for host-pathogen protein interactions. Bioinformatics 2013, 29, i217–i226. [Google Scholar] [CrossRef] [PubMed]
  41. Lindbeck, A.; Snower, D.J. Multitask learning and the reorganization of work: From tayloristic to holistic organization. J. Labor Econ. 2000, 18, 353–376. [Google Scholar] [CrossRef]
  42. Foley, K.M.; Roselle, S.J.; Appel, K.W.; Bhave, P.V.; Pleim, J.E.; Otte, T.L.; Mathur, R.; Sarwar, G.; Young, J.O.; Gilliam, R.C.; et al. Incremental testing of the Community Multiscale Air Quality (CMAQ) modeling system version 4.7. Geosci. Model Dev. 2010, 3, 205–226. [Google Scholar] [CrossRef][Green Version]
  43. Yahya, K.; Wang, K.; Campbell, P.; Chen, Y.; Glotfelty, T.; He, J.; Pirhalla, M.; Zhang, Y. Decadal application of WRF/Chem for regional air quality and climate modeling over the US under the representative concentration pathways scenarios. Part 1: Model evaluation and impact of downscaling. Atmos. Environ. 2017, 152, 562–583. [Google Scholar] [CrossRef]
  44. Horel, J.; Splitt, M.; Dunn, L.; Pechmann, J.; White, B.; Ciliberti, C.; Lazarus, S.; Slemmer, J.; Zaff, D.; Burks, J.; et al. Mesowest: Cooperative mesonets in the western United States. Bull. Am. Meteorol. Soc. 2002, 83, 211–225. [Google Scholar] [CrossRef]
  45. Athanasiadis, I.N.; Kaburlasos, V.G.; Mitkas, P.A.; Petridis, V. Applying machine learning techniques on air quality data for real-time decision support. In Proceedings of the First international NAISO Symposium on Information Technologies in Environmental Engineering (ITEE’2003), Gdansk, Poland, 24–27 June 2003. [Google Scholar]
  46. Corani, G. Air quality prediction in Milan: Feed-forward neural networks, pruned neural networks and lazy learning. Ecol. Model. 2005, 185, 513–529. [Google Scholar] [CrossRef]
  47. Fu, M.; Wang, W.; Le, Z.; Khorram, M.S. Prediction of particular matter concentrations by developed feed-forward neural network with rolling mechanism and gray model. Neural Comput. Appl. 2015, 26, 1789–1797. [Google Scholar] [CrossRef]
  48. Jiang, D.; Zhang, Y.; Hu, X.; Zeng, Y.; Tan, J.; Shao, D. Progress in developing an ANN model for air pollution index forecast. Atmos. Environ. 2004, 38, 7055–7064. [Google Scholar] [CrossRef]
  49. Ni, X.Y.; Huang, H.; Du, W.P. Relevance analysis and short-term prediction of PM 2.5 concentrations in Beijing based on multi-source data. Atmos. Environ. 2017, 150, 146–161. [Google Scholar] [CrossRef]
  50. Caruana, R. Multitask learning. In Learning to Learn; Springer: Boston, MA, USA, 1998; pp. 95–133. [Google Scholar]
  51. Liu, J.; Ji, S.; Ye, J. Multi-task feature learning via efficient l 2, 1-norm minimization. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
  52. Recht, B.; Fazel, M.; Parrilo, P.A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010, 52, 471–501. [Google Scholar] [CrossRef]
  53. Argyriou, A.; Micchelli, C.A.; Pontil, M. On spectral learning. J. Mach. Learn. Res. 2010, 11, 935–953. [Google Scholar]
  54. Maurer, A. Bounds for linear multi-task learning. J. Mach. Learn. Res. 2006, 7, 117–139. [Google Scholar]
  55. Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
  56. Xu, Y.; Lin, Q.; Yang, T. Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  57. Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
  58. Xiao, Y.; Li, Z.; Yang, T.; Zhang, L. SVD-free convex-concave approaches for nuclear norm regularization. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australian, 19–25 August 2017. [Google Scholar]
  59. Xu, Y.; Liu, M.; Lin, Q.; Yang, T. ADMM without a Fixed Penalty Parameter: Faster Convergence with New Adaptive Penalization. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  60. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Figure 1. Locations of measurement sites. Blue stars denote the two air quality monitoring sites. Red circles denote the two meteorological sites.
Figure 1. Locations of measurement sites. Blue stars denote the two air quality monitoring sites. Red circles denote the two meteorological sites.
Bdcc 02 00005 g001
Figure 2. Improvement of different methods over the baseline method for Lewis University–Lemont Village (LU–LV) dataset.
Figure 2. Improvement of different methods over the baseline method for Lewis University–Lemont Village (LU–LV) dataset.
Bdcc 02 00005 g002
Figure 3. Improvement of different methods over the baseline method for Lansing Municipal Airport–Alsip Village (LMA–AV) dataset.
Figure 3. Improvement of different methods over the baseline method for Lansing Municipal Airport–Alsip Village (LMA–AV) dataset.
Bdcc 02 00005 g003
Figure 4. Optimization techniques.
Figure 4. Optimization techniques.
Bdcc 02 00005 g004
Table 1. Summary of measurement sites and observed variables.
Table 1. Summary of measurement sites and observed variables.
Measurement SitesVariables
Alsip Village (AV)Ozone concentration and PM 2.5 concentration
Lemont Village (LV)Ozone concentration and sulfur dioxide concentration
Lansing Municipal Airport (LMA)Temperature, relative humidity, wind speed and direction, wind gust, precipitation accumulation, visibility, dew point, wind cardinal direction, pressure, and weather conditions
Lewis University (LU)The same as for LMA site
Table 2. Root-mean-squared error (RMSE) for all approaches and datasets. The best approaches are marked as bold.
Table 2. Root-mean-squared error (RMSE) for all approaches and datasets. The best approaches are marked as bold.
ApproachesLMA-AV: O 3 LMA-AV: PM 2.5 LU-LV: O 3 LU-LV: SO 2
Heavy– 2 , 1 0.125690.0410.08830.033591
Light– 2 , 1 0.115910.0370.0853760.033411

Share and Cite

MDPI and ACS Style

Zhu, D.; Cai, C.; Yang, T.; Zhou, X. A Machine Learning Approach for Air Quality Prediction: Model Regularization and Optimization. Big Data Cogn. Comput. 2018, 2, 5.

AMA Style

Zhu D, Cai C, Yang T, Zhou X. A Machine Learning Approach for Air Quality Prediction: Model Regularization and Optimization. Big Data and Cognitive Computing. 2018; 2(1):5.

Chicago/Turabian Style

Zhu, Dixian, Changjie Cai, Tianbao Yang, and Xun Zhou. 2018. "A Machine Learning Approach for Air Quality Prediction: Model Regularization and Optimization" Big Data and Cognitive Computing 2, no. 1: 5.

Article Metrics

Back to TopTop