Air Quality Prediction in Smart Cities Using Machine Learning Technologies Based on Sensor Data: A Review

: The inﬂuence of machine learning technologies is rapidly increasing and penetrating almost in every ﬁeld, and air pollution prediction is not being excluded from those ﬁelds. This paper covers the revision of the studies related to air pollution prediction using machine learning algorithms based on sensor data in the context of smart cities. Using the most popular databases and executing the corresponding ﬁltration, the most relevant papers were selected. After thorough reviewing those papers, the main features were extracted, which served as a base to link and compare them to each other. As a result, we can conclude that: (1) instead of using simple machine learning techniques, currently, the authors apply advanced and sophisticated techniques, (2) China was the leading country in terms of a case study, (3) Particulate matter with diameter equal to 2.5 micrometers was the main prediction target, (4) in 41% of the publications the authors carried out the prediction for the next day, (5) 66% of the studies used data had an hourly rate, (6) 49% of the papers used open data and since 2016 it had a tendency to increase, and (7) for efﬁcient air quality prediction it is important to consider the external factors such as weather conditions, spatial characteristics, and temporal features. Error (RMSE), Normalised Root Mean Square Error (NRMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE) and Pearson correlation coefﬁcient (R). The results showed that compared to other methods, SLI-ESN performed better results. In addition, the authors compared methods, in terms of the time factor, and the results showed that ESN and ELM based methods were faster than deep learning model, because the latter one consumes much time in training step for optimal subset selection and model optimization. The proposed method was not the fastest one, but it was in an acceptable time frame. About the limitations, the main problem was that for the longer term, the result was not satisfactory. [57], CMAQ [58], and neural network. As evaluation metrics were used Relative Absolute Error (RAE), Relative Squared Error (RSE) and R. The results showed that the proposed method outperformed other methods.


Introduction
The numbers show that more and more people are moving to cities. According to United Nations (UN) urban population as of 2018 is about 55.3% [1] and by 2050 it will become 68% [2]. Growth of urbanization causes some problems related to different aspects of life, such as transportation, health care, air quality, etc. The smart city concept was created to solve these problems, which by integrating Information and Communication Technology (ICT) with citizens and existing resources can support sustainable development and life quality improvement. There are different definitions describing smart cities, such as 'A smart city is a city in which there are six main components including smart economy, smart transportation, smart environment, smart citizens, smart life, and smart management' [3] or 'The use of smart computing technologies to make the critical infrastructure components and services of a city-which include city administration, education, healthcare, public safety, real estate, transportation, and utilities-more intelligent, interconnected, and efficient' [4]. Using the services built around the smart city notion allows us to capture a huge amount of data about the current situation and to see the real picture all over the city. The availability of data provided by sensors is a significant feature of smart cities [5,6].
From the above-mentioned issues and definitions, we can notice that air quality is considered to be an essential component in the smart city concept. Air quality has become a massive problem in many

Search Strategy and Inclusion/Exclusion Criteria
To select and research relevant papers, first of all, we selected databases, including Scopus and IEEE Xplore repositories. Then, we defined the searching terms, and the entry query was as follows: 'Machine Learning' AND 'Air Quality Predict*' OR 'Air Pollut*'. The next step was year and source type restrictions by selecting journal papers and conference proceedings published since 2002. The output of this step provided 316 papers.
It should be noted that recently Rybarczyk and Zalakeviciute published a paper about 'Machine learning approaches for outdoor air quality modelling: A systematic review' [11]. By reviewing this paper, we have defined key features which were taken into consideration during our study. In the first place, we have narrowed the scope of the models by choosing only forecasting models, while the authors mentioned above also included papers concentrated on the estimation models. Another reduction is that we selected papers in which only sensor data in the smart cities context are being used. After these steps and after excluding duplicated manuscripts and reviewing titles and abstracts, the filtration output reached 131. Finally, applying quality assessment, irrelevant papers were excluded, and as a result, we had 41 selected papers. The key questions, on which we focused during the quality assessment, are listed below: 1. Are the research aims clearly specified? 2. Was the study designed to achieve these aims? 3. Are the used techniques clearly described and their selection justified? 4. Are the data collection methods adequately described? 5. Is the purpose of the data analysis clear? 6. Are the findings convincing? 7. How clear are the links between data, interpretation and conclusions?
The inclusion and exclusion criteria used during the review are listed in Table 1 and the overall workflow of the review is represented in Figure 1.

Overview of the Included Studies
After examining the works, the following aspects were extracted: publication years, countries which were served as a case study and machine learning algorithms applied in the papers, which will enable us to obtain a general picture of the present scene.
Regarding publication years, Figure 2 shows the progress of the publications related to air quality prediction in smart cities using machine learning techniques based on sensor data. About countries, Figure 3 on the world map demonstrates the countries where are located the cities which were served as a case study in the publications. It can be noted that China is leading this kind of research works with 26 papers, followed by Italy (3 papers), Spain (2 papers), USA (2 papers), South Korea (1 paper), Iran (1 paper), Egypt (1 paper), Romania (1 paper), Qatar (1 paper), Finland (1 paper) and Saudi Arabia (1 paper).
Related to the algorithms, we categorised the applied methods based on machine learning algorithms. Figure 4 shows the output of categorization (Neural Network (NN), Regression, Ensemble, Hybrid Model, Others). It can be seen that the neural network is leading other algorithms by use in 17 papers. The next most used algorithm is regression, applied in 11 manuscripts, then ensemble in 10 papers, hybrid models in five papers, and Others are two papers, one of which is focused on the regularization and optimization, and the other study applied multinomial naïve bayes and multinomial logistic regression methods.

Exhaustive Descriptions of Included Studies
This section includes a brief description of each selected paper, involving applied methods and obtained results. We grouped the papers based on machine learning algorithms represented in Figure 4 (NN, Regression, Ensemble, Hybrid Model, Others).

Group 1: Neural Network (NN)
Prediction of Air Pollution Concentration Based on mRMR and Echo State Network [12]: to predict PM 2.5 , Xu and Ren employed a Supplementary Leaky Integrator Echo State Network (SLI-ESN) which can memorise historical information. First of all, they used minimum Redundancy Maximum Relevance (mRMR) feature selection method to solve a problem related to data redundancy, which increased computational speed. Then they applied phase space reconstruction to extract evolutionary information of relevant variables, and finally, to perform prediction, SLI-ESN was applied. The following methods were used for comparison purposes: Echo State Network (ESN), Leaky Integrator Echo State Network (LI-ESN), Extreme Learning Machine (ELM), Hierarchical ELM and Stacked Auto-Encoder. The dataset consisted of air pollution (PM 2.5 , particulate matter with diameter equal to 10 micrometers (PM 10 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), ground-level ozone (O 3 ), sulfur dioxide (SO 2 )) and meteorological (temperature, pressure, humidity, wind speed, wind direction) data. The predictive indicators used for evaluating the model were Root Mean Square Error (RMSE), Normalised Root Mean Square Error (NRMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE) and Pearson correlation coefficient (R). The results showed that compared to other methods, SLI-ESN performed better results. In addition, the authors compared methods, in terms of the time factor, and the results showed that ESN and ELM based methods were faster than deep learning model, because the latter one consumes much time in training step for optimal subset selection and model optimization. The proposed method was not the fastest one, but it was in an acceptable time frame. About the limitations, the main problem was that for the longer term, the result was not satisfactory.
Spatiotemporal The result showed that when the window size was five, it was the optimal size for the hourly as well as for daily and weekly granularities. The limitation was that only the historic air pollution data were used and other relevant data (the meteorological and urban information) were not included.
Air Pollution Forecasting Using a Deep Learning Model Based on 1D Convnets and Bidirectional GRU [14]: Tao et al. introduced short-term forecasting method for PM 2.5 which included the Convolutional-based Bidirectional Gated Recurrent Unit (CBGRU) combined with 1D convnets and Bidirectional Gated Recurrent Unit (BGRU) neural networks. The former one was responsible for feature extraction, and the latter one was for time series forecasting. For checking the effectiveness of the method, the authors compared it to SVR, Gradient Boosting Regression (GBR), Decision Tree Regression (DTR), simple RNN, LSTM, Gated Recurrent Unit (GRU) and BGRU. The data used in this study were from the machine learning repository at the University of California, Irvine (UCI) [15] and meteorological data from Beijing Capital International Airport. RMSE, MAE and SMAPE were used for evaluation purposes. Comparing to the traditional ones, the prediction results demonstrated that the error of the CBGRU model was lower, including RMSE-14.5319, MAE-10.4789 and SMAPE-0.2055.
A Deep CNN-LSTM Model for Particulate Matter (PM 2.5 ) Forecasting in Smart Cities [16]: to forecast PM 2.5 , the combination of Convolutional Neural Network (CNN) and LSTM was applied. CNN was responsible for features extraction, LSTM was for analysing the extracted features and for estimating the PM 2.5 concentration of the next point in time. The method proposed here (APNet) used PM 2.5 concentration, cumulated wind speeds, and cumulated hours of rain over the last 24 h in order to predict PM 2.5 for the next hour. Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Multilayer Perceptron (MLP), CNN, and LSTM were used for comparison purposes. As an evaluation metrics, the authors selected MAE, RMSE, R and Index of Agreement (IA). The results showed that, although CNN and LSTM separately achieved good results, the combination of them, the proposed CNN-LSTM model (APNet) was better having the following results: MAE-14.63446, RMSE-24.22874, R-0.959986, IA-0.97831.
A Sequence-to-Sequence Air Quality Predictor Based on the n-Step Recurrent Prediction [17]: taking into account that sequence-to-sequence (seq2seq) had some problems (slow training speed, error accumulation), Liu et al. proposed to use an Attention-based Air Quality Predictor (AAQP) with n-step recurrent prediction. To accelerate the training process, RNN in encoder was replaced with a Fully Connected (FC) layer and also considering that FC layer was not as powerful as RNN during the process of sequential data, position embedding was applied. To improve the accuracy, n-step recurrent prediction was applied. MAE and determination coefficient (R 2 ) were used as performance metrics. The following methods were used to compare and measure the proposed method: ANN, SVM, GRU, LSTM, seq2seq, seq2seq-mean, seq2seq-attention and n-step AAQP. The Olympic Center station (smaller fluctuations of PM 2.5 ) and Dongsi station (big fluctuations of PM 2.5 ) were selected as the target stations. The results showed that attention-based models demonstrated better results and also recurrent prediction induced better results compared to direct prediction. The proposed AAQP (GRU) method in the Olympic Center station had similar performances to the original seq2seq model with attention. According to the MAE score, the best performance was seq2seq-attention (GRU)-33.109, and for R 2 was seq2seq-attention(GRU)-0.253. In the Dongsi station according to MAE score and R 2 , the best performance was 1-step AAQP (GRU)-41.468 and 0.228 respectively, which confirmed that proposed AAQP (GRU) method was better compared to other methods. Related to the steps analysis, the results showed that 12-step AAQP was the best. The authors also compared training and prediction time for each model. The results showed that training time (s) of 12-step AAQP (GRU) and the prediction time of 12-step AAQP (LSTM) had better performances. Concerning future work, it was suggested to work on spatial attention and to collect more weather forecast data.
An End-to-End Adaptive Input Selection With Dynamic Weights for Forecasting Multivariate Time Series [18]: presents the Adaptive Input Selection with Recurrent Neural Network (AIS-RNN) for multivariate time series forecasting. The model consisted of two parts; the first model generated context-dependent importance weights for selecting proper inputs; afterwards, the second model based on the inputs predicted the target variable. For the latter part Elman networks (simple RNN), LSTM and GRU were applied. RMSE, MAE and MAPE were applied to estimate model performances. The dataset used in this research consisted of 3 different types: financial, energy use of appliances and air quality dataset. The latter one was taken from the machine learning repository at UCI [19]. For comparison purposes the following methods were taken: LSTM, GRU, RNN, SVM, RF, AdaBoost, DT based on Recursive Feature Elimination, VAR-based and without anything, and also LSTM, RNN, GRU based on AIS-RNN. The results showed that the proposed model outperformed other models. As future work, the authors proposed to extend AIS-RNN as an end-to-end ensemble model.
Prediction of Urban PM 2.5 Concentration Based on Wavelet Neural Network [20]: focuses on prediction of PM 2.5 using Wavelet Neural Network (WNN). Different techniques were chosen in order to evaluate the effectiveness of WNN, including ELM, Fuzzy Neural Network (FNN) and Least Squares Support Vector Machine (LSSVM). The dataset contained hour average concentration of temperature, relative humidity, O 3 , CO, NO 2 , SO 2 , PM 10 and PM 2.5 . Firstly, the Pearson correlation and the bilateral significance test were used to calculate the correlation between PM 2.5 and other pollutants. After inputting PM 10 , CO, NO 2 , SO 2 and O 3 , temperature and relative humidity were considered for predicting the concentration of PM 2.5 . For evaluating the prediction models, the statistical parameter of R 2 , RMSE, and MAPE were chosen. Comparing to other methods, the results showed that detection results based on WNN were more accurate, only in terms of R 2 for 1 hour the FNN had comparatively better results (0.099). However, making WNN still is challenging because of the determination of proper wavelet basis function and hidden layer nodes. For evaluation purposes were used R and MAE. In terms of haze content, PM 2.5 and PM 10 were taken, and in terms of meteorological content, wind speed, wind direction, temperature, humidity, light, and atmospheric pressure were obtained for the period of 2016-2017. Multiple regression, ARMA, Classification and Regression Tree, and NN were applied for comparison purposes. The results showed that DBN-H outperformed other methods, having R-0.767 and MAE-26.5 mug/m 3 results. Overall, DBN-H model provided a correlation result with 18% better than others, and MAE was decreased by 15.7 µg/m 3 . As a limitation, the lack of data was reported.
Deep Distributed Fusion Network for Air Quality Prediction [22]: Yi et al. offered the Deep Neural Network (DNN)-based approach consisted of a spatial transformation component and a deep distributed fusion network. The model was applied for 48 hour fine-grained air quality forecasts for more than 300 Chinese cities. Air quality data consisted of hourly collected elements, including PM 2.5 , PM 10 , NO 2 , CO, O 3 , and SO 2 . A meteorological dataset consisted of weather (sunny, cloudy, overcast, foggy, snow, small rain, moderate rain, and heavy rain), humidity, temperature, pressure, wind speed, and wind direction was used. Weather forecast dataset consisted of weather, temperature, wind strength and wind direction. The proposed model was compared to the following methods: ARIMA, lasso, GBDT, FFA [23], LSTM, DeepST [24], DMVST-Net [25], DeepSD [26], DeepFM [27], WFM [28]. As an evaluation, metrics accuracy (ACC) and MAE were selected. The proposed approach outperformed other methods. The final results had 2.4%, 12.2%, and 63.2% relative accuracy improvements on short-term, long-term and sudden changes prediction, respectively compared to the previous system. Regarding future work, the long-term sudden changes prediction was considered.
Prediction of Air Pollutants Concentration Based on an Extreme Learning Machine: The Case of Hong Kong [29]: to increase air pollution prediction Zhang and Ding applied the extreme learning machine which performed good generalization with fast learning speed. The dataset used in this study included air quality (NO 2 , nitrogen oxide (NO x ), O 3 , PM 2.5 , SO 2 ) and meteorological (temperature, wind speed, wind direction, relative humidity) data during the period of 2010-2015. The following parameters were applied in order to evaluate the proposed methods: MAE, RMSE, IA, and R 2 . Compared to FeedForward Neural Network based on Back Propagation (FFANN-BP) and Multiple Linear Regression (MLR), the proposed method performed better.
Relevance analysis and short-term prediction of PM 2.5 concentrations in Beijing based on multi-source data [30]: focuses on the short-term prediction of PM 2.5 in Beijing. Multivariate statistical analysis method and Back Propagation Neural Network (BPNN) were applied in order to study correlation analysis. Afterwards, ARIMA was applied for predicting PM 2.5 . The dataset consisted of air quality data (CO, NO 2 , SO 2 , PM 10 ), meteorological data (average rainfall, daily mean temperature, average relative humidity, average wind speed, maximum wind speed) and social media data (microblog data) collected during January and December in 2014. To analyse the correlation, the authors used R and the Spearman Correlation Coefficient and to evaluate the method RMSE was applied, which reached to 6.76 mg/m 3 .
Evolving Keras Architectures for Sensor Data Analysis [31]: presents the genetic algorithm for the architecture of deep neural networks using the KERAS library [32]. Applying air pollution data, the results showed that the proposed model could increase the accuracy of the air pollution prediction. The target pollutants were CO, NO 2 , NO x , benzene (C 6 H 6 ), and non-methane hydrocarbons (NMHC). Compared to SVM and selected fixed architectures, the proposed method performed better.
Forecasting PM 2.5 Concentration using Spatio-Temporal Extreme Learning Machine [33]: by taking into account fast training, fewer configuration parameters, and ease of obtaining global optima, Spatio-Temporal Extreme Learning Machine (STELM) method was applied for enhancing forecasting of PM 2.5 for the next 72 hours. The dataset consisted of air quality data (NO 2 , CO, SO 2 , O 3 , PM 10 , and PM 2.5 ) and meteorological spatio-temporal sequences (temperature, humidity, wind direction, wind force, and precipitation) collected during April and May in 2014. Mean Relative Error (MRE) and MAE were used to evaluate the proposed method. Overall, the precision in the first 12, 24, 48, 72 hours were 82%, 78%, 71%, and 63%, respectively.
Urban Air Pollution Monitoring System With Forecasting Models [34]: aims to monitor urban air pollution and based on the results to make a prediction. Shaban et al. applied the following machine learning algorithm, including SVM, M5P, and ANN with univariate and multivariate models to forecast O 3 , NO 2 , and SO 2 for the next 1, 8, 12, and 24 h. The data were collected every 15 min. To compare the methods, the following metrics were used, including Prediction Trend Accuracy (PTA), RMSE and NRMSE. The results showed that M5P outperformed other methods. Additionally, the results confirmed that the multivariate approach had better performances compared to the univariate approach.
Air Quality Forecasting using Neural Networks [35]: Zhao et al. suggested to apply extreme learning machine-based approach to forecast air quality. The case study was Helsinki, and the data included hourly air quality data (nitric oxide (NO), O 3 , PM 10 , PM 2.5 ) and meteorological data (relative humidity, pressure, temperature, and wind). Taking into account the challenges related to big data analysis, the authors applied forward selection in order to select most correlated variables, later by applying Principal Component Analysis (PCA) they reduced the dimensionality. In general, the proposed method provided good results; however, for the future work, the authors suggested an ensemble extreme learning machine to enhance the accuracy of the prediction.
Predicting minority class for suspended particulate matters level by extreme learning machine [36]: taking into consideration the problem related to the imbalance dataset which can affect on the prediction result, Vong et al. applied ELM and SVM methods to predict PM 10 by handling the problem mentioned above. They also applied prior duplication strategy, which also aims to improve the output of the prediction. The data were provided by Macau government meteorological center (SMG) [37], including air quality (PM 10 , NO 2 , SO 2 , O 3 ) and meteorological data (atmospheric pressure, temperature, mean relative humidity, wind speed, rainfall, sunshine hour, wind direction) from 2003 to 2010. The results showed that ELM with or without prior duplication predicted minority classes better than SVM, also in terms of the training time and the memory ELM outperformed SVM model.
Three improved neural network models for air quality forecasting [38]: taking into account the drawbacks of the neural network (computationally expensive training, local minima, overfitting, etc.), Wang et al. suggested to apply Adaptive Radial Basis Function (ARBF) network with and without PCA, and improved SVM to predict air quality. The dataset consisted of air quality (Respirable Suspend Particles (RSP), SO 2 , NO x , NO, NO 2 , CO) and meteorological (wind speed, wind direction, outdoor and indoor temperature, solar radiation) data of the city of Hong Kong during 2000. MAE, RMSE and Willmott's index of agreement (WIA) were used to evaluate the methods. The results confirmed the advantages of each proposed method (ARBM automatically defined the network architecture and had fast learning speed, ARBF/PCA was an improved version of ARBF by simplifying the latter method, and, finally, SVM had higher accuracy).

Group 2: Regression
Comparative Analysis of Machine Learning Techniques for Predicting Air Quality in Smart Cities [39]: Ameer et al. used different models for predicting air quality, such as DTR, Random Forest Regression (RFR), MLP and GBR. The dataset used in this study included year, month, day, hour, season, PM 2.5 , dew point, temperature, humidity, pressure, combined wind direction, accumulated wind speed, hourly precipitation, accumulated precipitation. For evaluation criteria, MAE and RMSE were used. Here were the best methods of each city in terms of RMSE and MAE: Beijing city-DTR (RMSE-0.07) and RFR (MAE-16.92%); Shanghai city-MLP (RMSE-0.03, MAE-13.84%); Shenyang city-RFR (RMSE -0.059) and MLP (MAE-13.65%); Guangzhou city-MLP (RMSE-0.045, MAE-12.2%); Chengdu city-RFR (RMSE-0.08, MAE-10.5%). In terms of processing time, DTR and RFR were faster compared to the other two methods. After hyperparameter tuning on a single Spark node, the results showed that RF was the best technique, which also was able to find the peak values. For future work, the authors mentioned using additional factors related to air pollution.
Air-Pollution Prediction in Smart Cities through Machine Learning Methods: A Case of Study in Murcia, Spain [40]: was focused on the prediction of the ozone level (O 3 ) in the region of Murcia (Spain). The machine learning techniques used in this paper are: bagging-with REPTree classifier, a random committee with random tree based classifier, RF, M5P and an instance-based technique with K Nearest Neighbors (KNN). The dataset included average per hour of chemical elements (NO, NO 2 , SO 2 , NO x , PM 10 , C 6 H 6 , toluene (C 7 H 8 ), xylene (XIL)) and climatic parameters (temperature, relative humidity, wind direction, wind speed, atmospheric pressure and solar radiation) for each day for 2013-2014 years. For the evaluation intention, the models were measured by MAE, RMSE and R 2 . The results showed that RF had lower RMSE and MAE than the other machine learning models. Related to R 2 , above 0.75R 2 was considered a satisfactory result and all the methods obtained higher from this threshold. The results indicated R 2 setting between 80% and 90% overall. In addition, except for this, Martínez-España et al. applied the Wilcoxon Signed Ratings Test, which confirmed that RF had better results than the other machine learning techniques with 99% confidence level. After choosing the best model, the next step was to do hierarchical clustering in order to know how many models would be needed for O 3 prediction in the region of Murcia. For that purpose, the Discrete Wavelet Transform (DWT) and Euclidean distance measurement were applied. The output indicated that air pollution monitoring area can be divided into two zones: three cities except Caravaca were unified as one cluster and Caravaca remained as a separate cluster. For future work, new elements (PM 10 , SO 2 ) must be considered and analysed, and, also another improvement would be to transfer information to the target groups.
Air Pollution Forecasting Model Based on Chance Theory and Intelligent Techniques [41]: to forecast PM 10 hourly concentration for the next hour, Eldakhly et al. suggested to apply chance Weighted Support Vector Regression (chWSVR). The method can deal with interval-valued uncertainty. The dataset consisted of air pollution data (SO 2 , CO, O 3 , PM 10 , PM 2.5 , NO x ) and meteorological data (air temperature, relative humidity, atmospheric pressure, planetary boundary layer height, wind speed and direction) collected from 2007 to 2010. With the data mentioned above, temporal variables were also included as an input. The following parameters were used as an evaluation metrics: RMSE, R, fisher r-to-z transformation (z') and t-value (significant at α 0.05). Compared to RF and bootstrap aggregating techniques, the proposed model demonstrated better results. Apart from air quality and meteorological data, the following variables also were included in the study: proximity to transportation, topographical characteristics, neighbourhood characteristics. R 2 was applied as an evaluation metric. The results showed that the proposed method could provide an efficient prediction.
Particulate Matter Air Pollutants Forecasting using Inductive Learning Approach [45]: Oprea et al. applied an inductive learning approach to forecast PM 10 for the next three days using the data of the previous 8 days. The two methods that were used in this study are M5P and REPTree. The data set included air quality (sulfur dioxide, nitrogen monoxide, carbon monoxide, nitrogen oxides, nitrogen dioxide, particulate matter, ozone, o-xylene, m-xylene, benzene, toluene, p-xylene, butadiene, ethyl-benzene), and meteorological data (temperature, relative humidity, solar radiation, atmospheric pressure, wind direction, wind speed, precipitations), over a period of January 2009 to December 2009, January 2011 to December 2011, and January 2015 to April 2015. With the help of PCA, the most correlated variables were selected, including SO 2 , NO 2 , air temperature and relative humidity. The evaluation metrics used in this study were R, MAE and RMSE. The results showed that M5P enhances the accuracy of the prediction.
Wind-sensitive Interpolation of Urban Air Pollution Forecasts [46]: is focused on the prediction NO, NO 2 , SO 2 , O 3 and interpolation real-time forecasts in city Valencia. Wind aspect was used for prediction taking into account the factor in Valencia. This air quality data were taken from the Valencia City Council, and the meteorological data (temperature, relative humidity, pressure, wind speed, rain) were taken from the Meteorological Agency of the Government of Spain (AEMET) [47]. Additional to these data traffic intensity features were extracted (traffic level in the surrounding stations and traffic level 3 h before). The following machine learning techniques were applied, including Linear Regression (LR), Quantile Regression (QR) with lasso method, KNN with k = 10, DTR, and RF. To measure the methods mentioned above, the authors used RMSE. The results showed that RF had comparable better results. Afterwards, the authors analysed the wind direction effect on air pollution to enrich the forecasting model, and they applied Local IDW for interpolation purposes, which includes a wind direction factor.
Comparing the Performance of Statistical Models for Predicting PM 10 Concentrations [48]: is focused on the hourly PM 10 prediction. The following machine learning methods were applied, including MLR, QR, Generalised Additive Model (GAM), and Boosted Regression Trees 1-way (BRT1) and 2-way (BRT2). The dataset included air quality (CO, SO 2 , NO, NO 2 , PM 10 ) and meteorological (wind speed, wind direction, temperature, relative humidity, rainfall, pressure) data of the city of Makkah during 2012. To evaluate the methods, the Mean Bias Error (MBE), MAE, RMSE, the fraction of prediction within a Factor of Two (FACT2), R, and IA were applied. The results showed that QR outperformed other methods. As a limitation it was mentioned that only one city was considered as a case study, and also the time period was short.
Forecasting daily ambient air pollution based on least squares support vector machines [49]: aims to perform air quality prediction using LSSVM. The data used in this study were collected from 2003 to 2006. To evaluate the method, it was compared to MLP by applying relative error measure. The data were taken from 2003 to 2006 years. The results confirmed the advantages of the proposed method.
Online prediction model based on support vector machine [50]: is concentrated on the prediction of air quality in the city of Hong Kong using an online SVM, which was compared to conventional SVM. The dataset consisted of hourly measurement of air quality data (CO, NO, NO 2 , SO 2 , NO x , O 3 , RSP), and meteorological data (indoor and outdoor temperature, solar radiation, wind direction, wind speed). To evaluate the methods, the following metrics were taken, including MAE, RMSE and WIA. The results showed the superiority of the online SVM.
Air pollutant parameter forecasting using support vector machines [51]: Lu et al. studied air quality prediction by applying SVM. They compared SVM to Radial Basis Function (RBF). The data used in this research contained hourly measurements of air quality of the city of Hong Kong during the year of 1999. Taking into consideration the effects of RSP on the case study, the authors selected this pollutant to evaluate the proposed method. The data of June and December were taken. In case of analysing data during December, meteorological data were ignored, while during June the data were included. As an evaluation metric, MAE was used. The results showed that SVM is better in the term of generalization performance, and it provides higher accuracy.

Group 3: Ensemble
A predictive data feature exploration-based air quality prediction approach [52]: Zhang et al. proposed Light Gradient Boosting Machine (LightGBM) model and combining predictive and historical data executed prediction of the PM 2.5 concentration over the next 24 h. This method helped to process the high-dimensional large-scale data and support parallel learning. The problem of the lack of data was solved by applying the sliding window mechanism, which increases the training dimensions to millions. PCA dimension reduction method was used to discard redundant information. Afterwards, all data were integrated, including air quality (PM 2.5 , PM 10 , NO 2 , CO, O 3 , SO 2 of the 35 air quality monitoring stations in Beijing from 2017 to 2018), temporal, meteorological (temperature, weather, humidity, wind direction, wind speed), weather forecast and statistical features. The proposed method was compared to Adaboost, GBDT, XGboost, DNN and also with LGBT without forecasting.
To evaluate the prediction model, three evaluation functions were used, including, SMAPE, Mean Square Error (MSE), and MAE. The results showed that LightGBM outperformed other methods. This is due to that LightGBM is a histogram-based algorithm that supports parallel learning, which causes faster training rate and higher accuracy. In addition, it is worth noting that the proposed method outperformed LightGBM without predictive data.
A multiple kernel learning approach for air quality prediction [53]: Zheng et al. proposed multiple kernel learning model with support vector classifier (MKSVC) as the base learner, which combines feature selection, metric learning and ensemble method for predicting air quality. For learning kernels, the centred alignment approach was applied, and for determining the optimal number of kernels, a boosting approach was applied. The case study was Hong Kong and Beijing. Air pollutant dataset contained Fine Suspended Particulates, NO 2 , NO x , O 3 , RSP, and SO 2 . The meteorology dataset contained temperature, atmospheric pressure at weather station level, atmospheric pressure reduced to mean sea level, pressure tendency, relative humidity, mean wind direction, mean wind speed, dew, dew point. Timestamp features were contained month, week, day and hour. The prediction targets for this study were the Air Quality Health Index (AQHI) in Hong Kong and the PM 2.5 Individual Air Quality Level (IAQL) in Beijing. The model was compared to ARIMA, RF and SVM, MLP and LSTM. For evaluating the effectiveness of the methods, the authors used accuracy, MSE, Weighted Precision (WP), Weighted Recall (WR), and Weighted F1-score (WF). The results of forecasting future 1, 3, 6, 9, and 12 hours' AQHI in Hong Kong showed that MKSVC was the best among all methods. MKSVC was best also for forecasting the PM 2.5 IAQL of Beijing. Compared to other methods, the proposed approach demonstrated relatively good performances for long-term prediction and severe air pollution prediction; however, for effective air quality prediction, more exploration should be done.
A data ensemble approach for real-time air quality forecasting using extremely randomised trees and deep neural networks [54]: Eslami et al. applied extremely randomised tree (extra-trees method) and DNN, generalised ensemble models, for forecasting ozone concentration. The ensemble model integrated two regression models: low-and high-ozone peak models. Two models were generalised, such as merging all samples from all sources and uniformly distributing the samples based on target ozone peaks. In addition, regularised models were developed in order to focus more on episodes with high-ozone peaks more significant than the threshold (90 Parts Per Billion (PPB)). The data used in this paper included the observed hourly values of O 3 and NO x concentrations, surface temperature, relative humidity, wind speed, direction, dew point temperature, surface pressure, and precipitation. For evaluation purposes, IA was applied. The results showed that yearly IA was in the range of 0.84-0.89 and yearly Rs were in the range of 0.72-0.80. As a limitation was mentioned that high-ozone episodes were underpredicted, particularly during the high-ozone season (April-September).
A Deep Spatial-Temporal Ensemble Model for Air Quality Prediction [55]: Wang and Song proposed a deep Spatial-Temporal Ensemble (STE) model, which included weather pattern-based partitioning strategy, spatial correlation and temporal predictor based on deep LSTM. The dataset consisted of air quality data (CO, NO 2 , SO 2 , O 3 , PM 10 , PM 2.5 ) and weather forecast data (temperature, humidity, wind speed, wind direction) from May 2013 to April 2017 from 35 monitoring stations in Beijing, China. To evaluate the effectiveness of the model, MAE, RMSE and accuracy were used. The following baselines were used: LR, Regression Tree (RT), DNN, FFA [23]. The results showed that STE outperformed other methods.
Early Air Pollution Forecasting as a Service: an Ensemble Learning Approach [56] : is focused on the air pollution prediction using Multi-channel Ensemble Learning via Supervised Assignment (MELSA) algorithm reported in web service. The case study for this research was Beijing city. The air pollution data consisted of PM 2.5 , PM 10 , SO 2 , CO, NO x , O 3 , and meteorological data were relative humidity, dew point temperature, surface pressure. The aim of this study is using the features mentioned above as an input to predict air quality index (AQI) for 24-72 h temporal resolution. The proposed method was compared to the following methods, including stacking, RF, AdaBoosting, bagging, WRFChem [57], CMAQ [58], and neural network. As evaluation metrics were used Relative Absolute Error (RAE), Relative Squared Error (RSE) and R. The results showed that the proposed method outperformed other methods.

A Comprehensive Evaluation of Air Pollution Prediction Improvement by a Machine Learning Method [59]:
Xi et al. applied the machine learning techniques in order to predict air pollution with better accuracy. As an input variables were taken air quality (PM 2.5 , PM 10 , SO 2 , NO 2 , CO, O 3 ), meteorological (wind speed, direction, pressure, humidity, temperature), chemical components (organic carbon, black carbon, dust) from October 2013 to April 2015. The methods (RF, gradient boosting, SVM, DT and combined models of these models) were applied in 74 cities in China. The results showed that in the case of including more features, the accuracy would increase, also the combination of the methods performed better results than each method separately.
Ensemble forecasting with machine learning algorithms for ozone, nitrogen dioxide and PM 10 on the Prev'Air platform [60]: describes the Prev'Air operational platform which is served to generate a daily map for forecasting O 3 , NO 2 and PM 10 . The data were collected between 2008 and 2010 years. To evaluate the performance indicators in order to include in the platform, Normalized Mean Square Error (NMSE), correlation, daily observed mean vs. daily simulated mean were applied. The Discounted Ridge Regression (DRR) was applied in order to compute new weights before the prediction; afterwards, the authors compared it to Best Model and to the Best Constant Linear Combination. RMSE metric was used to evaluate the methods. The result showed that respectively O 3 was reduced by about 29%, 35% and 19% for hourly, daily and peak, NO 2 was reduced by about 19%, 26% and 20% for hourly, daily and peak, PM 10 was reduced by about 17%, 19% and 11% for hourly, daily and peak.

Group 4: Hybrid Model
A Weight-adjusting Approach on an Ensemble of Classifiers for Time Series Forecasting [61]: is focused on forecasting time series using hybrid heterogeneous forecasting model including ARIMA model, SVM and ANN. The approach used in this paper is to take each model's weight based on their ability and history of predicting numerical values. The data used in this study were taken from the machine learning repository at UCI [62]. It included CO, relative humidity, Benzene concentration, etc., from March 2004 to April 2005. For this study from the air quality data set the hourly averaged CO was used. For evaluation purposes, MAE and MAPE were used. Comparing to each single classifier in the ensemble and with RF, the results showed that the proposed method had better performances (MAE-0.5779 and MAPE-30.52%) and also time complexity was O(N), where N is the size of validation data set. The experiments showed that after weight adjusting, the weight of SVM was always larger, which confirms that the role of SVM is more important. The weight of ARIMA was always the smallest, which raises doubts about the choice of ARIMA for time series prediction. Regarding future work, the authors mentioned the use of more than three classifiers and removing classifiers with negative weight.
Application of a Hybrid Model Based on Echo State Network and Improved Particle Swarm Optimization in PM 2.5 Concentration Forecasting: A Case Study of Beijing, China [63]: Xu and Ren proposed a hybrid model based on ESN and an Improved Particle Swarm Optimization (IPSO) to forecast PM 2.5 in Beijing city. First of all, the authors applied Phase Space Reconstruction to map the original data to the high-dimensional space, then Particle Swarm Optimization (PSO) for increasing the searching speed, and by taking into account the fact that PSO can face the problem to find the global minimum, the Convergence Cross-Mapping for proper subset selection was applied. Finally, ESN was applied for prediction. The dataset included hourly averages of PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , CO, temperature, pressure, humidity, wind speed, and wind direction from 1 January 2016, to 31 December 2016. The following prediction criteria were used for evaluation of the effectiveness of the proposed hybrid model, including RMSE, MAE, SMAPE, and R. The following models were selected for comparison purposes, such as the original model (ESN), Single-hidden Layer Feedforward Network, ELM, BPNN, LSSVM, and LSTM. The authors provided one-step and 10-step forecasting experiments. The results for both steps showed that the proposed model provided better performances among all models. The limitation was that it failed to consider the potential factors in extreme conditions (e.g., radon emissions). An additional extension can be to achieve medium-and long-term forecasts in terms of the time factor. PM 2.5 forecasting using SVR with PSOGSA algorithm based on CEEMD, GRNN and GCA considering meteorological factors [64]: is focused on forecasting of next 30 days' PM 2. 5  For evaluating the following metrics were used: MAE, MAPE, RMSE, R, and IA. The results showed that the suggested hybrid model had relatively better performances. As future work, the authors proposed to apply the method to forecast other air pollution indexes and to evaluate the air quality in other cities.
A novel optimal-hybrid model for daily air quality index prediction considering air pollutant factors [65]: Wu and Lin suggested optimal-hybrid model combined with Secondary Decomposition (SD), AI method and optimization algorithm for forecasting air quality index. In the proposed SD method, Wavelet Decomposition (WD) was chosen as the primary decomposition technique to generate a high-frequency detail sequence WD (D) and a low-frequency approximation sequence WD (A). Variational Mode Decomposition (VMD) improved by Sample Entropy (SE) was adopted to smooth the WD (D). LSTM with good ability of learning and time series memory were applied to make it easy to be predicted. LSSVM with the parameters optimized by the Bat Algorithm (BA) considered air pollutant factors including PM 2.5 , PM 10 , SO 2 , CO, NO 2 and O 3 , which is suitable for forecasting WD (A) that retains original information of AQI series. The dataset was from 1 December 2016 to 31 December 2018 respectively collected from Beijing and Guilin located in China. RMSE, MAE, MAPE and R were selected as evaluation metrics. The results showed that the proposed method outperformed other methods.
A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine [66]: to accurately predict air quality index Wang et al. used a novel hybrid model based on two-phase decomposition, extreme learning machine and different evolution. The two-phase decomposition was based on CEEMD and VMD, which helps to handle non-stationary features. The dataset was from 1 July 2014 to 30 June 2016 of the cities Beijing and Shanghai. To evaluate the model MAE, RMSE and MAPE were applied. The result showed that the proposed method outperformed other methods (MAE-2.53, RMSE-3.27, MAPE-5.09).

Group 5: Others
Regularization and optimization [67]: Zhu et al. proposed parameter-reducing formulations and consecutive-hour-related regularizations for forecasting concentration of air pollutants for the next day. The dataset consisted of meteorological and air pollution data from 2006 to 2015 (Chicago area). The main steps of this study are to explicitly control the number of model parameters and then, to enforce a certain regularization on the model parameter explicitly. For the first step, three models were selected, including Baseline, Heavy and Light. For regularization task, Frobenius norm regularization, 2,1-norm regularization, nuclear norm regularization and Consecutive Close (CC) regularization were proposed. The following models were compared, including the baseline model with standard Frobenius norm regularization (Baseline), the heavy model with standard Frobenius norm regularization (Heavy-F), the light model with standard Frobenius norm regularization (Light-F), the heavy model with 2,1-norm regularization (Heavy-2,1), the heavy model with nuclear-norm regularization (Heavy-nuclear), the heavy model with CC regularization using the 2-norm (Heavy-CCL2), the heavy model with CC regularization using the 1-norm (Heavy-CCL1), the light model with 2,1-norm regularization (Light-2,1), the light model with nuclear-norm regularization (Light-nuclear), the light model with CC regularization using the 2-norm (Light-CCL2), the light model with CC regularization using the 1-norm (Light-CCL1). As evaluation metric was chosen RMSE, and comparatively better results performed Light-CCL1 for Lansing Municipal Airport-Alsip Village (LMA-AV): O 3 and Lewis University-Lemont Village (LU-LV): SO 2 with the score 0.11535 and 0.03248 respectively, Light-nuclear for LMA-AV: PM 2.5 with score 0.0368, and Light-CCL2 for LU-LV: O 3 with score 0.0845. As a limitation was mentioned that similarities between nearby meteorology stations were not considered which could improve the prediction.
Predictive mapping of urban air pollution using Apache Spark on a Hadoop cluster [68]: represents the platform based on Apache Spark and Hadoop cluster, which predicts air pollution in city Tehran for the next 24 hours. To provide efficient prediction, the authors used Multinomial Naïve Bayes and Multinomial Logistic Regression algorithms. Then, applying the IDW method, the predictive map was generated for the whole city. The dataset used in this study consisted of air pollution data (CO, SO 2 , PM 10 , PM 2.5 , NO 2 , O 3 ) captured from 21 monitoring stations and meteorological data (temperature, pressure, cloud cover, relative humidity, wind speed, wind direction) obtained from 4 weather stations between 2009 and 2013. The results showed that the Naïve Bayes model creates more classes than the Logistic Regression model. To compare models, the following metrics were used: precision, recall and F1 score. The logistic regression has comparable higher accuracy (0.68), but it failed to predict classes 4, 5, 6, 7. Meanwhile, the Naïve Bayes model could perform better results for those classes. Overall, the two methods provided good outcomes; however, there were problems related to handling imbalanced data. Based on the latter limitation for future work, more advanced machine learning techniques should be used.

Discussion
After describing all the selected papers, we created a comparison table by extracting the main features of the papers (Table 2). Table 2  Year: as we have already seen (Section 3.1), Figure 2 displays the evolution of the publications over the years. We can see a significant increase since 2014-2015, which can be explained with the appearance of smart cities and open data portals notions in science.
Algorithms: having information about the distribution of the publications per each machine learning algorithm (Figure 4), it would be interesting to know how the usage of the algorithms changed throughout the years. Figure 5 presents the publications for each machine learning algorithm over the years. It can be noted that, in recent years, the number of publications (used neural network, ensemble and hybrid models) has an increasing trend, which is not applicable to the regression method. The latter one has been applied since 2002 almost with the unchangeable trend, moreover, in recent publications, the regression method was mainly applied along with other algorithms for comparison purposes.    [12] AQ, MET YES mRMR is preferable for future selection. Longer term is not satisfactory, long time consuming on optimal subset selection and the model optimization.
[13] AQ, Spatial NO IDW helped to improve BLSTM by 5.6%. Using only the historic air pollution data. [52] AQ, MET, WFD, Spatial NO Faster training rate, higher accuracy. N/S [14] AQ, MET YES To obtain a sequence pattern. N/S [39] AQ, MET NO RFR reduces overfitting, detects peak values.
To use additional factors related to the air pollution.
[61] AQ, MET YES Relatively better result, time complexity is linear.
To use more than three classifiers and remove classifiers with negative weight.
[17] AQ, MET ON REQUEST Reduction of error addition and the training time.
To work on spatial attention, to collect more weather forecast data [63] AQ, MET NO Comparatively better accuracy.
It fails to consider the potential factors in extreme conditions, additional extension-to achieve mediumand long-term forecasts in terms of time factor.
[54] AQ, MET YES The models' computation time for real-time hourly prediction is less compared to the station-specific machine learning models.
high-ozone episodes were underpredicted, particularly during the high-ozone season.
[67] AQ, MET YES To improve the convergence of optimization and to speed up the training process for big data.
Consider the commonalities between nearby meteorology stations.
[53] AQ, MET, Temporal YES Better for Short-term and severe air pollution prediction. More exploration must be done.
[16] AQ, MET YES Relatively better result. N/S [20] AQ, MET NO High stability and robustness. Difficulties to make WNN.
[21] AQ, MET YES Correlation result is 18% better, while the MAE declines by 15.7 µg/m 3 . The lack of data.
The long-term sudden changes prediction.
To forecast other air pollution indexes, to evaluate the AQ in other cities.
[55] AQ, WFD, Spatial NO Effective and reaches nearly 60% in accuracy. N/S Evaluation Metrics: are the metrics used in order to measure applied methods. Figure 6 displays for each metric the number of publications where the metric was applied. We can notice that the most used metrics are RMSE (Equation (1)) and MAE (Equation (2)), each of them applied in 24 publications.
where n is the number of instances, E i and A i are the estimated and actual values. The lower value of these two metrics corresponds to a better prediction. Prediction Target: is the main pollutant in the case study, for which prediction different techniques were applied in order to monitor, to measure, and in the final step, to predict the concentration of that pollutant. Figure 7 represents the pollutants which were considered as a prediction target in the selected studies. The targets are PM 2.5 , NO x , O 3 , PM 10 which in early publications was mentioned as RSP, SO 2 , CO, Suspended Particulate Matter (SPM), NMHC, C 6 H 6 , black carbon (BC), and N/S is the number of publications that did not specify the prediction target. It can be noted that PM 2.5 is the principal pollutant, being as the prediction target in 19 studies. In general, the prediction of particulate matters has always been the main focus of researchers. The only significant change over the years is the ability to measure and monitor finer particulate matters with the help of new sensors (Figure 8). After finding out the main pollutant of the selected papers, it is interesting to explore countries distribution per pollutant. In Figure 9, we can see that the case study of 18 publications out of 19, having PM 2.5 as a pollutant target, is China. Time Granularity: is time resolution which is considered as the prediction interval. In Figure 10, we can notice that most used time resolutions were 24 h in 17 papers, and just in four papers the authors tried to make weekly prediction. The main reason is the issue related to the accuracy of long term predictions (for example, in [12] RMSE for the next 1h was 9.3953, for the next 5 h it was 37.6874 and for the next 10 h it was 65.7108). Data Rates: is a frequency of the data acquisition from the sensors. Figure 11 presents data obtainment frequency for the selected studies. As it might be seen, in the majority of the papers (27) sensors provided hourly data, in nine studies sensors provide daily data, in one paper the frequency was 15 min, and the rest includes publications which did not provide any information about data rates. Dataset Types: include types of data which were used in order to perform analysis. The used dataset types involve AQ: air quality data, MET: meteorological data, Temporal: include the day of the month (values from 1 to 31), day of the week (values from 1 to 7), the hour of the day (values from 1 to 24), WFD: weather forecast data, Spatial: in one paper it refers to proximity to transportation, topographical characteristics, neighbourhood characteristics, and for the rest cases it indicates the locations of the stations, Social Media: microblog data, Chemical: chemical component forecast data (organic carbon, black carbon, sea salt, etc.) and TIF: traffic intensity features, which contains information about traffic level in the surrounding stations (vehicles/hour). As shown in Figure 12, in the majority of the papers (36) air quality data was combined with meteorological data, considering the importance of the latter one in air quality prediction. The next types are temporal data in six papers, weather forecast data in five papers, spatial data in five papers, and social media, chemical forecast data, traffic intensity features, each of them in one paper. Open Data: includes information about data accessibility. Taking into consideration the role of reproducibility nowadays, we explored to know which papers provide a link to the dataset used to carry out the experiments. However, it is worth mentioning that reproducibility is not only data; it also refers to code availability [69]. No paper provided code scripts, although the algorithms were available and were explained in the papers. Figure 13 shows data accessibility of the selected studies. We can see that 20 papers for their analysis used open data, 16 papers used private data, and Others includes five papers; in the first paper the authors mentioned that data can be available through the request, the second paper used partially available data, the third study provided link to access to the data, but now the link is not available, and the other two studies mentioned that they took data from Hong Kong Environmental Protection Department without providing any reference. Figure 14 displays the number of publications for data accessibility over the years. We can notice that since 2010 the authors have started to use open data portals to capture data to perform analysis and since 2017, in contrast to papers using private data, the number of publications having open data increased.  Advantages: are the main findings of the methods which can improve the accuracy of the prediction. Here are several findings extracted from the studies. According to Xu and Ren [12] ESN and ELM consume less time to train data than the deep learning model. Compared to the random forest, correlation feature selection, fast correlation-based filter, mutual information, information gain, regularization models, relief-based algorithms, and genetic algorithm, mRMR is preferable for future selection. Ma et al. [13] mentioned that LSTM based algorithms performed better than RNN, and because of bidirectional modelling concept BLSTM provided better results compared to LSTM, integration of IDW improved BLSTM by 5.6% because the spatial factor was taken into consideration. Zhang et al. [52] pointed out the advantage of LightGBM, being a histogram-based algorithm, processes high-dimensional big data better compared to the other boost algorithms. Tao et al. [14] also mentioned the superiority of LSTM compared to RNN and confirmed the advantage of the bidirectional model. According to Ameer et al. [39] compared to a decision tree, gradient boosting and multilinear perception, random forest obtained better results by reducing overfitting and detecting peak values. Although, on the other hand, Li and Ngan [61] mentioned that random forest could have a challenge with fitting a wide variety of data distribution. Shaban et al. [34] noted that M5P compared to SVM and ANN, generalised better, and SVM can manage high dimensional data better than ANN.
Limitation/Future work: are the main reasons that authors considered as a challenge for obtaining higher accuracy. Some authors mentioned limitation, some of them propose to expand the work applying certain mechanism. It can be noticed that in many studies as a limitation was mentioned the lack of the data (also considering data types).
After finding out that the most used metric are MAE and RMSE, the most used time granularity is 24 h, and the most used prediction target is PM 2.5 , we have decided to extract the accuracy from the papers which predicted PM 2.5 for the next 24 h and which measured the accuracy using MAE and RMSE in order to compare machine learning algorithms (Table 3). Table 3 shows the output after the extraction process. It can be seen that there are not many papers matching our criteria, which created some difficulties to complete our comparison. For MAE_24h we have papers applying neural networks and ensemble algorithms, and for RMSE_24h the papers used neural networks, and one paper used regularization and optimization. Looking at the values, we can notice that there is a significant difference between neural networks and regularization-optimization (the latter one has quite a high accuracy: 0.03), which is not applicable to MAE_24h. Overall, because of the lack of information, it is challenging to compare the accuracy of machine learning algorithms.

Conclusions
The objective of this paper is to give a general perception of the current approaches presented related to the air quality prediction concept by reviewing the recent publications. As air quality prediction is a huge topic, we have defined a set of key points in order to narrow the scope and focus on a specific task. To select papers, we inserted a beforehand defined query in the following databases: Scopus and IEEE Xplore repositories. For further observation, we have selected studies published since 2002 and, afterwards, by excluding irrelevant papers based on the inclusion/exclusion criteria. Eventually, 41 manuscripts were selected. Reviewing the chosen papers, we have extracted the essential features and based on the latter findings, the papers were linked, and further comparison was carried out.
Taking into consideration the geographical component, the result shows that China was the leading country being the case study in 26 papers. The important finding was that to increase the accuracy of air quality prediction, it is valuable, in addition to the air quality data, to include also a dataset of other factors that affects the air quality. Thus, in most cases, the authors used meteorological data, and some of them also involved other types of data, such as calendar features, traffic intensity features, spatial features, etc.
Related to the prediction target, the outcome shows that PM 2.5 was the main element, applied in 19 papers, 18 of which utilised data of the cities located in China. Most cases, the authors performed a prediction for the next day. Twenty-seven studies used data hourly collected from the sensors.
Among the analysed works, 20 of them use open data to perform air quality predictions. These works were carried out from 2014 until now, coinciding with the movement of open data within the cities [70]. Therefore, we can affirm that the open data movement has increased the number of research works in the field of machine learning, especially in the prediction of air quality.
Regarding machine learning techniques, the studies used neural networks (38%), regression(24%), ensemble (22%), hybrid (11%) models, one study used regularization and optimization, and the other research applied multinomial naïve bayes and multinomial logistic regression methods. For evaluating applied techniques tailored to the algorithms mentioned above, different metrics were applied. Overall, 29 metrics were applied, from which MAE and RMSE were the most used metrics, each of them being applied in 24 papers. It is very important to mention the challenges regarding the data. First of all, for an accurate air quality prediction it is essential to capture as much relevant data as possible, including weather forecast data, air quality data, meteorological data, etc. Then, it is necessary to apply different techniques (e.g., PCA) to remove redundant data and to select representative subsets for further analysis. It is also important to mention that air quality prediction for a long temporal resolution is a challenging task, as the accuracy decreases with the increase of the prediction interval. Another essential aspect is time complexity; for example, methods based on neural network algorithms to train data, usually require a long time.
In general, it is very difficult to compare the results obtained during analysis of the papers, as they used different data, and they analysed different temporal granularity. As future work, an exhaustive work is proposed. Using all the suggested methods, they should be developed and tested using the same datasets. In this way, the results could be compared in a similar and fair scale.
Author Contributions: Conceptualization, D.I. and F.R.; formal analysis, S.T.; funding acquisition, S.T.; methodology, D.I. and F.R.; supervision, S.T. and F.R.; writing-original draft, D.I.; writing-review and editing, S.T. and F.R. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: The heavy model with 2,1-norm regularization Heavy-nuclear The heavy model with nuclear-norm regularization Heavy-CCL2 The heavy model with CC regularization using the 2-norm Heavy-CCL1 The heavy model with CC regularization using the 1-norm Light-2,1 The light model with 2,1-norm regularization Light-nuclear The light model with nuclear-norm regularization Light-CCL2 The light model with CC regularization using the 2-norm Light-CCL1 The light model with CC regularization using the 1-norm LMA-AV