Features Exploration from Datasets Vision in Air Quality Prediction Domain

: Air pollution and its consequences are negatively impacting on the world population and the environment, which converts the monitoring and forecasting air quality techniques as essential tools to combat this problem. To predict air quality with maximum accuracy, along with the implemented models and the quantity of the data, it is crucial also to consider the dataset types. This study selected a set of research works in the ﬁeld of air quality prediction and is concentrated on the exploration of the datasets utilised in them. The most signiﬁcant ﬁndings of this research work are: (1) meteorological datasets were used in 94.6% of the papers leaving behind the rest of the datasets with a big difference, which is complemented with others, such as temporal data, spatial data, and so on; (2) the usage of various datasets combinations has been commenced since 2009; and (3) the utilisation of open data have been started since 2012, 32.3% of the studies used open data, and 63.4% of the studies did not provide the data.


Introduction
According to the United Nations (UN) in 2018, more than 55% of the world's population lives in urban areas.The trend shows that by 2050 urban population will increase until 68%; particularly compared to other regions, the urban population will grow faster in Asia and Africa, considering that these regions have more rural population [1].Among the positive effects, such as better employment and education opportunities, enhanced healthcare system, greater access to social services, urbanisation also has negative consequences being a cause of air pollution or the increased demands on resources, among others.According to the World Health Organisation (WHO), every year, more than seven million persons die because of this problem or related to that [2].
It is very important to understand which pollutants are considered when determining air quality, and how to calculate and represent air quality indicators.Regarding the pollutants, they form from natural and anthropogenic sources.The WHO identifies the following pollutants as having serious impacts: particulate matter with diameter less than 2.5 micrometers (PM 2.5 ), particulate matter with diameter less than 10 micrometers (PM 10 ), nitrogen oxide (NO x ), ground-level ozone (O 3 ) and sulfur dioxide (SO 2 ) [3].Depending on the region and the presence of predominant pollutants, it is proposed to use different indices for calculating air quality, for example, the United States Environmental Protection Agency (EPA) Air Quality Index (AQI), the Canada Air Quality Health Index (AQHI), Common Air Quality Index (CAQI) or Daily Air Quality Index (DAQI), among others.
Information about air quality prediction can prompt authorities and decision-makers to apply protective measures in order to reduce air pollution, and this knowledge helps citizens to organise their daily activities by escaping high polluted areas [4,5].In order to predict air quality more accurately, it is important to consider external factors that influence air quality and include them as input to run models.As an example of those external factors are precipitation, wind direction, traffic intensity or population density, among others [6][7][8][9].
It should also be emphasised the effect to publish this kind of data as open data, which existence is beneficial both for government and for citizens.The availability of open data have an impact in many areas, such as an increase of transparency, improvement of efficiency and effectiveness of government services, empowerment of citizens, engagement and participation of citizens in governance [10,11].At the same time, these data can be used by researchers as real inputs to run their models in research works.
Taking the aspects mentioned above into account, the main goal of this manuscript is to analyse and synthesise studies related to air quality prediction using Machine Learning (ML) technologies, and find out: (1) What types of datasets are used to improve air quality predictions?(2) What characteristics of the dataset are important for efficient and effective air quality forecasting?and (3) Which features are the most used to define ML models?We believe that this work can be useful for other new works in the field of air quality prediction.Furthermore, considering the scale of the scope in which the topic may be addressed, it should be noted that the perspective of this work is based on data science, and how the obtained results can be used to start a new study to predict air quality in a certain area.
The rest of the paper is organised as follows.Section 2 explains the methodology.Section 3 presents the obtained results and introduces the discussion predicated on the acquired outcomes.Eventually, in Section 4 the conclusions are included.

Methods
To achieve the central goals for which this study is targeted, we used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [12] in order to select relevant papers.Those papers were queried in Association for Computing Machinery (ACM), IEEE Xplore, Scopus and Web of Science (WoS) databases using the following query: ("machine learning") AND ("prediction" OR "forecast") AND ("air quality" OR "air pollution"), which was being applied to title, abstract and keywords.At the first stage, it was selected all papers published until 28 September 2020 (search date) and the result was 1214 papers in total.Then duplicated and non-empirical manuscripts were removed.Afterwards, based on the inclusion/exclusion criteria listed in Table 1, screening of title, abstract and keywords, and full-text assessment were implemented.Later, the manuscripts set was filtered by focusing on several aspects.Mainly the main emphasis was to select papers concentrated on forecasting models of outdoor air pollution, which analyses were performed applying ML technologies.Another essential point was to consider the type of datasets, which assumed that in addition to air quality data, the studies should also include different datasets, such as meteorological, spatial or traffic, among others.It also should be mentioned that only journal papers were included in the final set, which has 93 items.After reviewing those papers, the key features were extracted, which are presented in detail in the next section.The described workflow of the selection procedure of the relevant studies is illustrated in Figure 1.

Results and Discussion
After analysing the manuscripts set the main objective results, the exploration and observation based on those obtained results are introduced at this stage.The following essential components of the selected studies were extracted, and the result is summarised in Table A1 in Appendix A: Year, Case Study, Prediction Target, Dataset Type, Data Rate, Period (Days), Open Data, Algorithm, Time Granularity and Evaluation Metric.
From Figure 2 it can be seen that among the 26 dataset types meteorological data is the most used dataset, appearing in eighty-eight publications.The next relatively more frequent dataset types are 'Temporal', 'Spatial', 'Traffic', 'AOD' and 'Land Use' datasets.Figure 2 shows the number of publications for each dataset type; however, it is also very important to see the number of publications for dataset combinations.From the dataset types mentioned above, thirty combinations were formed and used in the publications.Table 2 shows the number of publications for each dataset combinations.The most detected combination is meteorological data jointly only with air quality data, appearing in forty-five papers.It should be noted that there are twenty-three datasets combinations, each of them appears only in one publication, so they are combined as Others for the convenience of further analysis.Year: includes years of publications.Figure 3 demonstrates the distribution of the used dataset combinations over the years, mentioning the number of publications of each published year, and it could help to identify the progress throughout the period.
It can be observed that intensive dataset combinations have been applied since 2016, particularly during 2019 and 2020.Only meteorological data were dominant throughout the whole period.The increase in the number of manuscripts can be attributed to the open data movement promoted by the governments [13].This aspect will be analysed later.Case Study: are the countries which were served as a case study in the papers.In the majority of the papers (forty) China was a case study.Here is a list of the rest of the countries with the number of publications: USA-six; Taiwan-six; India-four; Iran-four; South Korea-four; UK-three; Canada-two; Ecuador-two; Egypt-two; Europe-two; Francetwo; Italy-two; Kuwait-two; Saudi Arabia-two; Turkey-two; Germany-one; Jordan-one; Mongolia-one; Poland-one; Qatar-one; Slovenia-one; Spain-one; Thailand-one; and Tunisiaone.Apart from this examination, it will be helpful also to know dataset combinations for each case study.Figure 4 illustrates the distribution of dataset combinations in terms of the case study.As may be noted, China was a case study in the papers with the majority dataset combinations (China with 'MET' is the dominant combination (twenty-one papers)), exclusive of 'MET, Spatial, Temporal'.
Prediction Target: is the dominant pollutant in a certain area for which prediction different techniques have been performed.In general, seventeen prediction targets were utilised: PM 2.5 , O 3 , NO x , PM 10 , air quality index (AQI), SO 2 , carbon monoxide (CO), ultrafine particle (UFP or PM 0.1 ), particulate matters less than 0.1 micrometers in diameter, air quality health index (AQHI), individual air quality index (IAQI), Ammonia (NH 3 ), particle number concentrations (PNCs (particle number concentration is the total number of particles per unit volume of air [14])), particles less than 10 nanometers (PN 10 ), black carbon (BC), suspended particulate matter (SPM) and carbon dioxide (CO 2 ).
As we mentioned in the introduction, there are several indices that help to facilitate the interpretation of air pollution.Figure 5 presents the distribution of dataset combinations in terms of prediction target, and it can be seen, that prediction target can be an individual pollutant, as well as an air quality index.However, the prevailed targets are individual pollutants, particularly, PM 2.5 , O 3 , NO x , and PM 10 , which can be explained with the importance of those pollutants.Moreover, according to the United States Environmental Protection Agency (USEPA), air quality in a certain area is defined by the above-mentioned pollutants [15].It can be viewed, that PM 2.5 being the most used prediction target (forty-eight papers), was applied in the publications with all the combinations, specially with 'MET' it was the most used combination by researchers (twenty-one papers).It is noteworthy, that development of technology gives an opportunity to observe finer particles (PM 0.1 , PN 10 [16,17]), which have higher toxicity and are easily inhaled.Data Rate: is the timespan during which the sensors provided data.Figure 6 shows the distribution of dataset combinations in terms of data rate.Overall, biweekly, daily, hourly, minutely, secondly, 15 min, 5 min, 5 s data rates were used in the studies, and nine studies did not provide information about data rate.It can be seen, that hourly data rate being the most used (fifty-six papers) is utilised in the publications with all combinations, particularly with 'MET' it was the most used combination by researchers (thirty-two papers).The most used periods are 365 days in nine papers.The result shows that connecting this feature with the data rate, it can give an idea about the volume of the data used for the analysis.Obviously, it cannot provide any guarantee about the quality of data, and it can include noisy data; however, we assume that the final utilised data were not reduced significantly after the data cleaning process.
Open Data: contains information about data availability.Figure 7 illustrates the distribution of dataset combinations in terms of data availability.There are three categories: Yes, No, Partially.The first two, basically, show if the authors provide or do not provide data used in the studies, the papers with Partially refer to the studies where the authors provided only the part of data.It is interesting to know about data accessibility throughout the period.From Figure 8 it is detectable that since 2012 the authors had started to use open data in their research, which, interestingly, corresponds to the period when the idea of open data portals [18,19] and smart cities [20] has appeared.Figure 9 displays the data availability per case study.It can be observed that China includes all three categories.
It would be also interesting to observe the relation between the authors' affiliation and the case study of certain research.The results show that in the majority of the papers (55), the affiliations of all the co-authors are located in the corresponding case studies.In eleven papers the author's affiliations are located in the countries different from case studies.For example, in the following paper [21], the author's affiliations are located in China and the case study is USA.In twenty-seven papers, the co-authors' affiliation partially correspond to the case study.For instance, in this paper [22] the case study is Canada and the author's affiliations belong to China and Canada.
Algorithm: are the ML algorithms on which the applied methods are based.Having different prediction targets and methods, it would be valuable to see if there is any relation between targets and applied methods in order to figure out which methods are used to predict a particular target.According to the results of the study, the following connection was detected (main prediction targets and corresponding methods): PM-LSTM, SVM, RF; O 3 -MLP, RNN; NO x -SVM, RF, RNN; SO 2 -SVM; CO-LSTM; AQI-SVM.Time Granularity: is the time interval, for which period the prediction was applied.Figure 11 shows the distribution of dataset combinations in terms of time resolution.The used time resolutions are 1 h, 2 h, 3 h, 4 h, 6 h, 8 h, 10 h, 12 h, 24 h, 48 h, 72 h, five days, one week, 15 days and one month.It must be mentioned that these extracted intervals are the maximum intervals applied in each article.It is detectable that 24 h is the most used time resolution regarding the number of publications and different dataset combinations.Furthermore, it can be seen, that the most extended prediction time resolution, one month, is applied in publication with 'Others' combination, and considering that the longer resolution decreases the accuracy, it can be seen that there is only one paper implemented prediction for one month.
Evaluation Metric: are the measures which were used to evaluate the applied method.Overall, sixty-nine metrics were used to evaluate the methods, from which the most used metrics are Root Mean Square Error (RMSE) in seventy-seven papers, Mean Absolute Error (MAE) in forty-two papers.Figure 12 demonstrates the distribution of dataset combinations in terms of evaluation metric (each database combination is marked with a different color).It can be shown, that compared to other dataset types 'MET', 'MET, Temporal' and 'Others' were combined with more metrics, particularly, RMSE with 'MET' (forty-one papers) and MAE with 'MET' (twenty-four papers) are the most used combinations.Additionally, taking into consideration the most used prediction target (PM 2.5 ) and the most used time resolution (24 h), the results show that PM 2.5 was a prediction target in eighteen papers with the combination of RMSE and 'MET', and in ten papers with the combination of MAE with 'MET', and 24 h was a predicted time resolution in ten papers with RMSE and 'MET' combination and in six papers with MAE and 'MET' combination.Furthermore, the metrics that have been used in more than six publications with corresponding equations and descriptions are extracted and displayed in Table 3.The metrics are RMSE, MAE, Coefficient of Determination (R 2 ), Correlation Coefficient (R), Mean Absolute Percentage Error (MAPE), Index of Agreement (IA), Mean Square Error (MSE), Normalised Root Mean Square Error (NRMSE) [23][24][25][26][27][28][29][30].Another point to which attention should be paid is understanding in the world of evaluation metrics how to choose the best and the most acceptable model.To select the best model, the majority of the authors selected different benchmark models and, applying the same validation metrics to all models, chose the outperformed model.Only a few authors, such as Goulier et al. [31] and Zhang et al. [32] have focused on the importance to test whether the model performs well enough, acceptable or not.It is important to follow up on evaluation studies to ensure that the evaluation procedure is correct.For example, the articles by Kadiyala and Kumar [33], Alexander et al. [34], Janssen et al. [35] The number of publications of dataset combinations in terms of evaluation metrics.
It is worth mentioning the limitations noted by the authors in their works.The accuracy of model performance depends on many factors, such as ML algorithms, spatial characteristics, prediction targets, temporal resolution, etc.Several authors have mentioned the structural limitations of algorithms, such as the tendency to overfit, complexity, difficulty with interpretation, and time-consuming [36][37][38].Regarding the prediction target, depending on which pollutant is the prediction target the accuracy may vary since the chemical structure of the pollutants is different.For example, Li et al. in their study [39] found out that the proposed model predicts better PM 2.5 than NO x , as NO x is highly reactive and has larger temporal variability.Therefore, many studies mentioned the implementation of the proposed model for predicting other pollutants as future work [21,40].Another limitation is the lack of data in spatiotemporal resolution [41,42].Missing values can also be included in this scope, depending on their quantity, the performance can be significantly reduced [43,44].An important factor is the presence of sudden changes.One solution might be to collect more data, as the training dataset will include more sudden changes, which in turn will lead to better performance in case of sudden changes [42].Including other datasets such as aerosol optical depth data and meteorological data can help to overcome this issue [45].It might also be useful to apply techniques for handling imbalanced datasets [40].Another limitation that we have already mentioned is a prediction with the long temporal resolution since due to the accumulated error, the accuracy decreases as the temporal resolution increases [46,47].
Table 3.The most used metrics (more than six publications) with corresponding equations and definitions (where N is the number of predict days, O i and P i are the observed and predict values, respectively, and O i is the average of observed data).

Metrics Equations
Description It measures the geometric difference between observed and predict data.
It measures the average magnitude of the errors in a set of predictions, without considering their direction.R 2 It shows how differences in one variable can be explained by a difference in a second variable.R It measures the strength and the direction of a linear relationship between two variables.
It measures the size of the error in percentage terms.
It is the ratio of the mean square error and the potential error.
It measures the average squared difference between the observed and the predict values It is the normalised version of RMSE, which makes easier to compare different models with different scales.

Conclusions
Predicting air quality with higher accuracy is gaining in importance and necessity day by day.Therefore, it is very essential to explore the state-of-the-art of the field.Of the numerous aspects that exist in the field of research, this article, through reviewing studies, focuses on datasets in order to examine which datasets are used by researchers and to identify additional variables that they have taken into account in their analysis to predict air quality.A set of the most relevant papers in this field have been selected using ACM, IEEE Xplore, Scopus and WoS databases.Overall, ninety-three papers were selected, reviewed and, afterwards, the essential dataset features were extracted and synthesised (Year, Case Study, Prediction Target, Dataset Type, Data Rate, Period (Days), Open Data, Algorithm and Time Granularity).The results show that twenty-six datasets are used to supplement data collected by air quality sensors, including 'MET', 'Temporal', 'Spatial' and 'Social Media', among others.The results show a significant difference on the use of 'MET', which is the dominant dataset used in 94.6% of the studies, and 48.4% of the studies combined with only air quality data.
Regarding data availability, it was shown that since 2012 a new stage has begun, associated with the use of open data portals [48], which is crucial for science and contributes to the improvement and development of various research fields and encourages the emergence of new exciting results, which, in turn, has also led to an increase in the number of publications.
A very important finding is to explore and understand which methods are most commonly used and dominant in the field to predict a specific target, for example, to predict particulate matter, LSTM, SVM and RF were found to be the most commonly used methods.
In general, it may be inferred that extra datasets can have significant importance, and involving them in the analysis could improve air quality prediction and obtain more accurate results.However, it is difficult to indicate which datasets are more valuable and it should also be noted that it is not always advisable to include many datasets, as having a huge dataset can be a problem as it requires more training time and may contain redundant data.
Therefore, future work can be addressed to the establishment of a framework based on the same conditions (model, prediction target, evaluation metric, time resolution) with the objective to validate and compare the improvement of each dataset type.

Figure 1 .
Figure 1.PRISMA flow diagram for the systematic review (n is the number of papers).

Figure 2 .
Figure 2. The number of publications per each dataset type.

Figure 3 .
Figure 3.The distribution of the dataset combinations throughout the years.

Figure 4 .
Figure 4.The number of publications of dataset combinations in terms of case study.

Figure 5 .
Figure 5.The number of publications of dataset combinations in terms of prediction target.

Figure 6 .
Figure 6.The number of publications of dataset combinations in terms of data rate.Period (Days): is the period (the number of days) of the data collection.The summary statistics of these days reveals a mean of days of 1300.63 days (Std.Dev: 1484.68) and a median of 731 days (Min: 3 and Max: 8023).The most used periods are 365 days in nine papers.The result shows that connecting this feature with the data rate, it can give an idea about the volume of the data used for the analysis.Obviously, it cannot provide any guarantee about the quality of data, and it can include noisy data; however, we assume that the final utilised data were not reduced significantly after the data cleaning process.Open Data: contains information about data availability.Figure7illustrates the distribution of dataset combinations in terms of data availability.There are three categories: Yes, No, Partially.The first two, basically, show if the authors provide or do not provide data used in the studies, the papers with Partially refer to the studies where the authors provided only the part of data.It is interesting to know about data accessibility throughout the period.From Figure8it is detectable that since 2012 the authors had started to use open data in their research, which, interestingly, corresponds to the period when the idea of open data portals[18,19]  and smart cities[20] has appeared.Figure9displays the data availability per case study.It can be observed that China includes all three categories.It would be also interesting to observe the relation between the authors' affiliation and the case study of certain research.The results show that in the majority of the papers (55), the affiliations of all the co-authors are located in the corresponding case studies.In eleven papers the author's affiliations are located in the countries different from case studies.For example, in the following paper[21], the author's affiliations are located in China and the case study is USA.In twenty-seven papers, the co-authors' affiliation partially correspond to the case study.For instance, in this paper[22] the case study is Canada and the author's affiliations belong to China and Canada.Algorithm: are the ML algorithms on which the applied methods are based.Figure10shows the distribution of dataset combinations in terms of ML algorithms.The ML algo-

Figure 7 .Figure 8 .
Figure 7.The number of publications of dataset combinations in terms of data availability.

Figure 9 .
Figure 9. Data availability per case study.

Figure 10 .
Figure 10.The number of publications of dataset combinations in terms of ML algorithms.

Figure 11 .
Figure 11.The number of publications of dataset combinations in terms of time granularity.

Author Contributions:
Conceptualisation, D.I. and F.R.; Formal analysis, S.T.; Funding acquisition, S.T.; Methodology, D.I. and F.R.; Supervision, S.T. and F.R.; Writing-original draft, D.I.; Writing-review & editing, S.T. and F.R. All authors have read and agreed to the published version of the manuscript.Funding: Ditsuhi Iskandaryan has been funded by the predoctoral programme PINV2018-Universitat Jaume I (PREDOC/2018/61).S.T. has been funded by the Juan de la Cierva-Incorporación postdoctoral programme of the Ministry of Science and Innovation-Spanish government (IJC2018-035017-I).This work has been funded by the Generalitat Valenciana through the Subvenciones para la realización de proyectos de I+D+i desarrollados por grupos de investigación emergentes program (GV/2020/035).Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 1 .
Inclusion and exclusion criteria.

Table 2 .
The number of publications of dataset combinations.To find out dataset features used in each research work, each component of Table A1in Appendix A was observed in terms of dataset types, and the results of the observation are displayed below.
can serve as a guide for researchers.