Machine Learning Regression to Predict Pollen Concentrations of Oleaceae and Quercus Taxa in Thessaloniki, Greece †

: Airborne pollen triggers allergic reactions in up to 40% of the global population. The incidence of pollen allergies is increasing in Thessaloniki, Greece and it is predicted that more than 50% of the European Union’s inhabitants will suffer from allergic rhinitis by 2025. Thus, it is essential to investigate and predict high pollen concentrations to address this growing concern. This study utilized the Gradient Boosting Regression (GBR) technique, a machine learning approach, to estimate pollen concentrations of Oleaceae and Quercus taxa, using daily meteorological and land surface data obtained from the European Center for Medium-Range Weather Forecasts (ECMWF). The method accurately predicted pollen concentrations for both species, with an Index of Agreement (IoA) of 0.86 for Oleaceae and 0.78 for Quercus, despite the limited size of the dataset.


Introduction
Pollen, a significant environmental factor, has a considerable impact on human health, triggering various respiratory diseases in urban European cities and affecting up to 40% of the global population [1,2].Allergy-related respiratory diseases are among the critical public health concerns of the 21st century [3,4].The European Union has estimated that over half of its population will suffer from allergic rhinitis and/or asthma by 2025, resulting in reduced quality of life, decreased workplace productivity, and increased healthcare costs [5][6][7].Atmospheric pollen concentration doubles every decade [8,9]; therefore, predicting and monitoring pollen concentrations are of utmost importance.
Previous studies on predicting pollen concentrations in Thessaloniki, Greece have been based only on observational data [8,18] and in-field measurements [19,20].Current research on the prediction of pollen concentrations using machine learning techniques is limited and usually depends on extensive datasets.The primary focus has been on the use of K-means clustering algorithms [17] and data-driven modeling methods such as the multi-layer perceptron, support vector regression, and regression trees [21].These Environ.Sci.Proc.2023, 26, 2 2 of 6 methods have been applied to develop effective prediction models for mean daily pollen concentrations of highly allergenic taxa, such as Oleaceae.Nonetheless, there is presently no commensurate investigation regarding the application of machine learning techniques to forecast pollen concentrations of Quercus taxa or for constrained datasets.
The objective of this research is to develop a machine learning approach based on Gradient Boosting Regression (GBR) to estimate pollen concentrations for Oleaceae and Quercus taxa.The proposed method leverages daily meteorological and land surface data obtained from the European Center for Medium-Range Weather Forecasts (ECMWF).The training dataset comprises 6 years of daily pollen concentration measurements from 2016 to 2021.The final year of the dataset, 2022, is allocated for conducting an independent testing phase to evaluate the machine learning model's performance.In addition, the main pollen season for all years is determined.

Pollen Data
Airborne pollen in Thessaloniki was collected using a 7-day recording volumetric spore trap of the Hirst design [22], located at 30 m a.g.l. on the roof of the Department of Biology at Aristotle University of Thessaloniki in the city center (40 • 37 N, 22 • 57 E) [19,20].The station has been continuously operating since 1987, following the standard guidelines of the European Aerobiology Society for pollen counting [23].Measurements are expressed as average daily pollen concentrations (grains/m 3 ).The identification of the main pollen season of Oleaceae and Quercus taxa was executed by utilizing the 95% method [24].

ECMWF Data
The daily meteorological and land surface contextual data were sourced from the ECMWF reanalysis [25].A range of 22 predictor variables were utilized based on the methodology of Zewdie et al. [11], including the total water column, cloud cover, surface and mean sea level pressures, vertical and horizontal wind speed, soil temperature at various levels, skin temperature, surface albedo, total column ozone, volumetric soil water, dew point temperature at 2 m, surface and 2 m temperature, precipitation, and high and low vegetation cover.

GBR and Analysis
Gradient Boosting Regression (GBR) is a widely used machine learning algorithm that is particularly suited for analyzing tabular datasets.This approach is capable of identifying complex, nonlinear relationships between a model's target and its associated features, and is highly adaptable, able to effectively handle both missing values and outliers [26,27].
The GBR model was developed using pollen and ECMWF data collected from 2016 to 2021 (year 2018 is missing due to a lack of data), with the data from 2022 used for testing the model's predictive performance.All input parameters were time-lagged up to 30 days back, including the sine of the Julian day, to identify the relationship between pollen abundance and previous days' atmospheric weather and land surface parameters.The GBR algorithm, with Friedman's mean squared error criterion as the splitting criterion, was implemented, and normalization was not required.To identify the best combination of hyperparameters, the RandomizedSearchCV function from the scikit-learn library was employed [28].The function performed 1000 iterations on the training data, exploring various hyperparameter settings.The loss function used was Huber, and the ensemble comprised 300 estimators.

Results and Discussion
Figure 1 depicts the time series of pollen concentrations for the Oleaceae and Quercus taxa for the years 2016 to 2022.Notably, the Oleaceae exhibits elevated concentrations in 2019, reaching a peak of 152 grains/m 3 on 28 May.In the remaining years, a consistent pattern is observed, where the main pollen season commences in mid-to late March and ends in early July, with concentrations staying below 60 grains/m 3 .In contrast, the Quercus exhibits higher concentrations, with peak values observed in 2016 (670 grains/m 3 on 20 April) and 2021 (654 grains/m 3 on 1 May).However, in 2022, the concentrations decrease compared to previous years, not exceeding 150 grains/m 3 .The pollination period for Quercus exceeds from mid-April to mid-June.

Results and Discussion
Figure 1 depicts the time series of pollen concentrations for the Oleaceae and Quercus taxa for the years 2016 to 2022.Notably, the Oleaceae exhibits elevated concentrations in 2019, reaching a peak of 152 grains/m 3 on 28 May.In the remaining years, a consistent pa ern is observed, where the main pollen season commences in mid-to late March and ends in early July, with concentrations staying below 60 grains/m 3 .In contrast, the Quercus exhibits higher concentrations, with peak values observed in 2016 (670 grains/m 3 on 20 April) and 2021 (654 grains/m 3 on 1 May).However, in 2022, the concentrations decrease compared to previous years, not exceeding 150 grains/m 3 .The pollination period for Quercus exceeds from mid-April to mid-June.Figure 2 illustrates the time series of observed and predicted daily concentrations for the Oleaceae and Quercus species for 2022.The GBR model demonstrates satisfactory performance in predicting the observed pollen concentrations for both taxa, albeit with a slight underestimation of the peaks.Specifically, during the onset of the main pollen period, the model underestimates the concentrations of Oleaceae, resulting in an overestimation during the occurrence of secondary peaks.Conversely, for the Quercus species, the model initially overestimates the concentrations, followed by an underestimation at the peaks.

Results and Discussion
Figure 1 depicts the time series of pollen concentrations for the Oleaceae and Quercus taxa for the years 2016 to 2022.Notably, the Oleaceae exhibits elevated concentrations in 2019, reaching a peak of 152 grains/m 3 on 28 May.In the remaining years, a consistent pa ern is observed, where the main pollen season commences in mid-to late March and ends in early July, with concentrations staying below 60 grains/m 3 .In contrast, the Quercus exhibits higher concentrations, with peak values observed in 2016 (670 grains/m 3 on 20 April) and 2021 (654 grains/m 3 on 1 May).However, in 2022, the concentrations decrease compared to previous years, not exceeding 150 grains/m 3 .The pollination period for Quercus exceeds from mid-April to mid-June.The statistical metrics (Appendix A) presented in Table 1 (MB Equation (A1), MAE Equation (A2), NMAE Equation (A3), IoA Equation (A4)) further demonstrate the satisfactory correlation and estimation of daily concentrations using the GBR model.The observed and predicted values exhibit a significant agreement, with an IoA of 0.86 for Oleaceae and 0.78 for Quercus, highlighting the model's effectiveness in accurately capturing the pollen concentrations for both species.

Conclusions
The present study effectively utilized the Gradient Boosting Regression (GBR) technique to precisely estimate daily pollen concentrations for the Oleaceae and Quercus taxa in Thessaloniki, Greece.The model's accuracy was confirmed through the agreement between the observed and predicted values, while its capability to forecast the timing of the main pollen season was successfully demonstrated.These findings hold significant implications for the management of allergies and the implementation of preventive measures, addressing the mounting apprehension surrounding pollen allergies in the population.

Figure 2
Figure2illustrates the time series of observed and predicted daily concentrations for the Oleaceae and Quercus species for 2022.The GBR model demonstrates satisfactory performance in predicting the observed pollen concentrations for both taxa, albeit with a slight underestimation of the peaks.Specifically, during the onset of the main pollen period, the model underestimates the concentrations of Oleaceae, resulting in an overestimation during the occurrence of secondary peaks.Conversely, for the Quercus species, the model initially overestimates the concentrations, followed by an underestimation at the peaks.

Figure 2 .
Figure 2. Time series of the observed and predicted (a) Oleaceae and (b) Quercus daily pollen concentrations (2022).

Figure 2
Figure2illustrates the time series of observed and predicted daily concentrations for the Oleaceae and Quercus species for 2022.The GBR model demonstrates satisfactory performance in predicting the observed pollen concentrations for both taxa, albeit with a slight underestimation of the peaks.Specifically, during the onset of the main pollen period, the model underestimates the concentrations of Oleaceae, resulting in an overestimation during the occurrence of secondary peaks.Conversely, for the Quercus species, the model initially overestimates the concentrations, followed by an underestimation at the peaks.

Figure 2 .
Figure 2. Time series of the observed and predicted (a) Oleaceae and (b) Quercus daily pollen concentrations (2022).

Figure 2 .
Figure 2. Time series of the observed and predicted (a) Oleaceae and (b) Quercus daily pollen concentrations (2022).

Table 1 .
Statistical metrics for the evaluation of GBR model.

Table 2
confirms the GBR model's successful prediction of the peak day and timing of the main pollen season.The actual and predicted peak days for Oleaceae aligned closely, occurring on DOY 145 (25 May) and DOY 146 (26 May), respectively.Similarly, the actual and predicted peak days for the Quercus coincided, observed on DOY 117(27 April)and DOY 118 (28 April), respectively.Furthermore, there was notable agreement between the predicted and observed start and end dates for both taxa.The GBR model accurately estimated the start and end dates for the Oleaceae as DOY 89(30 March)and DOY 196 (15 July), respectively, and for the Quercus, as DOY 111 (21 April) and DOY 165 (14 June), respectively.These findings demonstrate the GBR model's reliable estimation of the main pollen season's timing, providing valuable insights for allergy management and preventive measures.

Table 2 .
Actual and Expected Dates of Start, End, and Peak of the Main Pollen Season (2022) in Day of Year (DOY).