1. Introduction
Air pollution is a global problem, with significant impacts on human health. In particular, small-sized suspended particles (PM2.5, particles ≤ 2.5 μm) penetrate deeply into the respiratory system and blood, causing chronic respiratory and cardiovascular problems and reduced life expectancy [
1,
2].
Forecasting PM2.5 concentration is essential for early warning to the public and for supporting public health policies. Traditional methods are based on statistical models (e.g., ARIMA) [
3,
4,
5], which often fail to capture nonlinear relationships and seasonal patterns [
6].
Machine learning and deep neural networks provide the ability to learn complex relationships between environmental variables and PM2.5 concentrations. These approaches can combine temporal patterns, environmental factors, and past PM values for improved prediction [
7,
8].
The aim of this work is to develop a comprehensive pipeline for PM2.5 prediction, which evaluates different categories of models and examines the relationship between accuracy and computational cost.
2. Methods
2.1. Data
The data comes from air sensors and includes columns of temperature, humidity, and PM concentrations (PM1, PM2.5, PM10). The analysis covers a two-month period, specifically September and October 2025, in the Agia Paraskevi area of Athens, Greece. The data are recorded on an hourly basis, while the reference data were obtained from the website [
9]. The data were sorted chronologically, and a complete time series was created with a defined measurement frequency. The forecasting task in this study is formulated as a one-step-ahead, short-term prediction of PM2.5 concentrations at an hourly resolution, using a fixed-length historical window of observations. This one-hour-ahead horizon reflects operational requirements for urban air quality monitoring and early warning systems. Sequence-based models such as LSTM and CNN–LSTM exploit temporal dependencies within the input window, whereas tree-based models rely on engineered lag and rolling features to capture short-term trends. Defining the task in this manner ensures transparent, reproducible evaluation across model classes and highlights the practical relevance of accurate short-term forecasts for public health and environmental policy.
2.2. Pre-Processing
The pre-processing included:
1. Data cleaning and filling gaps with forward/backward fill.
2. Dealing with outliers via IQR capping for PM2.5 and reducing the impact of rare events or sensor errors [
10].
3. Feature engineering:
Hour lag features for PM2.5.
Rolling statistics: averages and standard deviations for 3, 6, and 12 h.
Heat calculation index of temperature and humidity.
Time characteristics (time, day, month, and weekend).
2.3. Models
2.3.1. Tree-Based Models
Random Forest (RF): Ensemble of decision trees for capturing nonlinear relationships and reducing overfitting.
XGBoost/HistGradientBoosting (XGB/HGB): Gradient boosting for fast training and performance optimization.
2.3.2. Deep Learning Models
LSTM: Captures long-term dependencies in the time series.
CNN–LSTM: Combines convolution layers for detecting short-term patterns and LSTM for long-term trends. Grid search was performed for conv_filters, kernel_size, lstm _units, and dropout.
2.3.3. Hybrid Deep Learning Models
The LSTM embeddings were used as input for Random Forest (Hybrid) LSTM + RF, combining temporal patterns and tree-based model stability.
2.3.4. Hyperparameter Tuning and Hybrid Model
The hyperparameter tuning strategy was designed to ensure transparency and fair comparison among the examined models. For the tree-based models (Random Forest and XGBoost/HistGradientBoosting), fixed hyperparameter settings were adopted based on commonly used values in the literature for air quality prediction tasks. This choice reflects the relative robustness of these models to moderate hyperparameter variations and allows limiting computational overhead.
Similarly, the standalone LSTM model was trained using a fixed architecture and predefined training parameters, serving as a baseline deep learning approach for comparison with more complex models.
In contrast, explicit hyperparameter tuning was applied to the CNN–LSTM model due to its increased architectural complexity and sensitivity to configuration choices. A grid search was conducted over key parameters, including the number of convolutional filters, kernel size, number of LSTM units, and dropout rate, with the final configuration selected based on validation performance.
A hybrid LSTM–Random Forest model was also implemented to combine temporal feature learning with tree-based regression stability. After training the LSTM on fixed-length input sequences, feature representations were extracted from the final LSTM hidden layer. These embeddings were concatenated with the feature vector of the last time step of each input window and used as input to a Random Forest regressor for the final PM2.5 prediction.
2.4. Training and Assessment
The models were trained and evaluated using a time-split training-test approach, with 80% of the dataset used for training and the remaining 20% for testing. This approach preserves the temporal structure of the PM2.5 data and ensures that the models are evaluated against future, unobserved values, which is crucial for forecasting applications.
Predictive performance was quantified using several metrics. The Mean Squared Error (MSE) and the Root Mean Squared Error (RMSE) were used to assess the overall accuracy of the forecasts, while the Mean Absolute Scaled Error (MASE) accounted for seasonal variations in the time series, thus providing a scaled measure that facilitates the comparison of different models.
In addition to accuracy, computational efficiency was systematically measured by monitoring the training time of each model. This allows for a comparison of the ratio of predictive performance to computational effort, which is particularly important for real-time applications or large datasets. By combining the aforementioned evaluation criteria, the study provides a comprehensive understanding of both the effectiveness and the practicality of the individual modeling approaches.
3. Results and Discussion
To provide a quantitative assessment of model performance, key prediction metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Scaled Error (MASE), and training time, were systematically collected for each model.
Table 1 summarizes these metrics, facilitating a clear and comprehensive comparison of both predictive accuracy and computational efficiency across the different modeling approaches.
Following the model-specific insights, three evaluation plots were analyzed for each model to further assess their performance and provide a more nuanced understanding of how each model behaves in practice.
The results presented in the table illustrate the differences in prediction accuracy and computational efficiency of the investigated machine learning models. CNN–LSTM achieves the best accuracy with the lowest values for MSE (5.3891), RMSE (2.3214), and MASE (0.1848). This performance demonstrates the high ability of the combination of convolutional and recurrent structures to recognize both short- and long-term patterns in time series. However, the price for this high accuracy is the very long training time (482.32 s), which limits the model’s use in environments with ample computational resources or where training requirements are low.
The hybrid LSTM Random Forest model offers the best compromise between accuracy and computational effort. It exhibits significantly higher accuracy than pure tree-based models (RMSE = 2.5872) and a significantly lower error rate than simple LSTM models, while its training time (161.21 s) is considerably shorter than that of pure deep learning models. This makes it suitable for semi-real-world applications where a balance between speed and efficiency is required.
In contrast, the tree-based Random Forest and XGB/HGB models demonstrate the highest computational efficiency. XGB/HGB is the fastest model in the study (0.91 s), completing training almost instantly and achieving remarkable accuracy (RMSE = 2.8950). Random Forest is also fast (7.61 s) and shows slightly higher accuracy than XGB/HGB, making it suitable for applications where a stable, simple, and interpretable model without high computational overhead is desired. However, both tree-based models exhibit a higher MASE value compared to the deep learning models, indicating that while they adequately represent the general trends of the time series, they do not capture the strong short-term fluctuations with the same level of detail.
The classic LSTM shows a moderate accuracy (RMSE = 2.8844), similar to the tree-based models, but with a significantly longer training time (235.83 s). This demonstrates that, without additional structures such as convolutions or hybrid approaches, the LSTM does not fully utilize its computational potential to achieve significantly better results. Nevertheless, it captures longer-term patterns better than tree-based models, as the corresponding prediction curves illustrate.
Overall, the results show that the choice of the optimal model depends on the application goal. CNN–LSTM is the best choice when the highest accuracy is paramount and computational costs are not a concern. Tree-based models, especially XGB/HGB, are ideally suited for real-time and frequent retraining, while the hybrid LSTM + RF represents a promising intermediate solution that combines the advantages of both approaches.
Furthermore, the “True vs. Predicted” plots (
Figure 1,
Figure 2,
Figure 3,
Figure 4 and
Figure 5) analysis provides a more detailed understanding of the model’s behavior, complementing the quantitative metrics in the table. Each plot highlights how the model tracks the PM2.5 time series and allows for an assessment of its ability to capture both general trends and strong short-term fluctuations. The insights drawn from the graphs were based on a visual comparison of the alignment of the predictions with the actual values, the ability of the models to reproduce peaks in the time series, and the time response to sharp changes.
The tree-based models (RF and XGB/HGB) capture the general trends satisfactorily but show smoothing at the peaks and a small delay at fast changes. The LSTM captures medium-term patterns better, but still misses some peaks. CNN–LSTM exhibits the highest fidelity, closely following the shape of the true curve and accurately representing both trends and short-term fluctuations. The LSTM + RF hybrid model exhibits intermediate behavior, recording more peaks than the tree-based models, but with less accuracy than CNN–LSTM. The visual analysis, therefore, confirms the quantitative metrics and strengthens the conclusions drawn from the MSE, RMSE, and MASE values.
4. Conclusions
This study presented a comprehensive pipeline for predicting PM2.5 concentrations, which includes extensive data preprocessing and feature engineering, such as lag features, rolling statistics, and heat index calculation. In addition, different classes of models were evaluated, including tree-based models (Random Forest, XGB/HGB), deep neural networks (LSTM, CNN–LSTM), and hybrid models combining LSTM embeddings with Random Forest. The analysis included a comparison of the predictive accuracy with the metrics MSE, RMSE, and MASE, as well as recording the computational requirements of each model.
The results showed that CNN–LSTM offers the highest predictive accuracy, capturing short-term fluctuations of PM2.5 with high fidelity, although it requires significant training time. Tree-based models, such as XGB/HGB and Random Forest, are highly computationally efficient and provide good generalization for the underlying trends of the time series, making them suitable for real-time applications and large datasets. The hybrid combination of LSTM embeddings with Random Forest offers a balanced compromise between accuracy and training time, providing stable and precise predictions that are suitable for practical applications.
Overall, the choice of the appropriate model depends on the application goal. CNN–LSTM is ideal for maximum prediction accuracy, XGB/HGB for speed and scaling on large datasets, while the hybrid model offers a balance between accuracy and computational cost. The results of the study support the development of practical air quality forecasting systems and the implementation of public health measures.
Author Contributions
Conceptualization, K.O. and I.C.; methodology, K.O. and V.T.; software, K.O. and S.M.; validation, K.O., S.M. and I.C.; formal analysis, V.T.; investigation, K.O. and S.M.; resources, K.O.; data curation, K.O. and S.M.; writing—original draft preparation, K.O. and S.M.; writing—review and editing, V.T. and I.C.; visualization, K.O.; supervision, V.T.; project administration, K.O. and I.C.; funding acquisition, I.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
All of the data created in this study are presented in the context of this article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Pope, C.A., III; Dockery, D.W. Health Effects of Fine Particulate Air Pollution: Lines that Connect. J. Air Waste Manag. Assoc. 2006, 56, 709–742. [Google Scholar] [CrossRef] [PubMed]
- WHO. Air Quality Guidelines for Particulate Matter, Ozone, Nitrogen Dioxide and Sulfur Dioxide; World Health Organization: Geneva, Switzerland, 2006.
- Christakis, I.; Moutzouris, K.; Tsakiridis, O.; Stavrakas, I. Barometric Pressure as a Correction Factor for Low-Cost Particulate Matter Sensors. IOP Conf. Ser. Earth Environ. Sci. 2022, 1123, 012068. [Google Scholar] [CrossRef]
- Christakis, I.; Sarri, E.; Tsakiridis, O.; Stavrakas, I. Identification of the Safe Variation Limits for the Optimization of the Measurements in Low-Cost Electrochemical Air Quality Sensors. Electrochem 2023, 5, 1–28. [Google Scholar] [CrossRef]
- Mahajan, S.; Helbing, D. Dynamic calibration of low-cost PM2.5 sensors using trust-based consensus mechanisms. npj Clim. Atmos. Sci. 2025, 8, 257. [Google Scholar] [CrossRef] [PubMed]
- Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
- IQAir. Empowering the World to Breathe Cleaner Air|IQAir. Available online: https://www.iqair.com (accessed on 9 January 2026).
- Tukey, J. Exploratory Data Analysis; Addison-Wesley: Reading, MA, USA, 1977. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |