Handling Missing Air Quality Data Using Bidirectional Recurrent Imputation for Time Series and Random Forest: A Case Study in Mexico City
Abstract
1. Introduction
Reference | Datasets and Missing Rate | Input Variables and Missing Rate | Techniques | Target | Key Findings |
---|---|---|---|---|---|
Kim et al. (2021) [4] | Minute-level air quality data (2020–2021) from South Korea: Guro-gu (24 stations) and Dangjin-si (42 stations). | PM2.5 and PM10 concentrations. Missing Rate (Real/Artificial): 7.91–16.1%/20%. | A novel N-BEATS deep learning model with interpretable blocks (trend, seasonality, and residual) was compared with baseline methods (mean, spatial average, and MICE). | Imputation of missing PM2.5 and PM10 values. | The N-BEATS model outperforms traditional methods and allows interpretability via component decomposition but struggles with long term seasonal patterns due to fixed-period Fourier terms. |
Alahamade and Lake (2021) [20] | Hourly pollutant data (PM10, PM2.5, O3, and NO2) of the Automatic Urban and Rural Network (AURN) from the UK (2015–2017), covering 167 station types (urban, suburban, rural, roadside, and industrial). | Concentrations of PM10, PM2.5, O3, and NO2. Missing Rate (Real/Artificial): Not quantified (some pollutants entirely missing at certain stations)/Not applied | Evaluated models: (1) Clustering based on multivariate time series similarity (CA, CA+ENV, and CA+REG); (2) Spatial methods using the 1 or 2 nearest stations (1NN and 2NN); (3) Ensemble method: median of all five approaches. | Imputation of missing PM10, PM2.5, O3, and NO2 using temporal and spatial similarity. | The ensemble method performed best for O3, PM10, and PM2.5, CA+ENV for NO2. Performance depended on the pollutant and station type. MVTS clustering allowed full imputation; extremes were slightly under- or overestimated. |
Wang et al. (2023) [9] | Hourly air quality data (2019–2022) from 16 stations in Qinghai Province and Haidong City, China. | Multivariate time series of six pollutants (PM2.5, PM10, O3, NO2, SO2, and CO). Missing Rate (Real/Artificial): 5–22%/30%. | BRITS-ALSTM (BRITS encoder + LSTM with attention) was compared to Mean, KNN, MICE, MissForest, M-RNN, BRITS, and BRITS-LSTM. | Imputation of missing pollutants (PM2.5, PM10, O3, NO2, SO2, and CO) data with high and irregular missing rates. | BRITS-ALSTM outperformed baselines for all pollutants and missing patterns. |
He et al. (2023) [5] | Air quality and meteorological data from Mexico City (2005–2019; 42 stations), combined with satellite data from OMI, TROPOMI, and reanalysis data from CAMS. #stations? | Daily NO2 concentrations, wind speed/direction, temperature, cloud coverage, and satellite-based NO2 columns. Missing Rate (Real/Artificial): Not quantified/Not simulated. | Comparative modeling using RF, XGBoost, and GAM. Missing values were imputed using Random Forest. | Predict daily NO2 surface concentrations in Mexico City. | XGBoost and RF outperformed GAM. The model integrated hybrid sources (ground, satellite, and meteorological) to improve NO2 prediction. |
Colorado Cifuentes and Flores Tlacuahua (2020) [19] | Hourly air quality and meteorological data (2012–2017) from 13 monitoring stations in the Monterrey Metropolitan area, Mexico. | Pollutants and meteorological variables over a 24-h window. Missing Rate (Real/Artificial): <25%/Not simulated. | A deep neural network (DNN). Missing values were imputed using interpolation for non-seasonal time series. | 24-h ahead prediction of O3, PM2.5, and PM10. | The DNN model achieved good predictive accuracy for all target pollutants, and the imputation process preserved model performance. |
Hua et al. (2024) [10] | Six real-world hourly datasets from Germany, China, Taiwan, and Vietnam. The datasets include ~15,000 to 1.2 million samples. | Pollutant and meteorological variables. Missing Rate (Real/Artificial): 0–42%/10–80%. | Mean, median, KNN, MICE, SAITS, BRITS, MRNN, and transformer. | Evaluate the impact of different imputation strategies on the performance of air quality forecasting models aimed at predicting 24-h concentrations of AQI, PM2.5, PM10, CO, SO2, and O3. | SAITS achieved the highest accuracy, followed by BRITS. KNN performed well on large datasets with high missing rates. MICE was effective on smaller datasets but was slower. The transformer model performed worse than the top methods. |
Zhang and Zhou (2024) [8] | PM2.5 time series (2018–2020; 225 stations) from Xi’an, China, with single, block, and long-interval missing patterns. | PM2.5 data from multiple stations, with spatial-temporal dependencies. Missing Rate (Real/Artificial): <1%/[10%, 20%, 30%, and 50%] | TMLSTM-AE was compared to KNN, SVD, ST-MVL, LSTM, and DAE. | Impute complex missing patterns in single-feature PM2.5 time series using spatial-temporal dependencies. | TMLSTM-AE outperforms traditional and baseline methods, especially on long and block missing data. |
2. Materials and Methods
2.1. Database
2.1.1. Study Area
2.1.2. Data Collection and Integration
2.1.3. Missing Data Patterns Analysis and Station Filtering
2.1.4. Missing Data Patterns in Selected Monitoring Station
2.2. Model Training and Evaluation Pipeline
- Missing data identification: binary masks were created to identify real missing values in the dataset.
- Data splitting: each station’s dataset was divided into 80% for training and 20% for testing.
- Hyperparameter optimization: hyperparameter tuning was performed in sequential steps due to computational constraints.
- Artificial missingness for evaluation: to enable controlled performance assessment, 20% of the observed values in the test set were randomly removed under a Missing Completely At Random (MCAR) assumption. This masking allowed direct comparison between imputed and ground-truth values.
- Normalization: for the BRITS method, all variables were standardized using MinMaxScaler normalization, applied separately to the training and test datasets.
- Model Training and Evaluation: after hyperparameter optimization, models were retrained using the full training set and evaluated on the masked test set. Imputation performance was assessed using several evaluation metrics, including MAE, RMSE, Wasserstein distance, and TOST equivalence tests.
2.3. Model Training
2.3.1. Random Forest (RF) Hyperparameter Optimization
- n_estimators: number of trees in the ensemble.
- max_depth: maximum depth of individual trees.
- max_iter: number of iterations in the iterative imputation process (specific to IterativeImputer).
- min_samples_split: minimum number of samples required to split an internal node.
- min_samples_leaf: minimum number of samples required to be at a leaf node.
2.3.2. BRITS Hyperparameter Optimization
- RNN units: number of hidden units in the RNN layers. Larger values increase model capacity to learn temporal patterns.
- Subsequence length: input window size used during training.
- Learning rate: Rate at which the optimizer updates the model parameters.
- Batch size: number of samples per training batch.
- Use_regularization: whether to apply dropout/L2 regularization to prevent overfitting.
- Dropout_rate: proportion of neurons randomly deactivated during training.
2.4. Model Evaluation
- Mean Absolute Error (MAE) quantifies the average absolute difference between imputed values ŷt and observed values :
- 2.
- Root Mean Squared Error (RMSE) measures the square root of the average squared differences between imputed and observed values:
- 3.
- Wasserstein distance, also known as Earth Mover’s Distance (EMD), measures the dissimilarity between two probability distributions by calculating the minimum effort required to transform one distribution into another [22]. In the context of air quality time series, it is useful for comparing distributions of imputed and observed values. Given two cumulative distribution functions, and , the first-order Wasserstein distance is defined as:
- 4.
- The Two One-Sided Test (TOST [23]) is a statistical procedure used to assess the equivalence between the mean of imputed and observed values. Unlike conventional statistical tests that seek to detect significant differences, TOST is specifically designed to demonstrate equivalence, that is, to confirm that the difference is small enough to be considered negligible within a predefined tolerance margin. Let μo and μi be the means of observed and imputed values, respectively. According to Santamaría-Bonfil et al. [2], TOST is implemented as two one-sided t-tests:
3. Results
3.1. Results of Hyperparameter Optimization
3.2. Results of Imputation Models
3.2.1. Performance Evaluation on Masked Subset
MAE Results
RMSE Results
3.2.2. Distributional Similarity Assessment
Results of Wasserstein Distance
Results of TOST
Visualization of Kernel Density Estimation (KDE)
Time Series Reconstruction from 2014–2023: MER Station
3.3. Comparative Evaluation of Temporal Autocorrelation and Extreme Event Preservation Between RF and BRITS
4. Discussion
5. Limitations and Future Perspectives
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
# | Station | Variable | Masked Subset | Complete Distribution Dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | Original Mean (Om) | Imputed Mean (Im) | Mean Difference (Om-Im) | TOST Interval ± | Wasserstein Distance (%) | ||||||||
RF | BRITS | RF | BRITS | RF | BRITS | RF | BRITS | RF | BRITS | |||||
1 | MER | CO | 0.17 | 0.34 | 0.23 | 0.47 | 0.70 | 0.69 | 0.67 | −0.02 | 0.01 | 0.40 | 0.29 | 0.25 |
2 | NO | 6.57 | 16.63 | 16.92 | 26.69 | 23.90 | 23.70 | 23.59 | 0.16 | 0.27 | 20.00 | 0.73 | 0.33 | |
3 | NO2 | 4.57 | 9.14 | 9.26 | 11.92 | 32.50 | 32.88 | 32.16 | −0.36 | 0.36 | 6.60 | 0.86 | 0.55 | |
4 | NOx | 8.89 | 22.49 | 19.28 | 32.36 | 56.40 | 56.55 | 55.53 | −0.20 | 0.82 | 23.40 | 0.79 | 0.38 | |
5 | O3 | 7.23 | 16.05 | 11.43 | 21.35 | 25.80 | 25.56 | 25.25 | 0.27 | 0.59 | 8.30 | 0.25 | 0.54 | |
6 | PM10 | 4.76 | 15.13 | 9.90 | 20.22 | 47.00 | 46.99 | 46.71 | 0.01 | 0.28 | 23.30 | 0.68 | 0.31 | |
7 | PM2.5 | 3.86 | 9.81 | 7.95 | 13.64 | 24.40 | 24.35 | 24.38 | 0.09 | 0.06 | 19.00 | 0.48 | 0.24 | |
8 | PMCO | 3.93 | 8.72 | 7.94 | 12.15 | 22.60 | 22.65 | 22.61 | −0.09 | −0.05 | 18.50 | 0.48 | 0.21 | |
9 | RH | 8.59 | 13.46 | 11.18 | 16.90 | 52.30 | 52.36 | 52.54 | −0.05 | −0.23 | 5.00 | 0.26 | 0.50 | |
10 | SO2 | 2.84 | 4.33 | 5.36 | 7.02 | 4.70 | 4.77 | 4.61 | −0.04 | 0.12 | 12.60 | 0.06 | 0.05 | |
11 | TMP | 1.70 | 3.01 | 2.28 | 3.76 | 18.00 | 18.05 | 17.84 | −0.05 | 0.16 | 1.50 | 0.23 | 0.77 | |
12 | WDR | 83.03 | 98.15 | 105.60 | 120.78 | 183.70 | 184.29 | 184.37 | −0.64 | −0.72 | 18.00 | 0.75 | 0.56 | |
13 | WSP | 0.48 | 0.74 | 0.66 | 0.96 | 2.10 | 2.12 | 2.12 | 0.00 | 0.00 | 0.40 | 0.17 | 0.19 | |
14 | TLA | CO | 0.15 | 0.27 | 0.20 | 0.37 | 0.60 | 0.60 | 0.61 | 0.01 | 0.01 | 0.30 | 0.26 | 0.31 |
15 | NO | 6.15 | 17.69 | 17.64 | 30.27 | 23.50 | 23.55 | 23.02 | −0.07 | 0.46 | 19.20 | 0.70 | 0.29 | |
16 | NO2 | 4.34 | 8.64 | 8.49 | 12.13 | 30.10 | 30.16 | 29.93 | −0.06 | 0.17 | 7.00 | 0.85 | 0.41 | |
17 | NOx | 8.23 | 20.62 | 19.58 | 29.91 | 53.60 | 53.69 | 53.40 | −0.12 | 0.16 | 21.50 | 0.82 | 0.38 | |
18 | O3 | 7.17 | 15.55 | 10.63 | 20.76 | 25.60 | 25.28 | 25.36 | 0.30 | 0.21 | 8.40 | 0.24 | 0.57 | |
19 | PM10 | 5.13 | 18.32 | 11.44 | 25.60 | 49.50 | 49.59 | 48.25 | −0.12 | 1.22 | 27.50 | 0.69 | 0.32 | |
20 | PM2.5 | 3.97 | 10.22 | 7.96 | 13.60 | 23.90 | 23.93 | 23.22 | −0.05 | 0.66 | 10.60 | 0.95 | 0.49 | |
21 | PMCO | 4.32 | 11.18 | 8.67 | 18.76 | 25.60 | 25.66 | 25.05 | −0.07 | 0.55 | 26.00 | 0.44 | 0.17 | |
22 | RH | 8.25 | 12.52 | 10.88 | 15.52 | 49.30 | 49.40 | 49.39 | −0.09 | −0.08 | 4.40 | 0.41 | 1.08 | |
23 | SO2 | 4.66 | 5.10 | 8.15 | 9.93 | 7.00 | 7.00 | 6.83 | −0.05 | 0.13 | 12.00 | 0.09 | 0.09 | |
24 | TMP | 1.79 | 3.01 | 2.39 | 3.79 | 17.20 | 17.15 | 17.22 | 0.07 | −0.01 | 1.50 | 0.33 | 0.77 | |
25 | WDR | 87.39 | 105.27 | 115.21 | 141.17 | 248.00 | 246.58 | 251.15 | 1.38 | −3.20 | 18.00 | 1.79 | 1.37 | |
26 | WSP | 0.65 | 0.92 | 0.87 | 1.18 | 2.20 | 2.19 | 2.22 | 0.04 | 0.00 | 0.50 | 0.39 | 0.52 | |
27 | SAG | CO | 0.15 | 0.29 | 0.20 | 0.39 | 0.60 | 0.57 | 0.53 | −0.02 | 0.02 | 0.30 | 0.60 | 0.53 |
28 | NO | 5.19 | 11.52 | 15.29 | 20.16 | 16.50 | 16.62 | 15.43 | −0.07 | 1.12 | 19.80 | 0.74 | 0.30 | |
29 | NO2 | 4.06 | 7.09 | 8.02 | 9.26 | 22.90 | 22.98 | 22.68 | −0.05 | 0.25 | 5.50 | 1.36 | 0.62 | |
30 | NOx | 7.19 | 16.95 | 17.25 | 25.65 | 39.50 | 39.60 | 38.36 | −0.12 | 1.12 | 22.20 | 0.90 | 0.39 | |
31 | O3 | 7.23 | 15.01 | 10.29 | 19.40 | 25.90 | 25.85 | 25.34 | 0.02 | 0.53 | 8.00 | 0.55 | 0.83 | |
32 | PM10 | 5.24 | 17.65 | 12.14 | 25.80 | 49.80 | 49.72 | 47.32 | 0.08 | 2.48 | 40.30 | 0.83 | 0.37 | |
33 | PM2.5 | 4.40 | 9.83 | 8.94 | 14.31 | 23.10 | 23.16 | 22.07 | −0.04 | 1.05 | 34.90 | 0.49 | 0.21 | |
34 | PMCO | 4.79 | 10.83 | 10.43 | 17.81 | 26.70 | 26.68 | 25.19 | 0.00 | 1.49 | 33.30 | 0.64 | 0.31 | |
35 | RH | 9.49 | 14.03 | 12.03 | 17.17 | 54.40 | 54.75 | 55.00 | −0.38 | −0.63 | 4.90 | 1.09 | 1.37 | |
36 | SO2 | 2.94 | 3.47 | 6.86 | 8.56 | 4.30 | 4.13 | 3.87 | 0.13 | 0.39 | 16.20 | 0.15 | 0.16 | |
37 | TMP | 1.84 | 2.78 | 2.40 | 3.53 | 18.40 | 18.35 | 18.33 | 0.03 | 0.05 | 1.60 | 0.53 | 0.56 | |
38 | WDR | 89.93 | 100.27 | 111.19 | 128.92 | 137.10 | 137.39 | 135.76 | −0.32 | 1.31 | 18.00 | 1.66 | 0.98 | |
39 | WSP | 0.46 | 0.60 | 0.63 | 0.78 | 1.70 | 1.63 | 1.65 | 0.03 | 0.00 | 0.40 | 0.51 | 0.33 | |
40 | UAX | CO | 0.12 | 0.20 | 0.16 | 0.27 | 0.50 | 0.48 | 0.47 | 0.00 | 0.00 | 0.20 | 0.32 | 0.34 |
41 | NO | 3.19 | 7.82 | 9.76 | 13.85 | 10.00 | 9.99 | 9.79 | 0.02 | 0.22 | 18.60 | 0.35 | 0.15 | |
42 | NO2 | 3.14 | 7.16 | 6.47 | 9.98 | 21.50 | 21.67 | 21.51 | −0.13 | 0.04 | 6.60 | 0.71 | 0.38 | |
43 | NOx | 4.79 | 13.11 | 11.42 | 19.37 | 31.50 | 31.67 | 31.30 | −0.13 | 0.24 | 20.40 | 0.51 | 0.24 | |
44 | O3 | 8.63 | 18.33 | 13.07 | 23.89 | 32.40 | 32.34 | 32.17 | 0.08 | 0.25 | 8.90 | 0.37 | 0.64 | |
45 | PM2.5 | 7.19 | 9.99 | 10.41 | 15.00 | 20.20 | 19.93 | 19.83 | 0.23 | 0.34 | 10.40 | 0.18 | 0.28 | |
46 | RH | 8.79 | 14.53 | 11.73 | 18.00 | 54.40 | 55.02 | 55.46 | −0.64 | −1.08 | 4.90 | 0.85 | 1.86 | |
47 | SO2 | 1.80 | 1.66 | 4.08 | 4.57 | 2.50 | 2.56 | 2.45 | −0.04 | 0.07 | 5.30 | 0.13 | 0.11 | |
48 | TMP | 1.81 | 2.94 | 2.40 | 3.68 | 17.20 | 17.21 | 17.20 | 0.02 | 0.03 | 1.50 | 0.55 | 1.55 | |
49 | WDR | 72.65 | 84.76 | 95.10 | 106.61 | 175.00 | 174.36 | 172.58 | 0.59 | 2.37 | 18.00 | 2.54 | 2.46 | |
50 | WSP | 0.52 | 0.73 | 0.75 | 0.99 | 2.00 | 2.03 | 2.01 | 0.01 | 0.03 | 0.50 | 0.35 | 0.82 | |
51 | FAC | CO | 0.13 | 0.33 | 0.21 | 0.50 | 0.60 | 0.58 | 0.57 | 0.00 | 0.01 | 0.30 | 0.37 | 0.23 |
52 | NO | 6.85 | 14.99 | 19.25 | 25.19 | 20.20 | 20.06 | 19.65 | 0.11 | 0.52 | 17.70 | 0.59 | 0.19 | |
53 | NO2 | 4.05 | 9.14 | 7.68 | 12.73 | 24.30 | 24.33 | 24.21 | −0.06 | 0.06 | 7.50 | 0.64 | 0.29 | |
54 | NOx | 8.59 | 20.45 | 20.65 | 30.23 | 44.40 | 44.36 | 43.95 | 0.07 | 0.48 | 21.50 | 0.65 | 0.26 | |
55 | O3 | 6.68 | 14.90 | 10.31 | 19.95 | 28.90 | 28.56 | 28.64 | 0.31 | 0.23 | 9.30 | 0.31 | 0.42 | |
56 | PM10 | 12.78 | 17.60 | 18.44 | 24.21 | 37.40 | 36.30 | 37.59 | 1.09 | −0.20 | 22.40 | 0.35 | 0.34 | |
57 | RH | 8.92 | 15.56 | 11.81 | 19.07 | 55.20 | 55.04 | 55.16 | 0.16 | 0.04 | 5.00 | 0.51 | 0.71 | |
58 | SO2 | 3.39 | 4.02 | 7.05 | 9.16 | 4.90 | 4.87 | 4.76 | 0.05 | 0.15 | 10.90 | 0.10 | 0.09 | |
59 | TMP | 2.25 | 4.09 | 2.97 | 5.07 | 17.00 | 17.16 | 17.06 | −0.13 | −0.02 | 1.80 | 0.50 | 0.85 | |
60 | WDR | 69.72 | 89.64 | 94.79 | 113.11 | 216.90 | 215.55 | 216.94 | 1.31 | −0.08 | 18.00 | 1.11 | 0.70 | |
61 | WSP | 0.42 | 0.66 | 0.58 | 0.87 | 1.80 | 1.75 | 1.75 | 0.00 | 0.00 | 0.40 | 0.27 | 0.25 | |
62 | NEZ | CO | 0.12 | 0.27 | 0.18 | 0.38 | 0.60 | 0.54 | 0.55 | 0.01 | 0.01 | 0.50 | 0.21 | 0.18 |
63 | NO | 3.90 | 11.71 | 11.65 | 23.20 | 13.00 | 12.98 | 12.44 | −0.01 | 0.53 | 25.80 | 0.40 | 0.15 | |
64 | NO2 | 3.62 | 8.56 | 7.41 | 11.14 | 24.30 | 24.37 | 24.16 | −0.08 | 0.13 | 5.80 | 1.12 | 0.56 | |
65 | NOx | 5.63 | 15.26 | 13.46 | 23.23 | 37.20 | 37.34 | 36.58 | −0.09 | 0.66 | 28.40 | 0.53 | 0.23 | |
66 | O3 | 7.39 | 16.24 | 10.85 | 20.76 | 29.40 | 30.23 | 28.92 | −0.84 | 0.47 | 7.90 | 0.62 | 0.85 | |
67 | PM2.5 | 7.62 | 10.82 | 13.56 | 19.38 | 21.90 | 21.72 | 21.34 | 0.17 | 0.55 | 19.60 | 0.17 | 0.19 | |
68 | RH | 8.96 | 12.82 | 11.60 | 15.78 | 51.10 | 51.20 | 51.08 | −0.12 | 0.00 | 4.60 | 0.54 | 0.43 | |
69 | SO2 | 2.17 | 2.16 | 4.90 | 5.96 | 3.20 | 3.31 | 3.02 | −0.15 | 0.14 | 10.30 | 0.15 | 0.10 | |
70 | TMP | 1.74 | 2.65 | 2.29 | 3.38 | 17.30 | 17.39 | 17.26 | −0.07 | 0.05 | 1.70 | 0.30 | 0.31 | |
71 | WDR | 85.24 | 104.66 | 106.92 | 129.10 | 146.90 | 147.32 | 145.05 | −0.46 | 1.81 | 18.00 | 2.21 | 1.84 | |
72 | WSP | 0.70 | 1.12 | 0.94 | 1.45 | 2.50 | 2.48 | 2.47 | 0.02 | 0.03 | 0.60 | 0.35 | 0.59 | |
73 | PED | CO | 0.08 | 0.16 | 0.11 | 0.23 | 0.40 | 0.38 | 0.38 | 0.01 | 0.00 | 0.10 | 0.41 | 0.41 |
74 | NO | 2.54 | 6.03 | 6.57 | 10.49 | 7.10 | 7.07 | 6.89 | 0.01 | 0.19 | 11.10 | 0.44 | 0.18 | |
75 | NO2 | 2.95 | 6.55 | 5.87 | 8.73 | 21.20 | 21.28 | 20.94 | −0.10 | 0.25 | 6.00 | 0.85 | 0.41 | |
76 | NOx | 4.10 | 10.31 | 8.45 | 14.92 | 28.20 | 28.29 | 27.88 | −0.06 | 0.34 | 13.00 | 0.70 | 0.31 | |
77 | O3 | 8.09 | 17.90 | 11.51 | 23.41 | 33.90 | 33.35 | 33.25 | 0.53 | 0.63 | 9.20 | 0.39 | 0.76 | |
78 | PM10 | 3.63 | 12.48 | 8.54 | 17.74 | 32.70 | 32.75 | 31.62 | −0.08 | 1.06 | 20.80 | 1.54 | 0.79 | |
79 | PM2.5 | 3.17 | 7.39 | 6.41 | 10.09 | 18.80 | 18.81 | 18.71 | −0.03 | 0.07 | 8.20 | 2.39 | 1.40 | |
80 | PMCO | 2.98 | 6.37 | 6.88 | 8.71 | 13.90 | 13.91 | 13.18 | −0.01 | 0.72 | 19.00 | 0.85 | 0.38 | |
81 | RH | 8.67 | 14.57 | 11.60 | 17.88 | 53.10 | 53.23 | 53.36 | −0.08 | −0.21 | 5.00 | 0.47 | 0.43 | |
82 | SO2 | 1.89 | 1.89 | 3.75 | 3.78 | 3.00 | 2.94 | 2.89 | 0.06 | 0.11 | 5.30 | 0.12 | 0.14 | |
83 | TMP | 1.75 | 2.83 | 2.34 | 3.50 | 16.90 | 16.85 | 16.92 | 0.05 | −0.01 | 1.50 | 0.30 | 0.33 | |
84 | WDR | 71.86 | 85.18 | 96.45 | 108.78 | 187.70 | 188.45 | 187.66 | −0.70 | 0.08 | 18.00 | 0.62 | 0.48 | |
85 | WSP | 0.47 | 0.63 | 0.66 | 0.83 | 1.90 | 1.91 | 1.92 | 0.01 | 0.00 | 0.50 | 0.18 | 0.12 | |
86 | VIF | CO | 0.11 | 0.22 | 0.16 | 0.32 | 0.40 | 0.43 | 0.40 | 0.00 | 0.02 | 0.20 | 0.46 | 0.56 |
87 | NO | 3.99 | 11.04 | 11.73 | 19.64 | 12.20 | 12.26 | 10.87 | −0.01 | 1.38 | 12.30 | 1.10 | 0.56 | |
88 | NO2 | 3.61 | 7.86 | 6.98 | 11.01 | 18.60 | 18.64 | 17.55 | −0.06 | 1.03 | 5.20 | 1.71 | 1.08 | |
89 | NOx | 5.57 | 16.68 | 13.63 | 24.44 | 30.80 | 30.78 | 28.71 | 0.04 | 2.11 | 14.10 | 1.45 | 0.80 | |
90 | O3 | 7.25 | 15.13 | 10.64 | 19.40 | 28.40 | 28.86 | 27.04 | −0.45 | 1.37 | 7.90 | 0.41 | 1.15 | |
91 | PM10 | 19.98 | 28.70 | 34.05 | 44.21 | 54.30 | 52.71 | 52.19 | 1.59 | 2.10 | 38.60 | 0.31 | 0.33 | |
92 | RH | 10.08 | 15.08 | 13.31 | 18.43 | 54.50 | 54.51 | 54.41 | 0.00 | 0.09 | 5.00 | 0.37 | 0.98 | |
93 | SO2 | 4.67 | 4.49 | 10.22 | 11.22 | 5.30 | 5.73 | 4.86 | −0.39 | 0.47 | 13.30 | 0.27 | 0.18 | |
94 | TMP | 2.05 | 3.09 | 2.65 | 3.89 | 17.00 | 17.03 | 16.83 | −0.06 | 0.14 | 1.80 | 0.30 | 0.87 | |
95 | WDR | 103.68 | 114.29 | 120.98 | 140.72 | 206.30 | 202.17 | 197.34 | 4.16 | 9.00 | 18.00 | 3.60 | 3.43 | |
96 | WSP | 0.52 | 0.74 | 0.71 | 0.99 | 1.90 | 1.90 | 1.83 | −0.01 | 0.05 | 0.40 | 0.41 | 0.91 | |
97 | MGH | CO | 0.15 | 0.25 | 0.21 | 0.33 | 0.50 | 0.49 | 0.54 | 0.04 | −0.01 | 0.20 | 1.09 | 0.57 |
98 | NO | 6.44 | 15.90 | 19.19 | 27.46 | 21.80 | 21.80 | 20.68 | −0.01 | 1.11 | 24.90 | 0.99 | 0.49 | |
99 | NO2 | 4.43 | 8.61 | 9.19 | 11.52 | 27.80 | 27.93 | 27.78 | −0.11 | 0.04 | 7.10 | 1.48 | 0.84 | |
100 | NOx | 8.88 | 21.55 | 21.21 | 32.51 | 49.60 | 49.79 | 48.30 | −0.22 | 1.28 | 26.80 | 1.20 | 0.63 | |
101 | O3 | 7.13 | 17.13 | 10.78 | 22.64 | 28.90 | 28.55 | 28.05 | 0.34 | 0.85 | 9.50 | 1.27 | 1.00 | |
102 | RH | 9.54 | 14.86 | 12.42 | 18.20 | 48.60 | 47.79 | 49.31 | 0.82 | −0.71 | 5.00 | 2.00 | 1.32 | |
103 | SO2 | 3.54 | 3.31 | 7.32 | 6.43 | 4.00 | 3.74 | 3.65 | 0.23 | 0.33 | 7.30 | 0.19 | 0.27 | |
104 | TMP | 1.60 | 2.71 | 2.17 | 3.34 | 18.00 | 18.33 | 17.75 | −0.37 | 0.21 | 1.50 | 1.70 | 1.58 | |
105 | WDR | 76.70 | 95.76 | 99.03 | 118.26 | 195.90 | 180.07 | 198.38 | 15.81 | −2.49 | 18.00 | 4.82 | 3.03 | |
106 | WSP | 0.48 | 0.72 | 0.67 | 0.92 | 2.00 | 2.13 | 1.95 | −0.16 | 0.03 | 0.50 | 1.92 | 0.91 | |
107 | CUA | CO | 0.13 | 0.18 | 0.16 | 0.24 | 0.40 | 0.42 | 0.41 | 0.01 | 0.01 | 0.20 | 0.69 | 0.47 |
108 | NO | 3.13 | 7.05 | 9.06 | 13.52 | 9.20 | 9.10 | 8.51 | 0.06 | 0.64 | 16.10 | 0.52 | 0.22 | |
109 | NO2 | 3.20 | 6.43 | 6.33 | 8.95 | 20.70 | 20.74 | 20.21 | −0.06 | 0.47 | 6.70 | 1.22 | 0.60 | |
110 | NOx | 4.71 | 13.33 | 10.79 | 21.38 | 29.80 | 29.85 | 28.71 | −0.02 | 1.11 | 18.40 | 0.82 | 0.39 | |
111 | O3 | 9.24 | 15.03 | 12.70 | 20.23 | 34.30 | 34.83 | 33.99 | −0.54 | 0.29 | 10.50 | 0.90 | 0.69 | |
112 | PM10 | 9.77 | 12.81 | 13.70 | 16.77 | 31.70 | 31.18 | 31.03 | 0.51 | 0.66 | 19.20 | 0.46 | 0.47 | |
113 | RH | 10.97 | 14.04 | 14.22 | 17.31 | 58.60 | 57.63 | 58.30 | 1.02 | 0.35 | 5.00 | 2.04 | 1.25 | |
114 | SO2 | 2.02 | 2.19 | 4.32 | 4.48 | 3.00 | 2.90 | 2.82 | 0.13 | 0.20 | 6.10 | 0.18 | 0.26 | |
115 | TMP | 1.71 | 2.30 | 2.32 | 2.91 | 14.50 | 14.79 | 14.50 | −0.25 | 0.05 | 1.60 | 1.14 | 0.85 | |
116 | WDR | 71.01 | 85.65 | 92.62 | 106.44 | 165.10 | 168.76 | 160.36 | −3.64 | 4.76 | 18.00 | 4.42 | 3.98 | |
117 | WSP | 0.49 | 0.63 | 0.67 | 0.82 | 2.00 | 2.07 | 2.00 | −0.03 | 0.04 | 0.40 | 1.49 | 1.25 | |
118 | SFE | CO | 0.09 | 0.16 | 0.12 | 0.21 | 0.36 | 0.34 | 0.36 | 0.02 | −0.01 | 0.11 | 1.78 | 1.62 |
119 | NO | 3.11 | 6.58 | 7.52 | 11.02 | 8.23 | 8.50 | 7.21 | −0.27 | 1.01 | 8.10 | 1.83 | 0.72 | |
120 | NO2 | 3.09 | 6.54 | 6.23 | 8.76 | 20.30 | 20.21 | 19.91 | 0.09 | 0.39 | 5.80 | 2.30 | 1.29 | |
121 | NOx | 4.45 | 11.34 | 9.37 | 15.80 | 28.52 | 28.74 | 27.31 | −0.22 | 1.21 | 10.35 | 2.46 | 1.24 | |
122 | SO2 | 1.95 | 2.05 | 3.74 | 3.97 | 2.83 | 2.62 | 2.56 | 0.21 | 0.27 | 4.70 | 0.38 | 0.39 | |
123 | O3 | 8.21 | 17.14 | 11.83 | 22.42 | 33.60 | 34.14 | 32.87 | −0.54 | 0.73 | 9.10 | 2.04 | 1.81 | |
124 | PM10 | 3.44 | 10.70 | 7.12 | 13.93 | 32.98 | 32.99 | 32.32 | −0.01 | 0.66 | 16.80 | 1.64 | 1.02 | |
125 | PM2.5 | 3.00 | 6.77 | 6.12 | 9.06 | 18.00 | 18.03 | 17.46 | −0.03 | 0.54 | 8.90 | 1.81 | 1.17 | |
126 | PMCO | 2.93 | 5.95 | 6.05 | 8.06 | 14.98 | 14.97 | 15.02 | 0.02 | −0.04 | 15.55 | 0.90 | 0.43 | |
127 | RH | 8.86 | 14.91 | 11.58 | 18.60 | 58.58 | 55.62 | 58.66 | 2.96 | −0.07 | 4.90 | 3.88 | 1.82 | |
128 | TMP | 1.75 | 2.74 | 2.27 | 3.42 | 15.16 | 15.44 | 15.08 | −0.29 | 0.07 | 1.41 | 2.02 | 1.30 | |
129 | WDR | 66.33 | 80.77 | 93.80 | 103.12 | 184.07 | 176.04 | 185.70 | 8.04 | −1.63 | 18.00 | 5.62 | 3.82 | |
130 | WSP | 0.52 | 0.71 | 0.72 | 0.93 | 2.32 | 2.39 | 2.30 | −0.07 | 0.02 | 0.57 | 1.49 | 0.94 | |
131 | AJM | CO | 0.13 | 0.16 | 0.16 | 0.22 | 0.40 | 0.35 | 0.40 | 0.05 | 0.00 | 0.11 | 2.94 | 1.68 |
132 | NO | 1.94 | 3.96 | 4.41 | 6.57 | 4.43 | 4.65 | 4.23 | −0.23 | 0.20 | 4.80 | 1.89 | 0.86 | |
133 | NO2 | 2.59 | 5.55 | 5.48 | 7.54 | 16.98 | 17.01 | 17.01 | −0.03 | −0.03 | 5.30 | 2.68 | 1.46 | |
134 | NOx | 3.28 | 7.71 | 6.62 | 10.90 | 21.42 | 21.62 | 21.16 | −0.20 | 0.26 | 7.25 | 2.91 | 1.51 | |
135 | SO2 | 2.08 | 1.97 | 3.78 | 4.04 | 3.02 | 2.47 | 2.69 | 0.55 | 0.33 | 4.65 | 0.74 | 0.52 | |
136 | O3 | 8.04 | 16.69 | 10.94 | 21.63 | 41.02 | 39.91 | 39.78 | 1.12 | 1.24 | 9.80 | 2.64 | 2.08 | |
137 | PM10 | 3.15 | 10.17 | 6.36 | 13.09 | 31.83 | 31.92 | 31.76 | −0.08 | 0.08 | 10.20 | 3.41 | 1.65 | |
138 | PM2.5 | 2.92 | 6.62 | 6.05 | 8.62 | 18.92 | 18.88 | 18.56 | 0.04 | 0.36 | 6.30 | 3.43 | 1.82 | |
139 | PMCO | 2.53 | 5.79 | 5.12 | 7.87 | 12.92 | 12.91 | 13.37 | 0.01 | −0.46 | 8.70 | 1.94 | 0.97 | |
140 | RH | 8.25 | 15.37 | 11.14 | 18.71 | 54.33 | 51.91 | 54.99 | 2.42 | −0.66 | 4.90 | 4.88 | 2.98 | |
141 | TMP | 1.87 | 2.56 | 2.50 | 3.30 | 15.45 | 15.23 | 15.57 | 0.23 | −0.12 | 1.67 | 2.83 | 2.14 | |
142 | WDR | 74.25 | 93.31 | 99.86 | 116.66 | 179.97 | 182.28 | 179.44 | −2.31 | 0.53 | 18.00 | 6.50 | 3.86 | |
143 | WSP | 0.77 | 1.13 | 1.06 | 1.54 | 2.76 | 2.79 | 2.74 | −0.03 | 0.03 | 0.73 | 1.88 | 1.10 |
References
- World Health Organization (WHO). Ambient (Outdoor) Air Pollution. Available online: https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health (accessed on 19 July 2025).
- State of Global Air (SoGA). State of Global Air Report 2024. Available online: https://www.stateofglobalair.org/hap (accessed on 19 July 2025).
- World Health Organization (WHO). Air Pollution. Available online: https://www.who.int/health-topics/air-pollution#tab=tab_1 (accessed on 19 July 2025).
- Kim, T.; Kim, J.; Yang, W.; Lee, H.; Choo, J. Missing Value Imputation of Time-Series Air-Quality Data via Deep Neural Networks. Int. J. Environ. Res. Public Health 2021, 18, 12213. [Google Scholar] [CrossRef] [PubMed]
- He, M.Z.; Yitshak-Sade, M.; Just, A.C.; Gutiérrez-Avila, I.; Dorman, M.; de Hoogh, K.; Kloog, I. Predicting Fine-Scale Daily NO2 over Mexico City Using an Ensemble Modeling Approach. Atmos. Pollut. Res. 2023, 14, 101763. [Google Scholar] [CrossRef] [PubMed]
- Gobierno de México. Programa de Gestión para Mejorar la Calidad del Aire (PROAIRE). Available online: https://www.gob.mx/semarnat/acciones-y-programas/programas-de-gestion-para-mejorar-la-calidad-del-aire (accessed on 19 July 2025).
- Gobierno de la Ciudad de México. Mexico City Atmospheric Monitoring System (SIMAT). Available online: http://www.aire.cdmx.gob.mx/default.php (accessed on 19 July 2025).
- Zhang, X.; Zhou, P. A Transferred Spatio-Temporal Deep Model Based on Multi-LSTM Auto-Encoder for Air Pollution Time Series Missing Value Imputation. Future Gener. Comput. Syst. 2024, 156, 325–338. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, K.; He, Y.; Fu, Q.; Luo, W.; Li, W.; Xiao, S. Research on Missing Value Imputation to Improve the Validity of Air Quality Data Evaluation on the Qinghai-Tibetan Plateau. Atmosphere 2023, 14, 1821. [Google Scholar] [CrossRef]
- Hua, V.; Nguyen, T.; Dao, M.S.; Nguyen, H.D.; Nguyen, B.T. The Impact of Data Imputation on Air Quality Prediction Problem. PLoS ONE 2024, 19, e0306303. [Google Scholar] [CrossRef] [PubMed]
- Alkabbani, H.; Ramadan, A.; Zhu, Q.; Elkamel, A. An Improved Air Quality Index Machine Learning-Based Forecasting with Multivariate Data Imputation Approach. Atmosphere 2022, 13, 1144. [Google Scholar] [CrossRef]
- Camastra, F.; Capone, V.; Ciaramella, A.; Riccio, A.; Staiano, A. Prediction of Environmental Missing Data Time Series by Support Vector Machine Regression and Correlation Dimension Estimation. Environ. Model. Softw. 2022, 150, 105343. [Google Scholar] [CrossRef]
- Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef] [PubMed]
- Cao, W.; Wang, D.; Li, J.; Zhou, H.; Li, L.; Li, Y. BRITS: Bidirectional Recurrent Imputation for Time Series. Adv. Neural Inf. Process. Syst. 2018, 31, 6775–6785. [Google Scholar]
- Yoon, J.; Jordon, J.; Schaar, M. GAIN: Missing Data Imputation Using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5689–5698. [Google Scholar]
- Shahbazian, R.; Greco, S. Generative Adversarial Networks Assist Missing Data Imputation: A Comprehensive Survey and Evaluation. IEEE Access 2023, 11, 88908–88928. [Google Scholar] [CrossRef]
- Cini, A.; Marisca, I.; Alippi, C. Filling the Gaps: Multivariate Time Series Imputation by Graph Neural Networks. arXiv 2021, arXiv:2108.00298. [Google Scholar]
- Du, W.; Côté, D.; Liu, Y. SAITS: Self-Attention-Based Imputation for Time Series. Expert Syst. Appl. 2023, 219, 119619. [Google Scholar] [CrossRef]
- Colorado Cifuentes, G.U.; Flores Tlacuahuac, A. A Short-Term Deep Learning Model for Urban Pollution Forecasting with Incomplete Data. Can. J. Chem. Eng. 2021, 99, S417–S431. [Google Scholar] [CrossRef]
- Alahamade, M.; Lake, A. Handling Missing Data in Air Quality Time Series: Evaluation of Statistical and Machine Learning Approaches. Atmosphere 2021, 12, 1130. [Google Scholar] [CrossRef]
- World Population Review. Mexico City Population. Available online: https://worldpopulationreview.com/cities/mexico/mexico-city (accessed on 19 July 2025).
- Farjallah, R.; Selim, B.; Jaumard, B.; Ali, S.; Kaddoum, G. Evaluation of Missing Data Imputation for Time Series Without Ground Truth. arXiv 2025, arXiv:2503.05775v1. [Google Scholar] [CrossRef]
- Santamaría-Bonfil, G.; Santoyo, E.; Díaz-González, L.; Arroyo-Figueroa, G. Equivalent Imputation Methodology for Handling Missing Data in Compositional Geochemical Databases of Geothermal Fluids. Geothermics 2022, 104, 102440. [Google Scholar] [CrossRef]
- Cini, A.; Rinaldi, A.; Bianchi, F.M.; Alippi, C. GRIN: A Graph Recurrent Imputation Network for Multivariate Time Series. In Proceedings of the Tenth International Conference on Learning Representations, Virtual, 25–29 April 2022; Available online: https://openreview.net/pdf?id=kOu3-S3wJ7 (accessed on 15 August 2025).
- Chen, X.; Liu, J.; Lu, C. Adaptive Graph Convolutional Imputation Network for Environmental Sensor Data Recovery. Front. Environ. Sci. 2022, 10, 1025268. [Google Scholar] [CrossRef]
- Li, Y.; Wang, J.; Ma, S.; Wang, Y. Physics-inspired Deep Graph Learning for Air Quality Assessment. npj Clim. Atmos. Sci. 2023, 6, 65. [Google Scholar] [CrossRef]
- Dimitri, G.M.; Cappelli, I.; Scarselli, F.; Fort, A.; Gori, M. Graph Neural Networks for Missing Data Imputation in Time Series from Meteorological Sensors. In Proceedings of the 2024 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), St. Albans, UK, 21–23 October 2024; pp. 1242–1247. [Google Scholar] [CrossRef]
Hyperparameter | Default Value | Values Evaluated |
---|---|---|
(a) Random Forest | ||
n_estimators | 100 | [10, 20, 50, 80] |
max_depth | None | [10, 20, None] |
max_iter | 10 | [5, 10, 15, 20] |
min_samples_split | 2 | [2, 5, 10, 15] |
min_samples_leaf | 1 | [1, 2, 4, 6] |
(b) BRITS | ||
RNN units | 64 | [64, 128, 256, 512] |
subsequence length | 24 | [16, 32, 64, 128, 168] |
learning rate | 0.001 | [0.001, 0.005, 0.009] |
batch size | 64 | [16, 32, 64, 128, 168] |
use regularization | False | [False, True] |
dropout rate | 0 | [0.1, 0.2, 0.3] |
Hyperparameter | Values Evaluated | Best Parameters for Each Station | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PED | MER | SAG | TLA | FAC | NEZ | UAX | CUA | VIF | MGH | SFE | AJM | ||
n_estimators | [10, 20, 50, 80] | 20 | 80 | 50 | 20 | 50 | 50 | 80 | 20 | 10 | 20 | 50 | 50 |
max_depth | [10, 20, None] | None | None | 10 | 20 | 10 | 20 | None | 10 | 10 | None | 20 | None |
max_iter | [5, 10, 15, 20] | 15 | 10 | 5 | 15 | 15 | 5 | 10 | 10 | 15 | 15 | 10 | 10 |
min_samples_split | [2, 5, 10, 15] | 2 | 5 | 5 | 5 | 2 | 2 | 5 | 10 | 2 | 5 | 5 | 5 |
min_samples_leaf | [1, 2, 4, 6] | 2 | 2 | 1 | 2 | 2 | 4 | 1 | 4 | 1 | 4 | 2 | 2 |
Hyperparameter | Values Evaluated | Best Parameters for Each Station | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PED | MER | SAG | TLA | FAC | NEZ | UAX | CUA | VIF | MGH | SFE | AJM | ||
RNN units | [64, 128, 256, 512] | 512 | 256 | 128 | 512 | 512 | 512 | 64 | 512 | 128 | 128 | 256 | 512 |
subsequence length | [16, 32, 64, 128, 168] | 16 | 128 | 32 | 64 | 32 | 128 | 32 | 32 | 32 | 168 | 128 | 168 |
learning rate | [0.001, 0.005, 0.009] | 0.001 | 0.001 | 0.009 | 0.009 | 0.009 | 0.005 | 0.005 | 0.005 | 0.009 | 0.005 | 0.005 | 0.005 |
batch size | [16, 32, 64, 128, 168] | 64 | 16 | 168 | 32 | 64 | 64 | 16 | 32 | 64 | 32 | 16 | 32 |
use regularization | [False, True] | False | True | False | False | False | False | False | True | True | False | False | False |
dropout rate | [0.10, 0.2, 0.30] | 0.1 | 0.1 | 0.1 | 0.3 | 0.3 | 0.2 | 0.2 | 0.1 | 0.3 | 0.3 | 0.3 | 0.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Díaz-González, L.; Trujillo-Uribe, I.; Pérez-Sansalvador, J.C.; Lakouari, N. Handling Missing Air Quality Data Using Bidirectional Recurrent Imputation for Time Series and Random Forest: A Case Study in Mexico City. AI 2025, 6, 208. https://doi.org/10.3390/ai6090208
Díaz-González L, Trujillo-Uribe I, Pérez-Sansalvador JC, Lakouari N. Handling Missing Air Quality Data Using Bidirectional Recurrent Imputation for Time Series and Random Forest: A Case Study in Mexico City. AI. 2025; 6(9):208. https://doi.org/10.3390/ai6090208
Chicago/Turabian StyleDíaz-González, Lorena, Ingrid Trujillo-Uribe, Julio César Pérez-Sansalvador, and Noureddine Lakouari. 2025. "Handling Missing Air Quality Data Using Bidirectional Recurrent Imputation for Time Series and Random Forest: A Case Study in Mexico City" AI 6, no. 9: 208. https://doi.org/10.3390/ai6090208
APA StyleDíaz-González, L., Trujillo-Uribe, I., Pérez-Sansalvador, J. C., & Lakouari, N. (2025). Handling Missing Air Quality Data Using Bidirectional Recurrent Imputation for Time Series and Random Forest: A Case Study in Mexico City. AI, 6(9), 208. https://doi.org/10.3390/ai6090208