Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador

Alvarez, Cesar Ivan; Ulloa Vaca, Carlos Andrés; Echeverria Llumipanta, Neptali Armando

doi:10.3390/rs17203472

Open AccessArticle

Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador

by

Cesar Ivan Alvarez

¹

,

Carlos Andrés Ulloa Vaca

^2,*

and

Neptali Armando Echeverria Llumipanta

³

¹

Centre for Climate Resilience, University of Augsburg, Universitäts Strasse 12a, 86159 Augsburg, Germany

²

Grupo de Investigación en Ciencias Ambientales GRICAM, Carrera de Ingeniería Ambiental, Universidad Politécnica Salesiana, Quito 170702, Ecuador

³

Topografia Automatizada y Fotogrametria Digital, Universidad Catolica de Santiago de Guayaquil, Guayaquil 090505, Ecuador

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3472; https://doi.org/10.3390/rs17203472

Submission received: 17 August 2025 / Revised: 4 October 2025 / Accepted: 16 October 2025 / Published: 17 October 2025

(This article belongs to the Special Issue Machine Learning and GeoAI for Remote Sensing Environmental Monitoring (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Machine learning using Google AlphaEarth Foundations satellite embeddings in Google Earth Engine accurately predicted NO₂ and SO₂ concentrations in Quito (R² = 0.71), capturing fine-scale pollution patterns at 10 m resolution.
SHAP analysis revealed that only a small subset of embedding bands drives accurate predictions, demonstrating that compact, globally consistent features can explain urban air quality dynamics without handcrafted indices or auxiliary datasets.

What is the implication of the main finding?

Embedding-based remote sensing models provide a scalable solution for urban air quality monitoring in the Global South, overcoming sparse ground stations and persistent cloud cover.
The approach supports policy-relevant applications such as hotspot detection, trend analysis, and sustainable urban planning, offering transferable methods for data-scarce cities worldwide.

Abstract

Many Global-South cities lack dense monitoring and suffer persistent cloud cover, hampering fine-scale trend detection. This study evaluates the potential of annual multi-sensor satellite embeddings from the AlphaEarth Foundations model in Google Earth Engine to predict and map major air pollutants in Quito, Ecuador, between 2017 and 2024. The 64-dimensional embeddings integrate Sentinel-1 radar, Sentinel-2 optical imagery, Landsat surface reflectance, ERA5-Land climate variables, GRACE terrestrial water storage, and GEDI canopy structure into a compact representation of surface and climatic conditions. Annual median concentrations of NO₂, SO₂, PM_2.5, CO, and O₃ from the Red Metropolitana de Monitoreo Atmosférico de Quito (REEMAQ) were paired with collocated embeddings and modeled using five machine learning algorithms. Support Vector Regression achieved the highest accuracy for NO₂ and SO₂ (R² = 0.71 for both), capturing fine-scale spatial patterns and multi-year changes, including COVID-19 lockdown-related reductions. PM_2.5 and CO were predicted with moderate accuracy, while O₃ remained challenging due to its short-term photochemical and meteorological drivers and the mismatch with annual aggregation. SHAP analysis revealed that a small subset of embedding bands dominated predictions for NO₂ and SO₂. The approach provides a scalable and transferable framework for high-resolution urban air quality mapping in data-scarce environments, supporting long-term monitoring, hotspot detection, and evidence-based policy interventions.

Keywords:

urban air quality; satellite embeddings; Google Earth Engine; machine learning; Quito

1. Introduction

Urban air pollution remains a major environmental and public health challenge, particularly in cities of the Global South where resources for continuous air quality monitoring are limited [1,2]. In many cases, establishing and maintaining dense air quality networks or implementing low-cost sensor alternatives is financially and logistically unfeasible [3,4], thereby restricting decision-makers’ ability to assess pollutant dynamics, evaluate policy interventions, and protect public health. Exposure to nitrogen dioxide (NO₂), sulfur dioxide (SO₂), fine particulate matter (PM_2.5), ozone (O₃), and carbon monoxide (CO) is consistently associated with respiratory and cardiovascular diseases, premature mortality, and significant socio-environmental impacts, highlighting the need for cost-effective and transferable approaches to estimate their spatial and temporal variability [5,6].

Quito, Ecuador—located at 2850 m above sea level in the tropical Andes—faces complex air quality issues driven by vehicular emissions, industrial activity, and topographically induced thermal inversions [7]. The city’s Red Metropolitana de Monitoreo de la Calidad del Aire (REEMAQ) is the only operational network in Ecuador, providing valuable ground-based measurements; however, it has limited spatial coverage and insufficient station density, which hinders its ability to capture the city’s pronounced spatial heterogeneity [8]. Most of the Ecuadorian territory remains without systematic air quality monitoring.

Previous studies in Quito have used remote sensing combined with regression-based models, such as land-use regression (LUR) approaches using Landsat or MODIS-derived indices and meteorological data, achieving good performance but facing recurring limitations: reliance on handcrafted features from specific sensors, dependence on additional ground-based variables, limited spatial transferability requiring city-specific recalibration, high cloud density during most of the year that affects optical remote sensing data [9], and weak temporal generalization for multi-year predictions [10,11].

Recent advances in artificial intelligence have introduced satellite image embeddings—dense, information-rich feature vectors generated from multi-sensor Earth observation datasets through self-supervised learning—which address many of these constraints [12]. Google DeepMind’s AlphaEarth Foundations (AEF) model integrates over 3 billion geospatial observations from Sentinel-1 SAR, Sentinel-2 optical bands, Landsat imagery, ERA5-Land meteorology, GRACE hydrology, and GEDI LiDAR canopy data, producing globally consistent 64-dimensional embeddings summarizing spectral, seasonal, and structural characteristics at 10 m resolution [13]. Available via Google Earth Engine (GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL), these embeddings are robust to cloud contamination, require no manual feature engineering, and enable scalable applications in data-scarce environments [14].

While embeddings have been successfully applied to land cover classification, biomass estimation, and environmental monitoring, their use for urban air pollution prediction remains unexplored. This study addresses that gap by assessing the predictive capacity of AEF embeddings for annual NO₂, SO₂, PM_2.5, CO, and O₃ concentrations in Quito, using only satellite-derived features and machine learning [15,16,17]. Multiple regression algorithms—Support Vector Regression, Ridge Regression, Random Forest, Gradient Boosting, and k-Nearest Neighbors—are compared, and model interpretation is carried out using Shapley Additive Explanations (SHAP) [18,19]. Building on the best-performing models, we generate high-resolution (10 m) prediction maps for NO₂ and SO₂ for 2017 and 2024 to analyze spatial patterns and temporal changes. The objectives are to (i) evaluate the performance of machine learning models in predicting annual pollutant concentrations from satellite embeddings, (ii) identify the most influential embedding features, and (iii) produce fine-scale maps using only remote sensing data from embedding features to assess multi-year changes, offering a scalable framework for urban air quality assessment in the Global South.

2. Materials and Methods

2.1. Study Area

This study focuses on Quito, the capital of Ecuador, situated in the tropical Andes at approximately 0°13′S, 78°30′W, with an elevation of approximately 2850 m above sea level (Figure 1). The Metropolitan District of Quito encompasses an area of approximately 4200 km² and has a population exceeding 2.7 million. The city’s topography is diverse, ranging from densely urbanized basins in the valley floor to mountainous peripheries. Quito lies in northern South America, bordered by Colombia to the north and Peru to the south, while the Amazon basin to the east occasionally contributes to long-range transport of biomass burning emissions. The city exhibits moderate seasonal variability but is strongly influenced by meteorological phenomena such as thermal inversions, which intensify pollutant accumulation. Air quality is primarily affected by vehicular traffic, industrial emissions, biomass burning, and social disruptions such as strikes or exceptional events like the COVID-19 pandemic [20]. Frequent cloud cover and the limited spatial distribution of monitoring stations make Quito a representative case of a data-scarce urban environment in the Global South.

2.2. Ground-Based Air Quality Data (REEMAQ)

Ground truth data were obtained from the Red Metropolitana de Monitoreo Atmosférico de Quito (REEMAQ), the city’s official air quality monitoring network, managed by the Secretaría de Ambiente de Quito and available through the open data portal https://datosambiente.quito.gob.ec/ [21], accessed on 10 August 2025. The dataset comprised measurements of five pollutants—NO₂, SO₂, PM_2.5, O₃, and CO—recorded between 2017 and 2025 at the nine monitoring stations shown in Figure 1: San Antonio, Carapungo, Cotocollao, Belisario, Centro, El Camal, Guamaní, Los Chillos, and Tumbaco. REEMAQ data, collected initially hourly, were aggregated to annual medians to match the embedding scale and reduce the influence of outliers. This temporal aggregation reduced the impact of extreme values, ensured statistical robustness, and matched the yearly scale of the embedding features. Before aggregation, the dataset was checked for null values and missing or invalid records were removed. All nine stations (San Antonio, Carapungo, Cotocollao, Belisario, Centro, El Camal, Guamaní, Los Chillos, and Tumbaco) with complete annual records for each pollutant were retained in the final dataset. For model training and prediction, the geographic coordinates of each station were used to extract the corresponding 64-band satellite embedding values from Google Earth Engine. Across the study period, data completeness exceeded 90% for NO₂, SO₂, and CO, while PM_2.5 and O₃ showed slightly lower coverage (85–88%) due to occasional instrument downtime.

2.3. Satellite Embeddings (A00–A63)

We employed the AlphaEarth Foundations (AEF) embedding dataset, a global multi-sensor representation model designed to encode spatial patterns of Earth surface and climate variables into compact, information-rich feature vectors [22]. The embeddings were generated using a transformer-based architecture trained on a diverse range of Earth observation datasets, including Sentinel-1 synthetic aperture radar (SAR) backscatter, Sentinel-2 and Landsat-8 multispectral reflectance, MODIS vegetation indices, GRACE gravity anomalies, GEDI canopy height, topography, soil moisture, and atmospheric parameters from ERA5-Land reanalysis. This combination captures domains such as land cover, vegetation structure and phenology, biomass, hydrology, and climate.

Although Sentinel-1 SAR backscatter does not directly measure atmospheric gases such as NO₂ or SO₂, its inclusion within the AlphaEarth Foundations embeddings provides structural information on urban morphology, imperviousness, and surface roughness [23]. These characteristics indirectly influence pollutant dispersion and the location of emission hotspots, thereby contributing useful contextual features when combined with optical, climatic, and structural inputs. This approach is consistent with existing air quality studies that also include the relationship with SAR measurements [24].

For this study, we accessed the annual embedding layers through Google Earth Engine under the collection ID GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL. Each layer is provided at a 10 m spatial resolution for the years 2017–2024, with each pixel containing a 64-dimensional vector that encodes latent information from the original multi-sensor datasets. These embeddings integrate both static and dynamic environmental properties, including optical and radar signals, vegetation metrics, topographic context, and long-term climate patterns.

A key advantage of the AEF embeddings over traditional satellite-derived indices is their capability for data fusion, which enables the combination of heterogeneous data sources into a unified representation. This enables robust predictive modeling even in data-scarce environments. Additionally, because they incorporate climate reanalysis data from ERA5-Land, the embeddings implicitly encode meteorological and seasonal patterns without requiring separate climate covariates.

At inference time, only the input datasets are required to generate embeddings for new locations or years, making the system highly scalable and suitable for applications in regions without extensive monitoring infrastructure. In this study, the embeddings served as the sole predictor variables in machine learning models for air pollutant concentrations, eliminating the need for manually engineered features.

The inclusion of ERA5-Land variables means that the embeddings contain not only surface spectral and structural information but also annual climate summaries. This integration enables the model to consider long-term meteorological context, despite the fact that short-term weather variability is not explicitly represented due to annual aggregation.

The embeddings are provided at 10 m resolution, are robust to cloud contamination, and are consistent across years and regions. For each REEMAQ station and year, we extracted the full set of embedding bands (A00–A63) as predictor variables for the machine learning models.

2.4. Machine Learning Models and Evaluation

We trained and evaluated five machine learning regression models using the 64-band annual embeddings (A00–A63) to predict the yearly concentrations of each pollutant. The models tested were Support Vector Regression (SVR) [25], Ridge Regression [26], Random Forest Regressor [27], Gradient Boosting Regressor [28], and k-Nearest Neighbors (KNN) [29]. These models were selected for their diversity in complexity, interpretability, and proven applicability in environmental modeling. SVR is well-suited to high-dimensional datasets with nonlinear relationships and achieved the best overall performance in our study, particularly for CO and SO₂. Ridge Regression, a regularized linear approach, served as a robust and interpretable baseline model, effectively controlling overfitting. Random Forest, a nonlinear ensemble method, captures feature interactions and handles noisy data while providing estimates of feature importance. Gradient Boosting, another ensemble approach, allows for fine-tuned optimization and often achieves high predictive accuracy on structured datasets. KNN, a simple non-parametric model, was included to assess the influence of spatial proximity in the embedding feature space.

The dataset contained between 42 and 49 station–year samples per pollutant, depending on station availability, yielding approximately 60 samples in total for each pollutant. Given this relatively small sample size, all models were implemented using the scikit-learn Python library (matplotlib v3.10.7, rasterio v1.4.0, and shap v0.48.0) and trained with a 5-fold cross-validation scheme to ensure robust and balanced evaluation [30]. Hyperparameters were optimized using a grid search strategy to maximize predictive performance, and the same training–testing splits were applied across all models to allow direct comparability [31].

Model evaluation was performed using 5-fold cross-validation (CV), a protocol widely applied in machine learning and remote sensing studies as a balance between computational cost, bias, and variance. Each pollutant dataset comprised approximately 60 samples (station–year combinations), and the 5-fold split yielded approximately 70% for training and 30% for testing in each fold. This ensured sufficient training size for model stability while maintaining independent test samples for robust assessment. We acknowledge that 5-fold CV mixes samples across stations and years, which may yield optimistic results compared to stricter station- or year-based validation. However, for this initial feasibility study, the primary objective was to benchmark the performance of embedding-based models relative to traditional predictors in a consistent manner across all pollutants. Future work will extend this to stricter spatial and temporal hold-out strategies (e.g., Leave-One-Station-Out or Leave-One-Year-Out). In addition, the models are constrained by the relatively small sample size (~60 station–year observations per pollutant) and the use of annual embeddings, which smooth short-term variability and limit predictive skill for pollutants strongly driven by daily meteorological and photochemical processes, such as O₃.

Model evaluation focused on three key metrics: the coefficient of determination (R²), the mean absolute error (MAE), and the root mean squared error (RMSE). Final model selection for each pollutant was based primarily on R² performance on the test folds, ensuring that the chosen model provided the best balance between predictive accuracy and generalization ability.

2.5. Feature Importance Analysis (SHAP)

To enhance the interpretability of the modeling results, we applied Shapley Additive Explanations (SHAP) to the best-performing model for each pollutant. SHAP values quantify the contribution of each embedding band to individual predictions, providing a transparent assessment of which spectral–textural features most strongly influence pollutant estimates [32]. This approach allowed us to identify the most relevant embedding dimensions, gain insights into the relationships between pollutant concentrations and surface characteristics, and inform potential dimensionality reduction strategies for future applications.

For the final models with the highest accuracy, we applied them to the complete embedding rasters of Quito for the years 2017 and 2024. Predictions were generated at 10 m spatial resolution, enabling a detailed visualization of the spatial distribution of pollutants and their temporal changes within a consistent analytical framework. All coding and model execution were carried out in Google Colab, with model applications and raster processing conducted using Google Earth Engine and the geemap Python package. The resulting maps were further processed and visualized using Python libraries such as matplotlib v3.10.7, rasterio v1.4.0, and shap v0.48.0. Final georeferenced rasters were exported to Google Drive and refined for cartographic figures in ArcGIS Pro v3.5, ensuring compatibility with spatial planning workflows and supporting their use in health risk communication. The workflow diagram is presented in Figure 2.

3. Results

3.1. Analysis of Ground-Based REEMAQ Data

The REEMAQ dataset for Quito, spanning from 2017 to 2025, revealed distinct pollutant-specific patterns and spatial variability across the monitoring network (Figure 3). Temporal aggregation of annual means showed that NO₂ concentrations were consistently higher at stations located near high-traffic corridors and in the historic city center, with some sites exceeding the WHO yearly guideline value of 10 µg/m³ in multiple years [33]. SO₂ exhibited strong spatial localization, with the highest values recorded in the southern industrial sector and generally low concentrations in residential and green areas.

PM_2.5 displayed both seasonal and interannual variability, with elevated values during the dry season, particularly in central and southern districts. These peaks likely reflect a combination of traffic emissions, industrial activity, and regional transport from biomass burning events, with annual averages at several sites exceeding the WHO guideline value of 5 µg/m³. CO levels were relatively homogeneous across the network, with modest peaks in areas of dense traffic flow. In contrast, O₃ concentrations were higher in peripheral and elevated regions, showing the typical inverse spatial relationship with NO₂, consistent with photochemical production processes.

Interannual trends suggested modest declines in NO₂ and CO at several central monitoring stations, potentially linked to fleet modernization and traffic management policies. However, increases in PM_2.5 and SO₂ were observed in some southern and peri-urban stations, indicating localized emission growth. Overall, these patterns underscore the heterogeneity of Quito’s air pollution profile and highlight the importance of spatially resolved modeling approaches.

3.2. Machine Learning Model Performance

Table 1 summarizes the predictive performance of the best-performing models for each pollutant using annual AlphaEarth Foundations embeddings. The evaluation was based on R², RMSE, and MAE metrics under a 5-fold cross-validation scheme. The strongest results were obtained for NO₂ and SO₂, with both achieving R² = 0.71 using Support Vector Regression (SVR). For NO₂, k-Nearest Neighbors (KNN) also matched this performance, with RMSE values around 2.91–2.92 µg/m³ and MAE between 2.33 and 2.53 µg/m³, indicating good agreement between predictions and observations. For SO₂, Random Forest ranked second with R² = 0.66 and RMSE = 0.43 µg/m³. PM_2.5 predictions reached moderate accuracy (R² = 0.55) with Ridge Regression and Elastic Net, while CO was moderately well predicted by SVR (R² = 0.61) and less so by Gradient Boosting (R² = 0.48). O₃ proved difficult to model using embeddings alone, as Random Forest and Ridge Regression produced negative values (−0.02 and −0.04), highlighting the need for dynamic meteorological predictors. The weak performance for O₃ reflects its dependence on short-term photochemical reactions and meteorological variability, which are smoothed when using annual embeddings. For this reason, we did not emphasize O₃ predictions in Figure 4 and Figure 5 and instead focused on pollutants with more stable source patterns, such as NO₂ and SO₂, where the models achieved substantially stronger performance.

Scatterplots of observed versus predicted values (Figure 4) confirm these patterns. For NO₂ and SO₂, data points cluster closely around the 1:1 line, indicating strong predictive alignment. For PM_2.5 and CO, dispersion increases at higher observed concentrations, with a tendency to underpredict peak events. For O₃, the absence of a clear trend line reflects the weak model fit. Each point in Figure 4 corresponds to a station–year observation from the test folds of the 5-fold cross-validation. With approximately 60 observations available per pollutant and about 20 used for testing in each fold, the scatterplots display ~20 points per pollutant.

We constructed spatial prediction maps for NO₂ and SO₂ at 10 m resolution (Figure 6 and Figure 7), which captured fine-scale variability across Quito, as the models for these pollutants achieved the highest accuracy. In 2017, NO₂ hotspots were aligned with major roads and dense urban zones, while by 2024, concentrations had decreased in central areas but increased in rapidly expanding southern peri-urban districts. SO₂ hotspots remained concentrated in the southern industrial corridor in both years, with some intensification observed in 2024. The high spatial resolution of these maps enables their direct use in local policy-making and targeted interventions.

4. Discussion

4.1. Main Findings

The REEMAQ monitoring data and the embedding-based model predictions together provide a clear view of what a fully remote-sensing-driven approach can achieve for urban air-quality modelling in a data-scarce context such as Quito [11].

Ground observations confirm persistent spatial differences across pollutants: NO₂ is concentrated in the central business district and along major traffic corridors where topography-induced inversions trap emissions; SO₂ is localised in the southern industrial corridor; PM_2.5 exhibits seasonal peaks during the dry season and regional biomass-burning events; CO is more evenly distributed but shows localised spikes in traffic-heavy areas; O₃ is highest in peripheral elevated zones dominated by photochemical production away from NO₂ sources [34,35].

The AlphaEarth Foundations embeddings—integrating Sentinel-1, Sentinel-2, Landsat, ERA5-Land meteorology, GRACE hydrology and GEDI LiDAR—captured this heterogeneity particularly well for pollutants with stable sources. Within the embeddings, Sentinel-1 radar provides indirect but valuable context on impervious-surface distribution, built-up density and surface roughness, factors that influence traffic emissions and pollutant accumulation in the urban basin.

NO₂ achieved the highest predictive accuracy, with Support-Vector Regression and k-Nearest Neighbours both reaching R² = 0.71 and RMSE ≈ 2.9 µg m⁻³, indicating that the embeddings encode strong spatial proxies for traffic networks, impervious-surface distribution and urban morphology [36,37,38]. The good skill for NO₂ reflects its spatial stability and strong correlation with persistent land-use features, which are well represented in multi-sensor composites.

In contrast, pollutants with more dynamic or secondary-formation processes—PM_2.5 and CO—reached only moderate accuracy (R² = 0.55 and 0.61, respectively), as their behaviour depends on short-term meteorology and episodic transport not captured by annual embeddings. O₃ proved the most difficult to predict, yielding negative R² values, consistent with its non-linear chemistry and strong dependence on short-term meteorological variability and precursor interactions [39,40].

4.2. Comparison with Existing Studies

Our results fall within the upper range of previous Quito-focused modelling efforts and are comparable to outcomes reported for data-rich regions (Table 2).

Earlier studies in Quito relied on regression-type approaches with Landsat or MODIS-derived indices and meteorological data [9,10,11], or on Sentinel-5P TROPOMI or MODIS AOD retrievals combined with regression-kriging or land-use-regression models [41,42], generally achieving R² ≈ 0.40–0.65 at 1 km resolution.

By contrast, our embedding-based framework produced R² = 0.71 for NO₂ and SO₂ at 10 m resolution using only globally consistent multi-sensor inputs.

These results also approach those of high-data contexts such as Great Britain, where Random-Forest models with extensive local predictors achieved R² ≈ 0.75–0.80 at 1 km [25].

Importantly, our models required no handcrafted indices, auxiliary land-use datasets or pollutant-specific retrievals. The ten-metre detail demonstrates the potential of embedding-based models for neighbourhood-scale exposure assessment in the Global South.

4.3. SHAP Interpretability

The SHAP analysis provided further insights into model interpretability. For NO₂, a limited set of embedding bands (e.g., A12, A47, and A03) dominated predictions, likely linked to proxies of urban density, impervious materials, and vegetation cycles. SO₂ predictions were strongly influenced by A05, A26, and A51, which appear to capture industrial land-use characteristics. By contrast, PM_2.5 and CO relied on more diffuse patterns across multiple embedding bands, suggesting weaker and less stable predictors. These findings indicate that NO₂ and SO₂ models could be streamlined by prioritizing a smaller subset of highly relevant embedding features, reducing computational demand while maintaining accuracy [43,44]. Bands with near-zero SHAP values provided little to no contribution, but their inclusion ensured completeness in this exploratory application.

4.4. Strengths, Novelty, Policy Relevance and Transferability

This study is one of the first applications of satellite embeddings for urban air-quality modelling in the Global South, combining Sentinel-1/2, Landsat, ERA5-Land, GRACE and GEDI to capture dispersion-relevant urban morphology and climate context.

Unlike earlier Quito methods based on Sentinel-5P TROPOMI or MODIS AOD retrievals [10,36,41,42], the embedding approach is cloud-robust, lightweight, and transferable, avoiding the need for emissions inventories or handcrafted indices required by chemical-transport or land-use-regression models.

High-resolution predictions reveal policy-relevant patterns, such as declines in NO₂ in the historic centre but persistent SO₂ in the southern industrial corridor between 2017 and 2024, that can support traffic management and industrial emission control strategies.

Because the embeddings are globally consistent, the framework can be transferred to other Global-South cities with minimal recalibration.

An additional strength is the potential to link fine-scale air-quality predictions with the identification of respiratory-disease hotspots, thereby reinforcing the connection between environmental monitoring and public health planning [45].

The future integration of Sentinel-4/5 products and dense IoT sensor networks will enhance temporal resolution and facilitate near-real-time urban air-quality surveillance [46,47,48].

4.5. Integrated Limitations

Despite these advances, three interrelated factors limit performance:

Sparse monitoring network—Quito has only nine stations, constraining representativeness and increasing spatial-interpolation uncertainty.
Annual aggregation—the use of annual embeddings smooths short-term meteorological and photochemical variability, lowering predictive skill for pollutants such as O₃ that depend on day-to-day processes.
Topographic complexity—Quito’s setting in a high Andean valley (~2850 m) surrounded by steep mountains promotes thermal inversions and weak circulation that trap pollutants [49]. Model smoothing across steep terrain and limited station density resulted in some apparent NO₂ spill-over into mountain slopes, an artefact also observed in other mountainous regions where satellite-based NO₂ retrievals often correlate poorly with surface concentrations due to vertical-profile uncertainties and representation errors [50,51,52].

These findings suggest that predictions of elevated NO₂ in mountainous zones should be interpreted with caution. While embedding-based models provide valuable coverage in regions with sparse monitoring, future efforts should integrate higher-frequency meteorological predictors, localized emission inventories, and additional ground validation to resolve better the complex interactions between topography, circulation, and pollution distribution in Quito and similar Andean cities.

5. Conclusions

This study demonstrates the potential of machine learning combined with Google’s AlphaEarth Foundations satellite embeddings to improve urban air quality modelling in data-scarce regions, using Quito, Ecuador as a representative case.

By relying solely on globally available multi-sensor embeddings—integrating Sentinel-1, Sentinel-2, Landsat, ERA5-Land, GRACE, and GEDI—we generated 10 m-resolution predictions of annual pollutant concentrations without the need for handcrafted features, auxiliary land-use layers, or pollutant-specific retrievals.

The models performed best for NO₂ (R² = 0.71) and SO₂ (R² = 0.71), pollutants with stable, localised emission sources that are well represented in the embeddings, confirming their ability to capture proxies for traffic intensity, industrial activity, and urban morphology.

Performance for PM_2.5, CO, and O₃ was more modest, reflecting the limitation of annual aggregation in representing short-term meteorological variability, chemical transformations, and episodic events such as biomass burning or inversion layers.

Compared with earlier Quito studies, which were mainly based on Sentinel-5P vertical-column retrievals or MODIS AOD products, this embedding-based framework provides a finer-scale, cloud-robust, and transferable alternative that better resolves intra-urban heterogeneity while reducing dependence on local ancillary data, which is often unavailable in Global South settings.

The use of SHAP analysis further improved interpretability by identifying the most influential embedding bands for different pollutants and indicating how model complexity can be reduced in future applications.

Embedding-based models, therefore, help to fill the critical gap left by global air-quality models, which often underperform in cities with sparse monitoring networks or incomplete emissions inventories, by leveraging globally consistent EO data with a minimal set of ground observations.

Looking ahead, combining this approach with pollutant-specific vertical-column retrievals (Sentinel-5P and forthcoming Sentinel-4/5), higher-frequency meteorological inputs, and IoT low-cost sensor networks will enable hybrid frameworks that improve temporal generalisation and support near-real-time urban-air-quality monitoring.

Overall, this work highlights that embedding-based machine learning offers a scalable, policy-relevant, and globally transferable methodology for neighbourhood-scale air quality prediction, providing urgently needed information for public health protection, climate resilience planning, and sustainable urban development in cities of the Global South.

Author Contributions

Conceptualization, C.I.A.; methodology, C.I.A.; software, C.I.A.; validation, C.I.A.; formal analysis, C.I.A. and N.A.E.L.; investigation, C.I.A.; resources, N.A.E.L.; data curation, C.I.A. and C.A.U.V.; writing—original draft preparation, C.I.A.; writing—review and editing, C.I.A.; visualization, C.I.A.; supervision, C.I.A.; project administration, C.A.U.V.; funding acquisition, C.A.U.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed at the corresponding author.

Acknowledgments

We thank the research team for all the help and support provided while developing this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

REEMAQ	Red Metropolitana de Monitoreo Atmosférico de Quito
PM_2.5	Particulate Matter with aerodynamic diameter ≤2.5 μm
NO₂	Nitrogen Dioxide
SO₂	Sulfur Dioxide
O₃	Ozone
CO	Carbon Monoxide
AEF	AlphaEarth Foundations
ERA5	ECMWF Reanalysis v5
SVR	Support Vector Regression
SHAP	Shapley Additive Explanations

References

Kim, S.Y.; Kerr, G.H.; van Donkelaar, A.; Martin, R.V.; West, J.J.; Anenberg, S.C. Tracking air pollution and CO₂ emissions in 13,189 urban areas worldwide using large geospatial datasets. Commun. Earth Environ. 2025, 6, 311. [Google Scholar] [CrossRef] [PubMed]
Zalakeviciute, R.; Lopez-Villada, J.; Ochoa, A.; Moreno, V.; Byun, A.; Proaño, E.; Mejía, D.; Bonilla-Bedoya, S.; Rybarczyk, Y.; Vallejo, F. Urban Air Pollution in the Global South: A Never-Ending Crisis? Atmosphere 2025, 16, 487. [Google Scholar] [CrossRef]
Kushwaha, M.; Mehta, S.; Arora, P.; Dye, T.; Matte, T. Integrated Use of Low-Cost Sensors to Strengthen Air Quality Management; Vital Strategies: New York, NY, USA, 2022; Available online: https://www.vitalstrategies.org/resources/integrated-use-of-low-cost-sensors-to-strengthen-air-quality-management-in-indian-cities/ (accessed on 16 August 2025).
Castell, N.; Dauge, F.R.; Schneider, P.; Vogt, M.; Lerner, U.; Fishbain, B.; Broday, D.; Bartonova, A. Can commercial low-cost sensor platforms contribute to air quality monitoring and exposure estimates? Environ. Int. 2017, 99, 293–302. [Google Scholar] [CrossRef]
World Health Organization. Types of Pollutants. In Air Quality, Energy and Health; World Health Organization; Available online: https://www.who.int/teams/environment-climate-change-and-health/air-quality-and-health/health-impacts/types-of-pollutants (accessed on 16 August 2025).
Institute of Environmental Science and Research (ESR). Health Effects of Air Pollution; ESR: Porirua, New Zealand, 2022; Available online: https://www.phfscience.nz/media/cofl2ahi/esr-environmental-health-report-health-effects-pollution.pdf (accessed on 16 August 2025).
Vallejo, F.; Villacrés, P.; Yánez, D.; Espinoza, L.; Bodero-Poveda, E.; Díaz-Robles, L.A.; Oyaneder, M.; Campos, V.; Palmay, P.; Cordovilla-Pérez, A.; et al. Prolonged Power Outages and Air Quality: Insights from Quito’s 2023–2024 Energy Crisis. Atmosphere 2025, 16, 274. [Google Scholar] [CrossRef]
Secretaría de Ambiente del Distrito Metropolitano de Quito. Red Metropolitana de Monitoreo de la Calidad del Aire (REMMAQ). This Platform Provides Real-Time Monitoring and Analysis of Air Quality (e.g., PM₁₀, PM_2.5, NO₂, SO₂, O₂, CO, and VOCs) Across Quito. Available online: https://ambiente.quito.gob.ec/red-metropolitana-de-monitoreo-de-la-calidad-del-aire/ (accessed on 16 August 2025).
Alvarez-Mendoza, C.I.; Teodoro, A.; Ramirez-Cando, L. Improving NDVI by removing cirrus clouds with optical remote sensing data from Landsat-8—A case study in Quito, Ecuador. Remote Sens. Appl. Soc. Environ. 2019, 13, 257–274. [Google Scholar] [CrossRef]
Alvarez-Mendoza, C.I.; Teodoro, A.; Ramirez-Cando, L. Spatial estimation of surface ozone concentrations in Quito Ecuador with remote sensing data, air pollution measurements and meteorological variables. Environ. Monit. Assess. 2019, 191, 155. [Google Scholar] [CrossRef] [PubMed]
Alvarez-Mendoza, C.I.; Teodoro, A.C.; Torres, N.; Vivanco, V. Assessment of Remote Sensing Data to Model PM10 Estimation in Cities with a Low Number of Air Quality Stations: A Case of Study in Quito, Ecuador. Environments 2019, 6, 85. [Google Scholar] [CrossRef]
Rolf, E.; Proctor, J.; Carleton, T.; Bolliger, I.; Shankar, V.; Ishihara, M.; Recht, B.; Hsiang, S. A generalizable and accessible approach to machine learning with global satellite imagery. Nat. Commun. 2021, 12, 4392. [Google Scholar] [CrossRef]
Brown, C.F.; Kazmierski, M.R.; Pasquarella, V.J.; Rucklidge, W.J.; Samsikova, M.; Zhang, C.; Shelhamer, E.; Lahera, E.; Wiles, O.; Ilyushchenko, S.; et al. AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data. arXiv 2025, arXiv:2507.22291. [Google Scholar] [CrossRef]
Kozlov, M. Google AI model creates maps of Earth ‘at any place and time’. Nature 2025, 644, 313–314. [Google Scholar] [CrossRef]
Tang, D.; Zhan, Y.; Yang, F. A review of machine learning for modeling air quality: Overlooked but important issues. Atmos. Res. 2024, 300, 107261. [Google Scholar] [CrossRef]
Agbehadji, I.E.; Obagbuwa, I.C. Systematic Review of Machine Learning and Deep Learning Techniques for Spatiotemporal Air Quality Prediction. Atmosphere 2024, 15, 1352. [Google Scholar] [CrossRef]
Méndez, M.; Merayo, M.G.; Núñez, M. Machine learning algorithms to forecast air quality: A survey. Artif. Intell. Rev. 2023, 56, 10031–10066. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, H.; Zhai, A.; Kong, C.; Zhang, J. Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications. Atmosphere 2025, 16, 776. [Google Scholar] [CrossRef]
Tao, C.; Zhang, Q.; Huo, S.; Ren, Y.; Han, S.; Wang, Q.; Wang, W. PM_2.5 pollution modulates the response of ozone formation to VOC emitted from various sources: Insights from machine learning. Sci. Total Environ. 2024, 916, 170009. [Google Scholar] [CrossRef]
Alvarez, C.I.; López, S.; Vásquez, D.; Gualotuña, D. Assessing Air Quality Dynamics during Short-Period Social Upheaval Events in Quito, Ecuador, Using a Remote Sensing Framework. Remote Sens. 2024, 16, 3436. [Google Scholar] [CrossRef]
Secretaria de Ambiente de Quito. DATOS HISTÓRICOS REMMAQ (2004–2024)—Historic Air Quality Data. Available online: https://datosambiente.quito.gob.ec/ (accessed on 16 August 2025).
Google DeepMind. AlphaEarth Foundations Helps Map Our Planet in Unprecedented Detail. Discover (DeepMind Blog), 2025. Available online: https://deepmind.google/discover/blog/alphaearth-foundations-helps-map-our-planet-in-unprecedented-detail/ (accessed on 16 August 2025).
Stamou, A.; Karachaliou, E.; Tavantzis, I.; Bakousi, A.; Dosiou, A.; Tsifodimou, Z.-E.; Stylianidis, E. Satellite Imagery for Comprehensive Urban Morphology and Surface Roughness Analysis: Leveraging GIS Tools and Google Earth Engine for Sustainable Urban Planning. Urban Sci. 2025, 9, 213. [Google Scholar] [CrossRef]
Mohamadi, B.; Abu, G.; Mohamed, O.; Li, H.; Al-Sabbagh, T.A.; Younes, A. Integrating InSAR coherence and air pollution detection satellites to study the impact of war on air quality. Int. J. Appl. Earth Obs. Geoinf. 2025, 142, 104687. [Google Scholar] [CrossRef]
Chatterjee, K.; Kumar, S.S.; Kumar, R.P.; Bandyopadhyay, A.; Swain, S.; Mallik, S.; Al-Rasheed, A.; Abbas, M.; Soufiene, B.O. Future Air Quality Prediction Using Long Short-Term Memory Based on Hyper Heuristic Multi-Chain Model. IEEE Access 2024, 12, 123678–123693. [Google Scholar] [CrossRef]
Khattab, I.G.; Ali, M.C.; Abonazel, M.R.; Elshamy, H.M.; Azazy, A.R. Air Quality Forecasting Based on Socio-Economic Environmental Indicators: Combining Statistical Machine Learning Techniques. Int. J. Anal. Appl. 2025, 23, 183. [Google Scholar] [CrossRef]
Chen, J.; Zhu, S.; Wang, P.; Zheng, Z.; Shi, S.; Li, X.; Xu, C.; Yu, K.; Chen, R.; Kan, H.; et al. Predicting particulate matter, nitrogen dioxide, and ozone across Great Britain with high spatiotemporal resolution based on random forest models. Sci. Total Environ. 2024, 926, 171831. [Google Scholar] [CrossRef] [PubMed]
Alfasanah, Z.; Niam, M.Z.H.; Wardiani, S.; Ahsan, M.; Lee, M.H. Monitoring air quality index with EWMA and individual charts using XGBoost and SVR residuals. MethodsX 2025, 14, 103107. [Google Scholar] [CrossRef]
Alhathloul, S.H.; Mishra, A.K.; Khan, A.A. Low visibility event prediction using random forest and K-nearest neighbor methods. Theor. Appl. Climatol. 2024, 155, 1289–1300. [Google Scholar] [CrossRef]
Singh, S.; Kumar, M.; Verma, B.K.; Kumar, S. Optimizing Air Pollution Prediction with Random Forest Algorithm. Aerosol Sci. Eng. 2025. [Google Scholar] [CrossRef]
Sawah, M.S.; Elmannai, H.; El-Bary, A.A.; Lotfy, K.; Sheta, O.E. Improving air quality prediction using hybrid BPSO with BWAO for feature selection and hyperparameters optimization. Sci. Rep. 2025, 15, 13176. [Google Scholar] [CrossRef]
Yao, T.; Lu, S.; Wang, Y.; Li, X.; Ye, H.; Duan, Y.; Fu, Q.; Li, J. Revealing the drivers of surface ozone pollution by explainable machine learning and satellite observations in Hangzhou Bay, China. J. Clean. Prod. 2024, 440, 140938. [Google Scholar] [CrossRef]
World Health Organization. WHO Global Air Quality Guidelines: Particulate Matter (PM_2.5 and PM₁₀), Ozone, Nitrogen DIOXIDE, Sulfur Dioxide and Carbon Monoxide; World Health Organization: Geneva, Switzerland, 2021. Available online: https://www.ncbi.nlm.nih.gov/books/NBK574594/?utm_source=chatgpt.com (accessed on 10 August 2025).
Parra, R. Modeling PM_2.5 Levels Due to Combustion Activities and Fireworks in Quito (Ecuador) for Forecasting Using WRF-Chem. Atmosphere 2025, 16, 495. [Google Scholar] [CrossRef]
Cazorla, M.; Trujillo, M.; Seguel, R.; Gallardo, L. Comparative ozone production sensitivity to NO_X and VOCs in Quito, Ecuador, and Santiago, Chile. Atmos. Chem. Phys. 2025, 25, 7087–7109. [Google Scholar] [CrossRef]
Rowley, A.; Karakuş, O. Predicting air quality via multimodal AI and satellite imagery. Remote Sens. Environ. 2023, 293, 113609. [Google Scholar] [CrossRef]
Mejía, C.D.; Faican, G.; Zalakeviciute, R.; Matovelle, C.; Bonilla, S.; Sobrino, J.A. Spatio-temporal evaluation of air pollution using ground-based and satellite data during COVID-19 in Ecuador. Heliyon 2024, 10, e28152. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Chau, P.N.; Zalakeviciute, R.; Thomas, I.; Rybarczyk, Y. Deep Learning Approach for Assessing Air Quality During COVID-19 Lockdown in Quito. Front. Big Data 2022, 5, 842455. [Google Scholar] [CrossRef]
Tavella, R.A.; das Neves, D.F.; Silveira, G.d.O.; Vieira de Azevedo, G.M.G.; Brum, R.d.L.; Bonifácio, A.d.S.; Machado, R.A.; Brum, L.W.; Buffarini, R.; Adamatti, D.F.; et al. The Relationship Between Surface Meteorological Variables and Air Pollutants in Simulated Temperature Increase Scenarios in a Medium-Sized Industrial City. Atmosphere 2025, 16, 363. [Google Scholar] [CrossRef]
Lakra, K.; Avishek, K. Influence of meteorological variables and air pollutants on fog/smog formation in seven major cities of Indo-Gangetic Plain. Environ. Monit. Assess. 2024, 196, 533. [Google Scholar] [CrossRef]
Kassem, H.; El Hajjar, S.; Abdallah, F.; Omrani, H. Multi-view deep embedded clustering: Exploring a new dimension of air pollution. Eng. Appl. Artif. Intell. 2025, 139, 109509. [Google Scholar] [CrossRef]
Jiménez-Navarro, M.J.; Martínez-Ballesteros, M.; Martínez-Álvarez, F.; Asencio-Cortés, G. Explaining deep learning models for ozone pollution prediction via embedded feature selection. Appl. Soft Comput. 2024, 157, 111504. [Google Scholar] [CrossRef]
Morillas, C.; Alvarez, S.; Serio, C.; Masiello, G.; Martinez, S. TROPOMI NO₂ Sentinel-5P data in the Community of Madrid: A detailed consistency analysis with in situ surface observations. Remote Sens. Appl. Soc. Environ. 2024, 33, 101083. [Google Scholar] [CrossRef]
Alvarez-Mendoza, C.I. The Use of Remote Sensing in Air Pollution Control and Public Health. In Socio-Environmental Research in Latin America; López, S., Ed.; The Latin American Studies Book Series; Springer: Cham, Switzerland, 2023. [Google Scholar] [CrossRef]
Alvarez-Mendoza, C.I.; Teodoro, A.; Freitas, A.; Fonseca, J. Spatial estimation of chronic respiratory diseases based on machine learning procedures—An approach using remote sensing data and environmental variables in Quito, Ecuador. Appl. Geogr. 2020, 123, 102273. [Google Scholar] [CrossRef]
Fraunhofer Institute for Applied Optics Precision Engineering IOF. ESA Mission Sentinel 5 Launches with Optics from Jena; Fraunhofer Institute for Applied Optics and Precision Engineering IOF: Jena, Germany, 2025. [Google Scholar]
De Vito, S.; Del Giudice, A.; D’Elia, G.; Esposito, E.; Fattoruso, G.; Ferlito, S.; Formisano, F.; Loffredo, G.; Massera, E.; D’Auria, P.; et al. Future Low-Cost Urban Air Quality Monitoring Networks: Insights from the EU’s AirHeritage Project. Atmosphere 2024, 15, 1351. [Google Scholar] [CrossRef]
Connolly, R.E.; Yu, Q.; Wang, Z.; Chen, Y.-H.; Liu, J.Z.; Collier-Oxandale, A.; Papapostolou, V.; Polidori, A.; Zhu, Y. Long-term evaluation of a low-cost air sensor network for monitoring indoor and outdoor air quality at the community scale. Sci. Total Environ. 2022, 807, 150797. [Google Scholar] [CrossRef]
Mancheno, G.; Jorquera, H. High spatial resolution WRF-Chem modeling in Quito, Ecuador. Environ. Sci. Adv. 2025, 4, 1310–1332. [Google Scholar] [CrossRef]
Chang, B.; Liu, H.; Zhang, C.; Xing, C.; Tan, W.; Liu, C. Relating satellite NO₂ tropospheric columns to near-surface concentrations: Implications from ground-based MAX-DOAS NO₂ vertical profile observations. npj Clim. Atmos. Sci. 2025, 8, 1. [Google Scholar] [CrossRef]
Cazorla, M.; Gallardo, L.; Jimenez, R. The complex Andes region needs improved efforts to face climate extremes. Elem. Sci. Anth. 2022, 10, 92. [Google Scholar] [CrossRef]
Rijsdijk, P.; Eskes, H.; Dingemans, A.; Boersma, K.F.; Sekiya, T.; Miyazaki, K.; Houweling, S. Quantifying uncertainties in satellite NO₂ superobservations for data assimilation and model evaluation. Geosci. Model Dev. 2025, 18, 483–509. [Google Scholar] [CrossRef]

Figure 1. Location of the study area in the urban area of Quito, Ecuador, showing the city’s administrative parish boundaries in black lines. The positions of REEMAQ air quality monitoring stations are indicated by red dots.

Figure 2. Workflow diagram summarizing the main steps of the study.

Figure 3. Annual median (a) CO, (b) NO₂, (c) PM_2.5, (d) SO₂, and (e) O₃ concentrations at Quito’s REEMAQ stations from 2017 to 2025, with WHO limits marked by red dashed lines and the COVID-19 lockdown period highlighted in gray, showing pollutant-specific trends and station-level differences.

Figure 4. Observed versus predicted pollutant concentrations for the best-performing models based on 5-fold cross-validation. Each panel shows results for (a) NO₂, (b) SO₂, (c) PM_2.5, and (d) CO, with the 1:1 line shown for reference. NO₂ and SO₂ models exhibit strong agreement, while PM_2.5 and CO show moderate performance.

Figure 5. SHAP feature importance plots for the best-performing models of each pollutant. The horizontal axis shows the mean absolute SHAP value for each embedding band (A00–A63), representing its contribution to model predictions. Bands with higher SHAP values have greater influence, with a small subset dominating predictions for (a) NO₂ and (b) SO₂ and a more even distribution observed for (c) PM_2.5 and (d) CO. Bands with very low SHAP contributions (close to zero) had negligible influence on predictions for NO₂ and SO₂. Their inclusion in the models ensured that the full 64-dimensional embedding representation was evaluated, but they did not affect predictive performance. This highlights redundancy within the embedding set, suggesting that future streamlined models could prioritize only the most relevant bands without losing accuracy.

Figure 6. Spatial distribution of predicted NO₂ concentrations in Quito at 10 m resolution for 2017 on the left and 2024 on the right maps, using the best-performing model. Higher concentrations are observed along major transport corridors and in central districts, with reductions in the city center and increases in southern peri-urban areas over the study period. Apparent enhancements in some adjacent mountainous areas likely reflect model smoothing and sparse monitoring coverage and should be interpreted with caution.

Figure 7. Spatial distribution of predicted SO₂ concentrations in Quito at 10 m resolution for 2017 (left) and 2024 (right) maps, using the best-performing model. Persistent hotspots are observed in the southern industrial corridor, with localized intensification between 2017 and 2024.

Table 1. Performance metrics of best-performing models.

Pollutant	Model	No. Train	No. Test	MAE (µg/m³)	RMSE (µg/m³)	R²
CO	SVR	42	18	0.06	0.07	0.61
CO	Gradient Boosting	42	18	0.07	0.08	0.48
NO₂	SVR	42	18	2.53	2.91	0.71
NO₂	KNN	42	18	2.33	2.92	0.71
O₃	Random Forest	48	21	3.78	4.56	−0.02
O₃	Ridge	48	21	3.67	4.60	−0.04
PM_2.5	Ridge	49	22	1.20	1.57	0.55
PM_2.5	Elastic Net	49	22	1.21	1.57	0.55
SO₂	SVR	44	19	0.28	0.39	0.71
SO₂	Random Forest	44	19	0.36	0.43	0.66

Table 2. Comparison of predictive performance with existing studies.

Study	Location	Data/Method	Pollutant(s)	Reported R²	Resolution
Alvarez-Mendoza et al. (2019) [10]	Quito	Landsat + meteorological regression	O₃	0.55	30 m
Mejía et al. (2024) [37]	Quito	Sentinel-5P + land-use regression	NO₂	0.40–0.60	1 km
Chau et al. (2022) [38]	Quito	Deep learning + Sentinel-5P	PM_2.5, NO₂	0.45–0.65	1 km
Chen et al. (2024) [25]	Great Britain	Random Forest + multiple predictors	NO₂, O₃	0.75–0.80	1 km
This study	Quito	AlphaEarth embeddings + SVR	NO₂, SO₂	0.71	10 m

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alvarez, C.I.; Ulloa Vaca, C.A.; Echeverria Llumipanta, N.A. Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador. Remote Sens. 2025, 17, 3472. https://doi.org/10.3390/rs17203472

AMA Style

Alvarez CI, Ulloa Vaca CA, Echeverria Llumipanta NA. Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador. Remote Sensing. 2025; 17(20):3472. https://doi.org/10.3390/rs17203472

Chicago/Turabian Style

Alvarez, Cesar Ivan, Carlos Andrés Ulloa Vaca, and Neptali Armando Echeverria Llumipanta. 2025. "Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador" Remote Sensing 17, no. 20: 3472. https://doi.org/10.3390/rs17203472

APA Style

Alvarez, C. I., Ulloa Vaca, C. A., & Echeverria Llumipanta, N. A. (2025). Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador. Remote Sensing, 17(20), 3472. https://doi.org/10.3390/rs17203472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador

Abstract

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Ground-Based Air Quality Data (REEMAQ)

2.3. Satellite Embeddings (A00–A63)

2.4. Machine Learning Models and Evaluation

2.5. Feature Importance Analysis (SHAP)

3. Results

3.1. Analysis of Ground-Based REEMAQ Data

3.2. Machine Learning Model Performance

4. Discussion

4.1. Main Findings

4.2. Comparison with Existing Studies

4.3. SHAP Interpretability

4.4. Strengths, Novelty, Policy Relevance and Transferability

4.5. Integrated Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI