Next Article in Journal
Characteristics of Atmospheric CO2 at Shangri-La Regional Atmospheric Background Station in Southwestern China: Insights from Recent Observations (2019–2022)
Previous Article in Journal
Evaluation of Regional Atmospheric Models for Air Quality Simulations in the Winter Season in China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automating Air Pollution Map Analysis with Multi-Modal AI and Visual Context Engineering

1
Department of Geoinformatics and Applied Computer Science, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Krakow, Mickiewicza 30, 30-059 Krakow, Poland
2
Institute of Physics Belgrade, National Institute of the Republic of Serbia, Pregrevica 118, 11000 Belgrade, Serbia
*
Author to whom correspondence should be addressed.
Atmosphere 2026, 17(1), 2; https://doi.org/10.3390/atmos17010002
Submission received: 6 November 2025 / Revised: 10 December 2025 / Accepted: 13 December 2025 / Published: 19 December 2025
(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

Abstract

The increasing volume of data from IoT sensors has made manual inspection time-consuming and prone to bias, particularly for spatiotemporal air pollution maps. While rule-based methods are adequate for simple datasets or individual maps, they are insufficient for interpreting multi-year time series data with 1 h timestamps, which require both domain-specific expertise and significant time investment. This limitation is especially critical in environmental monitoring, where analyzing long-term spatiotemporal PM2.5 maps derived from 52 low-cost sensors remains labor-intensive and susceptible to human error. This study investigates the potential of generative artificial intelligence, specifically multi-modal large language models (MLLMs), for interpreting spatiotemporal PM2.5 maps. Both open-source models (Janus-Pro and LLaVA-1.5) and commercial large language models (GPT-4o and Gemini 2.5 Pro) were evaluated. The initial results showed a limited performance, highlighting the difficulty of extracting meaningful information directly from raw sensor-derived maps. To address this, a visual context engineering framework was introduced, comprising systematic optimization of colormaps, normalization of intensity ranges, and refinement of map layers and legends to improve clarity and interpretability for AI models. Evaluation using the GEval metric demonstrated that visual context engineering increased interpretation accuracy (defined as the detection of PM2.5 spatial extrema) by over 32.3% (relative improvement). These findings provide strong evidence that tailored visual preprocessing enables MLLMs to effectively interpret complex environmental time series data, representing a novel approach that bridges data-driven modeling with ecological monitoring and offers a scalable solution for automated, reliable, and reproducible analysis of high-resolution air quality datasets.

1. Introduction

Heightened environmental concentrations of fine particulate matter (PM) result in numerous health-related complications, including respiratory [1,2,3,4,5,6,7] and cardiovascular [1,3,4,8,9,10,11] issues, as well as increased mortality rates [12,13,14]. PMs are classified by their size, with the most commonly measured PM concentrations being those smaller than 10 μ m and 2.5 μ m, the latter of which can infiltrate deeply into the lungs [15]. Low air quality (AQ) is associated with low-income areas or nations with poor economic conditions, suggesting that air pollution (AP) reflects socioeconomic disparities, as well as disparities between urban and rural areas, with urban settings demonstrating elevated AP levels relative to rural areas [16,17,18,19].
In light of significant evidence linking increased AP to adverse health effects, the World Health Organization (WHO) and the United Nations Sustainable Development Goals (SDGs) established specific objectives, guidelines, and initiatives to combat AP. The WHO released the 2021 global AQ guidelines and established revised thresholds for annual and daily PM concentrations, reducing them from the 2005 edition of the guidelines [20,21]. Likewise, SDGs 3.9 and 11.6 aim to reduce pollution-related health and mortality risks, as well as urban environmental impacts, especially in terms of AQ and waste [22].
Advancements in sensors have reduced the entry cost associated with monitoring AP, thus providing more information for data-driven policy and regulation purposes. Conventional sensors employed by governmental or regulatory agencies can impose a considerable financial strain when attempting to monitoring AP on a larger scale or at a fine resolution. Low-cost sensors can potentially fill that gap in order to efficiently monitor AP on a broader scale and/or lower resolution [23] at a substantially reduced cost. Low-cost sensors have been shown to achieve an acceptable accuracy compared to regulatory-grade AQ monitoring sensors when a correction equation is applied [23,24,25], but the data from low-cost sensors should be interpreted with caution [26].
The development of novel analysis, processing, and interpretation workflows for both regulatory-grade and low-cost sensors is particularly beneficial in the era of artificial intelligence (AI). When dealing with a large volume of low-cost sensor data, manual inspection and interpretation of spatially dependent AP data is susceptible to bias and reliant on the domain-specific expertise of the researcher, in addition to being labor-intensive.
Large language models (LLMs) have shown several promising examples of their applicability to AP data. For instance, MLLMs trained on peer-reviewed literature in atmospheric sciences have effectively been utilized to deliver regulatory information, data analysis, and management recommendations for AQ data, as demonstrated by AirGPT [27]. MLLMs were also utilized in a comprehensive literature review regarding the effects of climate change on global AQ with notable precision [28]. Additionally, MLLMs were applied on AQ data during the Los Angeles wildfires in January 2025, successfully generating health recommendations, policies, and summary reports [29]. Beyond text-based applications, a wide array of AI techniques, particularly machine learning, have been applied in AQ forecasting, yielding promising results on AQ datasets [30,31,32,33,34] and references therein.
Despite the broad applicability of MLLMs in AQ research, previous studies have focused primarily on text-based datasets. To the best of our knowledge, no research has systematically investigated the utilization of MLLMs on image-based spatial AQ maps or analyzed the impact of visual context design on the interpretive potential of MLLMs. Considering the positive results of utilizing AI and MLLMs for AP data, it was considered worthwhile to examine this research gap by testing the capabilities of MLLMs in the interpretation of AP maps, particularly PM2.5 maps. Both commercial and open-source MLLMs were applied, with map optimizations such as colormap adjustments, intensity normalization, and layer refinement to enhance interpretability. As in classical modeling, where neglecting key spatial parameters can lead to misinterpretation [35], careful visual preparation of maps is critical for reliable MLLM analysis. The possible outcomes of this research could provide a basis for the automation of AQ interpretability where large data streams are present, thus decreasing labor-intensive processes susceptible to researcher biases. The findings of this research may be extrapolated beyond AQ data to other environmental and ecological datasets, particularly those utilizing high-resolution data and substantial volumes of data, where automation can enhance interpretation, automate labor-intensive processes, and mitigate bias.

2. Materials and Methods

2.1. Data and Maps

The dataset used for the MLLM analysis originates from Kraków (Poland) and its surrounding areas (see Figure 1), covering the period from 00:00 on 1 January 2022 to 00:00 on 1 January 2023. These data represent outputs from forecasting models developed in previous pipeline studies [36,37]. The dataset consists of hourly mean PM2.5 values. With regard to preprocessing, the raw data provided by Airly (www.airly.org (accessed on 1 Novermber 2025)) were used with minimal intervention: only linear interpolation of occasional missing values was applied, cross-checked with nearest-neighbour sensor consistency and with the nearest governmental reference station. No additional cleaning, outlier removal, or bias correction was performed, as preserving the natural variability of the measurements was an important part of the analysis. The dataset was scaled (normalized) solely to ensure numerical stability of the models. For the MLLM experiments, we utilized both historical observations and forecasted values. The distinction between observed and predicted data was secondary, as the primary objective was to evaluate the MLLM’s general capability to comprehend and process the underlying data patterns. In total, 52 low-cost sensors (LCSs) provided by Airly (https://airly.org/ (accessed on 1 Novermber 2025)) were employed. These sensors operate based on the principle of light scattering. Although LCS units generally exhibit higher bias and uncertainty compared to reference-grade monitoring stations, which typically rely on gravimetric methods, numerous studies have reported strong agreement between their measurements and reference data under a wide range of environmental conditions [38,39]. Other research has evaluated the performance of these sensors in detecting smog episodes and in assessing the influence of various environmental factors on prediction accuracy [36]. Furthermore, the effects of meteorological conditions and local topography on sensor readings have been extensively investigated using explainable artificial intelligence (XAI) methods [40]. Kraków was selected as a representative area for studying AP due to its location within a moderate climate zone characterized by pronounced seasonal weather variability. The city is situated in Poland (European Union), where coal combustion during the winter months remains one of the primary sources of deteriorating AP [41], a phenomenon that shares many similarities with urban environments in China [42], India [43], and other regions facing comparable challenges. At the same time, Kraków represents a unique case study, as it is an isolated urban area that has implemented a complete ban on the use of fossil fuels for domestic heating. This transition is occurring under the influence of European Union legislation enforcing strict air quality standards, making Kraków an exceptional natural laboratory for assessing the effectiveness of environmental policies and the dynamics of air pollution under evolving regulatory frameworks.
In the current study, an additional pipeline was developed to generate spatial distribution maps of PM2.5 concentrations. Both forecasted and observed values were interpolated using ordinary kriging [46,47] with the following semi-variogram parameters: sill = 5.0, range = 0.6°, nugget = 0.1; linear model. The interpolation was performed on a regular grid with a resolution of 0.02°. A single global variogram model was applied, and Kriging was performed using the same time-invariant parameters estimated from the combined dataset. This global approach may smooth episodic variations in the autocorrelation structure and attenuate local extremes, which could reduce the visibility of sharp gradients and local hotspots for MLLM interpretation. However, this is acceptable, as the primary objective of the study is to assess the overall capability of MLLMs to interpret spatial patterns in PM2.5. The resulting Kriging grids were visualized as maps representing the spatial distribution of PM2.5 concentrations.

2.2. Multimodal Large Language Models

These visualizations were used to assess how effectively multimodal large language models and commercial models can interpret such data. Multimodal models are models capable of processing and integrating multiple modalities such as text, images, audio, and others. In this study, the modalities used were images and text. Typically, such architectures consist of three main components: a visual encoder, a connector or projector, and MLLM (Figure 2).
In the Cambrian-1 study by Tong et al. [48], it was shown that MLLMs often face difficulties with the visual component, relying primarily on the language model for interpretation and reasoning. This observation proved highly valuable for the present research and informed several subsequent design decisions. During the experiments, two open-source MLLM architectures were evaluated: Janus-Pro-7B [49] and LLaVA-1.5-7B-HF [50]. For comparison with commercial models, GPT-4o [51] and Gemini 2.5 Pro [52] were also tested.
All models utilized the identical prompt template. Inference was performed with a temperature of 0.3 and default top-k settings. Results were averaged across multiple generations to ensure stability.

2.3. Visual Context Optimization for MLLM Interpretation

To improve the interpretability of PM2.5 spatial maps by MLLMs, a dedicated visual optimization process was implemented. The procedure focused on the selection of color palettes, scaling methods, and normalization techniques that enhance contrast and semantic consistency between visual and textual modalities.
The distribution of PM2.5 concentrations (Figure 3) was first analyzed to identify the characteristic range of values. More than 95.7% of all observations were below 50 µg/m3, while extreme values were rare. Based on this finding, a nonlinear power normalization (PowerNorm) was applied to the input data to improve contrast in low-value regions. The transformation was defined as [53]:
c = x * v min v max v min γ , γ = 0.4 , v min = 0 , v max = 289 .
where c is the normalized concentration value, x * is the original (pre-normalized) PM2.5 concentration, v min is the minimum value used in normalization (here: 0), v max is the maximum value used in normalization (here: 289), and γ is the exponent controlling the degree of nonlinearity (here: 0.4). The parameter γ = 0.4 was determined experimentally through visual inspection. It was selected to enhance contrast within the predominant low-concentration range, as linear scaling resulted in insufficient visual differentiation.
Alternative γ values were evaluated but proved less effective in highlighting spatial patterns.
For γ < 1 , the transformation expands the lower part of the value range, resulting in stronger differentiation between small concentrations while compressing higher values. This adjustment improves visual readability in the most frequent range of data (0–50 µg/m3). To ensure intuitive interpretation both by MLLMs and by human users familiar with air quality indicators, a color palette inspired by Air Quality Index (AQI) scales was adopted (Figure 4). The gradient transitions from green → yellow → orange → red → violet/burgundy, reflecting increasing pollution levels. This color semantics supports fast perception of risk gradients and aligns with common visual conventions used in air quality communication. A fixed maximum scale value of 289 was applied across all maps to maintain stable and comparable visualization between temporal and spatial scenarios. Values exceeding this limit were clipped to the highest color class, indicated with an upward triangular end on the color bar. Since such extreme observations constituted less than 0.02% of the dataset, this clipping did not introduce bias into hotspot detection. This approach prevents inconsistent scaling that could lead to misinterpretation by MLLMs when comparing different map contexts.

2.4. Evaluation

Two complementary evaluation approaches were applied to assess the interpretability of the generated visual–textual outputs: manual subjective evaluation and automated G-Eval scoring.

2.4.1. Manual Subjective Evaluation

A manual qualitative assessment was conducted to capture aspects of model perception that are difficult to express numerically. The evaluation focused on the following key capabilities of multimodal large language models:
  • Temporal reasoning—recognizing and describing changes over time between consecutive visualizations (e.g., sequences of PM2.5 maps);
  • Identification of spatial extremes—detecting and describing hot spots (high-concentration regions) and cold spots (low-concentration regions);
  • Recognition of spatial gradients—interpreting gradual attenuation or intensification of PM2.5 concentrations and identifying the general direction of dispersion;
  • Tracking of PM2.5 cluster displacements—detecting movement or shifting of pollution concentration clusters between time steps, indicating dynamic atmospheric transport or meteorological influence.
Each of these aspects was analyzed through direct visual comparison and examination of model-generated descriptions, with attention to whether the model correctly captured environmental dynamics and spatial relations.

2.4.2. Automated Evaluation with G-Eval

To complement the manual analysis, an automated quantitative metric based on G-Eval was used. G-Eval [50] is a large-language-model-based evaluation framework designed to assess text generation quality with strong correlation to human judgments. It employs a chain-of-thought (CoT) reasoning process combined with a form-filling scoring paradigm. The evaluator first generates intermediate evaluation steps from predefined task instructions and criteria (e.g., clarity, coherence, consistency), and then outputs a continuous score between 1 and 5. Instead of using discrete ratings, the probability of each possible score p ( s i ) is extracted from the MLLM’s output and combined into a weighted continuous metric:
score = i = 1 n p ( s i ) × s i .
where p ( s i ) is the probability assigned to score s i , and s i represents each discrete rating option (e.g., 1–5).
This method provides fine-grained, continuous evaluation scores that more accurately capture subtle differences between generated responses. In this study, the G-Eval metric was applied to assess the textual interpretations of PM2.5 spatial maps, their relevance to observed map features, and the accuracy of spatial reasoning. The evaluation criteria focused on verifying the correctness of identifying key measurement points (hot-spots and cold-spots). The dataset used for this evaluation consisted of 100 manually annotated PM2.5 maps, serving as high-quality reference data.
Theorem 1. 
In the correct answer, the following sensor IDs (hot-spots) must be included: [Hot-spots]. Additionally, the following sensor IDs (cold-spots) must also be included: [Cold-spots]. If a required point (e.g., 99) is replaced with an incorrect one (e.g., 77), a stronger penalty should be applied. If a required point is simply missing (no ID provided), a smaller penalty should be applied.

3. Results

The open-source models evaluated in this study, Janus-Pro-7B and LLaVA-1.5-7B-HF, exhibited significant difficulties in handling visual context. Their spatially grounded responses were often inconsistent and demonstrated limited spatial understanding. Moreover, older versions performed poorly on tasks requiring Optical Character Recognition (OCR), specifically the recognition of numerical sensor ID labels visible on the map images. For instance, when prompted with “List me ID of point with the most pollution on map?”, LLaVA-1.5-7B-HF responded: “The point with the most pollution on the map is located at approximately 1.33, 0.48, and 0.52. This point has a high concentration of red dots, indicating a significant amount of pollution in the area.” Quantitative assessment via G-Eval confirmed these limitations, with open-source models achieving negligible scores (averaging 2.5%). However, error analysis reveals that this underperformance stems primarily from OCR hallucinations rather than a lack of visual attention. While the models often correctly located high-pollution regions visually, they failed to transcribe the specific sensor IDs required by the ground truth. This indicates that for current open-source MLLMs, the bottleneck lies in fine-grained text recognition (OCR) rather than general spatial reasoning.
After the evaluation of open-source models, the focus was shifted toward commercial MLLMs, which were assessed under controlled and systematically varied visualization conditions. Unless stated otherwise, all results presented in this section were obtained using the G-Eval. Commercial models were tested on four differently prepared PM2.5 spatial maps (Figure 5), designed to assess the impact of visual context and preprocessing on model interpretation:
(i)
Map I—a correctly prepared map after context engineering, using a customized color scale inspired by the AQI scheme (see Section 2.3, Visual Context Optimization for MLLM Interpretation);
(ii)
Map II—a correctly prepared map with an additional overlaid shapefile;
(iii)
Map III—a raw, unprocessed map rendered with a variable color scale (Vardis), corresponding to near-default plotting parameters;
(iv)
Map IV—a raw Vardis map with an overlaid shapefile.
Figure 5. (a) Map I: Context-engineered map (AQI color scale) (b), Map II: Context-engineered map + shapefile overlay (c), Map III: Raw map (Vardis color scale) (d), Map IV: Raw map (Vardis color scale + shapefile overlay).
Figure 5. (a) Map I: Context-engineered map (AQI color scale) (b), Map II: Context-engineered map + shapefile overlay (c), Map III: Raw map (Vardis color scale) (d), Map IV: Raw map (Vardis color scale + shapefile overlay).
Atmosphere 17 00002 g005
The experiments indicate (Table 1) that adding a shapefile overlay strongly shifts model attention toward the overlay, leading to misinterpretation of hot and cold spots. The overlay darkens the color field, prompting the model to infer higher concentrations where a human reader would immediately recognize an added vector layer. With a well-prepared map, the model attains an average G-Eval of 0.38 with a shapefile overlay, and performance drops to 0.253, reducing image-analysis accuracy by over 33.6%. Crucially, visual context aligned with human perception also benefits the models: consistent color normalization combined with an AQI-like, high-contrast color scale—a palette widely associated with air-pollution risk—improves interpretability for both humans and MLLMs. This human–model alignment arises because the AQI palette foregrounds monotonic risk cues (e.g., yellow → orange → red) that models readily associate with textual descriptions of elevated PM2.5, yielding higher G-Eval scores and improving hotspot/cold-spot detection by up to 32.3%. Moreover, an AQI-like palette likely biases the model even at the language-component level by activating learned linguistic priors that link yellow→orange→red with “higher risk,” tightening the mapping between text tokens and the spatial signal. See scores in Table 1.
Manual tests were also conducted for Temporal Reasoning, Identification of Spatial Extremes, Recognition of Spatial Gradients, and Tracking of PM2.5 Cluster Displacements. These tests showed that visual context engineering substantially improves performance on all tasks, especially when models receive more than one kriging map. For cluster tracking over time, a fixed color scale and consistent normalization kept the model aligned: with a single, stable legend, the model could compare panels directly and did not lose track of clusters. Under variable scales, the model often flagged locations that merely shifted position on the legend rather than exhibiting true changes in concentration—failing to capture the actual cold/hot spots. With multiple maps, the model sometimes confused the colormap semantics (e.g., interpreting “warm” as “cold” and vice versa), an error that disappeared when the maps were prepared with consistent scaling, normalization, and explicit context cues.

4. Discussion

In this study, 52 low-cost sensors were deployed across Kraków and its surrounding areas to investigate the potential application of MLLMs for the spatial analysis of PM2.5 air pollution in 2022. Kraków was chosen as a reference city. Solid fuel burning is completely banned within the city. However, the country still relies heavily on coal for energy, resulting in high air pollution levels, especially in winter. Given this energy mix and geographical location, the city provides a robust evidence base and a valuable reference point for other low- and middle-income countries to assess the effectiveness of air quality improvement policies implemented by the European Union. The results of this study confirm that the use of MLLMs, such as GPT-4o and Gemini 2.5 Pro, can significantly enhance the interpretation of spatial distributions of PM2.5 air pollution. A key factor contributing to this improvement was visual context engineering. This proved to be a critical component determining the quality of model interpretation, rather than merely a supplementary step in data preprocessing. Visual context engineering refers to the deliberate design of visual data, such as maps and satellite imagery, enabling multimodal models to effectively understand spatial relationships and geographic patterns. In practice, this means that the way a map is represented its color palette, scale, proportions, and layer structure can substantially influence the model’s ability to interpret spatial information accurately. In this work, techniques such as colormap optimization, scale normalization, and layout standardization contributed to a noticeable improvement in model accuracy. Additionally, contextual map overlays (including administrative boundaries, emission zones, and transportation networks) allowed the models to better identify local anomalies and gradients in PM2.5 concentration. Optimizing the visual context increased the accuracy of model responses by 32.3% (relative improvement). Furthermore, it improved the coherence and interpretability of the generated explanations. This enhancement led to more stable outputs under varying input conditions, which is crucial for reliable application of MLLMs in environmental data analysis. It is important to emphasize that the effectiveness of MLLMs is highly dependent on visual consistency. Even minor changes in color scale or projection can alter the model’s interpretation, underscoring the need for standardized visual preprocessing in multimodal environmental research. Future research may also focus on automatic map tuning and generation of appropriate color scales tailored to the specific characteristics of the analyzed problem, using Generative Adversarial Networks (GANs) and other machine learning approaches. In summary, visual context engineering should be considered a key component of multimodal analysis, bridging data visualization, geoinformatics, and machine learning. Proper preparation of visual input data is essential for achieving high-quality, interpretable outputs from MLLMs in environmental applications.

5. Conclusions

This work proposes a new methodological direction for map interpretation that leverages MLLMs to extract, represent, and reason about complex spatiotemporal relationships. Unlike traditional approaches based on visual inspection or static statistical methods, the proposed framework enables scalable, data-driven interpretation of spatial phenomena, facilitating the discovery of latent patterns and causal relationships that remain hidden or difficult to detect through conventional analysis. The experiments demonstrate that the quality and consistency of visual context are crucial for accurate spatial map interpretation by MLLMs. The open-source models exhibited limited spatial understanding and struggled with OCR-related and visually grounded reasoning tasks, confirming their immaturity in visual analysis. It should be noted that a quantitative comparison (e.g., G-Eval) between these open-source models and commercial counterparts was deemed methodologically unsuitable. Preliminary tests revealed that open-source architectures exhibited a fundamental perceptual limitation in resolving small sensor ID numbers (OCR failure). Since the models could not correctly perceive the input data, evaluating their subsequent spatial reasoning capabilities would yield misleading results. Therefore, we report these limitations qualitatively rather than through metrics that conflate visual acuity with reasoning logic. In contrast, commercial MLLMs achieved notably better results, though their performance was highly dependent on the preparation and clarity of the input maps. Adding a shapefile overlay significantly distorted interpretation, as the models tended to focus on the vector layer rather than the underlying color field, leading to systematic misjudgment of PM2.5 concentration levels. The highest performance was obtained when using a consistent, AQI-inspired color scale, which improved both human and model interpretability and increased accuracy by over 30%. Stable legends and uniform normalization enabled models to reliably track spatial and temporal changes, reducing confusion caused by inconsistent colormaps or scaling. Well-structured visual context is crucial in enabling reliable spatial analysis with MLLMs, especially in environmental monitoring and assessment focused on particulate matter. While current models demonstrate competence in basic map interpretation, they exhibit systematic and reproducible errors when visual cues are ambiguous, inconsistent, or poorly standardized. This highlights the need for continued advancements in both model architecture and visualization protocols to ensure accurate, high-precision geospatial analysis, which is essential for informed environmental decision-making and policy development.

6. Limitations and Future Work

While this study focuses on a single city, sensor network, and pollutant (PM2.5), future studies should include other urban areas, diverse chemical regimes, alternative map types (e.g., wind fields, NO2 plumes), and additional pollutants, including gaseous species, to assess the generalizability of our findings. Further work could also explore the performance of different MLLMs, examine alternative visual context designs, and evaluate robustness across multiple years, seasons, and spatial resolutions. Such extensions would provide a more comprehensive understanding of how MLLMs can interpret environmental maps in varied real-world settings.

Author Contributions

Conceptualization, S.C., M.Z., T.D. and F.A.; methodology, S.C. and M.Z.; validation, S.C., M.Z., F.A. and T.D.; formal analysis, M.Z. and S.C.; investigation, S.C. and M.Z.; resources, S.C., M.Z. and T.D.; data curation, T.D.; writing—original draft preparation, M.Z., S.C. and F.A.; writing—review and editing, M.Z. and S.C.; visualization, M.Z. and S.C.; supervision, M.Z.; project administration, S.C. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Datasets from Airly sensors were analyzed in this study and can be found here: (https://map.airly.org/, accessed on 11 November 2025). API documentation from Airly is available here: (https://developer.airly.org/en/docs, accessed on 11 November 2025). The Airly sensor data analyzed in this study were obtained under a free academic API key issued by Airly S.A. in accordance with the Airly API Service Terms. These terms restrict the redistribution of the raw data and prohibit the sharing of the API key with third parties. Consequently, the underlying data are not publicly available. Researchers can obtain current data directly from Airly by registering for an academic research API key at (https://map.airly.org/ (accessed on 11 November 2025)) (see Airly API Terms of Service).

Acknowledgments

This research project was partly supported by the AGH University of Krakow, Faculty of Geology, Geophysics and Environmental Protection, as a part of a statutory project. Research project partly supported by the program “Excellence initiative—research university” for the AGH University. Artificial intelligence tools were employed exclusively for research purposes within this study. Specifically, two open-source multimodal large language models (MLLMs)—Janus-Pro-7B (https://deepseek-januspro.com/ (accessed on 1 November 2025)) and LLaVA-1.5-7B-HF (https://llava-vl.github.io/ (accessed on 1 November 2025))—were used to evaluate open-source performance on spatial reasoning and visual interpretation tasks. For comparison, two commercial MLLMs—GPT-4o (https://platform.openai.com/docs/models/gpt-4o (accessed on 1 November 2025)) and Gemini 2.5 Pro (https://deepmind.google/models/gemini/pro/ (accessed on 1 November 2025))—were also tested under the same experimental conditions. All AI models were utilized solely for experimental analysis and benchmarking. All study design, interpretation, and reporting were carried out by the authors. During the preparation of this work, Grammarly, Writefull, and OpenAI were used for language corrections to refine grammar, improve wording, and enhance overall clarity. After utilizing these tools, the author reviewed and edited the content as needed and carries full responsibility for the final published article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dominici, F.; Peng, R.D.; Bell, M.L.; Pham, L.; McDermott, A.; Zeger, S.L.; Samet, J.M. Fine Particulate Air Pollution and Hospital Admission for Cardiovascular and Respiratory Diseases. JAMA 2006, 295, 1127–1134. [Google Scholar] [CrossRef] [PubMed]
  2. Adamkiewicz, G.; Liddie, J.; Gaffin, J.M. The Respiratory Risks of Ambient/Outdoor Air Pollution. Clin. Chest Med. 2020, 41, 809–824. [Google Scholar] [CrossRef] [PubMed]
  3. Ren, Z.; Liu, X.; Liu, T.; Chen, D.; Jiao, K.; Wang, X.; Suo, J.; Yang, H.; Liao, J.; Ma, L. Effect of Ambient Fine Particulates (PM2.5) on Hospital Admissions for Respiratory and Cardiovascular Diseases in Wuhan, China. Respir. Res. 2021, 22, 128. [Google Scholar] [CrossRef] [PubMed]
  4. Danesh Yazdi, M.; Wang, Y.; Di, Q.; Wei, Y.; Requia, W.J.; Shi, L.; Sabath, M.B.; Dominici, F.; Coull, B.A.; Evans, J.S.; et al. Long-Term Association of Air Pollution and Hospital Admissions Among Medicare Participants Using a Doubly Robust Additive Model. Circulation 2021, 143, 1584–1596. [Google Scholar] [CrossRef]
  5. Chen, J.; Zeng, Y.; Lau, A.K.H.; Guo, C.; Wei, X.; Lin, C.; Huang, B.; Lao, X.Q. Chronic Exposure to Ambient PM2.5/NO2 and Respiratory Health in School Children: A Prospective Cohort Study in Hong Kong. Ecotoxicol. Environ. Saf. 2023, 264, 114558. [Google Scholar] [CrossRef]
  6. Lin, S.; Xue, Y.; Thandra, S.; Qi, Q.; Hopke, P.K.; Thurston, S.W.; Croft, D.P.; Utell, M.J.; Rich, D.Q. PM2.5 and Its Components and Respiratory Disease Healthcare Encounters—Unanticipated Increased Exposure–Response Relationships in Recent Years after Environmental Policies. Environ. Pollut. 2024, 360, 124585. [Google Scholar] [CrossRef]
  7. Hamanaka, R.B.; Mutlu, G.M. Particulate Matter Air Pollution: Effects on the Respiratory System. J. Clin. Investig. 2025, 135, e194312. [Google Scholar] [CrossRef]
  8. Miller, M.R. The Cardiovascular Effects of Air Pollution: Prevention and Reversal by Pharmacological Agents. Pharmacol. Ther. 2022, 232, 107996. [Google Scholar] [CrossRef]
  9. de Bont, J.; Jaganathan, S.; Dahlquist, M.; Persson, r.; Stafoggia, M.; Ljungman, P. Ambient Air Pollution and Cardiovascular Diseases: An Umbrella Review of Systematic Reviews and Meta-Analyses. J. Intern. Med. 2022, 291, 779–800. [Google Scholar] [CrossRef]
  10. Alexeeff, S.E.; Deosaransingh, K.; Van Den Eeden, S.; Schwartz, J.; Liao, N.S.; Sidney, S. Association of Long-Term Exposure to Particulate Air Pollution with Cardiovascular Events in California. JAMA Netw. Open 2023, 6, e230561. [Google Scholar] [CrossRef]
  11. Khoshakhlagh, A.H.; Mohammadzadeh, M.; Gruszecka-Kosowska, A.; Oikonomou, E. Burden of Cardiovascular Disease Attributed to Air Pollution: A Systematic Review. Glob. Health 2024, 20, 37. [Google Scholar] [CrossRef] [PubMed]
  12. Di, Q.; Dai, L.; Wang, Y.; Zanobetti, A.; Choirat, C.; Schwartz, J.D.; Dominici, F. Association of Short-Term Exposure to Air Pollution with Mortality in Older Adults. JAMA 2017, 318, 2446–2456. [Google Scholar] [CrossRef] [PubMed]
  13. Chen, J.; Hoek, G. Long-Term Exposure to PM and All-Cause and Cause-Specific Mortality: A Systematic Review and Meta-Analysis. Environ. Int. 2020, 143, 105974. [Google Scholar] [CrossRef] [PubMed]
  14. Lelieveld, J.; Pozzer, A.; Pöschl, U.; Fnais, M.; Haines, A.; Münzel, T. Loss of Life Expectancy from Air Pollution Compared to Other Risk Factors: A Worldwide Perspective. Cardiovasc. Res. 2020, 116, 1910–1917. [Google Scholar] [CrossRef]
  15. Xing, Y.F.; Xu, Y.H.; Shi, M.H.; Lian, Y.X. The Impact of PM2.5 on the Human Respiratory System. J. Thorac. Dis. 2016, 8, E69–E74. [Google Scholar] [CrossRef]
  16. Rentschler, J.; Leonova, N. Global Air Pollution Exposure and Poverty. Nat. Commun. 2023, 14, 4432. [Google Scholar] [CrossRef]
  17. Hajat, A.; Hsia, C.; O’Neill, M.S. Socioeconomic Disparities and Air Pollution Exposure: A Global Review. Curr. Environ. Health Rep. 2015, 2, 440–450. [Google Scholar] [CrossRef]
  18. Strosnider, H.; Kennedy, C.; Monti, M.; Yip, F. Rural and Urban Differences in Air Quality, 2008–2012, and Community Drinking Water Quality, 2010–2015—United States. MMWR Surveill. Summ. 2017, 66, 1–10. [Google Scholar] [CrossRef]
  19. Han, W.; Li, Z.; Guo, J.; Su, T.; Chen, T.; Wei, J.; Cribb, M. The Urban–Rural Heterogeneity of Air Pollution in 35 Metropolitan Regions across China. Remote Sens. 2020, 12, 2320. [Google Scholar] [CrossRef]
  20. World Health Organization. Air Quality Guidelines: Global Update 2005—Particulate Matter, Ozone, Nitrogen Dioxide and Sulfur Dioxide; WHO Regional Office for Europe: Copenhagen, Denmark, 2006. [Google Scholar]
  21. World Health Organization. WHO Global Air Quality Guidelines: Particulate Matter (PM2.5 and PM10), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
  22. United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. 2015. Available online: https://sdgs.un.org/goals (accessed on 25 October 2025).
  23. Malings, C.; Tanzer, R.; Hauryliuk, A.; Saha, P.K.; Robinson, A.L.; Presto, A.A.; Subramanian, R. Fine Particle Mass Monitoring with Low-Cost Sensors: Corrections and Long-Term Performance Evaluation. Aerosol Sci. Technol. 2019, 53, 1272–1287. [Google Scholar] [CrossRef]
  24. Holder, A.L.; Mebust, A.K.; Maghran, L.A.; McGown, M.R.; Stewart, K.E.; Vallano, D.M.; Elleman, R.A.; Baker, K.R. Field Evaluation of Low-Cost Particulate Matter Sensors for Measuring Wildfire Smoke. Sensors 2020, 20, 4796. [Google Scholar] [CrossRef] [PubMed]
  25. Levy Zamora, M.; Rice, J.; Koehler, K. One Year Evaluation of Three Low-Cost PM2.5 Monitors. Atmos. Environ. 2020, 235, 117615. [Google Scholar] [CrossRef] [PubMed]
  26. Feenstra, B.; Papapostolou, V.; Hasheminassab, S.; Zhang, H.; Der Boghossian, B.; Cocker, D.; Polidori, A. Performance Evaluation of Twelve Low-Cost PM2.5 Sensors at an Ambient Air Monitoring Site. Atmos. Environ. 2019, 216, 116946. [Google Scholar] [CrossRef]
  27. Song, J.; Ma, C.; Ran, M. AirGPT: Pioneering the Convergence of Conversational AI with Atmospheric Science. NPJ Clim. Atmos. Sci. 2025, 8, 179. [Google Scholar] [CrossRef]
  28. Lai, Y.; Lu, M.; Chen, G.; Fu, B.; Xu, Z.; Xin, J.; Li, G.; Zhang, W.; Li, B.; Cao, J. Unraveling the Complex Impact of Climate Change on Air Quality in the World. NPJ Clean Air 2025, 1, 25. [Google Scholar] [CrossRef]
  29. Gao, K.; Lu, D.; Li, L.; Chen, N.; He, H.; Du, J.; Xu, L.; Li, J. Instructor–Worker Large Language Model System for Policy Recommendation: A Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires. Int. J. Appl. Earth Obs. Geoinf. 2025, 133, 104774. [Google Scholar] [CrossRef]
  30. Esager, M.W.M.; Ünlü, K.D. Forecasting Air Quality in Tripoli: An Evaluation of Deep Learning Models for Hourly PM2.5 Surface Mass Concentrations. Atmosphere 2023, 14, 478. [Google Scholar] [CrossRef]
  31. Liu, Q.; Cui, B.; Liu, Z. Air Quality Class Prediction Using Machine Learning Methods Based on Monitoring Data and Secondary Modeling. Atmosphere 2024, 15, 553. [Google Scholar] [CrossRef]
  32. Yıldırım Özüpak, F.; Alpsalaz, F.; Aslan, E. Air Quality Forecasting Using Machine Learning: Comparative Analysis and Ensemble Strategies for Enhanced Prediction. Water Air Soil Pollut. 2025, 236, 464. [Google Scholar] [CrossRef]
  33. Makhdoomi, A.; Sarkhosh, M.; Ziaei, S. PM2.5 Concentration Prediction Using Machine Learning Algorithms: An Approach to Virtual Monitoring Stations. Sci. Rep. 2025, 15, 8076. [Google Scholar] [CrossRef]
  34. Utku, A.; Can, U.; Alpsülün, M.; Balıkçı, H.C.; Amoozegar, A.; Pilatin, A.; Barut, A. Advancing Air Quality Monitoring: Deep Learning-Based CNN–RNN Hybrid Model for PM2.5 Forecasting. Atmosphere 2025, 16, 1003. [Google Scholar] [CrossRef]
  35. Zareba, M.; Danek, T.; Zajac, J. On Including Near-surface Zone Anisotropy for Static Corrections Computation—Polish Carpathians 3D Seismic Processing Case Study. Geosciences 2020, 10, 66. [Google Scholar] [CrossRef]
  36. Zareba, M.; Cogiel, S.; Danek, T.; Weglinska, E. Machine Learning Techniques for Spatio-Temporal Air Pollution Prediction to Drive Sustainable Urban Development in the Era of Energy and Data Transformation. Energies 2024, 17, 2738. [Google Scholar] [CrossRef]
  37. Zareba, M.; Cogiel, S.; Danek, T. Spatio-Temporal PM2.5 Forecasting Using Machine Learning and Low-Cost Sensors: An Urban Perspective. Eng. Proc. 2025, 101, 6. [Google Scholar] [CrossRef]
  38. Danek, T.; Zaręba, M. The Use of Public Data from Low-Cost Sensors for the Geospatial Analysis of Air Pollution from Solid Fuel Heating during the COVID-19 Pandemic Spring Period in Krakow, Poland. Sensors 2021, 21, 5208. [Google Scholar] [CrossRef]
  39. Szewczyk, E.; Lupa, M.; Zaręba, M.; Węglińska, E.; Danek, T.; Mishra, A.K. Emergency Medical Interventions in Areas with High Air Pollution: A Case Study from Małopolska Voivodeship, Poland. Atmosphere 2025, 16, 983. [Google Scholar] [CrossRef]
  40. Zareba, M.; Danek, T. A novel methodology for Explainable Artificial Intelligence integrated with geostatistics for air pollution control and environmental management. Ecol. Inform. 2025, 92, 103450. [Google Scholar] [CrossRef]
  41. Zareba, M. Assessing the Role of Energy Mix in Long-Term Air Pollution Trends: Initial Evidence from Poland. Energies 2025, 18, 1211. [Google Scholar] [CrossRef]
  42. Zhang, J.; Li, X.; Pan, L. Policy Effect on Clean Coal-Fired Power Development in China. Energies 2022, 15, 897. [Google Scholar] [CrossRef]
  43. Singh, R.P.; Kumar, S.; Singh, A.K. Elevated Black Carbon Concentrations and Atmospheric Pollution around Singrauli Coal-Fired Thermal Power Plants (India) Using Ground and Satellite Data. Int. J. Environ. Res. Public Health 2018, 15, 2472. [Google Scholar] [CrossRef]
  44. European Environment Agency (EEA). Copernicus Land Monitoring Service 2022—Digital Terrain Model (EU-DEM). 2022. European Union, Copernicus Land Monitoring Service, European Environment Agency (EEA). Available online: https://land.copernicus.eu/imagery-in-situ/eu-dem (accessed on 1 November 2025).
  45. OpenStreetMap Contributors. OpenStreetMap. 2025. Available online: https://www.openstreetmap.org (accessed on 5 November 2025).
  46. Matheron, G. Principles of Geostatistics. Econ. Geol. 1963, 58, 1246–1266. [Google Scholar] [CrossRef]
  47. Cressie, N.A.C. Statistics for Spatial Data, revised ed.; Wiley-Interscience: New York, NY, USA, 1993. [Google Scholar] [CrossRef]
  48. Tong, S.; Brown, E.; Wu, P.; Woo, S.; Middepogu, M.; Akula, S.C.; Yang, J.; Yang, S.; Iyer, A.; Pan, X.; et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv 2024, arXiv:2406.16860. [Google Scholar] [CrossRef]
  49. Chen, X.; Wu, Z.; Liu, X.; Pan, Z.; Liu, W.; Xie, Z.; Yu, X.; Ruan, C. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv 2025, arXiv:2501.17811. [Google Scholar] [CrossRef]
  50. Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
  51. OpenAI. GPT-4o System Card: An Omni-modal Foundation Model for Text, Audio, Image, and Video. arXiv 2024, arXiv:2410.21276. [Google Scholar]
  52. DeepMind, G. Gemini 2.X: Pushing the Frontier with Advanced Reasoning—Model Card and Technical Report for Gemini 2.5 Pro & Flash. arXiv 2025, arXiv:2507.06261. [Google Scholar]
  53. Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Figure 1. Spatial distribution of air pollution monitoring stations included in the study. The map presents the topography of Kraków based on a digital terrain model (DTM), showing the locations and identifiers of Airly sensors (white rectangles) as well as the administrative district boundaries (black lines). The digital terrain model was obtained from the European Union’s Copernicus Land Monitoring Service (2022) and the European Environment Agency (EEA) [44]. The background map is derived from OpenStreetMap data [45].
Figure 1. Spatial distribution of air pollution monitoring stations included in the study. The map presents the topography of Kraków based on a digital terrain model (DTM), showing the locations and identifiers of Airly sensors (white rectangles) as well as the administrative district boundaries (black lines). The digital terrain model was obtained from the European Union’s Copernicus Land Monitoring Service (2022) and the European Environment Agency (EEA) [44]. The background map is derived from OpenStreetMap data [45].
Atmosphere 17 00002 g001
Figure 2. MLLM Workflow.
Figure 2. MLLM Workflow.
Atmosphere 17 00002 g002
Figure 3. Distribution of PM2.5 concentration values with coverage by standard deviation intervals. Over 95% of all observations fall below 50 µg/m3, motivating the use of non-linear power normalization.
Figure 3. Distribution of PM2.5 concentration values with coverage by standard deviation intervals. Over 95% of all observations fall below 50 µg/m3, motivating the use of non-linear power normalization.
Atmosphere 17 00002 g003
Figure 4. Applied colormap and power-law normalization (PowerNorm) used for PM2.5 concentration mapping. The color scale follows Air Quality Index (AQI) semantics to improve contrast and interpretability in low-value ranges.
Figure 4. Applied colormap and power-law normalization (PowerNorm) used for PM2.5 concentration mapping. The color scale follows Air Quality Index (AQI) semantics to improve contrast and interpretability in low-value ranges.
Atmosphere 17 00002 g004
Table 1. Average G-Eval Scores by Map Variant.
Table 1. Average G-Eval Scores by Map Variant.
Map VariantMap Type (Color Scale)Shapefile OverlayG-Eval Score
Map IContext-engineered (AQI)No0.381
Map IIContext-engineered (AQI)Yes0.253
Map IIIRaw map (Viridis)No0.288
Map IVRaw map (Viridis)Yes0.274
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cogiel, S.; Zareba, M.; Danek, T.; Arnaut, F. Automating Air Pollution Map Analysis with Multi-Modal AI and Visual Context Engineering. Atmosphere 2026, 17, 2. https://doi.org/10.3390/atmos17010002

AMA Style

Cogiel S, Zareba M, Danek T, Arnaut F. Automating Air Pollution Map Analysis with Multi-Modal AI and Visual Context Engineering. Atmosphere. 2026; 17(1):2. https://doi.org/10.3390/atmos17010002

Chicago/Turabian Style

Cogiel, Szymon, Mateusz Zareba, Tomasz Danek, and Filip Arnaut. 2026. "Automating Air Pollution Map Analysis with Multi-Modal AI and Visual Context Engineering" Atmosphere 17, no. 1: 2. https://doi.org/10.3390/atmos17010002

APA Style

Cogiel, S., Zareba, M., Danek, T., & Arnaut, F. (2026). Automating Air Pollution Map Analysis with Multi-Modal AI and Visual Context Engineering. Atmosphere, 17(1), 2. https://doi.org/10.3390/atmos17010002

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop