LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion
Abstract
1. Introduction
2. Related Work
2.1. Machine Learning for Air Quality Prediction
2.2. CFD Simulations of Pollutant Dispersion
2.3. Reduced-Order Models in Fluid Mechanics
2.4. Large Language Models for Time-Series Forecasting
3. Methodology
3.1. Dilated Convolutional Autoencoder
3.2. Pre-Trained Large Language Models
- 1.
- Masked Language Modeling (MLM): Random tokens in the input sequence are masked, and the model predicts the masked tokens.
- 2.
- Next Sentence Prediction (NSP): Given two sentences, the model determines whether the second sentence logically follows the first. Through these pre-training tasks, the model acquires robust semantic understanding and contextual awareness.
3.3. Reversible Instance Normalization
3.4. Multi-Head Cross-Attention Mechanism
4. Model Architecture
- 1.
- Dimensionality Reduction: DCAE projects high-dimensional flow field data into a low-dimensional latent space.
- 2.
- Prediction and Reconstruction: The LLM predicts low-dimensional dynamics, followed by a Dilated Convolutional Autodecoder (DCAD) reconstructing the full-dimensional flow field.
- 1.
- DCAE: Performs nonlinear dimensionality reduction and reconstruction of flow field data.
- 2.
- Temporal Text Embedding: Encodes low-dimensional flow field sequences into text-like vectors.
- 3.
- Textual Prompt Template: Integrates meteorological and contextual metadata to guide LLM predictions.
- 4.
- Pre-trained LLM: Executes few-shot inference on the embedded flow field representations.
4.1. Dilated Convolutional Autodecoder
4.2. Temporal Text Embedding
- 1.
- Preserving Local Context: Aggregating localized details within each segment.
- 2.
- Efficient Tokenization: Constructing a compact input token sequence to minimize computational resource consumption.
4.3. Patch Reprogramming for Physics-to-Text Alignment
4.3.1. Design Motivation: From Direct Word Embedding to Text Prototypes
- 1.
- High-dimensional sparsity: Physical tokens need to select representations from tens of thousands of words, resulting in an extremely sparse re-encoding space where numerous irrelevant words introduce noise.
- 2.
- Semantic misalignment: Natural language words (e.g., “apple,” “car”) have fundamentally different semantics from physical temporal patterns (e.g., “rapid rise,” “periodic oscillation”), making it difficult to form meaningful alignments through direct projection.
- 3.
- Computational redundancy: Learning mappings in a 50,000-dimensional vocabulary space requires a massive number of parameters (), which easily leads to overfitting with limited data.
4.3.2. Definition of the Text Prototype Codebook
4.3.3. Semantic Projection of Physical Tokens
4.3.4. Prototype Fusion via Attention Mechanism
4.3.5. Multi-Head Extension
4.3.6. Training Process and Gradient Flow
- The attention weights , consequently updating the query projection parameters ;
- Each vector in the prototype codebook C.
4.4. Textual Prompt Template
- 1.
- Dataset Context: This part provides domain background information about the input time series, helping the LLM understand the physical source and basic characteristics of the data. Specifically, it includes: building layout (aspect ratio H/W = 0.5, building height 21 m), meteorological conditions (prevailing wind direction 225°, inflow wind profile following an exponential law, temperature and humidity based on Shanghai 15 July 2024 measured data), pollution source information (cross-shaped line source, emission rate 8.0 × 10−6 kg/(m·s)), building thermal properties (thermal conductivity, albedo, etc.), and vegetation configuration (tree height 10m, shrub height 1.5 m). This information provides the model with foundational knowledge for understanding the current physical scenario.
- 2.
- Task Instruction: This part not only contains the specific prediction directive but also incorporates prior knowledge related to the pollutant dispersion task, aiming to activate the LLM’s general knowledge relevant to physical processes. Specifically, it includes the following:
- Prediction Directive: Clearly indicates the form of the task, e.g., “Based on the historical latent variable sequence of the previous 12 timesteps (2 h), predict the PM10 concentration field for the next 6 timesteps (1 h) in an autoregressive manner.”
- Task Prior Knowledge: Introduces fundamental physical principles of pollutant dispersion, such as “Pollutant dispersion in street canyons is governed by advection, turbulent diffusion, and source emissions; the evolution of the concentration field is continuous and satisfies mass conservation,” and “The prediction must follow temporal causality, where future states depend only on historical information.” This prior knowledge helps focus the LLM’s general sequence modeling capabilities on the specific task of physical field prediction.
- 3.
- Input Statistics: Augments the time series with key statistical metrics (e.g., min/max values, trends, lag correlations) to support pattern recognition and reasoning.
4.5. Pre-Trained LLM and Autoregressive Generation Mechanism
5. Experiment
5.1. Expriment Setups
5.1.1. Evaluation Metrics
- 1.
- Root Mean Square Error (RMSE):RMSE measures the average magnitude of the prediction error, giving higher weight to large errors. It is defined aswhere and denote the ground truth and predicted PM10 concentration at grid point respectively. The dimensions D, H and W correspond to the vertical layers, height, and width of the 3D field (, , ).
- 2.
- Structural Similarity Index (SSIM):SSIM is a perceptual metric that quantifies the similarity between two images in terms of luminance, contrast, and structure. For 3D fields, we compute SSIM slice-wise along the vertical dimension and average the results. Specifically, for each vertical layer i, we treat the horizontal slice and its prediction as 2D images, and calculate the SSIM using the standard formula:where and are the mean intensities of the ground truth and predicted slices, and are their standard deviations, and is the covariance. The constants and are small stabilizers. The overall SSIM for the 3D field is obtained by averaging over all vertical layers:SSIM ranges from 0 to 1, with values closer to 1 indicating higher structural fidelity. This metric is particularly sensitive to the preservation of spatial patterns such as pollutant plume morphology and concentration gradients.
- 3.
- Coefficient of Determination ():measures the proportion of variance in the ground truth data that is explained by the model. It is defined aswhere is the mean of the ground truth concentrations over all grid points. can be negative if the model performs worse than simply predicting the mean, and a value of 1 indicates perfect prediction. This metric provides a global assessment of the model’s explanatory power.
5.1.2. Experiment Configurations and Hyperparameter Choices
5.2. Experimental Dataset Construction
5.3. Training and Reconstruction Performance Evaluation of DCAE
5.4. Prediction Performance Analysis of LLM-ROM on Benchmark Scenarios
5.4.1. Comparative Experiments with Baseline Models
- 1.
- LSTM+Autoencoder (w/o DCAE): This baseline does not use DCAE for dimensionality reduction. Instead, LSTM operates directly on the original concentration field (28 × 65 × 65) to validate the necessity of DCAE compression. The LSTM has 256 hidden units, and the output layer is a fully connected layer that reconstructs the original field.
- 2.
- POD-GPR: A representative traditional reduced-order model, using Proper Orthogonal Decomposition to retain 30 modes combined with Gaussian Process Regression for time-series prediction.
- 3.
- DCAE-LSTM: A classic deep ROM paradigm, combining the frozen DCAE encoder with a two-layer LSTM (256 hidden units).
- 4.
- DCAE-GRU: Gated Recurrent Unit, a simplified variant of LSTM, with 256 hidden units.
- 5.
- DCAE-ConvLSTM: Convolutional LSTM, leveraging its convolutional structure to capture spatiotemporal dependencies simultaneously. The architecture consists of two ConvLSTM layers with 64 hidden channels and 3 × 3 kernels.
- 6.
- DCAE-Transformer: A standard Transformer encoder (4 layers, 8 heads, without pre-training) serving as a self-attention baseline.
- 7.
- DCAE-TFT: Temporal Fusion Transformer, a Transformer variant specifically designed for time-series forecasting, with default configurations.
- 8.
- DCAE-U-Net: The latent-space prediction problem is reformulated as an image-to-image translation task. The historical 12-step latent vectors are stacked into an input tensor (12 × 128) and mapped directly to the future 6-step latent vectors via a standard 4-layer U-Net architecture.
- 9.
- DCAE-FNO: Fourier Neural Operator, which learns temporal evolution directly in the latent space. The input sequence is treated as discrete samples in function space, and the mapping is learned in the spectral domain via Fourier layers. The model uses 4 Fourier layers with 16 Fourier modes.
- 1.
- Necessity of DCAE Dimensionality Reduction: The LSTM+Autoencoder operating directly on the original field achieves an RMSE of 24.43 and an SSIM of only 0.532, performing significantly worse than all DCAE-based methods. This strongly validates the critical role of compressing high-dimensional physical fields into a low-dimensional latent space for time-series prediction—dimensionality reduction not only substantially reduces computational complexity but, more importantly, eliminates redundant information, enabling the model to focus on core dynamical features.
- 2.
- Limitations of Traditional Deep ROMs: DCAE-LSTM and GRU achieve RMSEs of 12.32 and 11.44, respectively. While these outperform linear methods such as POD-GPR, they still show a considerable gap compared to advanced models. This indicates the inherent limitations of recurrent neural networks in handling long-term temporal dependencies.
- 3.
- Effectiveness of Self-Attention Mechanisms: ConvLSTM, DCAE-Transformer, and TFT outperform LSTM, with RMSEs ranging from 8.21 to 9.86, demonstrating the advantages of self-attention mechanisms in modeling long-range dependencies. Among these, TFT, as a Transformer variant specifically designed for time-series forecasting, performs better than the standard Transformer encoder.
- 4.
- Performance of Advanced Spatiotemporal Models: U-Net and FNO, as current SOTA models, achieve RMSEs of 7.22 and 6.71, with SSIMs of 0.920 and 0.918, respectively. FNO, as a representative neural operator method, outperforms U-Net, demonstrating the advantages of learning in the spectral domain.
- 5.
- Significant Advantages of LLM-ROM: LLM-ROM outperforms all compared methods across all metrics, achieving a 68.3% reduction in RMSE compared to the second-best method (FNO), with SSIM improvements exceeding 5.3%. This advantage strongly validates the core design of this work: by mapping continuous latent vectors to discrete semantic prototypes via the physics-to-text alignment module, the pre-trained general sequence knowledge of LLMs is activated, enabling more accurate capture of complex physical dynamics in the latent space.
5.4.2. Ablation Study
- 1.
- DCAE-LLM: Uses the complete prompt template containing all three components described above, i.e., full Dataset Context, task instruction (including prediction directive and prior knowledge) and input statistics.
- 2.
- DCAE-LLM (-): Retains only the input statistics and the most basic prediction directive from Component 1 and 2 (e.g., only the phrase “predict future concentration”), while completely removing the Dataset Context and all task prior knowledge beyond the basic directive. Specifically, the simplified module omits extensive domain background information such as building layout, meteorological conditions, pollution source details, building properties, and physical prior knowledge about pollutant dispersion.
- 3.
- DCAE-LLM (–): Completely removes all prompt information, inputting only the patched semantic embedding sequence from the physics-to-text alignment module.
- 1.
- The three components of text prompts (Dataset Context, task instruction, input statistics) are encoded via the same word embedding layer into fixed-dimensional vectors (consistent with LLM-ROM’s prompt embedding dimension).
- 2.
- These prompt vectors are concatenated with the patched latent vector sequence as input to the LSTM.
- 3.
- The LSTM adopts a two-layer architecture with 256 hidden units, consistent with DCAE-LSTM.
- 4.
- The output layer maps back to the latent vector space via a fully connected layer, followed by DCAE decoder reconstruction to physical fields.
- 1.
- Importance of Domain and Task Prior Knowledge: Compared to the full model (DCAE-LLM), the simplified module (DCAE-LLM (-)) shows a 122% increase in RMSE and a decrease in SSIM to 0.943. This proves that Component 1 (Dataset Context) and the rich task prior knowledge in Component 2 significantly contribute to model performance. Background information such as building layout, meteorological conditions, pollution source characteristics, and physical principles of pollutant dispersion helps the model more accurately understand the physical meaning of the latent space sequence, leading to more precise predictions.
- 2.
- Auxiliary Role of Statistical Information: The no-prompt module (DCAE-LLM (–)) performs worse than the simplified module (DCAE-LLM (-)) (RMSE increasing from 4.73 to 5.96), indicating that even in the absence of domain prior knowledge, basic statistical information still provides a useful summary of the sequence state and plays an auxiliary role.
- 3.
- Even with identical prompt embedding sequence, the lightweight conditioned LSTM achieves an RMSE of 11.89, significantly higher than LLM-ROM’s 2.13—a performance gap of 458%. This proves that the performance gain primarily stems from the LLM’s inherent sequence modeling capability and semantic understanding ability, not the prompt embedding sequence itself. Although LSTM can receive word embeddings as input, it treats these vectors as ordinary numerical features, learning statistical correlations with the prediction target through its gating mechanisms. In contrast, the pre-trained LLM’s embedding space itself encodes rich semantic knowledge. When prompt embeddings are fed into the LLM, it can activate the semantic understanding acquired during pre-training on massive corpora, truly “comprehending” the physical concepts represented by these prompts and their interrelationships (such as the opposition between “rise” and “fall,” the contrast between “high wind speed” and “low wind speed,” etc.). This fundamental difference in semantic understanding capability is the root cause of the significant performance gap between the lightweight conditioned LSTM and LLM-ROM, even when they receive identical prompt embedding sequence.
- 1.
- Decisive Role of Pre-trained KnowledgeReplacing Time-LLM’s pre-trained weights with random initialization causes a dramatic performance drop—RMSE skyrockets from 2.13 to 10.72, a 403% increase, and SSIM drops from 0.967 to 0.816. This result fully demonstrates that in few-shot scenarios with only 116 training samples, the Transformer architecture alone cannot learn effective physical dynamics from scratch. Pre-trained knowledge is the cornerstone of LLM-ROM performance, providing powerful temporal priors that enable rapid adaptation to physical field prediction tasks.
- 2.
- Trade-offs in Fine-tuning StrategiesWe compared two fine-tuning strategies: parameter-efficient fine-tuning and full fine-tuning. The results show that full fine-tuning brings only a 2.8% accuracy improvement (RMSE from 2.13 to 2.07), but at enormous cost:
- (a)
- Trainable parameters increase from 1.2M to 1.5B (a 1250× increase).
- (b)
- Training time extends from 0.25 h to 13 h.
- (c)
- GPU memory usage skyrockets from 14.2 GB to over 80 GB.
- (d)
- Overfitting risk increases in small-data scenarios (validation loss decreases then increases).
This indicates that in data-scarce CFD scenarios, the benefits of full fine-tuning are negligible, while the computational cost is enormous. Parameter-efficient fine-tuning achieves performance close to full fine-tuning with minimal parameters (0.04%), while avoiding overfitting and catastrophic forgetting, making it the superior strategy.
5.5. Transferability Experiment
- 1.
- With only 5 samples for fine-tuning, LLM-ROM achieves an RMSE of 5.78 with an SSIM of 0.903. This indicates that the model can rapidly capture the core dynamics of the new scenario even with extremely sparse labels. In contrast, DCAE-LSTM trained from scratch performs poorly with an RMSE of 19.69 and an SSIM of only 0.588, essentially failing to learn.
- 2.
- With only 20 samples for fine-tuning, LLM-ROM’s RMSE drops to 3.24 , approaching the full-training performance on the source domain (2.13 ), with SSIM improving to 0.952. This demonstrates that the model achieves effective domain adaptation with less than 14% of the target domain data.
- Error Doubling Step (EDS): Defined as the number of steps at which RMSE first exceeds twice the RMSE at step 15 (the initial prediction step). This metric measures the model’s ability to maintain low error levels; a larger EDS indicates better long-term stability.
- Average Error Growth Rate (AEGR): Defined as the average per-step increase in RMSE from step 15 to step 145:This metric quantifies the speed of error accumulation.
- 1.
- LLM-ROM exhibits the slowest error growth: From step 15 to 145, LLM-ROM’s RMSE increases from 4.58 to 9.28 (103% increase), while CAE-Transformer increases by 159% and CAE-LSTM by 187%. LLM-ROM’s final RMSE (9.28) is even lower than CAE-Transformer’s RMSE around step 35.
- 2.
- Significant advantage in EDS: LLM-ROM achieves an EDS of 94 steps, compared to 35 for CAE-Transformer and 23 for CAE-LSTM.
- 3.
- Lowest AEGR: LLM-ROM’s AEGR is 0.034, less than one-third of CAE-Transformer’s 0.103 and one-quarter of CAE-LSTM’s 0.141.
- 1.
- Initial Stage (steps 15–30): During this stage, LLM-ROM exhibits extremely slow error growth, with RMSE increasing only marginally from 4.58 to 4.92 —an increase of approximately 7.4%. The curve remains nearly flat, indicating that the model maintains very high fidelity in short-to-medium term predictions, accurately capturing the dominant dynamical modes of pollutant dispersion with minimal initial error and negligible accumulation.
- 2.
- Mid-term (31–90 steps): Error begins to increase at an approximately linear rate, with RMSE rising steadily from 5.31 to 7.65 —an increase of about 44%. This stage corresponds to the low-wind nighttime period (20:00–06:00), during which pollutants continuously accumulate on the leeward side, leading to complex flow structures and enhanced nonlinearity. Although the model maintains reasonably good prediction accuracy, the error accumulation rate accelerates compared to the initial stage. Nevertheless, the growth slope of LLM-ROM remains significantly lower than that of DCAE-Transformer and DCAE-LSTM, demonstrating its superior adaptability to complex dynamics.
- 3.
- Long-term (91–145 steps): Error growth slows considerably and gradually approaches saturation, with RMSE increasing slowly from 8.12 to 9.28 —an increase of only 14.3%. This phenomenon does not indicate performance degradation but rather results from a combination of factors: (1) pollutant concentrations have physical upper bounds, preventing model predictions from diverging indefinitely from ground truth; (2) LLM-ROM has sufficiently learned the long-term dynamics of the system, such that subsequent error accumulation primarily stems from the slow propagation of initial errors rather than newly introduced biases; and (3) the natural saturation effect inherent in autoregressive prediction—once the model reaches its inherent error ceiling, incremental errors in subsequent steps gradually approach zero. The final RMSE of 9.28 g/m3 corresponds to a relative error of approximately 18.6% of the source domain mean concentration, which is acceptable for 24 h ultra-long-term prediction and is substantially lower than that of the compared methods.
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| CFD | Computational Fluid Dynamics |
| LLM | Large Language Models |
| CAE | Convolutional Autoencoder |
| AMIRA | Autoregressive Integrated Moving Average |
| PM | Particulate matter |
| LSTM | Long Short-Term Memory |
| ROM | Reduced-Order Models |
| POD | Proper Orthogonal Decomposition |
| DMD | Dynamic Mode Decomposition |
Appendix A. Nomenclature
| Symbol | Description |
|---|---|
| Basic Dimensions | |
| D | Number of grid points in depth direction |
| H | Number of grid points in height direction |
| W | Number of grid points in width direction |
| T | Length of input sequence |
| F | Length of prediction sequence |
| d | Dimension of latent vector |
| Dimension of text embedding | |
| Dimension of token | |
| K | Number of text prototypes |
| L | Patch length |
| S | Patch stride |
| Number of patches | |
| Length of prompt | |
| h | Number of attention heads |
| Physical Variables | |
| Concentration field at time t | |
| Concentration value at grid point | |
| Predicted concentration field | |
| Latent Space Variables | |
| Latent vector at time t | |
| Normalized latent vector | |
| Predicted latent vector | |
| Patching Related | |
| i-th patch | |
| i-th patch token | |
| Patch projection weight matrix | |
| Patch projection bias | |
| Text Prototype Related | |
| Text prototype codebook | |
| k-th text prototype | |
| i-th query vector | |
| Alignment layer projection weight | |
| Alignment layer projection bias | |
| Attention weight | |
| Semantic token | |
| Predicted semantic token | |
| LLM Related | |
| Prompt embedding | |
| LLM input sequence at step k | |
| Hidden state of last position | |
| Inverse Projection Related | |
| Inverse projection weight matrix | |
| Inverse projection bias | |
| Network Modules | |
| DCAE encoder | |
| DCAE decoder | |
| Greek Letters | |
| Mean | |
| Standard deviation | |
| Small constant | |
| Trade-off coefficient |
References
- Wang, Y.; An, Z.; Zhao, Y.; Yu, H.; Wang, D.; Hou, G.; Cui, Y.; Luo, W.; Dong, Q.; Hu, P.; et al. PM2.5-bound polycyclic aromatic hydrocarbons (pahs) in urban beijing during heating season: Hourly variations, sources, and health risks. Atmos. Environ. 2025, 349, 121126. [Google Scholar] [CrossRef]
- Linda, J.; Uhlík, O.; Köbölová, K.; Pospíšil, J.; Apeltauer, T. Recognition of wind-induced resuspension of pm10 and its fractions pm10-2.5, pm2.5-1, and pm1 in urban environments. Aerosol Sci. Technol. 2025, 59, 567–579. [Google Scholar] [CrossRef]
- Chen, X.; Wei, F. Reducing PM2.5 and O3 through optimizing urban ecological land form based on its size thresholds. Atmos. Pollut. Res. 2025, 16, 102466. [Google Scholar] [CrossRef]
- Xiao, D.; Fang, F.; Zheng, J.; Pain, C.; Navon, I. Machine learning-based rapid response tools for regional air pollution modelling. Atmos. Environ. 2019, 199, 463–473. [Google Scholar] [CrossRef]
- Arriagada, N.B.; Morgan, G.G.; Buskirk, J.V.; Gopi, K.; Yuen, C.; Johnston, F.H.; Guo, Y.; Cope, M.; Hanigan, I.C. Daily PM2.5 and seasonal-trend decomposition to identify extreme air pollution events from 2001 to 2020 for continental Australia using a random forest model. Atmosphere 2024, 15, 1341. [Google Scholar] [CrossRef]
- Mogollón-Sotelo, C.; Casallas, A.; Vidal, S.; Celis, N.; Ferro, C.; Belalcazar, L. A support vector machine model to forecast ground-level pm 2.5 in a highly populated city with a complex terrain. Air Qual. Atmos. Health 2021, 14, 399–409. [Google Scholar] [CrossRef]
- Díaz-González, L.; Trujillo-Uribe, I.; Pérez-Sansalvador, J.C.; Lakouari, N. Handling missing air quality data using bidirectional recurrent imputation for time series and random forest: A case study in Mexico city. AI 2025, 6, 208. [Google Scholar] [CrossRef]
- Wei, Q.; Zhang, H.; Yang, J.; Niu, B.; Xu, Z. PM2.5 concentration prediction using a whale optimization algorithm based hybrid deep learning model in Beijing, China. Environ. Pollut. 2025, 371, 125953. [Google Scholar] [CrossRef] [PubMed]
- Sayeed, A.; Gupta, P.; Henderson, B.; Kondragunta, S.; Zhang, H.; Liu, Y. GOES-R PM2.5 evaluation and bias correction: A deep learning approach. Earth Space Sci. 2025, 12, e2024EA004012. [Google Scholar] [CrossRef]
- Jahromi, M.S.B.; Kalantar, V.; Akhijahani, H.S.; Salami, P. Application of artificial neural network, evolutionary polynomial regression, and life cycle assessment techniques to predict the performance of a new designed solar air ventilator with phase change material. Appl. Therm. Eng. 2025, 269, 126117. [Google Scholar] [CrossRef]
- Yu, Q.; Yuan, H.W.; Liu, Z.L.; Xu, G.M. Spatial weighting emd-lstm based approach for short-term pm2.5 prediction research. Atmos. Pollut. Res. 2024, 15, 102256. [Google Scholar] [CrossRef]
- Li, M.M.; Wang, X.L.; Yue, J.; Chen, L.; Wang, W.Y.; Yang, A.Q. PM2.5 prediction based on eof decomposition and cnn-lstm neural network. Huan Jing Ke Xue = Huanjing Kexue 2025, 46, 715–726. [Google Scholar]
- Feng, Y.; Kim, J.S.; Yu, J.W.; Ri, K.C.; Yun, S.J.; Han, I.N.; Qi, Z.; Wang, X. Spatiotemporal informer: A new approach based on spatiotemporal embedding and attention for air quality forecasting. Environ. Pollut. 2023, 336, 122402. [Google Scholar] [CrossRef]
- Hu, Z.-Z.; Min, Y.-T.; Leng, S.; Li, S.; Lin, J.-R. A multi-factor-fusion framework for efficient prediction of pedestrian-level wind environment based on deep learning. IEEE Access 2025, 13, 52912–52924. [Google Scholar] [CrossRef]
- Su, Y.; Wang, M.C. An automl algorithm: Multiple-steps ahead forecasting of correlated multivariate time series with anomalies using gated recurrent unit networks. AI 2025, 6, 267. [Google Scholar] [CrossRef]
- Manekar, K.; Bhaiyya, M.L.; Hasamnis, M.A.; Kulkarni, M.B. Intelligent microfluidics for plasma separation: Integrating computational fluid dynamics and machine learning for optimized microchannel design. Biosensors 2025, 15, 94. [Google Scholar] [CrossRef]
- Boikos, C.; Ioannidis, G.; Rapkos, N.; Tsegas, G.; Katsis, P.; Ntziachristos, L. Estimating daily road traffic pollution in Hong Kong using cfd modelling: Validation and application. Build. Environ. 2025, 267, 112168. [Google Scholar] [CrossRef]
- Badach, J.; Wojnowski, W.; Gebicki, J. Spatial aspects of urban air quality management: Estimating the impact of micro-scale urban form on pollution dispersion. Comput. Environ. Urban Syst. 2023, 99, 101890. [Google Scholar] [CrossRef]
- Schilders, W.H.; der Vorst, H.A.V.; Rommes, J. Model Order Reduction: Theory, Research Aspects and Applications; Springer: Berlin/Heidelberg, Germany, 2008; Volume 13. [Google Scholar]
- Balajewicz, M.; Dowell, E.H. Stabilization of projection-based reduced order models of the navier-stokes. Nonlinear Dynam. 2012, 70, 1619–1632. [Google Scholar] [CrossRef]
- Cuong, N.N.; Jaime, P. Efficient and accurate nonlinear model reduction via first-order empirical interpolation. J. Comput. Phys. 2023, 494, 112512. [Google Scholar] [CrossRef]
- Cao, Y.; Zhu, J.; Luo, Z.; Navon, I. Reduced-order modeling of the upper tropical pacific ocean model using proper orthogonal decomposition. Comput. Math. Appl. 2006, 52, 1373–1386. [Google Scholar] [CrossRef]
- Hesse, H.; Palacios, R. Reduced-order aeroelastic models for dynamics of maneuvering flexible aircraft. AIAA J. 2014, 52, 1717–1732. [Google Scholar] [CrossRef]
- Zhu, C.; Xiao, D.; Fu, J.; Feng, Y.; Fu, R.; Wang, J. A data-driven computational framework for non-intrusive reduced-order modelling of turbulent flows passing around bridge piers. Ocean Eng. 2024, 308, 118308. [Google Scholar] [CrossRef]
- Kerschen, G.; Golinval, J.C.; Vakakis, A.F.; Bergman, L.A. The method of proper orthogonal decomposition for dynamical characterization and order reduction of mechanical systems: An overview. Nonlinear Dynam. 2005, 41, 147–169. [Google Scholar] [CrossRef]
- Wu, P.; Sun, J.; Chang, X.; Zhang, W.; Arcucci, R.; Guo, Y.; Pain, C.C. Data-driven reduced order model with temporal convolutional neural network. Comput. Methods Appl. Mech. Eng. 2020, 360, 112766. [Google Scholar] [CrossRef]
- Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv 2019, arXiv:1905.07213. [Google Scholar] [CrossRef]
- Ostrogonac, S.; Pakoci, E.; Sečujski, M.; Mišković, D. Morphology-based vs unsupervised word clustering for training language models for serbian. Acta Polytech. Hung. 2019, 16, 183–197. [Google Scholar] [CrossRef]
- Liu, P.; Guo, H.; Dai, T.; Li, N.; Bao, J.; Ren, X.; Jiang, Y.; Xia, S.-T. Calf: Aligning llms for time series forecasting via cross-modal fine-tuning. Proc. Aaai Conf. Artif. Intell. 2025, 39, 18915–18923. [Google Scholar] [CrossRef]
- Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.-Y.; Liang, Y.; Li, Y.-F.; Pan, S.; et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv 2023, arXiv:2310.01728. [Google Scholar]
- Gopali, S.; Namini, S.S.; Abri, F.; Namin, A.S. The performance of the lstm-based code generated by large language models (llms) in forecasting time series data. Nat. Lang. Process. J. 2024, 9, 100120. [Google Scholar] [CrossRef]
- Wu, T.; Ling, Q. Glallm: Adapting llms for spatio-temporal wind speed forecasting via global–local aware modeling. Knowl.-Based Syst. 2025, 323, 113739. [Google Scholar] [CrossRef]
- Xiao, Q.; Chen, X.; Wang, Q.; Guo, X.; Wang, B.; Chen, W.; Wang, Z.; Liu, Y.; Xia, R.; Zou, H.; et al. Llm4fluid: Large language models as generalizable neural solvers for fluid dynamics. arXiv 2026, arXiv:2601.21681. [Google Scholar] [CrossRef]
- Jun, H.; Park, J.; Zhijian, Y.; Bo, Y.; Li, L.K. Early detection of global instability via a large language model. In Division of Fluid Dynamics Annual Meeting 2025; APS: College Park, MD, USA, 2025. [Google Scholar]
- Zhang, Y. A Better Autoencoder for Image: Convolutional Autoencoder. In ICONIP17-DCEC. 2018. Available online: https://www.semanticscholar.org/paper/A-Better-Autoencoder-for-Image%3A-Convolutional-Zhang/b1786e74e233ac21f503f59d03f6af19a3699024 (accessed on 26 February 2026).
- Ayachi, R.; Afif, M.; Said, Y.; Atri, M. Strided convolution instead of max pooling for memory efficiency of convolutional neural networks. In Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18); Springer: Berlin/Heidelberg, Germany, 2020; Volume 1, pp. 234–243. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.-H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. Iclr 2022, 1, 3. [Google Scholar]
- Jia, W.; He, G.; Chundong, L. Analysing the influence of different street vegetation on particulate matter dispersion using microscale simulations. Desalin. Water Treat. 2018, 110, 319–327. [Google Scholar]
- Wania, A.; Bruse, M.; Blond, N.; Weber, C. Analysing the influence of different street vegetation on traffic-induced particle dispersion using microscale simulations. J. Environ. Manag. 2012, 94, 91–101. [Google Scholar] [CrossRef] [PubMed]

















| Parameter | Value |
|---|---|
| Meteorological Conditions | Instantaneous air temperature and humidity sourced from weather station data (Longitude: 121.6°, Latitude: 30.8°) on 15 July 2024 Longwave sky radiation calculated by ENVI-met Specific humidity at 2500 m altitude: 7 g/kg Reference roughness length: 0.1 m Wind direction: 180°, 225° Wind speed: 1 m/s (at 10 m height) |
| Street Configuration | Aspect ratio: 0.5 (18/33), 0.9 (18/21), 1.2 (18/15) |
| Pollution Source | Type: PM10 (10 µm diameter) Emission height: 0.3 m Source geometry: Line source |
| Vegetation | Trees: Height = 10 m, Canopy width = 5 m, LAD = 2 (m2/m3) Shrubs: Height = 1.5 m, LAD = 2 (m2/m3) |
| Building | External walls: K = 1.0 (W/(m3·K)), = 0.4 Roof: K = 0.9 (W/(m3·K)), = 0.3 |
| Ground Structure & Thermal Properties | Layers: 20 cm concrete–10 cm sand-soil Concrete: = 0.3, = 1.51 (W/(m·K)), = 2300 (kg/m3) |
| Module | Layer | Dilation Rate | Stride | Output Size |
|---|---|---|---|---|
| Input | - | - | - | |
| Block 1 | 3D Dilated Convolution | 1 | 1 | |
| 3D Strided Convolution | 1 | 2 | ||
| Block 2 | 3D Dilated Convolution | 2 | 1 | |
| 3D Strided Convolution | 1 | 2 | ||
| Block 3 | 3D Dilated Convolution | 4 | 1 | |
| 3D Strided Convolution | 1 | 2 | ||
| Block 4 | 3D Dilated Convolution | 8 | 1 | |
| 3D Strided Convolution | 1 | 2 | ||
| Block 5 | 3D Dilated Convolution | 16 | 1 | |
| 3D Strided Convolution | 1 | 2 | ||
| Flatten | Flatten | - | - | 4086 |
| Fully Connected Layer | Linear | - | - | 128 |
| DCAE | POD | |
|---|---|---|
| Average RMSE (1 × 10−5) | 2.86 | 4.42 |
| Number of parameters (M) | 0.25 | 3.55 |
| Training time (h) | 0.43 | 0.1 |
| Prediction time (s) | 14.7 | 18.2 |
| Method | RMSE (1 × 10−2 μg/m3) | SSIM | Trainable Parameters | |
|---|---|---|---|---|
| LSTM+Autoencoder | 24.43 | 0.532 | 0.326 | 25.6M |
| POD-GPR | 18.86 | 0.703 | 0.654 | - |
| DCAE-LSTM | 12.32 | 0.848 | 0.823 | 8.2M |
| DCAE-GRU | 11.44 | 0.853 | 0.839 | 8.0M |
| DCAE-ConvLSTM | 8.57 | 0.884 | 0.875 | 9.3M |
| DCAE-Transformer | 9.86 | 0.878 | 0.866 | 12.1M |
| DCAE-TFT | 8.21 | 0.891 | 0.886 | 9.5M |
| DCAE-U-Net | 7.22 | 0.920 | 0.897 | 14.2M |
| DCAE-FNO | 6.71 | 0.918 | 0.911 | 4.8M |
| LLM-ROM (Ours) | 2.13 | 0.967 | 0.963 | 1.2M |
| Configuration | RMSE (1 × 10−2 μg/m3) | SSIM | RMSE Performance | |
|---|---|---|---|---|
| DCAE-LLM | 2.13 | 0.967 | 0.963 | - |
| DCAE-LLM (-) | 4.73 | 0.943 | 0.936 | +122% |
| DCAE-LLM (–) | 5.96 | 0.932 | 0.926 | +179% |
| Lightweight conditioned LSTM | 11.89 | 0.851 | 0.834 | +458% |
| Configuration | Trainable Parameters | RMSE (1 × 10−2 μg/m3) | SSIM | |
|---|---|---|---|---|
| LLM-ROM(ours) | 1.2M | 2.13 | 0.967 | 0.963 |
| w/o LoRA | 1.5B | 2.07 | 0.968 | 0.965 |
| w/o pre-training | 1.2M | 10.72 | 0.816 | 0.709 |
| Aspect Ratio (H/W) | Wind Direction (°) | Vegetation |
|---|---|---|
| 0.5, 0.9 | 180, 225 | trees, bushes |
| 1.2 | 180, 225 | bushes |
| Aspect Ratio (H/W) | Wind Direction (°) | Vegetation |
|---|---|---|
| 1.2 | 180, 225 | trees |
| Fine-Tuning Samples | Method | RMSE (1 × 10−2 μg/m3) | SSIM | |
|---|---|---|---|---|
| 5 samples | DCAE-LSTM | 19.69 | 0.588 | 0.464 |
| LLM-ROM (Ours) | 5.78 | 0.903 | 0.894 | |
| 10 samples | DCAE-LSTM | 16.69 | 0.678 | 0.583 |
| LLM-ROM (Ours) | 4.71 | 0.929 | 0.921 | |
| 20 samples | DCAE-LSTM | 12.41 | 0.767 | 0.729 |
| LLM-ROM (Ours) | 3.24 | 0.952 | 0.945 |
| Method | RMSE @ 15-Step | RMSE | Error Doubling Step | AEGR |
|---|---|---|---|---|
| DCAE-LSTM | 10.24 | 29.34 | 23 | 0.141 |
| DCAE-Transformer | 8.56 | 22.13 | 35 | 0.103 |
| DCAE-ROM | 4.58 | 9.28 | 94 | 0.034 |
| Method | Single Prediction Time (s) | Scenario Adaptation Time (s) | Speedup |
|---|---|---|---|
| CFD simulation | 20,000 | 20,000 | |
| LLM-ROM (ours) | 30 | 2030 (10% CFD + 30 extrapolation) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wu, P.; Qin, Z.; Yang, Y. LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion. AI 2026, 7, 104. https://doi.org/10.3390/ai7030104
Wu P, Qin Z, Yang Y. LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion. AI. 2026; 7(3):104. https://doi.org/10.3390/ai7030104
Chicago/Turabian StyleWu, Pin, Zhiyi Qin, and Yiguo Yang. 2026. "LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion" AI 7, no. 3: 104. https://doi.org/10.3390/ai7030104
APA StyleWu, P., Qin, Z., & Yang, Y. (2026). LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion. AI, 7(3), 104. https://doi.org/10.3390/ai7030104

