LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion

Wu, Pin; Qin, Zhiyi; Yang, Yiguo

doi:10.3390/ai7030104

Open AccessArticle

LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion

by

Pin Wu

^*

,

Zhiyi Qin

and

Yiguo Yang

School of Computer Science and Engineering, Shanghai University, Shanghai 200444, China

^*

Author to whom correspondence should be addressed.

AI 2026, 7(3), 104; https://doi.org/10.3390/ai7030104

Submission received: 31 January 2026 / Revised: 26 February 2026 / Accepted: 7 March 2026 / Published: 11 March 2026

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based flow field prediction for microclimate pollutant dispersion represents an emerging and promising methodology, where effectively integrating meteorological, spatial, and temporal information remains a critical challenge. To address this, we propose a novel non-intrusive reduced-order model (ROM) that synergizes a Dilated Convolutional Autoencoder (DCAE) with pre-trained large language models (LLMs). The DCAE, leveraging nonlinear mapping, was employed for extracting low-dimensional spatiotemporal flow field features. These features were then combined with textual prototypes via text embedding to enable few-shot inference using the LLM-based flow field prediction method. To optimize the utilization of pre-trained LLMs, we designed a specialized textual description template tailored for pollutant dispersion data, which enhances the contextual input of meteorological conditions to guide model predictions. Experimental validation through three-dimensional urban canyon simulations conclusively demonstrated the efficacy of the convolutional autoencoder and LLM-based framework in predicting pollutant dispersion flow fields. The proposed method exhibits remarkable transfer learning capabilities across varying street canyon geometries and meteorological conditions while significantly representing a 9.85× acceleration in prediction compared to Computational Fluid Dynamics (CFD).

Keywords:

application of artificial intelligence; time-series forecasting; dilated convolutional autoencoder; pre-trained large language models

1. Introduction

Fine particulate matter (PM), as a critical atmospheric pollutant, adversely impacts human health [1] and terrestrial vegetation ecosystems [2], while also significantly influencing air quality and climate change [3]. Accurate prediction of pollutant dispersion in urban environments, particularly within street canyons where human exposure is highest, is therefore of paramount importance for public health protection and environmental policy-making.

Traditional approaches for studying pollutant dispersion include field measurements, wind tunnel experiments, and numerical simulations. Among these, Computational Fluid Dynamics (CFD) has emerged as a powerful tool capable of providing high-fidelity, full-field information by solving the Navier–Stokes equations governing fluid flow and scalar transport. However, despite its accuracy, CFD simulations entail substantial computational costs, especially for transient three-dimensional problems. A single 24 h large eddy simulation of a street canyon can require weeks of runtime on high-performance computing clusters. This computational burden severely limits the applicability of CFD in scenarios requiring real-time predictions or extensive parameter sweeps for urban planning and policy assessment.

To address this fundamental challenge, reduced-order models (ROMs) have been developed as efficient surrogates that capture the dominant dynamics of full-order systems while drastically reducing computational complexity. Traditional ROM techniques, such as Proper Orthogonal Decomposition (POD), project high-dimensional systems onto low-dimensional linear subspaces. However, these linear methods struggle to represent strongly nonlinear phenomena characteristic of turbulent flows in complex urban geometries. The advent of deep learning has opened new avenues for constructing nonlinear ROMs with enhanced representational capacity. Convolutional Autoencoders (CAEs), in particular, have demonstrated remarkable capability in learning compact latent representations of high-dimensional flow fields.

More recently, large language models (LLMs) pre-trained on massive corpora have shown unprecedented success in sequence modeling tasks, with emerging evidence of their ability to capture complex temporal dependencies. This capability has sparked interest in adapting LLMs for time-series forecasting, including applications in physical systems. However, the potential of integrating LLMs with nonlinear dimensionality reduction techniques for spatiotemporal prediction of pollutant dispersion remains largely unexplored.

This paper proposes a novel ROM framework that synergistically integrates a Dilated Convolutional Autoencoder (DCAE) with a pre-trained LLM for efficient and accurate prediction of pollutant dispersion in three-dimensional street canyons. The DCAE extracts low-dimensional spatiotemporal features from CFD-generated flow field snapshots, while the LLM, augmented with text prototype learning and prompt engineering, performs temporal inference in the latent space. A tailored textual template enriches meteorological context to guide LLM predictions. Through comprehensive experiments on a dataset comprising 12 high-fidelity CFD cases with varying geometric configurations and meteorological conditions, we demonstrate that the proposed framework achieves superior prediction accuracy, strong generalization capability across unseen scenarios, and remarkable computational efficiency—achieving about 9.85× speedup compared to traditional CFD simulation.

The remainder of this paper is organized as follows. Section 2 reviews related work on machine learning for air quality prediction, CFD simulations of pollutant dispersion, reduced-order models in fluid mechanics and applications of LLMs in time-series forecasting. Section 3 introduces the methodology, detailing the core principles of the model. Section 4 details the proposed LLM-ROM framework. Section 5 presents the experimental setup and results analysis. Section 6 concludes the paper with a summary of contributions and future research directions.

2. Related Work

This section reviews existing literature relevant to the proposed LLM-ROM framework, organized into four thematic areas.

2.1. Machine Learning for Air Quality Prediction

With the widespread adoption of machine learning, models such as Autoregressive Integrated Moving Average (ARIMA) [4], Random Forest [5], and Support Vector Machines [6] have been applied to predict PM2.5 concentrations at monitoring stations. For example, Lorena Díaz-González et al. [7] used data from the Mexico City air quality monitoring network (2014–2023) to compare Random Forest and Bidirectional Recurrent Imputation for Time Series (BRITS). Recent advancements in deep learning enable hierarchical feature representation through deep neural networks to capture nonlinear relationships in data [8]. Researchers have increasingly utilized deep learning models for predictive tasks [9]. For instance, Mohamed [10] employed artificial neural networks to predict the Air Quality Index in Ahvaz, Iran, demonstrating their applicability through comparative experiments. Qiao and Hong [11] explored Long Short-Term Memory (LSTM) networks for PM2.5 prediction, noting that while LSTMs excel in time-series forecasting, their high memory requirements limit scalability. Ming et al. [12] proposed a hybrid CNN-LSTM model with superior prediction accuracy and generalization capabilities. Yang Feng et al. [13] proposed a spatiotemporal Informer model, which uses a new spatiotemporal embedding and spatiotemporal attention, to improve AQI forecast accuracy. Zhen-Zhong Hu et al. [14] proposed a novel Multi-Factor-Fusion framework to extract and fuse multiple factors and create an end-to-end neural network model capable of directly predicting wind fields. Ying Su et al. [15] utilized Gated Recurrent Unit (GRU) networks to tackle multiple correlated time-series forecasting problems.

2.2. CFD Simulations of Pollutant Dispersion

The continuous development of computational technologies and fluid dynamics theories has expanded the application domains of CFD to encompass diverse engineering practices and scientific research [16]. For example, CFD has been used to investigate the effectiveness of various air pollution mitigation strategies. Boikos et al. [17] proposed an approach that combines RANS CFD modeling with real-time meteorological data, 10 min resolution road traffic emissions and background concentration measurements for a congested area in Mong Kok, Hong Kong, to validate the CFD model. Badach et al. [18] adopted a methodological approach incorporating geographic information system (GIS) tools, 3D parametric modeling, and CFD simulations. In CFD, numerical simulations demand meticulous grid generation, advanced Navier–Stokes equation solvers, and precise handling of dynamic grid deformations. These complexities escalate computational costs, hindering real-time optimization in structural design [19].

2.3. Reduced-Order Models in Fluid Mechanics

To address these challenges, reduced-order models (ROMs) have emerged as a pivotal research focus in CFD [20]. ROMs map high-dimensional systems to low-dimensional subspaces, enabling efficient simulations with minimal computational resources. Current ROM methodologies are broadly categorized into intrusive (requiring code modifications) [21] and non-intrusive (data-driven) approaches [22,23]. Non-intrusive ROMs, leveraging machine learning, offer superior flexibility [24]. Traditional ROM techniques include system identification (e.g., Volterra series, ARMA models) and flow feature-based methods (e.g., Proper Orthogonal Decomposition (POD), Dynamic Mode Decomposition (DMD)). While POD excels in linear systems, its performance degrades in highly nonlinear scenarios. Autoencoders, as nonlinear deep learning models, have demonstrated enhanced feature extraction capabilities. Gaetan Kerschen et al. [25] replaced POD with autoencoders for nonlinear dimensionality reduction, achieving superior ROM performance. Wu et al. [26] combined autoencoders with LSTMs to model dynamic systems and introduced self-attention mechanisms to capture long-range dependencies.

2.4. Large Language Models for Time-Series Forecasting

Pre-trained large language models (LLMs), such as BERT [27] and GPT-2 [28], have revolutionized natural language processing and are now extending to time-series forecasting. Temporal sequences (e.g., time series) and textual data exhibit inherent structural parallels in their sequential organization and contextual dependencies. One fundamental similarity lies in their inherent sequentiality: both modalities are structured as ordered sequences of discrete elements—lexical tokens in textual data and numerical observations in temporal sequences—where the spatial or temporal arrangement of these elements encodes critical semantic or dynamic patterns. Furthermore, contextual dependence manifests analogously across both domains: in linguistic systems, the interpretation of a lexical unit is contingent upon its syntactic and semantic context, while in temporal systems, the value of a variable at a given timestep is probabilistically governed by its historical patterns, reflecting dependencies akin to autoregressive processes. Building on these parallels, pioneering works attempt to fine-tune powerful LLMs for time-series generation. Among them, Liu et al. [29] align the distributions of time-series and textual data to enhance the LLM’s effectiveness. TimeLLM [30] bridges the modalities of time-series and textual data by reprogramming time-series data with text prototypes, thereby unlocking the TSF performance of LLMs. Saroj Gopali et al. [31] demonstrated that fine-tuned LLMs achieve state-of-the-art performance in time-series tasks with minimal training data. Wu et al. [32] proposed GLALLM, a framework adapting LLMs for spatiotemporal wind speed forecasting. Xiao [33] introduced LLM4Fluid, a spatiotemporal prediction framework leveraging LLMs as generalizable neural solvers for fluid dynamics. Hur et al. [34] applied Chronos, a foundation model based on the LLM architecture, to early detection of global instability in low-density jets.

3. Methodology

This section will introduce key components of the proposed model: the Dilated Convolutional Autoencoder, reversible instance normalization, multi-head cross-attention mechanisms, and the pre-trained LLM architecture.

3.1. Dilated Convolutional Autoencoder

Autoencoders are neural network models based on unsupervised learning, capable of performing tasks such as feature extraction, dimensionality reduction, and data generation [35]. In the study of CFD, autoencoders are employed as nonlinear ROMs. Compared to linear models using POD, autoencoders achieve superior approximations of the original dynamic systems within lower-dimensional spaces. An autoencoder consists of two components: an encoder and a decoder. A latent layer connects these two parts—the encoder maps input data to the latent layer, while the decoder reconstructs the output from the latent layer.

Traditional autoencoders utilize fully connected layers for encoding and decoding, mapping input data into a low-dimensional latent space to learn key features. However, for structured grid data, conventional autoencoders often fail to capture spatial structures and local dependencies effectively, limiting their performance in such tasks. Convolutional autoencoders address this limitation by leveraging convolutional and pooling layers from convolutional neural networks (CNNs) to extract spatial features. While pooling layers rapidly reduce feature dimensions, they discard significant valuable information during downsampling. To mitigate this, strided convolutions replace pooling layers in our model, preserving essential information while eliminating redundancy [36].

Furthermore, standard convolutional autoencoders process only local information within their receptive fields. To expand the receptive field without increasing parameters, dilated convolutions [37] are introduced. These convolutions capture broader contextual information by incorporating gaps (dilations) between kernel elements. As network depth increases, multi-layer dilated convolutions progressively fill these gaps, enhancing global feature extraction. The effective kernel size

k^{'}

of dilated convolutional layers in deep neural networks is

k^{'} = k + (k - 1) (r - 1)

(1)

where k denotes the nominal kernel size and r represents the dilation rate.

3.2. Pre-Trained Large Language Models

In traditional machine learning and deep learning, addressing new target tasks typically requires the development of task-specific models, the construction of comprehensive datasets, and extensive parameter tuning to achieve satisfactory performance. This process entails significant resource consumption, prompting researchers to seek viable methods for transferring knowledge from source tasks to target tasks. By leveraging accumulated domain-agnostic knowledge, new models can achieve desirable outcomes with minimal data samples and training time. With the continuous advancement of deep learning, pre-trained large models have achieved remarkable success in fields such as natural language processing (NLP) and Computer Vision (CV). These models, pre-trained on massive datasets, capture rich semantic information and provide robust feature representations for downstream tasks.

Pre-trained large models typically adopt deep neural network architectures, such as the Transformer. The Transformer incorporates a self-attention mechanism capable of modeling long-range dependencies in sequential data. Within each Transformer structure, the encoder processes a set of input sequences

x = [x_{1}, \dots, x_{n}]

into a set of output sequences

y = [y_{1}, \dots, y_{n}]

. Figure 1 illustrates the architecture of the pre-trained large model. In the diagram, the input sequence undergoes embedding and positional encoding operations to generate the initial tokenized input tensor. This tensor is then processed by the multi-head self-attention mechanism to extract global feature representations. Subsequently, a feed-forward neural network propagates these features, ultimately yielding enriched semantic representations. The pre-training of large models involves two primary tasks:

1.: Masked Language Modeling (MLM): Random tokens in the input sequence are masked, and the model predicts the masked tokens.
2.: Next Sentence Prediction (NSP): Given two sentences, the model determines whether the second sentence logically follows the first. Through these pre-training tasks, the model acquires robust semantic understanding and contextual awareness.

3.3. Reversible Instance Normalization

In deep learning, normalization layers are a common technique to accelerate model training and enhance performance. In our model, the normalization layer aims to scale and shift the input data such that it achieves a zero mean and unit standard deviation. The corresponding normalization formulas are referenced in the Methodology Section of [38], which provides a rigorous mathematical derivation and implementation details. However, the input data undergoes normalization before being fed into the model. The model’s predictions are then inversely normalized to restore the data to its original scale. This normalization-inverse normalization workflow ensures stable data distributions across network layers while allowing the model to learn meaningful features from the data.

3.4. Multi-Head Cross-Attention Mechanism

The multi-head attention mechanism is a critical component for processing sequential data. It enables models to focus on salient parts of the input by assigning differentiated attention weights. In Transformer-based architectures, this mechanism is partitioned into multiple parallel “heads”, each with its own parameters, to capture distinct subsets of information from different representation subspaces. The core of the attention mechanism computes outputs based on interactions among queries (^Q), keys (K), and values (V). Given an input matrix

X \in R^{n \times d_{model}}

(where n is the sequence length and

d_{model}

is the model dimension), we first linearly project it into query, key, and value matrices:

\begin{matrix} Q = X W^{Q}, K = X W^{K}, V = X W^{V} \end{matrix}

(2)

where

W^{Q}, W^{K} \in R^{d_{model} \times d_{k}}

and

W^{V} \in R^{d_{model} \times d_{v}}

are learnable parameter matrices, with

d_{k}

and

d_{v}

being the dimensions of keys and values, respectively. Attention weights are derived from the compatibility between queries and keys, which then weight the values to generate the final output:

\begin{matrix} Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V \end{matrix}

(3)

where the scaling factor

\sqrt{d_{k}}

prevents the dot products from growing too large, which could lead to vanishing gradients. This mechanism enables each position to attend to all positions in the sequence.

Multi-head attention repeats this process h times, each with different projection matrices, then concatenates the outputs from all heads and applies a linear transformation:

\begin{matrix} MultiHead (Q, K, V) & = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \\ {head}_{i} & = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

(4)

where

W_{i}^{Q} \in R^{d_{model} \times d_{k}}, W_{i}^{K} \in R^{d_{model} \times d_{k}}

, and

W_{i}^{V} \in R^{d_{model} \times d_{v}}

are the projection matrices for the i-th head, and

W^{O} \in R^{h d_{v} \times d_{model}}

is the output projection matrix. Through this parallel mechanism, the model can jointly capture diverse feature interactions from different representation subspaces.

4. Model Architecture

The primary challenge in constructing flow field prediction models for multi-scenario applications lies in the limited adaptability of conventional models. While existing models excel in single-scenario, large-data regimes, they struggle to generalize across diverse flow fields, necessitating costly retraining—a significant practical limitation. To address this, we propose the DCAE-LLM framework, which integrates pre-trained large language models (LLMs) to achieve high performance with minimal training samples and exceptional transfer learning capabilities. For readers’ convenience, the main mathematical symbols used in this paper are listed in Appendix A. The model workflow comprises two stages:

1.: Dimensionality Reduction: DCAE projects high-dimensional flow field data into a low-dimensional latent space.
2.: Prediction and Reconstruction: The LLM predicts low-dimensional dynamics, followed by a Dilated Convolutional Autodecoder (DCAD) reconstructing the full-dimensional flow field.

As illustrated in Figure 2, the DCAE-LLM adopts an encoder–decoder architecture and consists of four core modules:

1.: DCAE: Performs nonlinear dimensionality reduction and reconstruction of flow field data.
2.: Temporal Text Embedding: Encodes low-dimensional flow field sequences into text-like vectors.
3.: Textual Prompt Template: Integrates meteorological and contextual metadata to guide LLM predictions.
4.: Pre-trained LLM: Executes few-shot inference on the embedded flow field representations.

The following sections elaborate on each module’s design and implementation.

4.1. Dilated Convolutional Autodecoder

When processing the historical input flow field sequence of T timesteps

X_{i n} = \{X_{1}, X_{2}, \dots, X_{T}\}

, the DCAE encoder first reduces the dimensionality of the flow field to obtain latent vector sequence

Z = \{z_{1}, z_{2}, \dots, z_{T}\}

. The resulting low-dimensional data is then fed into the model for temporal reasoning. This approach significantly reduces computational overhead and enhances efficiency. Following inference, a decoder—mirroring the encoder’s architecture—reconstructs the data back to its original high-dimensional space, yielding the full-resolution flow field. The structural diagram of this process is depicted in Figure 3.

4.2. Temporal Text Embedding

First, reversible instance normalization is applied to each input channel to ensure zero mean and unit variance, mitigating distributional shifts in the time series. Subsequently, the normalized data

{\hat{z}}_{t} = RevIN (z_{t})

, where

t = 1, 2, \dots, T

is then partitioned into multiple segments of length L, which may overlap or remain non-overlapping. The total number of segments

N_{p}

is calculated as

N_{p} = ⌊ \frac{(T - L)}{S} ⌋ + 1

(5)

where S represents the sliding stride between consecutive segments. Thus, the patched latent vector sequence

P_{i}

is calculated as

\begin{matrix} P_{i} = [{\hat{z}}_{i \cdot S + 1}, {\hat{z}}_{i \cdot S + 2}, \dots, {\hat{z}}_{i \cdot S + L}] \in R^{L \times 128}, i = 0, 1, \dots, N_{p} - 1 \end{matrix}

(6)

This segmentation serves a dual purpose:

1.: Preserving Local Context: Aggregating localized details within each segment.
2.: Efficient Tokenization: Constructing a compact input token sequence to minimize computational resource consumption.

Then, each patch is flattened and projected to obtain patch tokens:

\begin{matrix} t_{i} = W_{patch} \cdot Flatten (P_{i}) + b_{patch} \in R^{d_{token}} \end{matrix}

(7)

4.3. Patch Reprogramming for Physics-to-Text Alignment

To establish an effective mapping between the physical latent space and the semantic space of pre-trained language models, we propose a textual prototype learning module. The core idea is to construct a learnable, discrete semantic prototype codebook and employ a soft attention mechanism to “translate” the continuous latent vector sequence into a sequence of semantic tokens comprehensible to the language model. The entire module is trained in an end-to-end manner, optimizing the prototype codebook and mapping parameters via backpropagation of the prediction loss. The whole structure of patch reprogramming is shown in Figure 4.

4.3.1. Design Motivation: From Direct Word Embedding to Text Prototypes

The most straightforward approach is to map physical tokens to the pre-trained word embedding space of a language model, i.e., using its pre-trained embedding matrix

E \in R^{V \times d_{text}}

(where V is the vocabulary size, typically on the order of 50,000) to re-encode the tokens. However, physical temporal patterns lack explicit prior alignment with natural language vocabulary. Directly utilizing the large vocabulary E leads to the following issues:

1.: High-dimensional sparsity: Physical tokens need to select representations from tens of thousands of words, resulting in an extremely sparse re-encoding space where numerous irrelevant words introduce noise.
2.: Semantic misalignment: Natural language words (e.g., “apple,” “car”) have fundamentally different semantics from physical temporal patterns (e.g., “rapid rise,” “periodic oscillation”), making it difficult to form meaningful alignments through direct projection.
3.: Computational redundancy: Learning mappings in a 50,000-dimensional vocabulary space requires a massive number of parameters ( $d_{text} \times V$ ), which easily leads to overfitting with limited data.

To address these issues, this paper proposes textual prototype learning, which constructs a learnable set of text prototypes

C \in R^{K \times d_{text}} (K ≪ V)

that is much smaller than the original vocabulary but highly refined. Physical patterns are mapped into the low-dimensional semantic subspace spanned by these prototypes, achieving more efficient and precise modality alignment.

4.3.2. Definition of the Text Prototype Codebook

The definition of the text prototype codebook is expressed as

C \in R^{K \times d_{text}}

, where K is the number of prototypes and

d_{text}

is the text embedding dimension of Time-LLM. Each vector

c_{k} \in R^{d_{text}} (k = 1, \dots, K)

represents a learnable semantic primitive, corresponding to a basic temporal pattern (e.g., “rapid rise,” “slow decline,” “periodic oscillation,” “stable,” etc.). These prototypes are initialized at the beginning of training via K-means clustering on pre-trained word embeddings to obtain reasonable semantic priors, and are subsequently optimized jointly with other model parameters.

4.3.3. Semantic Projection of Physical Tokens

For the sequence of patched physical tokens

\{t_{1}, t_{2}, \dots, t_{N_{p}}\}

, where each

t_{i} \in R^{d_{token}}

, we first map them to the same semantic space as the text prototypes via a linear projection layer:

\begin{matrix} h_{i} = W_{a} t_{i} + b_{a} \in R^{d_{text}} \end{matrix}

(8)

where

W_{a} \in R^{d_{text} \times d_{token}}

and

b_{a} \in R^{d_{text}}

are learnable parameters. The projected

h_{i}

serves as a query vector for interacting with the prototype codebook.

4.3.4. Prototype Fusion via Attention Mechanism

We employ a multi-head cross-attention mechanism to compute the relevance between each physical token and all text prototypes. Taking a single head as an example, treating

h_{i}

as the query and the entire codebook C as both keys and values, the attention weights are computed as

\begin{matrix} α_{i, k} = \frac{exp (h_{i}^{⊤} c_{k} / \sqrt{d_{text}})}{\sum_{j = 1}^{K} exp (h_{i}^{⊤} c_{j} / \sqrt{d_{text}})}, k = 1, \dots, K \end{matrix}

(9)

where

α_{i, k}

represents the semantic similarity between the i-th physical token and the k-th text prototype, satisfying

\sum_{k = 1}^{K} α_{i, k} = 1

. The scaling factor

\sqrt{d_{text}}

stabilizes gradient propagation.

The final aligned text embedding

w_{i}

is obtained as the weighted sum of all prototypes:

\begin{matrix} w_{i} = \sum_{k = 1}^{K} α_{i, k} c_{k} \in R^{d_{text}} \end{matrix}

(10)

This soft assignment mechanism allows each physical token to be expressed as a combination of multiple semantic primitives, enhancing representational richness and flexibility. For instance, a complex pattern describing pollutant concentration as “rapidly rising then slowly declining” can be semantically composed by assigning high weights to prototypes corresponding to “rapid rise” (red line) and “steady decline” (blue line), as shown in Figure 5.

4.3.5. Multi-Head Extension

To capture diverse types of semantic interactions, we extend the above process to a multi-head formulation. With h heads, for each head

m = 1, \dots, h

, we use independent projection matrices

W_{a}^{(m)}, b_{a}^{(m)}

and prototype subspaces

C^{(m)} \in R^{K \times (d_{text} / h)}

. The attention output for head m is computed as

w_{i}^{(m)}

, and all head outputs are concatenated and fused via a linear layer:

\begin{matrix} w_{i} = Concat (w_{i}^{(1)}, \dots, w_{i}^{(h)}) W_{O} \end{matrix}

(11)

where

W_{O} \in R^{h d_{text} \times d_{tex}}

is the output projection matrix. The multi-head mechanism enables the model to attend to diverse features of the prototypes from different representation subspaces.

4.3.6. Training Process and Gradient Flow

The entire textual prototype learning module is trained end-to-end, with its parameters optimized by minimizing the downstream prediction loss

L_{pred}

. Specifically, after the forward pass generating the aligned embedding sequence

\{w_{i}\}

and subsequent Time-LLM inference, gradients are backpropagated through the chain rule to:

The attention weights $α_{i,}$ , consequently updating the query projection parameters $W_{a}, b_{a}$ ;
Each vector $c_{k}$ in the prototype codebook C.

Notably, because the attention weights

α_{i, k}

are continuous functions of the prototypes, the codebook can directly receive gradients for updates, allowing the prototypes to gradually evolve into semantic primitives that best represent the distribution of physical patterns during training. The initial K-means clustering of prototypes from the pre-trained word embedding space provides a good starting point and accelerates convergence.

4.4. Textual Prompt Template

Textual prompts are widely recognized as a direct and effective method to activate task-specific functionalities of large language models (LLMs). However, directly converting time-series data into natural language poses significant challenges, hindering automated dataset tracking and limiting the effective utilization of in-context prompts without compromising model performance. Recent studies demonstrate that other data modalities, such as images, can be seamlessly integrated as prompt prefixes to facilitate reasoning based on these inputs. Inspired by these advancements, and to enable real-world applicability of our method to time-series data, we propose a novel concept: leveraging prompts as prefixes to enrich contextual input information and guide the transformation of time-series patches. This approach markedly enhances the adaptability of LLMs in downstream tasks.

During implementation, we identify three critical elements for constructing effective prompts, which is shown in Figure 6:

1.

Dataset Context: This part provides domain background information about the input time series, helping the LLM understand the physical source and basic characteristics of the data. Specifically, it includes: building layout (aspect ratio H/W = 0.5, building height 21 m), meteorological conditions (prevailing wind direction 225°, inflow wind profile following an exponential law, temperature and humidity based on Shanghai 15 July 2024 measured data), pollution source information (cross-shaped line source, emission rate 8.0 × 10⁻⁶ kg/(m·s)), building thermal properties (thermal conductivity, albedo, etc.), and vegetation configuration (tree height 10m, shrub height 1.5 m). This information provides the model with foundational knowledge for understanding the current physical scenario.

2.

Task Instruction: This part not only contains the specific prediction directive but also incorporates prior knowledge related to the pollutant dispersion task, aiming to activate the LLM’s general knowledge relevant to physical processes. Specifically, it includes the following:

Prediction Directive: Clearly indicates the form of the task, e.g., “Based on the historical latent variable sequence of the previous 12 timesteps (2 h), predict the PM10 concentration field for the next 6 timesteps (1 h) in an autoregressive manner.”
Task Prior Knowledge: Introduces fundamental physical principles of pollutant dispersion, such as “Pollutant dispersion in street canyons is governed by advection, turbulent diffusion, and source emissions; the evolution of the concentration field is continuous and satisfies mass conservation,” and “The prediction must follow temporal causality, where future states depend only on historical information.” This prior knowledge helps focus the LLM’s general sequence modeling capabilities on the specific task of physical field prediction.

3.

Input Statistics: Augments the time series with key statistical metrics (e.g., min/max values, trends, lag correlations) to support pattern recognition and reasoning.

During the input stage, these three components are integrated into a coherent natural language prompt text, converted into a continuous prompt embedding sequence

T_{prompt}

via the LLM’s own word embedding layer, and concatenated with the patched semantic embedding sequence from the physics-to-text alignment module to form the complete model input

S^{(0)}

:

S^{(0)} = [T; w_{1}, w_{2}, \dots, w_{N_{p}}] \in R^{(L_{p} + N_{p}) \times d_{text}}

(12)

4.5. Pre-Trained LLM and Autoregressive Generation Mechanism

When utilizing the frozen large language model (LLM) for processing, we remove the prefix section and extract the output representations. These representations are then flattened and projected into a low-dimensional space via a linear transformation to obtain the final flow field predictions.

The core of the LLM reasoning component is the autoregressive generation mechanism based on the Transformer decoder. Given the contextual input sequence

S^{(0)} = [T; w_{1}, w_{2}, \dots, w_{N_{p}}]

(where

T

represents the text prompt embeddings and

\{w_{i}\}

denotes the historical physical token sequence), at the k-th prediction step, the model computes the conditional probability distribution for the next token:

\begin{matrix} P (w_{N_{p} + k} ∣ S^{(k - 1)}) = Softmax (Linear (h_{last}^{(k - 1)})) \end{matrix}

(13)

where

h_{last}^{(k - 1)}

is the hidden state of the last position after passing

S^{(k - 1)}

through the Transformer decoder. By iteratively applying this process, the model progressively generates the complete future token sequence

\{{\hat{w}}_{N_{p} + 1}, {\hat{w}}_{N_{p} + 2}, \dots, {\hat{w}}_{N_{p} + F}\}

. The core of this mechanism is causal attention, ensuring that each position can only attend to previous positions:

\begin{matrix} Attention (Q, K, V) = Softmax (\frac{{Q K}^{⊤}}{\sqrt{d_{k}}} + M) V \end{matrix}

(14)

where M is an upper triangular mask matrix

(M_{i j} = - \infty

when

i < j)

that prevents information leakage.

It is important to emphasize that the prediction loss of the LLM is not computed in the semantic space, but in the final physical field space. The generated semantic token sequence

\{{\hat{w}}_{i}\}

is first mapped back to latent vectors via an inverse projection layer:

\begin{matrix} {\hat{z}}_{t} = W_{patch}^{'} {\hat{w}}_{i} + b_{patch}^{'} \in R^{d} \end{matrix}

(15)

After de-patching and reversible denormalization, the complete future latent vector sequence

{\hat{Z}}_{future}

is recovered, and finally reconstructed into physical fields

{\hat{X}}_{t + 1}, \dots, {\hat{X}}_{t + F}

by the DCAE decoder. The optimization objective is to minimize the mean squared error between the predicted and ground truth physical fields:

\begin{matrix} L_{pred} = \frac{1}{F} \sum_{k = 1}^{F} {∥X_{t + k} - {\hat{X}}_{t + k}∥}_{F}^{2} \end{matrix}

(16)

This design allows the LLM to reason in the semantic space while the loss function directly supervises in the physical space, ensuring the physical consistency of the predictions.

In the proposed model architecture, only the parameters of the lightweight input transformation and output projection layers are updated, while the backbone language model remains frozen. Compared to conventional multimodal language models—which typically require fine-tuning with paired cross-modal data—DCAE-TimeLLM optimizes directly and operates efficiently with minimal time-series data and a few training epochs. This approach maintains high computational efficiency and imposes significantly fewer resource constraints compared to building domain-specific large models from scratch or fine-tuning existing ones.

5. Experiment

In this section, we evaluate the performance of the proposed reduced-order model (ROM) through two experimental setups. First, we validate the model’s accuracy in predicting pollutant dispersion under specific meteorological conditions and green infrastructure configurations using a three-dimensional urban canyon model. Second, we conduct transfer learning experiments by varying vegetation distributions under controlled meteorological conditions to assess the model’s generalizability. To ensure solution fidelity, all experimental data are derived from high-fidelity numerical simulations of structured grids using the ENVI-met 5.1 (non-hydrostatic solver).

5.1. Expriment Setups

In this section, we will describe the expriment setups, which includes evaluation metrics, experiment configurations and hyperparameter choices.

5.1.1. Evaluation Metrics

The evaluation metrics are carefully selected to quantify the prediction accuracy from multiple perspectives, and the detailed definitions are provided below.

1.: Root Mean Square Error (RMSE):
RMSE measures the average magnitude of the prediction error, giving higher weight to large errors. It is defined as

$RMSE = \sqrt{\frac{1}{D \times H \times W} \sum_{i = 1}^{D} \sum_{j = 1}^{H} \sum_{k = 1}^{W} {(C_{i, j, k} - {\hat{C}}_{i, j, k})}^{2}} (μ g / m^{3})$

(17)

where $C_{i, j, k}$ and ${\hat{C}}_{i, j, k}$ denote the ground truth and predicted PM10 concentration at grid point $(i, j, k)$ respectively. The dimensions D, H and W correspond to the vertical layers, height, and width of the 3D field ( $D = 28$ , $H = 65$ , $W = 65$ ).
2.: Structural Similarity Index (SSIM):
SSIM is a perceptual metric that quantifies the similarity between two images in terms of luminance, contrast, and structure. For 3D fields, we compute SSIM slice-wise along the vertical dimension and average the results. Specifically, for each vertical layer i, we treat the horizontal slice $C_{i} \in R^{H \times W}$ and its prediction ${\hat{C}}_{i} \in R^{H \times W}$ as 2D images, and calculate the SSIM using the standard formula:

$SSIM (C_{i}, {\hat{C}}_{i}) = \frac{(2 μ_{i} {\hat{μ}}_{i} + ϵ_{1}) (2 σ_{i, \hat{i}} + ϵ_{2})}{(μ_{i}^{2} + {\hat{μ}}_{i}^{2} + ϵ_{1}) (σ_{i}^{2} + {\hat{σ}}_{i}^{2} + ϵ_{2})}$

(18)

where $μ_{i}$ and ${\hat{μ}}_{i}$ are the mean intensities of the ground truth and predicted slices, $σ_{i}$ and ${\hat{σ}}_{i}$ are their standard deviations, and $σ_{i, \hat{i}}$ is the covariance. The constants $ϵ_{1}$ and $ϵ_{2}$ are small stabilizers. The overall SSIM for the 3D field is obtained by averaging over all vertical layers:

${SSIM}_{3 D} = \frac{1}{D} \sum_{i = 1}^{D} SSIM (C_{i}, {\hat{C}}_{i})$

(19)

SSIM ranges from 0 to 1, with values closer to 1 indicating higher structural fidelity. This metric is particularly sensitive to the preservation of spatial patterns such as pollutant plume morphology and concentration gradients.
3.: Coefficient of Determination ( $R^{2}$ ):
$R^{2}$ measures the proportion of variance in the ground truth data that is explained by the model. It is defined as

$R^{2} = 1 - \frac{\sum_{i = 1}^{D} \sum_{j = 1}^{H} \sum_{k = 1}^{W} {(C_{i, j, k} - {\hat{C}}_{i, j, k})}^{2}}{\sum_{i = 1}^{D} \sum_{j = 1}^{H} \sum_{k = 1}^{W} {(C_{i, j, k} - \bar{C})}^{2}}$

(20)

where $\bar{C}$ is the mean of the ground truth concentrations over all grid points. $R^{2}$ can be negative if the model performs worse than simply predicting the mean, and a value of 1 indicates perfect prediction. This metric provides a global assessment of the model’s explanatory power.

5.1.2. Experiment Configurations and Hyperparameter Choices

All experiments were conducted on identical computational hardware with the following software configurations: Python 3.9.7, PyTorch 1.11.0, and CUDA Toolkit 11.3.1. This study adopts Time-LLM as the temporal reasoning backbone, with pre-trained weights based on the Bert medium (355M parameters) architecture. Specifically, we utilize the officially released pre-trained Time-LLM model (version: time-llm-base-355M), which has been pre-trained on large-scale general-purpose time-series data and possesses powerful sequence modeling capabilities. The model weights are loaded via the Hugging Face Transformers library.

Additional network hyperparameters include that pre-trained LLM is BERT, epochs is 20, batch size is 12, activation functions are GELU, loss function is mean squared error (MSE). Patch length

L = 6

corresponds to a 1 h time window, which matches the short-term correlation timescale of pollutant dispersion (approximately 30–60 min) and effectively captures local dynamic features.

S = 3

results in 50% overlap between patches, enhancing local context continuity while maintaining temporal resolution. Following the recommendation of the original LoRA paper [39], we choose a small rank to maximize parameter efficiency while maintaining performance. Experiments show that with r = 8, only 0.04% of the parameters (1.2M) are trainable, achieving performance close to full fine-tuning. The number of text prototypes K = 128 is determined through preliminary experiments with grid search over the range K = 64 to 256, using validation set SSIM as the criterion. This value achieves the best balance between representational capacity and computational efficiency. The AdamW optimizer with an initial learning rate of

5 \times 10^{- 4}

and cosine annealing decay is adopted. This configuration was validated through multiple experiments to provide the best trade-off between training stability and convergence speed.

5.2. Experimental Dataset Construction

In this subsection, we validate the performance of the proposed reduced-order model (ROM) using a canonical cross-shaped urban canyon pollutant dispersion scenario [40]. The computational domain spans 195 m × 195 m horizontally with a vertical height of 84 m, discretized into 3 m × 3 m × 3 m grid cells, resulting in a total of 118,300 grid points. As illustrated in Figure 7: Gray regions represent 21 m-tall buildings, intermittent green patches denote 10 m-tall trees and continuous green areas indicate 1.5 m-tall hedges. Red lines mark vehicular emission-dominated pollution sources. The street aspect ratio (

H / W

), where H is building height and W is street width, is a critical parameter in urban canyon modeling. Three aspect ratios are defined: 0.5, 0.9, and 1.2, which is shown in Figure 8 [41]. Meteorological conditions include that wind directions are 180° (due north) and 225° (northeast).

Temperature and humidity are the realistic profiles for 15 July 2024 at coordinates 121.6° E, 30.8° N. High-fidelity pollutant dispersion fields were generated using ENVI-met CFD simulations, with key parameters summarized in Table 1. A 24 h pollutant dispersion simulation was conducted, recording

{PM}_{10}

concentrations at 10 min intervals, yielding 145 high-resolution snapshots. The dataset was partitioned into 116 timesteps (80%) for training and 29 timesteps (20%) for testing. This experiment focuses on the configuration with

H / W = 0.5

and northeast wind (225°) to validate model efficacy.

5.3. Training and Reconstruction Performance Evaluation of DCAE

For the dimensionality reduction and reconstruction module, the parameters of the Dilated Convolutional Autoencoder (DCAE) are listed in Table 2. In the encoder, strided dilated convolutions replace traditional pooling layers to reduce dimensionality, while the decoder mirrors the encoder’s architecture. Training employs the AdamW optimizer with a learning rate of 0.001, weight decay rate of 0.01, and runs for 1000 epochs.

As observed in Figure 9, when reducing dimensions to 30, the reconstruction root mean square error (RMSE) of the DCAE across pollutant dispersion snapshots (timestep 0 to 28 in the test set) is consistently lower than that of the Proper Orthogonal Decomposition (POD)-based method. The maximum error is approximately 3.7 × 10⁻⁵, and the minimum error is 2.1 × 10⁻⁵. This demonstrates that the dilated convolutional neural network outperforms fully connected networks in extracting spatially localized features, particularly for the three-dimensional structured grids used in this experiment.

Table 3 compares the average RMSE, parameter amount, training time, and inference time between the POD method and the DCAE. Figure 10 juxtaposes high-fidelity simulation results with reconstructed flow fields from both POD and DCAE at timestep 18 of the test set. Additionally, Figure 11 visually contrasts the absolute error distributions between the two methods and the high-fidelity data.

5.4. Prediction Performance Analysis of LLM-ROM on Benchmark Scenarios

Our core idea is not to use an LLM merely as a time-series predictor, but to leverage its pre-trained knowledge of sequence patterns and its powerful long-term dependency modeling capabilities to understand the complex dynamics of physical processes in a latent space. The dimensionality reduction via DCAE is a crucial step for modality alignment (“physics-to-language”), which enables us to reprogram the LLM’s general knowledge for physical simulation.

5.4.1. Comparative Experiments with Baseline Models

To comprehensively evaluate the predictive performance of LLM-ROM, this section systematically compares it with various representative methods, covering traditional reduced-order models, deep learning approaches, and advanced neural operator models. The following representative methods are selected as performance baselines:

1.: LSTM+Autoencoder (w/o DCAE): This baseline does not use DCAE for dimensionality reduction. Instead, LSTM operates directly on the original concentration field (28 × 65 × 65) to validate the necessity of DCAE compression. The LSTM has 256 hidden units, and the output layer is a fully connected layer that reconstructs the original field.
2.: POD-GPR: A representative traditional reduced-order model, using Proper Orthogonal Decomposition to retain 30 modes combined with Gaussian Process Regression for time-series prediction.
3.: DCAE-LSTM: A classic deep ROM paradigm, combining the frozen DCAE encoder with a two-layer LSTM (256 hidden units).
4.: DCAE-GRU: Gated Recurrent Unit, a simplified variant of LSTM, with 256 hidden units.
5.: DCAE-ConvLSTM: Convolutional LSTM, leveraging its convolutional structure to capture spatiotemporal dependencies simultaneously. The architecture consists of two ConvLSTM layers with 64 hidden channels and 3 × 3 kernels.
6.: DCAE-Transformer: A standard Transformer encoder (4 layers, 8 heads, without pre-training) serving as a self-attention baseline.
7.: DCAE-TFT: Temporal Fusion Transformer, a Transformer variant specifically designed for time-series forecasting, with default configurations.
8.: DCAE-U-Net: The latent-space prediction problem is reformulated as an image-to-image translation task. The historical 12-step latent vectors are stacked into an input tensor (12 × 128) and mapped directly to the future 6-step latent vectors via a standard 4-layer U-Net architecture.
9.: DCAE-FNO: Fourier Neural Operator, which learns temporal evolution directly in the latent space. The input sequence is treated as discrete samples in function space, and the mapping is learned in the spectral domain via Fourier layers. The model uses 4 Fourier layers with 16 Fourier modes.

To ensure a fair comparison, all models utilize the identical pre-trained DCAE encoder (with frozen parameters) to compress the 3D concentration fields into 30-dimensional latent vector sequences, with only the latent-space temporal prediction module being replaced. As a supplementary baseline, we also include an LSTM+Autoencoder model that operates directly on the original concentration field (28 × 65 × 65) to validate the necessity of DCAE dimensionality reduction. The experiments are conducted on the case with a temporal split: the first 116 timesteps (00:00–19:10) are used for training, and the last 29 timesteps (19:20–24:00) for testing. The input window consists of 12 timesteps (2 h), and the prediction window is 6 timesteps (1 h). The results are integrated into Table 4.

The following conclusions can be drawn from the experimental results:

1.: Necessity of DCAE Dimensionality Reduction: The LSTM+Autoencoder operating directly on the original field achieves an RMSE of 24.43 and an SSIM of only 0.532, performing significantly worse than all DCAE-based methods. This strongly validates the critical role of compressing high-dimensional physical fields into a low-dimensional latent space for time-series prediction—dimensionality reduction not only substantially reduces computational complexity but, more importantly, eliminates redundant information, enabling the model to focus on core dynamical features.
2.: Limitations of Traditional Deep ROMs: DCAE-LSTM and GRU achieve RMSEs of 12.32 and 11.44, respectively. While these outperform linear methods such as POD-GPR, they still show a considerable gap compared to advanced models. This indicates the inherent limitations of recurrent neural networks in handling long-term temporal dependencies.
3.: Effectiveness of Self-Attention Mechanisms: ConvLSTM, DCAE-Transformer, and TFT outperform LSTM, with RMSEs ranging from 8.21 to 9.86, demonstrating the advantages of self-attention mechanisms in modeling long-range dependencies. Among these, TFT, as a Transformer variant specifically designed for time-series forecasting, performs better than the standard Transformer encoder.
4.: Performance of Advanced Spatiotemporal Models: U-Net and FNO, as current SOTA models, achieve RMSEs of 7.22 and 6.71, with SSIMs of 0.920 and 0.918, respectively. FNO, as a representative neural operator method, outperforms U-Net, demonstrating the advantages of learning in the spectral domain.
5.: Significant Advantages of LLM-ROM: LLM-ROM outperforms all compared methods across all metrics, achieving a 68.3% reduction in RMSE compared to the second-best method (FNO), with SSIM improvements exceeding 5.3%. This advantage strongly validates the core design of this work: by mapping continuous latent vectors to discrete semantic prototypes via the physics-to-text alignment module, the pre-trained general sequence knowledge of LLMs is activated, enabling more accurate capture of complex physical dynamics in the latent space.

5.4.2. Ablation Study

In this study, we propose a method leveraging textual templates to guide large language models (LLMs) in pollutant prediction. To validate its efficacy, an ablation study was conducted comparing four variants containing DCAE-LLM, which incorporates the complete and detailed textual template, DCAE-LLM (-), which contains the relatively less detailed textual template which is shown in Figure 12, and DCAE-LLM (–), which excludes the textual template and a lightweight conditioned LSTM.

The followings will explain the specific differences between DCAE-LLM, DCAE-LLM (-) and DCAE-LLM (–):

1.: DCAE-LLM: Uses the complete prompt template containing all three components described above, i.e., full Dataset Context, task instruction (including prediction directive and prior knowledge) and input statistics.
2.: DCAE-LLM (-): Retains only the input statistics and the most basic prediction directive from Component 1 and 2 (e.g., only the phrase “predict future concentration”), while completely removing the Dataset Context and all task prior knowledge beyond the basic directive. Specifically, the simplified module omits extensive domain background information such as building layout, meteorological conditions, pollution source details, building properties, and physical prior knowledge about pollutant dispersion.
3.: DCAE-LLM (–): Completely removes all prompt information, inputting only the patched semantic embedding sequence from the physics-to-text alignment module.

In order to distinguish the contributions of prompt embedding sequence from LLM capability, we constructed a lightweight conditioned LSTM model, designed with the principle that it will receive the identical prompt embedding sequence as LLM-ROM but using a conditioned lightweight temporal model instead of the LLM as the reasoning core. The implementation details are as follows:

1.: The three components of text prompts (Dataset Context, task instruction, input statistics) are encoded via the same word embedding layer into fixed-dimensional vectors (consistent with LLM-ROM’s prompt embedding dimension).
2.: These prompt vectors are concatenated with the patched latent vector sequence as input to the LSTM.
3.: The LSTM adopts a two-layer architecture with 256 hidden units, consistent with DCAE-LSTM.
4.: The output layer maps back to the latent vector space via a fully connected layer, followed by DCAE decoder reconstruction to physical fields.

The only difference between this model and LLM-ROM is that the lightweight LSTM replaces the pre-trained LLM as the temporal reasoning core, while receiving identical prompt embedding sequence.

The experiments are conducted under the 1 h prediction window, comparing the three configurations described above. The results are shown in Table 5:

The key findings from the experiments are summarized below:

1.: Importance of Domain and Task Prior Knowledge: Compared to the full model (DCAE-LLM), the simplified module (DCAE-LLM (-)) shows a 122% increase in RMSE and a decrease in SSIM to 0.943. This proves that Component 1 (Dataset Context) and the rich task prior knowledge in Component 2 significantly contribute to model performance. Background information such as building layout, meteorological conditions, pollution source characteristics, and physical principles of pollutant dispersion helps the model more accurately understand the physical meaning of the latent space sequence, leading to more precise predictions.
2.: Auxiliary Role of Statistical Information: The no-prompt module (DCAE-LLM (–)) performs worse than the simplified module (DCAE-LLM (-)) (RMSE increasing from 4.73 to 5.96), indicating that even in the absence of domain prior knowledge, basic statistical information still provides a useful summary of the sequence state and plays an auxiliary role.
3.: Even with identical prompt embedding sequence, the lightweight conditioned LSTM achieves an RMSE of 11.89, significantly higher than LLM-ROM’s 2.13—a performance gap of 458%. This proves that the performance gain primarily stems from the LLM’s inherent sequence modeling capability and semantic understanding ability, not the prompt embedding sequence itself. Although LSTM can receive word embeddings as input, it treats these vectors as ordinary numerical features, learning statistical correlations with the prediction target through its gating mechanisms. In contrast, the pre-trained LLM’s embedding space itself encodes rich semantic knowledge. When prompt embeddings are fed into the LLM, it can activate the semantic understanding acquired during pre-training on massive corpora, truly “comprehending” the physical concepts represented by these prompts and their interrelationships (such as the opposition between “rise” and “fall,” the contrast between “high wind speed” and “low wind speed,” etc.). This fundamental difference in semantic understanding capability is the root cause of the significant performance gap between the lightweight conditioned LSTM and LLM-ROM, even when they receive identical prompt embedding sequence.

Figure 13 compares pollutant concentrations in the vertical and horizontal direction across all three models against high-fidelity CFD simulations at timestep 18 of the test set. To clearly visualize flow field prediction errors, we generated absolute error contour maps of pollutant concentrations at timestep 18 for the three model variants (Figure 14). Analysis of these error distributions reveals the following: the proposed ROM exhibits errors predominantly concentrated near ground-level intersection hubs and along wind direction transition zones. These observations robustly demonstrate the model’s exceptional performance under few-shot training conditions, highlighting the critical role of the flow field textual template in enhancing predictive accuracy.

To quantify the contribution of LLM pre-trained knowledge and fine-tuning strategies to model performance, we designed the extra ablation experiments. As shown in Table 6, the key findings are as follows:

1.

Decisive Role of Pre-trained Knowledge

Replacing Time-LLM’s pre-trained weights with random initialization causes a dramatic performance drop—RMSE skyrockets from 2.13 to 10.72, a 403% increase, and SSIM drops from 0.967 to 0.816. This result fully demonstrates that in few-shot scenarios with only 116 training samples, the Transformer architecture alone cannot learn effective physical dynamics from scratch. Pre-trained knowledge is the cornerstone of LLM-ROM performance, providing powerful temporal priors that enable rapid adaptation to physical field prediction tasks.

2.

Trade-offs in Fine-tuning Strategies

We compared two fine-tuning strategies: parameter-efficient fine-tuning and full fine-tuning. The results show that full fine-tuning brings only a 2.8% accuracy improvement (RMSE from 2.13 to 2.07), but at enormous cost:

(a): Trainable parameters increase from 1.2M to 1.5B (a 1250× increase).
(b): Training time extends from 0.25 h to 13 h.
(c): GPU memory usage skyrockets from 14.2 GB to over 80 GB.
(d): Overfitting risk increases in small-data scenarios (validation loss decreases then increases).

This indicates that in data-scarce CFD scenarios, the benefits of full fine-tuning are negligible, while the computational cost is enormous. Parameter-efficient fine-tuning achieves performance close to full fine-tuning with minimal parameters (0.04%), while avoiding overfitting and catastrophic forgetting, making it the superior strategy.

5.5. Transferability Experiment

The experimental results in preceding sections demonstrate that the proposed model achieves high predictive efficiency and accuracy in the 3D street canyon scenario. However, trained ROMs typically exhibit scenario-specific limitations, requiring retraining when environmental conditions change. A notable drawback of deep neural networks lies in their computationally intensive training processes, which diminishes their practical advantages compared to conventional methods. To address this limitation, we investigated the transfer learning potential of the model.

To validate the model’s transferability and training acceleration efficacy, we modified meteorological conditions in a canonical 3D street canyon flow field. Under the new conditions, two ROM variants maintained identical architectures: one model was initialized with random parameters, while the transfer model utilized pre-trained parameters from the previous section. Specifically, for the 3D street canyon configuration, we selected two aspect ratios (0.9 and 1.2) while keeping other meteorological parameters constant, only altering wind directions (northerly vs. northeasterly) to evaluate transferability under minor conditional variations.

Figure 15 compares the convergence rates of loss functions between the transfer model and the randomly initialized model under these configurations. The results reveal that the transfer model consistently achieved significantly lower loss values throughout the training iterations. Quantitative analysis demonstrates that after 30 training epochs, the randomly initialized model exhibited a loss value 2.8× higher than the transfer model. These findings indicate that the proposed model possesses robust transfer learning capabilities under modified meteorological conditions, substantially accelerating convergence. Remarkably, the transfer model achieved convergence within 14 min, representing a 55% reduction in training time compared to the randomly initialized counterpart.

To rigorously evaluate the transfer learning capability of LLM-ROM under truly few-shot scenarios, we constructed 12 training cases using the ENVI-met numerical simulation platform to generate high-fidelity datasets, where the source domain is shown in Table 7. The extrapolation performance was then tested on two additional cases (target domain), which is shown in Table 8, where the change in vegetation configuration leads to significantly altered flow fields and concentration distributions, making it ideal for validating transfer capability.

From the 145 timesteps of the target domain, we randomly sample 5, 10, and 20 timesteps as the fine-tuning training set (corresponding to 3.4%, 6.9%, and 13.8% of the total data), with the remaining samples used for testing. To ensure statistical robustness, each experiment is repeated 5 times and the results are averaged. The fine-tuning strategy follows that the LLM backbone and DCAE encoder parameters are completely frozen, with updates applied only to the Patch reprogramming layer (projection matrices

W_{a}, b_{a}

), the text prototype

E^{'}

, and the inverse projection layer

(W_{patch}^{'}, b_{patch}^{'})

, resulting in approximately 1.2M trainable parameters. Training uses the AdamW optimizer with a learning rate of

1 \times 10^{- 4}

, for up to 50 epochs with early stopping. As a baseline for comparison, DCAE-LSTM is trained from scratch on the target domain with the same number of samples (no parameters frozen). The experimental results are presented in Table 9.

The key observations from the experiments are summarized as follows:

1.: With only 5 samples for fine-tuning, LLM-ROM achieves an RMSE of 5.78 $\times 10^{- 2} μ g / m^{3}$ with an SSIM of 0.903. This indicates that the model can rapidly capture the core dynamics of the new scenario even with extremely sparse labels. In contrast, DCAE-LSTM trained from scratch performs poorly with an RMSE of 19.69 $\times 10^{- 2} μ g / m^{3}$ and an SSIM of only 0.588, essentially failing to learn.
2.: With only 20 samples for fine-tuning, LLM-ROM’s RMSE drops to 3.24 $\times 10^{- 2} μ g / m^{3}$ , approaching the full-training performance on the source domain (2.13 $\times 10^{- 2} μ g / m^{3}$ ), with SSIM improving to 0.952. This demonstrates that the model achieves effective domain adaptation with less than 14% of the target domain data.

Then, the model autoregressively predicted the latter 90% (130 timesteps) of flow field sequences using only the initial 10% (15 timesteps) from the extrapolation test set, with a sliding time window of 15 steps. Consistent with the methodology outlined in the preceding section, a Dilated Convolutional Autoencoder (DCAE) was employed to project the flow field data into a 30-dimensional latent space for efficient feature extraction. During the validation phase, we tested the model on flow field data under untrained scenarios featuring a street canyon aspect ratio (H/W) of 1.2, 10 m tall trees, and wind directions of 180° (southerly) and 225° (southwesterly). As illustrated in Figure 16, comparative analyses between Computational Fluid Dynamics (CFD) results and model predictions were conducted at three representative timesteps (Step 18, 70, and 145). The predicted concentration fields exhibited strong alignment with CFD benchmarks. Error maps revealed a maximum absolute error of 0.32 at later timesteps, attributable to error accumulation inherent in autoregressive extrapolation processes. Despite this expected temporal error propagation, the model maintained robust performance in transfer learning tasks, demonstrating its potential for generalization across diverse flow conditions.

In order to systematically evaluate the stability of LLM-ROM in long-term autoregressive prediction, we introduce the following two quantitative metrics:

Error Doubling Step (EDS): Defined as the number of steps at which RMSE first exceeds twice the RMSE at step 15 (the initial prediction step). This metric measures the model’s ability to maintain low error levels; a larger EDS indicates better long-term stability.
Average Error Growth Rate (AEGR): Defined as the average per-step increase in RMSE from step 15 to step 145:

$\begin{matrix} AEGR = \frac{{RMSE}_{145} - {RMSE}_{15}}{145 - 15} \end{matrix}$

(21)

This metric quantifies the speed of error accumulation.

The results in Table 10 demonstrate the following:

1.: LLM-ROM exhibits the slowest error growth: From step 15 to 145, LLM-ROM’s RMSE increases from 4.58 $\times 10^{- 2} μ g / m^{3}$ to 9.28 $\times 10^{- 2} μ g / m^{3}$ (103% increase), while CAE-Transformer increases by 159% and CAE-LSTM by 187%. LLM-ROM’s final RMSE (9.28) is even lower than CAE-Transformer’s RMSE around step 35.
2.: Significant advantage in EDS: LLM-ROM achieves an EDS of 94 steps, compared to 35 for CAE-Transformer and 23 for CAE-LSTM.
3.: Lowest AEGR: LLM-ROM’s AEGR is 0.034, less than one-third of CAE-Transformer’s 0.103 and one-quarter of CAE-LSTM’s 0.141.

Additionally, we record the full-field RMSE at each prediction step k (k = 15, 16, …, 145) and plot the RMSE evolution curve as a function of prediction steps.

Further analysis of the RMSE evolution curves in Figure 17 reveals that the error accumulation process of LLM-ROM can be divided into three distinct stages:

1.: Initial Stage (steps 15–30): During this stage, LLM-ROM exhibits extremely slow error growth, with RMSE increasing only marginally from 4.58 $\times 10^{- 2} μ g / m^{3}$ to 4.92 $\times 10^{- 2} μ g / m^{3}$ —an increase of approximately 7.4%. The curve remains nearly flat, indicating that the model maintains very high fidelity in short-to-medium term predictions, accurately capturing the dominant dynamical modes of pollutant dispersion with minimal initial error and negligible accumulation.
2.: Mid-term (31–90 steps): Error begins to increase at an approximately linear rate, with RMSE rising steadily from 5.31 $\times 10^{- 2} μ g / m^{3}$ to 7.65 $\times 10^{- 2} μ g / m^{3}$ —an increase of about 44%. This stage corresponds to the low-wind nighttime period (20:00–06:00), during which pollutants continuously accumulate on the leeward side, leading to complex flow structures and enhanced nonlinearity. Although the model maintains reasonably good prediction accuracy, the error accumulation rate accelerates compared to the initial stage. Nevertheless, the growth slope of LLM-ROM remains significantly lower than that of DCAE-Transformer and DCAE-LSTM, demonstrating its superior adaptability to complex dynamics.
3.: Long-term (91–145 steps): Error growth slows considerably and gradually approaches saturation, with RMSE increasing slowly from 8.12 $\times 10^{- 2} μ g / m^{3}$ to 9.28 $\times 10^{- 2} μ g / m^{3}$ —an increase of only 14.3%. This phenomenon does not indicate performance degradation but rather results from a combination of factors: (1) pollutant concentrations have physical upper bounds, preventing model predictions from diverging indefinitely from ground truth; (2) LLM-ROM has sufficiently learned the long-term dynamics of the system, such that subsequent error accumulation primarily stems from the slow propagation of initial errors rather than newly introduced biases; and (3) the natural saturation effect inherent in autoregressive prediction—once the model reaches its inherent error ceiling, incremental errors in subsequent steps gradually approach zero. The final RMSE of 9.28 $μ$ g/m³ corresponds to a relative error of approximately 18.6% of the source domain mean concentration, which is acceptable for 24 h ultra-long-term prediction and is substantially lower than that of the compared methods.

As shown in Table 11, when evaluating new meteorological conditions or building layouts, traditional CFD methods require complete full-order recomputation (20,000 s). Our method, in contrast, adopts a more efficient strategy: after 900 s’ training time of LLM-ROM, it only requires simulating the initial 10% of timesteps (15 steps) under the new scenario as input to the model, which takes approximately 2000 s (10% of the CFD time), and then uses the pre-trained LLM-ROM to extrapolate the remaining 90% (130 steps) in 30 s. Of course, this calculation also simplifies some practical steps, such as geometric modeling for the new scenario and boundary condition settings. Nevertheless, for scenario adaptation tasks, our method’s total time is approximately 2030 s, achieving about 9.85× speedup compared to traditional CFD’s 20,000 s, saving over 95% of computational time.

6. Conclusions and Future Work

This study proposes a non-intrusive reduced-order model that integrates a Dilated Convolutional Autoencoder with pre-trained large language models to address flow field prediction for microclimate pollutant dispersion, leveraging meteorological, spatial, and temporal information synergistically. The model employs the DCAE to extract low-dimensional spatiotemporal features from high-fidelity numerical simulations, while pre-trained LLMs uncover dynamic temporal relationships within these features. Prior to LLM-based prediction, a textual description template tailored for pollutant dispersion data is designed to enrich contextual inputs with meteorological conditions, urban geometry, and domain-specific prior knowledge, thereby guiding the LLM’s reasoning. Text embedding techniques further ensure that low-dimensional flow field data are formatted in a manner interpretable to the LLM. This approach significantly enhances the LLM’s performance in flow field time-series prediction and enables robust transfer learning across diverse scenarios.

To validate the proposed DCAE-LLM framework, two experiments of varying complexity were conducted. Results demonstrate that under limited training data, the ROM achieves superior prediction accuracy and enhanced stability compared to conventional Deep ROMs. In transfer learning experiments, the model maintains high precision when extrapolating to data with differing meteorological conditions or building configurations. These findings indicate that DCAE-LLM effectively captures hydrodynamic characteristics and substantially improves prediction accuracy and generalizability across heterogeneous flow fields. Experimental results demonstrate that LLM-ROM achieves an RMSE of

2.13 \times 10^{- 2} μ g / m^{3}

and an SSIM of 0.967 for 1 h PM10 concentration prediction in the baseline scenario, significantly outperforming existing deep ROMs and advanced time-series models. In few-shot transfer scenarios, the model requires only 5–10 target-domain samples for effective domain adaptation.

While the ROM excels in few-shot transfer learning, challenges persist in highly dynamic and variable scenarios. These limitations partly stem from data scarcity and complicating model training that warrant further investigation. For example, we plan to conduct systematic parametric sensitivity tests (e.g., wind speed ±20%, wind direction ±10°, different temperature and humidity profiles) to further quantify the model’s robustness boundaries, and future work will explore advanced data selection strategies to enhance adaptability in complex, transient environments and under fluctuating meteorological conditions. Future work will systematically investigate model sensitivity to key parameters using methods such as Latin Hypercube Sampling.

The present study is validated solely based on high-fidelity CFD simulation data and has not yet incorporated real-world urban air quality monitoring data. Although simulation data provide a physically consistent benchmark, a domain gap is mainly reflected in the following: (1) Emission source dynamics: real traffic flow exhibits randomness and time-varying characteristics, while simulations typically employ simplified hourly patterns. (2) Meteorological disturbances: real wind speed, wind direction, temperature, and humidity exhibit stronger randomness and non-stationarity. (3) Measurement noise: real-world data contain instrument errors and sampling uncertainty, while simulation data are ideal outputs. (4) Boundary conditions: the complexity of real urban underlying surfaces (such as building aging, vegetation growth) is difficult to fully model. Therefore, the generalization capability of the proposed model when transferred to real-world data scenarios remains to be validated.

Furthermore, the geometric configuration in this study is limited to an idealized cross-shaped canyon; future work could extend the framework to more complex urban morphologies. The current deterministic prediction framework could also be extended to probabilistic forecasting to quantify predictive uncertainty. We believe that the LLM-ROM framework opens new avenues for data-driven urban microclimate simulation and holds promise for applications in smart city management and environmental planning.

Author Contributions

Conceptualization, P.W. and Z.Q.; methodology, P.W. and Z.Q.; software, Z.Q.; validation, P.W. and Y.Y.; formal analysis, Y.Y.; investigation, Z.Q.; resources, P.W.; data curation, Z.Q.; writing—original draft preparation, Z.Q.; writing—review and editing, P.W. and Y.Y.; visualization, Z.Q.; supervision, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. This study did not involve humans.

Data Availability Statement

The authors will share their research data if required.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CFD	Computational Fluid Dynamics
LLM	Large Language Models
CAE	Convolutional Autoencoder
AMIRA	Autoregressive Integrated Moving Average
PM	Particulate matter
LSTM	Long Short-Term Memory
ROM	Reduced-Order Models
POD	Proper Orthogonal Decomposition
DMD	Dynamic Mode Decomposition

Appendix A. Nomenclature

For readers’ convenience, this appendix provides a comprehensive list of mathematical symbols and physical variables used in this paper. English letters followed by Greek letters, with their descriptions. Vectors and matrices are denoted in bold.

Table A1. Nomenclature.

Symbol	Description
Basic Dimensions
D	Number of grid points in depth direction
H	Number of grid points in height direction
W	Number of grid points in width direction
T	Length of input sequence
F	Length of prediction sequence
d	Dimension of latent vector
$d_{text}$	Dimension of text embedding
$d_{token}$	Dimension of token
K	Number of text prototypes
L	Patch length
S	Patch stride
$N_{p}$	Number of patches
$L_{p}$	Length of prompt
h	Number of attention heads
Physical Variables
$X_{t}$	Concentration field at time t
$C_{i, j, k}$	Concentration value at grid point
${\hat{X}}_{t}$	Predicted concentration field
Latent Space Variables
$z_{t}$	Latent vector at time t
${\tilde{z}}_{t}$	Normalized latent vector
${\hat{z}}_{t}$	Predicted latent vector
Patching Related
$P_{i}$	i-th patch
$t_{i}$	i-th patch token
$W_{patch}$	Patch projection weight matrix
$b_{patch}$	Patch projection bias
Text Prototype Related
$C$	Text prototype codebook
$c_{k}$	k-th text prototype
$h_{i}$	i-th query vector
$W_{a}$	Alignment layer projection weight
$b_{a}$	Alignment layer projection bias
$α_{i, k}$	Attention weight
$w_{i}$	Semantic token
${\hat{w}}_{i}$	Predicted semantic token
LLM Related
$T$	Prompt embedding
$S^{(k)}$	LLM input sequence at step k
$h_{last}^{(k)}$	Hidden state of last position
Inverse Projection Related
$W_{patch}^{'}$	Inverse projection weight matrix
$b_{patch}^{'}$	Inverse projection bias
Network Modules
$E_{ϕ}$	DCAE encoder
$D_{ψ}$	DCAE decoder
Greek Letters
$μ$	Mean
$σ$	Standard deviation
$ϵ$	Small constant
$λ$	Trade-off coefficient

References

Wang, Y.; An, Z.; Zhao, Y.; Yu, H.; Wang, D.; Hou, G.; Cui, Y.; Luo, W.; Dong, Q.; Hu, P.; et al. PM_2.5-bound polycyclic aromatic hydrocarbons (pahs) in urban beijing during heating season: Hourly variations, sources, and health risks. Atmos. Environ. 2025, 349, 121126. [Google Scholar] [CrossRef]
Linda, J.; Uhlík, O.; Köbölová, K.; Pospíšil, J.; Apeltauer, T. Recognition of wind-induced resuspension of pm10 and its fractions pm10-2.5, pm2.5-1, and pm1 in urban environments. Aerosol Sci. Technol. 2025, 59, 567–579. [Google Scholar] [CrossRef]
Chen, X.; Wei, F. Reducing PM_2.5 and O₃ through optimizing urban ecological land form based on its size thresholds. Atmos. Pollut. Res. 2025, 16, 102466. [Google Scholar] [CrossRef]
Xiao, D.; Fang, F.; Zheng, J.; Pain, C.; Navon, I. Machine learning-based rapid response tools for regional air pollution modelling. Atmos. Environ. 2019, 199, 463–473. [Google Scholar] [CrossRef]
Arriagada, N.B.; Morgan, G.G.; Buskirk, J.V.; Gopi, K.; Yuen, C.; Johnston, F.H.; Guo, Y.; Cope, M.; Hanigan, I.C. Daily PM_2.5 and seasonal-trend decomposition to identify extreme air pollution events from 2001 to 2020 for continental Australia using a random forest model. Atmosphere 2024, 15, 1341. [Google Scholar] [CrossRef]
Mogollón-Sotelo, C.; Casallas, A.; Vidal, S.; Celis, N.; Ferro, C.; Belalcazar, L. A support vector machine model to forecast ground-level pm 2.5 in a highly populated city with a complex terrain. Air Qual. Atmos. Health 2021, 14, 399–409. [Google Scholar] [CrossRef]
Díaz-González, L.; Trujillo-Uribe, I.; Pérez-Sansalvador, J.C.; Lakouari, N. Handling missing air quality data using bidirectional recurrent imputation for time series and random forest: A case study in Mexico city. AI 2025, 6, 208. [Google Scholar] [CrossRef]
Wei, Q.; Zhang, H.; Yang, J.; Niu, B.; Xu, Z. PM_2.5 concentration prediction using a whale optimization algorithm based hybrid deep learning model in Beijing, China. Environ. Pollut. 2025, 371, 125953. [Google Scholar] [CrossRef] [PubMed]
Sayeed, A.; Gupta, P.; Henderson, B.; Kondragunta, S.; Zhang, H.; Liu, Y. GOES-R PM_2.5 evaluation and bias correction: A deep learning approach. Earth Space Sci. 2025, 12, e2024EA004012. [Google Scholar] [CrossRef]
Jahromi, M.S.B.; Kalantar, V.; Akhijahani, H.S.; Salami, P. Application of artificial neural network, evolutionary polynomial regression, and life cycle assessment techniques to predict the performance of a new designed solar air ventilator with phase change material. Appl. Therm. Eng. 2025, 269, 126117. [Google Scholar] [CrossRef]
Yu, Q.; Yuan, H.W.; Liu, Z.L.; Xu, G.M. Spatial weighting emd-lstm based approach for short-term pm2.5 prediction research. Atmos. Pollut. Res. 2024, 15, 102256. [Google Scholar] [CrossRef]
Li, M.M.; Wang, X.L.; Yue, J.; Chen, L.; Wang, W.Y.; Yang, A.Q. PM_2.5 prediction based on eof decomposition and cnn-lstm neural network. Huan Jing Ke Xue = Huanjing Kexue 2025, 46, 715–726. [Google Scholar]
Feng, Y.; Kim, J.S.; Yu, J.W.; Ri, K.C.; Yun, S.J.; Han, I.N.; Qi, Z.; Wang, X. Spatiotemporal informer: A new approach based on spatiotemporal embedding and attention for air quality forecasting. Environ. Pollut. 2023, 336, 122402. [Google Scholar] [CrossRef]
Hu, Z.-Z.; Min, Y.-T.; Leng, S.; Li, S.; Lin, J.-R. A multi-factor-fusion framework for efficient prediction of pedestrian-level wind environment based on deep learning. IEEE Access 2025, 13, 52912–52924. [Google Scholar] [CrossRef]
Su, Y.; Wang, M.C. An automl algorithm: Multiple-steps ahead forecasting of correlated multivariate time series with anomalies using gated recurrent unit networks. AI 2025, 6, 267. [Google Scholar] [CrossRef]
Manekar, K.; Bhaiyya, M.L.; Hasamnis, M.A.; Kulkarni, M.B. Intelligent microfluidics for plasma separation: Integrating computational fluid dynamics and machine learning for optimized microchannel design. Biosensors 2025, 15, 94. [Google Scholar] [CrossRef]
Boikos, C.; Ioannidis, G.; Rapkos, N.; Tsegas, G.; Katsis, P.; Ntziachristos, L. Estimating daily road traffic pollution in Hong Kong using cfd modelling: Validation and application. Build. Environ. 2025, 267, 112168. [Google Scholar] [CrossRef]
Badach, J.; Wojnowski, W.; Gebicki, J. Spatial aspects of urban air quality management: Estimating the impact of micro-scale urban form on pollution dispersion. Comput. Environ. Urban Syst. 2023, 99, 101890. [Google Scholar] [CrossRef]
Schilders, W.H.; der Vorst, H.A.V.; Rommes, J. Model Order Reduction: Theory, Research Aspects and Applications; Springer: Berlin/Heidelberg, Germany, 2008; Volume 13. [Google Scholar]
Balajewicz, M.; Dowell, E.H. Stabilization of projection-based reduced order models of the navier-stokes. Nonlinear Dynam. 2012, 70, 1619–1632. [Google Scholar] [CrossRef]
Cuong, N.N.; Jaime, P. Efficient and accurate nonlinear model reduction via first-order empirical interpolation. J. Comput. Phys. 2023, 494, 112512. [Google Scholar] [CrossRef]
Cao, Y.; Zhu, J.; Luo, Z.; Navon, I. Reduced-order modeling of the upper tropical pacific ocean model using proper orthogonal decomposition. Comput. Math. Appl. 2006, 52, 1373–1386. [Google Scholar] [CrossRef]
Hesse, H.; Palacios, R. Reduced-order aeroelastic models for dynamics of maneuvering flexible aircraft. AIAA J. 2014, 52, 1717–1732. [Google Scholar] [CrossRef]
Zhu, C.; Xiao, D.; Fu, J.; Feng, Y.; Fu, R.; Wang, J. A data-driven computational framework for non-intrusive reduced-order modelling of turbulent flows passing around bridge piers. Ocean Eng. 2024, 308, 118308. [Google Scholar] [CrossRef]
Kerschen, G.; Golinval, J.C.; Vakakis, A.F.; Bergman, L.A. The method of proper orthogonal decomposition for dynamical characterization and order reduction of mechanical systems: An overview. Nonlinear Dynam. 2005, 41, 147–169. [Google Scholar] [CrossRef]
Wu, P.; Sun, J.; Chang, X.; Zhang, W.; Arcucci, R.; Guo, Y.; Pain, C.C. Data-driven reduced order model with temporal convolutional neural network. Comput. Methods Appl. Mech. Eng. 2020, 360, 112766. [Google Scholar] [CrossRef]
Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv 2019, arXiv:1905.07213. [Google Scholar] [CrossRef]
Ostrogonac, S.; Pakoci, E.; Sečujski, M.; Mišković, D. Morphology-based vs unsupervised word clustering for training language models for serbian. Acta Polytech. Hung. 2019, 16, 183–197. [Google Scholar] [CrossRef]
Liu, P.; Guo, H.; Dai, T.; Li, N.; Bao, J.; Ren, X.; Jiang, Y.; Xia, S.-T. Calf: Aligning llms for time series forecasting via cross-modal fine-tuning. Proc. Aaai Conf. Artif. Intell. 2025, 39, 18915–18923. [Google Scholar] [CrossRef]
Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.-Y.; Liang, Y.; Li, Y.-F.; Pan, S.; et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv 2023, arXiv:2310.01728. [Google Scholar]
Gopali, S.; Namini, S.S.; Abri, F.; Namin, A.S. The performance of the lstm-based code generated by large language models (llms) in forecasting time series data. Nat. Lang. Process. J. 2024, 9, 100120. [Google Scholar] [CrossRef]
Wu, T.; Ling, Q. Glallm: Adapting llms for spatio-temporal wind speed forecasting via global–local aware modeling. Knowl.-Based Syst. 2025, 323, 113739. [Google Scholar] [CrossRef]
Xiao, Q.; Chen, X.; Wang, Q.; Guo, X.; Wang, B.; Chen, W.; Wang, Z.; Liu, Y.; Xia, R.; Zou, H.; et al. Llm4fluid: Large language models as generalizable neural solvers for fluid dynamics. arXiv 2026, arXiv:2601.21681. [Google Scholar] [CrossRef]
Jun, H.; Park, J.; Zhijian, Y.; Bo, Y.; Li, L.K. Early detection of global instability via a large language model. In Division of Fluid Dynamics Annual Meeting 2025; APS: College Park, MD, USA, 2025. [Google Scholar]
Zhang, Y. A Better Autoencoder for Image: Convolutional Autoencoder. In ICONIP17-DCEC. 2018. Available online: https://www.semanticscholar.org/paper/A-Better-Autoencoder-for-Image%3A-Convolutional-Zhang/b1786e74e233ac21f503f59d03f6af19a3699024 (accessed on 26 February 2026).
Ayachi, R.; Afif, M.; Said, Y.; Atri, M. Strided convolution instead of max pooling for memory efficiency of convolutional neural networks. In Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18); Springer: Berlin/Heidelberg, Germany, 2020; Volume 1, pp. 234–243. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.-H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. Iclr 2022, 1, 3. [Google Scholar]
Jia, W.; He, G.; Chundong, L. Analysing the influence of different street vegetation on particulate matter dispersion using microscale simulations. Desalin. Water Treat. 2018, 110, 319–327. [Google Scholar]
Wania, A.; Bruse, M.; Blond, N.; Weber, C. Analysing the influence of different street vegetation on traffic-induced particle dispersion using microscale simulations. J. Environ. Manag. 2012, 94, 91–101. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The structure of the encoder in the pre-trained LLM.

Figure 2. The overall structure.

Figure 3. The structure of DCAE.

Figure 4. The structure of patch reprogramming.

Figure 5. Patch reprogramming.

Figure 6. Textual prompt template for Pollutant Dispersion Prediction.

Figure 7. Grid schematic diagram. (a)

H / W

ratio 0.5 grid configuration; (b)

H / W

ratio 0.9 grid configuration; (c)

H / W

ratio 1.2 grid configuration. (d)

H / W

ratio 0.5 grid with pollution line source; (e)

H / W

ratio 0.9 grid with pollution line source; (f)

H / W

ratio 1.2 grid with pollution line source.

Figure 7. Grid schematic diagram. (a)

H / W

ratio 0.5 grid configuration; (b)

H / W

ratio 0.9 grid configuration; (c)

H / W

ratio 1.2 grid configuration. (d)

H / W

ratio 0.5 grid with pollution line source; (e)

H / W

ratio 0.9 grid with pollution line source; (f)

H / W

ratio 1.2 grid with pollution line source.

Figure 8. Model schematic diagram. (a) Cross-sectional views of three aspect ratios (

H / W

). (b) Distribution of two pollution line sources. (c) Height and spatial arrangement of two vegetation types.

Figure 8. Model schematic diagram. (a) Cross-sectional views of three aspect ratios (

H / W

). (b) Distribution of two pollution line sources. (c) Height and spatial arrangement of two vegetation types.

Figure 9. The reconstruction RMSE of the DCAE and POD when reducing dimensions to 30.

Figure 10. The comparative analysis between high-fidelity, POD and DCAE reconstructed snapshots at the 18th timestep in the test set.

Figure 11. The absolute error distributions between the high-fidelity results, DCAE and POD method.

Figure 12. The less detailed textual template (DCAE-LLM (-)).

Figure 13. The comparative analysis between high-fidelity snapshots, DCAE-LLM, DCAE-LLM (-), DCAE-LLM (–) and DCAE-Transformer results at the 18th timestep in the test set.

Figure 14. The absolute error distributions between between high-fidelity, DCAE-LLM, DCAE-LLM (-) and DCAE-Transformer results at the 18th timestep in the test set.

Figure 15. The loss function of the transfer model and the random initialized model with different wind direction.

L O S S

represents the loss function value, and

E p o c h

represents the number of iterations.

Figure 15. The loss function of the transfer model and the random initialized model with different wind direction.

L O S S

represents the loss function value, and

E p o c h

represents the number of iterations.

Figure 16. The comparative analyses between Computational Fluid Dynamics (CFD) results and extrapolation predictions of 18, 40, 145 timesteps.

Figure 17. RMSE evolution over autoregressive prediction steps (k = 15 to 145).

Table 1. Key parameters for ENVI-met simulation.

Parameter	Value
Meteorological Conditions	Instantaneous air temperature and humidity sourced from weather station data (Longitude: 121.6°, Latitude: 30.8°) on 15 July 2024 Longwave sky radiation calculated by ENVI-met Specific humidity at 2500 m altitude: 7 g/kg Reference roughness length: 0.1 m Wind direction: 180°, 225° Wind speed: 1 m/s (at 10 m height)
Street Configuration	Aspect ratio: 0.5 (18/33), 0.9 (18/21), 1.2 (18/15)
Pollution Source	Type: PM10 (10 µm diameter) Emission height: 0.3 m Source geometry: Line source
Vegetation	Trees: Height = 10 m, Canopy width = 5 m, LAD = 2 (m²/m³) Shrubs: Height = 1.5 m, LAD = 2 (m²/m³)
Building	External walls: K = 1.0 (W/(m³·K)), $α$ = 0.4 Roof: K = 0.9 (W/(m³·K)), $α$ = 0.3
Ground Structure & Thermal Properties	Layers: 20 cm concrete–10 cm sand-soil Concrete: $α$ = 0.3, $β$ = 1.51 (W/(m·K)), $ρ$ = 2300 (kg/m³)

K is the heat transfer coefficient; LAD is the leaf area density;

α

is the reflectance;

β

is the thermal conductivity;

ρ

is the density.

Table 2. The parameters of DCAE.

Module	Layer	Dilation Rate	Stride	Output Size
Input	-	-	-	$28 \times 65 \times 65$
Block 1	3D Dilated Convolution	1	1	$28 \times 65 \times 65$
Block 1	3D Strided Convolution	1	2	$14 \times 33 \times 33$
Block 2	3D Dilated Convolution	2	1	$14 \times 33 \times 33$
Block 2	3D Strided Convolution	1	2	$7 \times 17 \times 17$
Block 3	3D Dilated Convolution	4	1	$7 \times 17 \times 17$
Block 3	3D Strided Convolution	1	2	$4 \times 9 \times 9$
Block 4	3D Dilated Convolution	8	1	$4 \times 9 \times 9$
Block 4	3D Strided Convolution	1	2	$2 \times 5 \times 5$
Block 5	3D Dilated Convolution	16	1	$2 \times 5 \times 5$
Block 5	3D Strided Convolution	1	2	$1 \times 3 \times 3$
Flatten	Flatten	-	-	4086
Fully Connected Layer	Linear	-	-	128

Table 3. The comparision of average RMSE, number of parameters, training time and prediction time between DCAE and POD.

	DCAE	POD
Average RMSE (1 × 10⁻⁵)	2.86	4.42
Number of parameters (M)	0.25	3.55
Training time (h)	0.43	0.1
Prediction time (s)	14.7	18.2

Table 4. The comparative experiments between representative methods.

Method	RMSE (1 × 10⁻² μg/m³)	SSIM	$R^{2}$	Trainable Parameters
LSTM+Autoencoder	24.43	0.532	0.326	25.6M
POD-GPR	18.86	0.703	0.654	-
DCAE-LSTM	12.32	0.848	0.823	8.2M
DCAE-GRU	11.44	0.853	0.839	8.0M
DCAE-ConvLSTM	8.57	0.884	0.875	9.3M
DCAE-Transformer	9.86	0.878	0.866	12.1M
DCAE-TFT	8.21	0.891	0.886	9.5M
DCAE-U-Net	7.22	0.920	0.897	14.2M
DCAE-FNO	6.71	0.918	0.911	4.8M
LLM-ROM (Ours)	2.13	0.967	0.963	1.2M

Table 5. The ablation experiments of the textual prompt.

Configuration	RMSE (1 × 10⁻² μg/m³)	SSIM	$R^{2}$	RMSE Performance
DCAE-LLM	2.13	0.967	0.963	-
DCAE-LLM (-)	4.73	0.943	0.936	+122%
DCAE-LLM (–)	5.96	0.932	0.926	+179%
Lightweight conditioned LSTM	11.89	0.851	0.834	+458%

Table 6. Ablation study results for LLM-ROM (1 h prediction).

Configuration	Trainable Parameters	RMSE (1 × 10⁻² μg/m³)	SSIM	$R^{2}$
LLM-ROM(ours)	1.2M	2.13	0.967	0.963
w/o LoRA	1.5B	2.07	0.968	0.965
w/o pre-training	1.2M	10.72	0.816	0.709

Table 7. The train set of the transferability experiment.

Aspect Ratio (H/W)	Wind Direction (°)	Vegetation
0.5, 0.9	180, 225	trees, bushes
1.2	180, 225	bushes

Table 8. The extrapolation set of the transferability experiment.

Aspect Ratio (H/W)	Wind Direction (°)	Vegetation
1.2	180, 225	trees

Table 9. The few-shot fine-tuning experiments between DCAE-LSTM and LLM-ROM.

Fine-Tuning Samples	Method	RMSE (1 × 10⁻² μg/m³)	SSIM	$R^{2}$
5 samples	DCAE-LSTM	19.69	0.588	0.464
5 samples	LLM-ROM (Ours)	5.78	0.903	0.894
10 samples	DCAE-LSTM	16.69	0.678	0.583
10 samples	LLM-ROM (Ours)	4.71	0.929	0.921
20 samples	DCAE-LSTM	12.41	0.767	0.729
20 samples	LLM-ROM (Ours)	3.24	0.952	0.945

Table 10. Comparison of error accumulation characteristics for different methods.

Method	RMSE @ 15-Step	RMSE	Error Doubling Step	AEGR
DCAE-LSTM	10.24	29.34	23	0.141
DCAE-Transformer	8.56	22.13	35	0.103
DCAE-ROM	4.58	9.28	94	0.034

Table 11. The comparison of extrapolation time between CFD simulation and LLM-ROM.

Method	Single Prediction Time (s)	Scenario Adaptation Time (s)	Speedup
CFD simulation	20,000	20,000	$1 \times$
LLM-ROM (ours)	30	2030 (10% CFD + 30 extrapolation)	$9.85 \times$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, P.; Qin, Z.; Yang, Y. LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion. AI 2026, 7, 104. https://doi.org/10.3390/ai7030104

AMA Style

Wu P, Qin Z, Yang Y. LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion. AI. 2026; 7(3):104. https://doi.org/10.3390/ai7030104

Chicago/Turabian Style

Wu, Pin, Zhiyi Qin, and Yiguo Yang. 2026. "LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion" AI 7, no. 3: 104. https://doi.org/10.3390/ai7030104

APA Style

Wu, P., Qin, Z., & Yang, Y. (2026). LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion. AI, 7(3), 104. https://doi.org/10.3390/ai7030104

Article Menu

LLM-ROM: A Novel Framework for Efficient Spatiotemporal Prediction of Urban Pollutant Dispersion

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning for Air Quality Prediction

2.2. CFD Simulations of Pollutant Dispersion

2.3. Reduced-Order Models in Fluid Mechanics

2.4. Large Language Models for Time-Series Forecasting

3. Methodology

3.1. Dilated Convolutional Autoencoder

3.2. Pre-Trained Large Language Models

3.3. Reversible Instance Normalization

3.4. Multi-Head Cross-Attention Mechanism

4. Model Architecture

4.1. Dilated Convolutional Autodecoder

4.2. Temporal Text Embedding

4.3. Patch Reprogramming for Physics-to-Text Alignment

4.3.1. Design Motivation: From Direct Word Embedding to Text Prototypes

4.3.2. Definition of the Text Prototype Codebook

4.3.3. Semantic Projection of Physical Tokens

4.3.4. Prototype Fusion via Attention Mechanism

4.3.5. Multi-Head Extension

4.3.6. Training Process and Gradient Flow

4.4. Textual Prompt Template

4.5. Pre-Trained LLM and Autoregressive Generation Mechanism

5. Experiment

5.1. Expriment Setups

5.1.1. Evaluation Metrics

5.1.2. Experiment Configurations and Hyperparameter Choices

5.2. Experimental Dataset Construction

5.3. Training and Reconstruction Performance Evaluation of DCAE

5.4. Prediction Performance Analysis of LLM-ROM on Benchmark Scenarios

5.4.1. Comparative Experiments with Baseline Models

5.4.2. Ablation Study

5.5. Transferability Experiment

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI