Edge Computing-Enabled Secure Forecasting Nationwide Industry PM2.5 with LLM in the Heterogeneous Network

Yin, Changkui; Mao, Yingchi; He, Zhenyuan; Chen, Meng; He, Xiaoming; Rong, Yi

doi:10.3390/electronics13132581

Open AccessArticle

Edge Computing-Enabled Secure Forecasting Nationwide Industry PM_2.5 with LLM in the Heterogeneous Network

by

Changkui Yin

^1,†,

Yingchi Mao

^1,*

,

Zhenyuan He

²,

Meng Chen

^3,*,†,

Xiaoming He

⁴ and

Yi Rong

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 210098, China

²

Yuxin Electronic Technology Group Co., Ltd., Zhengzhou 450046, China

³

Shenzhen Urban Transport Planning Center Co., Ltd., Shenzhen 518000, China

⁴

College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(13), 2581; https://doi.org/10.3390/electronics13132581

Submission received: 1 June 2024 / Revised: 24 June 2024 / Accepted: 26 June 2024 / Published: 30 June 2024

(This article belongs to the Special Issue Network Security Management in Heterogeneous Networks)

Download

Browse Figures

Versions Notes

Abstract

:

The heterogeneous network formed by the deployment and interconnection of various network devices (e.g., sensors) has attracted widespread attention.

P M_{2.5}

forecasting on the entire industrial region throughout mainland China is an important application of heterogeneous networks, which has great significance to factory management and human health travel. In recent times, Large Language Models (LLMs) have exhibited notability in terms of time series prediction. However, existing LLMs tend to forecast nationwide industry

P M_{2.5}

, which encounters two issues. First, most LLM-based models use centralized training, which requires uploading large amounts of data from sensors to a central cloud. This entire transmission process can lead to security risks of data leakage. Second, LLMs fail to extract spatiotemporal correlations in the nationwide sensor network (heterogeneous network). To tackle these issues, we present a novel framework entitled Spatio-Temporal Large Language Model with Edge Computing Servers (STLLM-ECS) to securely predict nationwide industry

P M_{2.5}

in China. In particular, We initially partition the entire sensor network, located in the national industrial region, into several subgraphs. Each subgraph is allocated an edge computing server (ECS) for training and inference, avoiding the security risks caused by data transmission. Additionally, a novel LLM-based approach named Spatio-Temporal Large Language Model (STLLM) is developed to extract spatiotemporal correlations and infer prediction sequences. Experimental results prove the effectiveness of our proposed model.

Keywords:

secure forecasting nationwide industry PM2.5; heterogeneous network; LLM; edge computing; security risks of data leakage

1. Introduction

With the deployment and interconnection of diverse network devices (e.g., sensors and servers), the heterogeneous network has emerged as a widespread network scenario covering a wide range of geographic areas and integrating various types of information for more effective decision-making [1,2,3,4]. Recently, the Chinese government has deployed numerous sensors across national industry regions, forming a nationwide sensor network (heterogeneous network) to monitor and collect industry Particulate Matter 2.5 (

P M_{2.5}

), i.e., particulate matter with a diameter of 2.5 μm or less in industrial regions. Based on these data in the heterogeneous network, an important application is nationwide industry

P M_{2.5}

prediction, which has important value for overhauling industrial production and human health [5,6].

Lately, we have viewed the birth of Large Language Models (LLMs) [7] and the revolutions that it has brought to Natural Language Processing (NLP) [8]. The core idea is to pretrain LLMs from billions of corpora, bringing about abundant intrinsic knowledge for facilitating downstream tasks. Until now, many efforts have attempted to use LLMs for time series forecasting on sensor networks. Most of these model a sensor network with small space granularity (e.g., Beijing, Shanghai, Yangtze River Delta, or Pearl River Delta) and train on a central server [9,10]. However, in our study we broaden our scope to collectively predict industry

P M_{2.5}

in industrial regions of the Chinese mainland with enormous fine space granularity covering thousands of sensors. More data are generated due to the large number of sensors. If we continue to use the centralized training pattern, more data need to be uploaded to the central cloud. This increase in data transmission volume is bound to raise the security risks around data leakage.

Several studies have shown that spatiotemporal correlations exist among sensors in the nationwide sensor network [11]. In particular, the nationwide sensor network is quite complex. Assuming a sensor network with multiple interlinked sensors, industry

P M_{2.5}

concentration on a given sensor is affected by its neighbors; these are identified as the spatial dependencies. The future industry

P M_{2.5}

concentration of each sensor is substantially influenced by its history, termed the temporal dependencies. These two types of dependencies change dynamically over time and interact with each other. Collectively, we call them the spatiotemporal correlations. As a decoder-only structure, an LLM produces purely output sequences. Thus, if LLMs are employed directly on the sensor network, the spatiotemporal correlations between the sensors are ignored.

In summary, two urgent challenges necessitate solutions in nationwide industry

P M_{2.5}

forecasting via LLM. First, handling large volumes of data on a central server heightens the related security risks. Second, LLMs struggle to contemplate spatiotemporal correlations among sensors. Motivated by these challenges, a novel LLM-based approach entitled Spatio-Temporal Large Language Model with Edge Computing Servers (STLLM-ECS) is developed to securely forecast nationwide industry

P M_{2.5}

in ECS. In detail, we first represent the nationwide sensor network as an undirected graph. Our NodeSort method is then used to partition the graph into several subgraphs. We deploy an ECS on each subgraph accordingly. The sensor data of each subgraph are allocated to the corresponding ECS for training instead of to a central cloud. This means that the data do not need to be transmitted to the central cloud, avoiding the security risks triggered by data transmission. Next, for each subgraph we develop an LLM-based module named the Spatio-Temporal Large Language Model (STLLM) to learn the spatiotemporal correlations and infer the output sequences. STLLM fills multiple gaps in modeling spatiotemporal features via LLM. The idea of our proposed method is depicted in Figure 1. The contributions of our work are summarized below:

To mitigate the security risks of centralized training due to data leakage during transmission, we present an edge-based distributed learning framework, STLLM-ECS, to securely forecast nationwide industry $P M_{2.5}$ in ECS. In detail, we develop a novel method named NodeSort to partition the nationwide sensor network graph into several subgraphs. The data and training tasks of each subgraph are then uploaded to an individual ECS rather than to a central cloud. This avoids the security risks around data leakage when transmitting data from sensors to the central cloud. In addition, we design an edge training strategy between neighbor subgraphs to speed up training and achieve the “training-during-inference” pattern. Meanwhile, the strategy facilitates sharing of similar industry $P M_{2.5}$ changes among neighboring subgraphs, thereby improving prediction accuracy.
An LLM-based model called STLLM is presented. A spatiotemporal module (STM) is developed to capture spatiotemporal correlations, while GPT-2 [7] is adopted to produce output sequences. This is a novel hybrid framework that introduces a spatiotemporal feature extraction module into the LLM for industry $P M_{2.5}$ prediction. It effectively provides the LLM with the ability to model spatiotemporal features. In addition, considering the weak computing power of ECS, a pruning strategy is developed to further lighten model deployment on the ECS.
We conduct extensive experiments on a nationwide industry $P M_{2.5}$ dataset comprising data from over 1000 sensors collected from across China’s industrial regions. Our results indicate that the proposed STLLM-ECS is superior to all compared baselines.

2. Related Work

2.1. Edge Computing

Presently, various concepts relevant to edge computing have been defined. One definition declares it to be a methodology that conducts computing at the edge of a network, i.e., ECS deployment to process computational tasks should occur near the data source [12]. In [13], the researchers classified ECS into three main categories. First, edge servers, which mainly contains Cloudlets, local cloud, etc. In contrast to traditional cloud computing, edge servers have weaker arithmetical power. Second, devices that coordinate among terminal devices. Compared to the edge servers, these have lower computing power, but are more portable. Third, communication technologies (e.g., opportunistic computing) in the device cloud accomplish resource migration and exploration among terminal devices. In addition, a vehicle cloud can be treated as a form of edge computing. In this scenario, the resources of the vehicle cloud are temporarily requisitioned. The edge, core cloud, and mobile users constitute the edge architecture. The computational tasks of mobile users can be offloaded from the central cloud and assigned to the edge for processing [14,15,16]. To date, edge applications have appeared in a variety of air pollution analysis tasks, including air pollution monitoring [17] and air pollution prediction in Beijing [18]. However, few works have focused on forecasting nationwide industry

P M_{2.5}

in China using edge computing.

2.2. LLMs for Time Series Analysis

LLMs exhibit powerful capabilities in understanding the complex dependencies of heterogeneous textual data and offering plausible generation [19]. Representative LLMs include GPT [20], GPT-2 [7], and GPT-4 [21]. Their presence has revolutionized various fields, especially time series analysis. To date, several researchers have employed LLMs in time series analysis. For instance, Yu et al. [22] designed an explainable financial forecasting approach based on Open-LLaMA and GPT-4. In [23], LLM4TS was proposed for time series forecasting. Specifically, Chang et al. designed a two-stage fine-tuning strategy, with Stage 1 consisting of supervised pretraining and Stage 2 of fine-tuning according to specific tasks. Zhou et al. [24] conducted fine-tuning using a Frozen Pretrained Transformer (FPT) without adjusting its feedforward or self-attention layers. After fine-tuning, the FPT was deployed on different time series analysis tasks. Nevertheless, existing LLMs encounter two issues in industry

P M_{2.5}

prediction. First, most LLM-based models use centralized training, requiring data to be transmitted from sensors to the central cloud. The large amount of data transmission increases the security risks around data leakage. Second, LLMs cannot extract the spatiotemporal correlations in the sensor network.

2.3. Air Pollution Forecasting

Air pollution forecasting methods are classified into two main categories, namely, physics-based and data-driven models.

Physics-based models: Such models treat the emission and diffusion of pollutants as a dynamic process which can be simulated by numerical functions. In order to achieve this, researchers must trace back the air pollution to its main causes, e.g., factories and vehicles [25,26]. However, it can be challenging to accurately collect these data sources.

Data-driven models: This type of model has become the most popular approach for air pollution prediction. This line of study adopts parameterized methods, e.g., deep neural networks, to mine the spatiotemporal correlations within air pollution data. In contrast to physics-based models, data-driven models are more flexible and demand less sophisticated domain knowledge. For example, Zheng et al. [27] designed a hybrid data-driven model which integrates predicted outcomes from different perspectives. Yi et al. [28] proposed a novel model called DeepAir which uses deep neural networks. Their experimental results proved that DeepAir is significantly superior to other shallow baselines in both long- and short-term forecasting. Several follow-ups have studied whether Graph Convolutional Network (GCN) approaches or attention-based approaches are more effective for capturing spatiotemporal correlations [29,30]. Unfortunately, these methods encounter a number of issues in the context of nationwide industry

P M_{2.5}

prediction, including performance degradation and inefficiency.

3. Preliminaries

Currently, a large number of sensors are deployed in industrial regions throughout the country to monitor and collect time series

P M_{2.5}

data. This allows critical spatial information (e.g., connectivity and distances) among sensors to be calculated. Given this spatial and temporal information, our proposed model has the capacity to forecast future industry

P M_{2.5}

. Three definitions are introduced below to facilitate the explanation of this process.

Definition I: Nationwide Sensor Network. The nationwide sensor network is treated as a undirected graph

G = (V, E)

, where

V

is a set of nodes that denote the sensors in nationwide industry regions. Assuming that N is the total number of nodes, we have

| V | = N

and

V = (V_{1}, V_{2}, \dots, V_{N})

, with

E

as the set of edges, revealing whether specific nodes are connected.

Definition II: Subgraphs. In order to employ edge computing, it is necessary to partition

G

into some number of subgraphs depending on the number of deployed ECSs. Each ECS is deployed in the region where a subgraph is located and is responsible for processing the data of the nodes in the subgraph. Given E ECSs, the subgraphs are identified as

SG = (S G^{1}, S G^{2}, \dots, S G^{E})

, where

S G^{j} = (V_{j, 1}, V_{j, 2}, \dots, V_{j, n_{j}})

,

j \in [1, E]

,

n_{j}

stands for the total number of nodes in subgraph j.

Definition III: Subgraph Representations. For brevity,

X_{t_{k}}^{j} = (x_{t_{k}, 1}, x_{t_{k}, 2}, \dots, x_{t_{k}, n_{j}}) \in R^{n_{j} \times F}

is used to indicate the representations of subgraph j, where F denotes the features of the nodes (in our case, industry

P M_{2.5}

; thus,

F = 1

). Hence,

x_{t_{k}, i}

is the industry

P M_{2.5}

value of node i at timestep

t_{k}

.

Problem Formulation. For a subgraph j, given its subgraph representations of past P timesteps

χ^{j} = (X_{1}^{j}, X_{2}^{j}, \dots, X_{P}^{j})

, we propose learning a mapping function

f_{j} (\cdot)

which infers the industry

P M_{2.5}

value of

n_{j}

nodes at the next Q timesteps. The problem can be formulated as follows:

(χ^{j}, S G^{j}) \underset{f_{j} (\cdot)}{\to} Y^{j}

(1)

where

Y^{j} = ({\hat{Y}}_{t_{P + 1}}^{j}, {\hat{Y}}_{t_{P + 2}}^{j}, \dots, {\hat{Y}}_{t_{P + Q}}^{j})

denotes the output sequences and

S G^{j}

is the topology of subgraph j.

4. STLLM-ECS Design

In this section, we describe the architecture of STLLM-ECS. First, a systematic review is presented, after which we detail the three main components of STLLM-ECS.

4.1. System Overview

To solve the above-mentioned challenges, STLLM-ECS is presented. The overview of STLLM-ECS is depicted in Figure 2. First, graph partitioning is proposed. In detail, we denote the entire nationwide sensor network as a graph. To allocate data and computing tasks to the ECSs, the graph ought to be partitioned into subgraphs. Unfortunately, the procedure of graph partitioning leads to information loss, as crucial edges in the graph are severed. In view of this, we propose the novel NodeSort method. It can evaluate node significance and retain the edges of significant nodes, effectively alleviating information loss. NodeSort follows the basic theory of PageRank [31], the universal algorithm for sorting web pages. Furthermore, we introduce Betweenness Centrality [32] to incorporate valuable information into PageRank. In contrast to traditional PageRank, NodeSort can better adapt to the characteristics of the nationwide sensor network and measure node importance.

After graph partitioning, we allocate each subgraph to the corresponding ECS. For each subgraph, an intelligent method based on STM and Generative Large Language Model (GLLM) named STLLM is developed to capture spatiotemporal correlations and infer prediction sequences. In particular, spatial and temporal attention dynamically adjust model attention in both spatial and temporal dimensions, enabling us to identify complicated relationships in two dimensions. Thus, we introduce them into the STM to capture spatiotemporal correlations. Moreover, the LLM facilitates the generation of prediction sequences due to its extensive intrinsic knowledge acquired from pretraining. We introduce the LLM into the GLLM to generate future industry

P M_{2.5}

predictions.

In addition, an edge training strategy is designed to reduce training time. In particular, considering the weak computational power of ECSs, pruning operations are adopted to make STLLM more lightweight and decrease the training workload.

4.2. Graph Partitioning Design

As mentioned above, the entire graph is partitioned into several subgraphs, which are then allocated to corresponding ECSs. To mitigate information loss during graph partitioning, we develop a novel method named NodeSort. The design details are as follows.

Similarly to other networks, e.g., road networks, node significance is a key factor in nationwide sensor networks. By measuring node significance, we can construct subgraphs centered on the most important nodes. As a result, the edges of the center nodes are preserved. These edges contain more information, which can help to reduce information loss. This type of method for effectively measuring node importance is indispensable. Therefore, we introduce PageRank with Betweenness Centrality to form NodeSort for implementation.

(1) PageRank: PageRank was initially applied to rank web pages based on their importance. Because PageRank can be defined on any digital graph, it has since been adopted in other domains, e.g., text summarization. Based on a random walk, given that the degree of node s is

D

, the likelihood of industrial

P M_{2.5}

diffusing from node s to other nodes is

L K_{s, t} = \{\begin{matrix} 1 / D, & s links to t \\ 0, & otherwise, \end{matrix}

(2)

where

s, t

are set to

1, 2, \dots, N

,

s \neq t

,

L K_{s, t}

stands for the transition likelihood among nodes s and t, and

L K = {[L K_{s, t}]}_{N \times N} \in R^{N \times N}

denotes the transition matrix. In addition, two characteristics are present in

L K

, i.e.,

L K_{s, t} \geq 0

and

\sum_{s = 1}^{N} L K_{s, t} = 1

.

Now, coming to the value of PageRank, we let

P R_{Value} = [P R ({Value}_{1}), P R ({Value}_{2}), \dots,

P R ({Value}_{N})]

, where

P R ({Value}_{n})

represents the PageRank value of node n. Given a complete random walk model, each element in its transition matrix

L K^{'}

is

1 / N

. The PageRank method can be presented as follows:

\begin{matrix} P R_{Value} & = β \cdot L K \cdot P R_{Value} + (1 - β) \cdot L K^{'} \\ = β \cdot L K \cdot P R_{Value} + \frac{1 - β}{N} \end{matrix}

(3)

where

β \in [0, 1]

stands for the damping factor, denoted as the resistance from one node to others. Due to the static distribution property of Markov chains, we further adopt this algorithm to solve for the

P R_{Value}

of all nodes.

(2) Betweenness Centrality: Ulrik and Brandes [32] clarified the Betweenness Centrality as the sum of the shortest paths passing through the node, formulated as follows:

B_{c} (n) = \sum_{s \neq n \neq t \in V} \frac{ρ_{s, t} (n)}{ρ_{s, t}}

(4)

where

ρ_{s, t} (n)

indicates the shortest path from node s to node t that passes through node n. The sum of the shortest paths from s to t is

ρ_{s, t}

, while V is the set of nodes in the nationwide sensor network. According to this definition, we can infer that nodes with large

B_{c}

have a tendency to become industrial

P M_{2.5}

pollution centers, as they are the shortest paths for many routes.

(3) NodeSort: By leveraging PageRank, it is possible to acquire the importance of nodes in the nationwide sensor network. However, PageRank fails when exploited directly on the nationwide sensor network due to two factors. First, the distance among nodes is a crucial feature that can help to determine industrial

P M_{2.5}

diffusion and propagation. Although we have defined the transition matrix, the distance feature is neglected. Second, PageRank allocates the same weight to each node without considering discrepancies in node importance. In light of these factors, our novel NodeSort method is designed to adapt the characteristics of the nationwide sensor network and assess node importance more realistically. In particular, affected by Betweenness Centrality, those nodes potentially serving as industrial

P M_{2.5}

pollution centers may be more vital. Thus, Betweenness Centrality is initially used to quantify node importance through weighting calculations. First,

L K_{s, t}

is reconstructed as

L K_{s, t}^{'} = \{\begin{matrix} \frac{B_{C} (t)}{\sum_{t} B_{C}}, & s links to t \\ 0, & otherwise, \end{matrix}

(5)

where

B_{C} (t)

denotes as the Betweenness Centrality of node t and

\sum_{t} B_{C}

indicates the sum of Betweenness Centrality values of the nodes connected with t.

In PageRank, the damping factor is usually treated as a constant value of 0.85. In NodeSort, the distance is employed to compute this factor, whic is possible because the distance is strongly resistant to movement. Let

β

be a diagonal matrix

β^{'} = (β_{1}, β_{2}, \dots, β_{N})

in which

β_{t}

is defined as

β_{t} = γ \cdot \frac{1}{\sum_{s} \frac{1}{d_{s, t}}},

(6)

with

d_{s, t}

representing the distance among nodes s and t and

γ

as the scaling factor. The Markov chain is reconstructed as follows:

\begin{matrix} P R_{Value} = [\begin{matrix} β_{1} \\ ⋱ \\ β_{N} \end{matrix}] \cdot L K \cdot P R_{Value} \\ + \frac{1}{N} [\begin{matrix} 1 - β_{1} \\ ⋱ \\ 1 - β_{N} \end{matrix}] . \end{matrix}

(7)

Leveraging the the value vector

P R_{Value}

of NodeSort, the most important nodes in the nationwide sensor network are acquired. After that, some subgraphs are constructed depending on these nodes.

In summary, NodeSort is adopted to calculate the importance of nodes in the large-scale network. The top-E most important nodes are leveraged as central nodes to construct the subgraphs. Hence, the edges of important nodes are retained, which assists in more precise predictions.

4.3. STLLM Design

As shown in Figure 3, STLLM is composed of three parts: (1) Input Embedding Layer; (2) STM; and (3) GLLM. Specifically, STLLM is a pipeline structure. In the following, we illustrate how to apply each part to capture spatiotemporal correlations and predict industrial

P M_{2.5}

, taking the processing of subgraph j as an example.

(1) Input Embedding Layer: For subgraph j, as the deep neural networks fail to directly deal with industry

P M_{2.5}

data, we need to change these data dimensions. It is necessary to transform the industry

P M_{2.5}

data

X_{j}

of P historical timesteps for

n_{j}

nodes into higher-dimensional features. In detail, two layers of FCs are used to transform the dimensions from F to

d_{m} o d e l

, represented as

H_{j} \in R^{P \times n_{j} \times d_{model}}

.

(2) STM: After obtaining the input embedding features, STM is deployed to capture spatiotemporal correlations among nodes of the subgraph, which is composed of L stacked Spatio-Temporal blocks (ST block). Each ST block comprises spatial attention, temporal attention, and the Spatio-Temporal Fusion (ST Fusion) mechanism. Spatial attention is proposed to capture spatial dependencies. Temporal attention is designed to extract temporal dependencies. Depending on the impact of temporal and spatial dependencies on the prediction, we use ST Fusion for adaptive fusion without human intervention. Figure 3 illustrates the structure of STM. Specifically, let

H_{j} \in R^{P \times n_{j} \times d_{model}}

denote the input of L ST blocks. In the lth ST block, its input is the output of the

l - 1

th ST block, denoted as

H S T^{l - 1} \in R^{P \times n_{j} \times d_{model}}

. The spatial representations generated by spatial attention are

H S^{l} \in R^{P \times n_{j} \times d_{model}}

, in which

h s_{t_{k}}^{l} \in R^{n_{j} \times d_{model}}

is the spatial representations at timestep

t_{k}

. The temporal representations generated by temporal attention are

H T^{l} \in R^{P \times n_{j} \times d_{model}}

, in which

h t_{n}^{l} \in R^{P \times d_{model}}

is the temporal representations of node n. The spatiotemporal representations produced by ST fusion are indicated as

H S T^{l} \in R^{P \times n_{j} \times d_{model}}

. In addition, the residual connections are used to enable a larger receptive field and boost training speed. Hence, the output of the lth ST block can be expressed as

H S T^{l} = H S T^{l} + H S T^{l - 1}

. The details of spatial attention, temporal attention, and ST Fusion in the lth ST block are described below.

Spatial Attention: To extract spatial dependencies, spatial attention based on a one-layer graph attention network is designed. Through an attention mechanism, such a network can dynamically assign weights to sensors in the nationwide sensor network based on the relevance between sensors.

In the lth ST block at timestep

t_{k}

, let

h s t_{t_{k}}^{l - 1} \in R^{n_{j} \times d_{model}}

be the input of a one-layer graph attention network. The operation of a single hth head is expressed as

h s_{t_{k}}^{l, h} = Softmax (α Q_{h}^{S} K_{h}^{S}) V_{h}^{S},

(8)

where

h s_{t_{k}}^{l, h} \in R^{n_{j} \times (\frac{d_{model}}{N_{h}})}

refers to the spatial representations in the hth head. Query

Q_{h}^{S} \in R^{n_{j} \times (\frac{d_{model}}{N_{h}})}

, key

K_{h}^{S} \in R^{n_{j} \times (\frac{d_{model}}{N_{h}})}

, and value

V_{h}^{S} \in R^{n_{j} \times (\frac{d_{model}}{N_{h}})}

are generated by linear mappings

h s t_{t_{k}}^{l - 1} W_{Q}^{S}, h s t_{t_{k}}^{l - 1} W_{K}^{S}

, and

h s t_{t_{k}}^{l - 1} W_{V}^{S}

, respectively, while

W_{Q}^{S}

,

W_{K}^{S}

, and

W_{V}^{S} \in R^{d_{model} \times (\frac{d_{model}}{N_{h}})}

are the weight parameters for linear mapping,

N_{h}

is the number of heads, and

α

is considered as a scaling factor. Later on, the spatial representations

h s_{t_{k}}^{l} \in R^{n_{j} \times d_{model}}

at timestep

t_{k}

are obtained through the following concentration operation:

h s_{t_{k}}^{l} = [h s_{t_{k}}^{l, 1}, h s_{t_{k}}^{l, 2}, \dots, h s_{t_{k}}^{l, N_{h}}] W_{o}^{S}

(9)

where

W_{o}^{S} \in R^{d_{model} \times d_{model}}

is the trainable mapping matrix.

Temporal Attention: To capture temporal dependencies, we develop the temporal attention with one-layer Multi-Head Self-Attention (MHSA). The main reason for this is that the attention can dynamically allocate weights to different timesteps according to their significance.

Assuming that the input of the lth ST block in node n is

h s t_{n}^{l - 1} \in R^{P \times d_{model}}

, the operation of a single hth head is expressed as follows:

h t_{n}^{l, h} = Softmax (α Q_{h}^{T} {(K_{h}^{T})}^{T}) V_{h}^{T}

(10)

where

h t_{n}^{l, h} \in R^{P \times (\frac{d_{model}}{N_{h}})}

refers to the temporal representations generated by he one-layer MHSA operation on the hth head,

Q_{h}^{T} = h s t_{n}^{l - 1} W_{Q}^{T}, K_{h}^{T} = h s t_{n}^{l - 1} W_{K}^{T}

, and

V_{h}^{T} = h s t_{n}^{l - 1} W_{V}^{T}

are query, key, and value, respectively, and

W_{Q}^{T}

,

W_{K}^{T}

, and

W_{V}^{T} \in R^{d_{model} \times (\frac{d_{model}}{N_{h}})}

are the learnable parameters for linear mapping. The output results of each head are concatenated and further mapped to obtain the temporal representations

h t_{n}^{l} \in R^{P \times d_{model}}

of node n in the lth ST block, denoted as

h t_{n}^{l} = [h t_{n}^{l, 1}, h t_{n}^{l, 2}, \dots, h t_{n}^{l, N_{h}}] W_{o}^{T},

(11)

where

W_{o}^{T} \in R^{d_{model} \times d_{model}}

is the trainable mapping matrix.

ST Fusion: The industry

P M_{2.5}

value of a node at a specific timestep is associated with its previous timesteps and other nodes. To adaptively fuse the temporal and spatial dependencies, we design ST Fusion as described in Figure 3. In the lth ST block, the outputs of spatial attention and temporal attention are separately denoted as

H T^{l} \in R^{P \times n_{j} \times d_{model}}

and

H S^{l} \in R^{P \times n_{j} \times d_{model}}

, respectively. These two first conduct FC and layer normalization, then are fused together:

z = σ ((H S^{l} ⊙ H T^{l}) W_{z}^{S T} + H T^{l} W_{z}^{T} + b_{z}),

(12)

H S T^{l} = H S^{l} ⊙ z + H T^{l} ⊙ (1 - z),

(13)

where

H S T^{l} \in R^{P \times n_{j} \times d_{model}}

denotes spatiotemporal representations generated by the lth ST block, z is the gate,

σ

represents the sigmoid function, ⊙ is the element-wise product, and

W_{z}^{S T} \in R^{d_{model} \times d_{model}}, W_{z}^{T} \in R^{d_{model} \times d_{model}}

, and

b_{z} \in R^{d_{model}}

are trainable parameters.

(3) GLLM: After modeling the spatiotemporal correlations, we need to generate the output sequences depending on these features. Because the LLM has acquired rich intrinsic knowledge through pretraining and has been applied to downstream tasks, we design a novel LLM-based method named GLLM to infer industry

P M_{2.5}

in the future, leveraging GPT-2 [7] as the backbone model. Concretely, GLLM is composed of channel-independence and patching, token and positional encoding, GPT-2, and the output layer. Next, we present these individual structures.

Channel-Independence and Patching: To adapt spatiotemporal representations for GPT-2, channel-independence and patching in the PatchTST method [33] are employed to tokenize these features. Specifically, we first use the FCs to restore the spatiotemporal features

H S T^{L} \in R^{P \times n_{j} \times d_{model}}

generated by the Lth ST block, denoted as

R^{P \times n_{j} \times F}

. Channel-independence then treats multi-node spatiotemporal representations (

P \times n_{j} \times F

) as multiple single nodes (

[P \times 1 \times F] \times n_{j}

) and a model is used to independently process them. Channel-blending models are intended to directly leverage cross-channel data, whereas channel-independence often indirectly extracts cross-channel interactions through weight sharing, thereby providing more precise predictions. The underlying reason for this is that channel-blending models often encounter data limitations and overfitting. In the context of applying channel-independence, the subsequent patching process groups adjacent timesteps into a singular patch-based token. This approach expands the input’s historical span without increasing the token length, providing more valuable information for GPT-2.

Token Encoding and Positional Encoding: After obtaining a sequence of tokens through patching, token encoding is adopted to transform these tokens to ensure compatibility with the GPT-2 backbone model. In traditional NLP practice, token encoding is usually accomplished by exploiting a learnable lookup table to project tokens into a high-dimensional space. However, as we are patching for spatiotemporal features that denote vectors rather than scalars, we use a one-dimensional convolutional layer instead.

For positional encoding, we employ the structure in the transformer [34] to map the patch locations. During the training phase, token encoding and positional encoding need to be fine-tuned.

GPT-2: Furthermore, we use a pretrained GPT-2 with six layers as the backbone architecture of GLLM. To preserve the foundational knowledge acquired by pretraining, most parameters are frozen during the training phase, including those related to MHSA and FC layers. In addition to the low data requirements of this approach, retaining most of the non-training parameters tends to result in better predictive performance than training LLMs from scratch.

To enhance downstream tasks at minimal cost, we fine-tune the layer normalization, which is viewed as a common practice.

Output layer: After GPT-2, the output layer is adopted to produce industry

P M_{2.5}

value in the future. Because the output of GPT-2 retains the form of patches, essentially a series of tokens, we utilize flattening, FC, and rearrangement operations, all of which must be fine-tuned during the training phase. In particular, assuming that the output token for a specific node n is produced, flattening is first used to straighten the tokens; FCs are then employed to modify the dimensions; finally, rearrangement is utilized to generate the unpatched time series for the next Q timesteps as the output of node n, denoted as

R^{Q \times 1 \times F}

. We separately iterate the spatiotemporal representations of the channel-independent

n_{j}

nodes to obtain the industry

P M_{2.5}

in the next Q timesteps, denoted as

Y^{j} \in R^{Q \times n_{j} \times F}

.

Eventually, we optimize the parameters of STLLM-ECS for subgraph j by minimizing the Mean Squared Error (MSE) loss function, represented as

MSE = \frac{\sum_{t_{s} = 1}^{Q} \sum_{n = 1}^{n_{j}} {(y_{t_{s}, n} - {\hat{y}}_{t_{s}, n})}^{2}}{Q \times n_{j}} + \frac{λ}{2} {∥ W ∥}^{2},

(14)

where

n_{j}

is the total number of nodes in subgraph j and Q denotes the length of the predicted sequence. The predicted and observed values at timestep

t_{s}

on node n are

y_{t_{s}, n}

and

{\hat{y}}_{t_{s}, n}

, respectively, while

λ

denotes the regularization and W is the learnable parameter.

4.4. Edge Training Strategy

Due to the similarity in terms of land use, we generally believe that there is an association between the industry

P M_{2.5}

concentrations of adjacent subgraphs. For instance, during the weekday in industrial areas, the industry

P M_{2.5}

concentrations not only affect the area but also spread to the surrounding areas; in other words, there are similar industry

P M_{2.5}

features in the surrounding areas due to the diffusion of industry

P M_{2.5}

. As a result, transfer learning is introduced to the neighboring subgraphs through sharing trainable network parameters on each RSU, which aims to shorten the training time and improve the predictions’ precision. In particular, assuming that the network parameters used to modeling subgraph j are transferred to its neighboring subgraph k, we first train the jth STLLM using the jth subgraph representations. After training, STLLM produces output sequences on the RSU. Meanwhile, the parameters of STM and the unfrozen parameters of GLLM are transferred to the kth RSU adjacent to the jth RSU for the initialization. The kth subgraph representations are then adopted to train based on the initialization. After multiple rounds of iteration, we obtain the optimized kth STLLM. In this way, when the jth STLLM is performing inference, the kth STLLM starts training. This “inference while training” mode further decreases the training time. Because the frozen GLLM parameter do not participate in training, we uniformly deploy these frozen model structures to each RSU before training. In addition, considering the limited computing power of the RSU, pruning operations are performed before uploading, i.e., cutting the heads of the MHSA and STBMHSA to implement dimensionality reduction and lightening.

5. Experiments

5.1. Experimental Settings

5.1.1. Dataset

STLLM-ECS was assessed on nationwide

P M_{2.5}

concentration data from throughout China’s industrial areas ranging from 1 January 2015 to 31 December 2018. We collected industrial

P M_{2.5}

concentrations from 1065 sites covering industrial areas in 186 cities. The data collection frequency was one hour. In addition, the method used for data normalization was Z-score normalization. We split the dataset in chronological order, with the initial two years as the training set, the third year as the cross-validation set, and the fourth year as the test set.

5.1.2. Baselines

STLLM-ECS was compared with advanced baselines affiliated with the following four classes:

Classical statistics and shallow machine learning models: History Average (HA) [35] was adopted to predict industrial $P M_{2.5}$ using the average of historical observed values. Support Vector Regression (SVR) [36] refers to vector autoregression.
Spatio-Temporal Graph Convolutional Networks (STGCNs)-based models: Selected STGCNs (e.g., Diffusion Convolutional Recurrent Neural Network (DCRNN) [29] and Spatio-Temporal Graph Convolutional Network (STGCN) [30]) were used as baselines. DCRNN and STGCN generalize well to nationwide industrial $P M_{2.5}$ prediction.
Attention-based models: Spatio-Temporal Graph Attention (ST-GRAT) [34], Graph Multi-Attention Network (GMAN) [37], ST-Transformer [38], and Airformer [39] are transformer variants used for spatiotemporal prediction that can easily accommodate industrial $P M_{2.5}$ prediction.
LLM-based models: Two LLM-based time series prediction models (e.g., LLM4TS [23] and FPT [24]) awere choosed for a comparison.

5.1.3. Evaluation Metrics

The performance of STLLM-ECS and the baselines are tested through four metrics divided into two groups: (a) Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) were used to evaluate the prediction accuracy; and (b) training time and GPU memory were utilized to assess model efficiency. In detail:

(1) MAE:

M A E = \frac{1}{Q \times n_{j}} \sum_{t_{s} = 1}^{Q} \sum_{n = 1}^{n_{j}} |y_{t_{s}, n} - {\hat{y}}_{t_{s}, n}| .

(15)

(2) RMSE:

R M S E = \sqrt{\frac{1}{Q \times n_{j}} \sum_{t_{s} = 1}^{Q} \sum_{n = 1}^{n_{j}} {(y_{t_{s}, n} - {\hat{y}}_{t_{s}, n})}^{2}} .

(16)

(3) Training time: The training time is composed of the Total Time (TT) and Average Time (AT); TT represents the overall cost of the entire training phase until the model converges, while AT is the average training time of the subgraphs. The training efficiency of a model can be measured through the training time. For instance, if the accuracy of two models is comparable, a shorter training time implies more efficient training.

(4) GPU M: The GPU M is the the memory usage of the GPU in the training phase, which can be used to evaluate the space overhead of the model. For example, lower GPU M means that fewer model parameters need to be trained. This shows that the model does not consume excessive GPU resources during training.

5.1.4. Parameter Settings

We reproduced HA through the statsmodels package in Python 3.8. The SVR was implemented using the sklearn package in Python. The remaining models were implemented using the PyTorch library. For STLLM-ECS, some parameters were set as follows.

In Graph Partitioning, the number of subgraphs was selected as 10. Hence, we employed ten Nvidia 3090 Ti GPU cards, Nvidia, CA, USA, to simulate ten RSUs. In addition, We set the scaling factor

γ

to 0.9.

In each STLLM, the predicted timesteps were set to 36 (

Q = 36

), i.e., predicting the nationwide industry

P M_{2.5}

for 1065 sites throughout China. The referenced timesteps were 96 h (P = 96). We set the learning rate and batch size as 0.0002 and 32, respectively. Stochastic gradient descent was selected as the optimizer, with

d_{model}

as 64. In addition, hyperparameters need to be set for two parts, i.e., STM and GLLM. For STM, the initial number of heads in spatial and temporal attention

N_{h}

, the dimensions of each head d, and the number of ST blocks L were set as 4, 16, and 1, respectively. For GLLM, we selected a patch length of 12 and stride of 12 in patching. The number of GPT-2 layers

L_{G P T}

was 6.

In pruning, the initial number of heads for MHSA and STBMHSA was 16. After every two subgraphs, we reduced the number of heads in MHSA and STBMHSA through a pruning operation.

5.2. Experimental Results

5.2.1. Performance Comparisons

STLLM-ECS was contrasted with the above baselines for nationwide industry

P M_{2.5}

prediction in the next 36 h, as shown in Table 1. GPUM is the GPU memory usage, while ‘-’ indicates that the model does not run using the GPU.

(1) Prediction Accuracy Comparison: From Table 1, we can draw the following conclusions: (1) deep learning-based models outperform the classical statistical and shallow machine learning models (e.g., HA and SVR) due to lack of spatiotemporal feature extraction capability; (2) attention-based models (e.g., ST-GRAT, GMAN, ST-Transformer, and Airformer) are superior to STGCN-based models (e.g., DCRNN, and STGCN), as the attention-based network further improves the capacity to extract global and dynamic spatiotemporal features using spatiotemporal attention compared to STGCN-based models; (3) although the LLM-based models (e.g., LLM4TS and FPT) were originally applied other spatiotemporal analysis tasks such as traffic flow prediction, they generalize well to nationwide industry

P M_{2.5}

prediction; and (4) compared with all baselines, STLLM-ECS demonstrates the best prediction accuracy, proving that STM can effectively extract spatiotemporal correlations and GLLM can generate prediction sequences thanks to the extensive intrinsic knowledge in the pretrained GPT-2.

(2) Training Efficiency Comparison: According to TT, AT, and GPUM, shown in Table 1, the following conclusions can be drawn. First, the TT of HA and ARIMA is lower than that of the other deep learning-based models due to their simple structures. Second, among the deep learning-based models, the LLM-based models (FPT and LLM4TS) have the lowest TT and GPUM. This is because most of the parameters in these models are frozen, and only a small portion of the parameters need to be trained. Third, except for FPT and LLM4TS, the TT of STLLM-ECS is the shortest. Meanwhile, the AT of STLLM-ECS does not exceed 0.5 h. This reveals that STLLM-ECS maintains satisfactory accuracy with small time overheads. Finally, due to its small-scale structure, the GPU memory overhead of STLLM-ECS is relatively small. The main reason for this is that the size of the processed subgraphs is small, allowing STLLM-ECS to maintain relatively low GPU memory usage even though the size of the national sensor network is large.

5.2.2. Case Study

A case study was conducted to visualize the fitting results of STLLM-ECS. The JiNanHuaGongChang site in Shandong and the TongZhouXinCheng site in Beijing were chosen for evaluation. We plotted the fitting curves for 500 continuous hours using HA, GMAN, and STLLM-ECS, as shown in Figure 4, observing the following conclusions. First, HA fails to learn the complex nonlinear relationships in industry

P M_{2.5}

data. Second, compared with HA, GMAN can extract spatiotemporal correlations, improving the ability in fitting; however, GMAN cannot recognize sudden changes in industry

P M_{2.5}

at the JiNanHuaGongChang and TongZhouXinCheng sites. Third, our proposed STLLM-ECS achieves the best fitting. One potential reason is that the pretrained GPT-2 contains rich intrinsic knowledge, which help to identify various patterns of change in industry

P M_{2.5}

.

5.2.3. Effect of Hyperparameters

Figure 5 illustrates the MAE and RMSE of STLLM-ECS on 1065 sites under different hyperparameter settings for predicting the next 36 h. When one hyperparameter was adjusted, the other hyperparameters were kept at their default optimal values (e.g.,

N_{h} = 4

,

d = 16

,

L = 1

, and

L_{G P T} = 6

). As shown in Figure 5a,b,d, the more complex model structures make it easier to underfit the data, while the simpler model structures make it easier to overfit. Figure 5c illustrates that the model with fewer ST blocks achieves the best prediction accuracy. This shows that stacking too many ST blocks leads to error accumulation.

6. Conclusions

In this paper, we have proposed a novel framework entitled STLLM-ECS for securely predicting nationwide industry

P M_{2.5}

in China. Specifically, the nationwide sensor network is first partitioned into several subgraphs. Each subgraph is been assigned an ECS. We then deploy STLLM on each ECS to extract spatiotemporal correlations and infer prediction sequences in the subgraphs. We conducted experiments on a nationwide sensor network throughout China’s industrial areas. Our experimental results show that STLLM-ECS is superior to state-of-the-art baselines in prediction performance.

Author Contributions

Conceptualization, Z.H., M.C. and X.H.; Data curation, C.Y. and M.C.; Formal analysis, C.Y., Z.H., M.C. and X.H.; Funding acquisition, Y.M.; Investigation, C.Y. and Y.R.; Methodology, C.Y., Z.H., M.C., X.H. and Y.R.; Project administration, Y.M.; Resources, C.Y., Y.M. and X.H.; Software, C.Y. and M.C.; Supervision, Y.M.; Validation, Z.H., M.C. and Y.R.; Visualization, C.Y.; Writing—original draft, C.Y. and Z.H.; Writing—review and editing, Y.M., M.C., X.H. and Y.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Research on Distribution Room Condition Sensing Early Warning and Distribution Cable Operation and Inspection Smart Decision-Making Technology, No. 524609220092.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

Author Zhenyuan He was employed by the company Yuxin Electronic Technology Group Co., Ltd. Author Meng Chen was employed by the company Shenzhen Urban Transport Planning Center Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Liu, Y.; Wang, K.; Lin, Y.; Xu, W. LightChain: A lightweight blockchain system for industrial internet of things. IEEE Trans. Ind. Inform. 2019, 15, 3571–3581. [Google Scholar] [CrossRef]
Liu, Y.; Du, H.; Niyato, D.; Kang, J.; Xiong, Z.; Jamalipour, A.; Shen, X. ProSecutor: Protecting Mobile AIGC Services on Two-Layer Blockchain via Reputation and Contract Theoretic Approaches. IEEE Trans. Mob. Comput. 2024. [Google Scholar] [CrossRef]
Dong, Y.; Hu, Z.; Wang, K.; Sun, Y.; Tang, J. Heterogeneous network representation learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; Volume 20, pp. 4861–4867. [Google Scholar]
Zhang, T.; Xu, C.; Lian, Y.; Tian, H.; Kang, J.; Kuang, X.; Niyato, D. When Moving Target Defense Meets Attack Prediction in Digital Twins: A Convolutional and Hierarchical Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2023, 41, 3293–3305. [Google Scholar] [CrossRef]
Hu, K.; Rahman, A.; Bhrugubanda, H.; Sivaraman, V. HazeEst: Machine learning based metropolitan air pollution estimation from fixed and mobile sensors. IEEE Sens. J. 2017, 17, 3517–3525. [Google Scholar] [CrossRef]
Han, Q.; Liu, P.; Zhang, H.; Cai, Z. A wireless sensor network for monitoring environmental quality in the manufacturing industry. IEEE Access 2019, 7, 78108–78119. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Otter, D.W.; Medina, J.R.; Kalita, J.K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 604–624. [Google Scholar] [CrossRef]
Wang, J.; Du, H.; Tian, Z.; Niyato, D.; Kang, J.; Shen, X. Semantic-aware sensing information transmission for metaverse: A contest theoretic approach. IEEE Trans. Wirel. Commun. 2023, 22, 5214–5228. [Google Scholar] [CrossRef]
Zhang, Q.; Wu, S.; Wang, X.; Sun, B.; Liu, H. A PM2.5 concentration prediction model based on multi-task deep learning for intensive air quality monitoring stations. J. Clean. Prod. 2020, 275, 122722. [Google Scholar] [CrossRef]
Hu, Y.; Cao, N.; Guo, W.; Chen, M.; Rong, Y.; Lu, H. FedDeep: A Federated Deep Learning Network for Edge Assisted Multi-Urban PM2.5 Forecasting. Appl. Sci. 2024, 14, 1979. [Google Scholar] [CrossRef]
Shi, W.; Dustdar, S. The promise of edge computing. Computer 2016, 49, 78–81. [Google Scholar] [CrossRef]
Toczé, K.; Nadjm-Tehrani, S. A taxonomy for management and optimization of multiple resources in edge computing. Wirel. Commun. Mob. Comput. 2018, 2018, 7476201. [Google Scholar] [CrossRef]
Zhang, T.; Xu, C.; Zou, P.; Tian, H.; Kuang, X.; Yang, S.; Zhong, L.; Niyato, D. How to mitigate DDoS intelligently in SD-IoV: A moving target defense approach. IEEE Trans. Ind. Inform. 2022, 19, 1097–1106. [Google Scholar] [CrossRef]
Wang, J.; Du, H.; Niyato, D.; Kang, J.; Xiong, Z.; Rajan, D.; Mao, S.; Shen, X. A unified framework for guiding generative ai with wireless perception in resource constrained mobile edge networks. IEEE Trans. Mob. Comput. 2024. [Google Scholar] [CrossRef]
Zhang, T.; Xu, C.; Shen, J.; Kuang, X.; Grieco, L.A. How to Disturb Network Reconnaissance: A Moving Target Defense Approach Based on Deep Reinforcement Learning. IEEE Trans. Inf. Forensics Secur. 2023, 18, 5735–5748. [Google Scholar] [CrossRef]
Su, X.; Liu, X.; Motlagh, N.H.; Cao, J.; Su, P.; Pellikka, P.; Liu, Y.; Petäjä, T.; Kulmala, M.; Hui, P.; et al. Intelligent and scalable air quality monitoring with 5G edge. IEEE Internet Comput. 2021, 25, 35–44. [Google Scholar] [CrossRef]
Wardana, I.N.K.; Gardner, J.W.; Fahmy, S.A. Collaborative Learning at the Edge for Air Pollution Prediction. IEEE Trans. Instrum. Meas. 2023, 73, 2503612. [Google Scholar] [CrossRef]
Wang, J.; Du, H.; Niyato, D.; Xiong, Z.; Kang, J.; Mao, S.; Shen, X.S. Guiding AI-generated digital content with wireless perception. IEEE Wirel. Commun. 2024. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 25 June 2024).
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Yu, X.; Chen, Z.; Ling, Y.; Dong, S.; Liu, Z.; Lu, Y. Temporal Data Meets LLM–Explainable Financial Time Series Forecasting. arXiv 2023, arXiv:2306.11025. [Google Scholar]
Chang, C.; Peng, W.C.; Chen, T.F. Llm4ts: Two-stage fine-tuning for time-series forecasting with pre-trained llms. arXiv 2023, arXiv:2308.08469. [Google Scholar]
Zhou, T.; Niu, P.; Sun, L.; Jin, R. One fits all: Power general time series analysis by pretrained lm. Adv. Neural Inf. Process. Syst. 2023, 36, 43322–43355. [Google Scholar]
Arystanbekova, N.K. Application of Gaussian plume models for air pollution simulation at instantaneous emissions. Math. Comput. Simul. 2004, 67, 451–458. [Google Scholar] [CrossRef]
Daly, A.; Zannetti, P. Air pollution modeling—An overview. Ambient. Air Pollut. 2007, 15–28. Available online: https://www.researchgate.net/profile/Arideep-Mukherjee/post/What-are-the-models-for-modelling-air-pollution/attachment/5bc95d70cfe4a76455fbd37d/AS%3A683302050607104%401539923312818/download/Modeling.pdf (accessed on 25 June 2024).
Zheng, Y.; Yi, X.; Li, M.; Li, R.; Shan, Z.; Chang, E.; Li, T. Forecasting fine-grained air quality based on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 2267–2276. [Google Scholar]
Yi, X.; Zhang, J.; Wang, Z.; Li, T.; Zheng, Y. Deep distributed fusion network for air quality prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 965–973. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
Brin, S. The PageRank citation ranking: Bringing order to the web. Proc. ASIS 1998, 98, 161–172. [Google Scholar]
Brandes, U. A faster algorithm for betweenness centrality. J. Math. Sociol. 2001, 25, 163–177. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Park, C.; Lee, C.; Bahng, H.; Tae, Y.; Jin, S.; Kim, K.; Ko, S.; Choo, J. ST-GRAT: A novel spatio-temporal graph attention networks for accurately forecasting dynamically changing road speed. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 1215–1224. [Google Scholar]
Bhatti, U.A.; Yan, Y.; Zhou, M.; Ali, S.; Hussain, A.; Qingsong, H.; Yu, Z.; Yuan, L. Time series analysis and forecasting of air pollution particulate matter (PM2.5): An SARIMA and factor analysis approach. IEEE Access 2021, 9, 41019–41031. [Google Scholar] [CrossRef]
Zhang, B.; Rong, Y.; Yong, R.; Qin, D.; Li, M.; Zou, G.; Pan, J. Deep learning for air pollutant concentration prediction: A review. Atmos. Environ. 2022, 290, 119347. [Google Scholar] [CrossRef]
Zheng, C.; Fan, X.; Wang, C.; Qi, J. Gman: A graph multi-attention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar]
Yu, M.; Masrur, A.; Blaszczak-Boxe, C. Predicting hourly PM2.5 concentrations in wildfire-prone areas using a SpatioTemporal Transformer model. Sci. Total Environ. 2023, 860, 160446. [Google Scholar] [CrossRef] [PubMed]
Liang, Y.; Xia, Y.; Ke, S.; Wang, Y.; Wen, Q.; Zhang, J.; Zheng, Y.; Zimmermann, R. Airformer: Predicting nationwide air quality in china with transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14329–14337. [Google Scholar]

Figure 1. The architecture of STLLM-ECS for nationwide industry

P M_{2.5}

prediction. A nationwide sensor network with over 1000 sensors located throughout China’s industrial areas is first partitioned into E subgraphs. Each Edge Computing Server (ECS) covers a specific subgraph. Sensors installed on the subgraph record industrial

P M_{2.5}

data, which are uploaded to the surrounding ECS. Meanwhile, an STLLM is deployed on each ECS. Each ECS is responsible for dealing with industry

P M_{2.5}

data, training the STLLM, and inferring future industry

P M_{2.5}

in the specific subgraph. Subsequently, the future industry

P M_{2.5}

information is transmitted to the management department for factory rectification. To accelerate training, the parameter transfer strategy is used to initialize STLLM deployment on each ECS.

Figure 1. The architecture of STLLM-ECS for nationwide industry

P M_{2.5}

prediction. A nationwide sensor network with over 1000 sensors located throughout China’s industrial areas is first partitioned into E subgraphs. Each Edge Computing Server (ECS) covers a specific subgraph. Sensors installed on the subgraph record industrial

P M_{2.5}

data, which are uploaded to the surrounding ECS. Meanwhile, an STLLM is deployed on each ECS. Each ECS is responsible for dealing with industry

P M_{2.5}

data, training the STLLM, and inferring future industry

P M_{2.5}

in the specific subgraph. Subsequently, the future industry

P M_{2.5}

information is transmitted to the management department for factory rectification. To accelerate training, the parameter transfer strategy is used to initialize STLLM deployment on each ECS.

Figure 2. Overview of STLLM-ECS. SA is the spatial attention. TA is the temporal attention.

Figure 3. Framework of STLLM. STLLM is composed of STM and GLLM, in which STM contains L stacked ST blocks.

Figure 4. Results of the industry

P M_{2.5}

prediction case study using the JiNanHuaGongChang and TongZhouXinCheng sites.

Figure 4. Results of the industry

P M_{2.5}

prediction case study using the JiNanHuaGongChang and TongZhouXinCheng sites.

Figure 5. Experimental results under different hyperparameter settings on the nationwide industry

P M_{2.5}

dataset.

Figure 5. Experimental results under different hyperparameter settings on the nationwide industry

P M_{2.5}

dataset.

Table 1. Industry

P M_{2.5}

prediction accuracy comparison of STLLM-ECS and baselines on the nationwide sensor network. Bolding indicates the best results.

Table 1. Industry

P M_{2.5}

prediction accuracy comparison of STLLM-ECS and baselines on the nationwide sensor network. Bolding indicates the best results.

Model	GPUM	1–12 h		13–24 h		25–36 h		Average		TT	AT
Model	GPUM	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	TT	AT
HA	-	47.28	92.65	47.28	92.65	47.28	92.65	47.28	92.65	1.95 h	-
SVR	-	31.04	64.75	34.57	70.41	37.83	75.24	34.48	70.07	2.04 h	-
DCRNN	5.03 G	15.63	29.59	16.72	31.15	17.48	34.34	16.52	31.36	4.84 h	-
STGCN	4.78 G	15.37	30.28	15.98	31.24	16.82	32.36	16.06	31.29	3.89 h	-
ST-GRAT	5.33 G	16.36	31.27	18.01	36.43	19.86	40.24	18.08	35.98	5.61 h	-
GMAN	6.73 G	16.84	33.65	17.47	36.92	19.24	39.85	17.85	36.81	6.94 h	-
ST-Transformer	5.17 G	16.24	30.89	17.82	35.89	19.01	39.14	17.69	35.31	5.12 h	-
Airformer	4.21 G	15.58	29.37	16.96	34.27	18.41	38.12	16.98	33.92	3.98 h	-
LLM4TS	1.72 G	14.23	27.84	15.99	30.17	17.72	33.46	15.98	30.49	2.35 h	-
FPT	1.58 G	14.12	28.54	16.39	33.95	16.03	32.67	15.51	31.72	2.14 h	-
STLLM-ECS	2.24 G	13.25	25.32	15.37	28.89	16.93	32.17	15.18	28.79	2.67 h	0.27 h

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yin, C.; Mao, Y.; He, Z.; Chen, M.; He, X.; Rong, Y. Edge Computing-Enabled Secure Forecasting Nationwide Industry PM_2.5 with LLM in the Heterogeneous Network. Electronics 2024, 13, 2581. https://doi.org/10.3390/electronics13132581

AMA Style

Yin C, Mao Y, He Z, Chen M, He X, Rong Y. Edge Computing-Enabled Secure Forecasting Nationwide Industry PM_2.5 with LLM in the Heterogeneous Network. Electronics. 2024; 13(13):2581. https://doi.org/10.3390/electronics13132581

Chicago/Turabian Style

Yin, Changkui, Yingchi Mao, Zhenyuan He, Meng Chen, Xiaoming He, and Yi Rong. 2024. "Edge Computing-Enabled Secure Forecasting Nationwide Industry PM_2.5 with LLM in the Heterogeneous Network" Electronics 13, no. 13: 2581. https://doi.org/10.3390/electronics13132581

APA Style

Yin, C., Mao, Y., He, Z., Chen, M., He, X., & Rong, Y. (2024). Edge Computing-Enabled Secure Forecasting Nationwide Industry PM_2.5 with LLM in the Heterogeneous Network. Electronics, 13(13), 2581. https://doi.org/10.3390/electronics13132581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Edge Computing-Enabled Secure Forecasting Nationwide Industry PM_2.5 with LLM in the Heterogeneous Network

Abstract

1. Introduction

2. Related Work

2.1. Edge Computing

2.2. LLMs for Time Series Analysis

2.3. Air Pollution Forecasting

3. Preliminaries

4. STLLM-ECS Design

4.1. System Overview

4.2. Graph Partitioning Design

4.3. STLLM Design

4.4. Edge Training Strategy

5. Experiments

5.1. Experimental Settings

5.1.1. Dataset

5.1.2. Baselines

5.1.3. Evaluation Metrics

5.1.4. Parameter Settings

5.2. Experimental Results

5.2.1. Performance Comparisons

5.2.2. Case Study

5.2.3. Effect of Hyperparameters

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI