Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems

Du, Chenxi; Fu, Zhengjie; Hu, Yifan; Liu, Yibin; Cao, Jingwen; Liu, Siyuan; Zhan, Yan

doi:10.3390/electronics15020356

Open AccessArticle

Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems

by

Chenxi Du

^1,†,

Zhengjie Fu

^1,2,†,

Yifan Hu

^1,3,†,

Yibin Liu

²,

Jingwen Cao

¹,

Siyuan Liu

^1,2 and

Yan Zhan

^1,4,*

¹

National School of Development, Peking University, Beijing 100871, China

²

China Agricultural University, Beijing 100083, China

³

School of Management, Zhejiang University, Hangzhou 310058, China

⁴

Artificial Intelligence Research Institute, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2026, 15(2), 356; https://doi.org/10.3390/electronics15020356

Submission received: 9 December 2025 / Revised: 29 December 2025 / Accepted: 12 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue Application of Machine Learning in Graphics and Images, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

With the widespread application of high-resolution remote sensing imagery and unmanned aerial vehicle technologies in agricultural scenarios, accurately characterizing spatial pest diffusion from multi-temporal images has become a critical issue in intelligent agricultural management. To overcome the limitations of existing machine learning approaches that focus mainly on static recognition and lack effective spatio-temporal diffusion modeling, a UAV-based pest diffusion prediction and simulation framework is proposed. Multi-temporal UAV RGB and multispectral imagery are jointly modeled using a graph-based representation of farmland parcels, while temporal modeling and environmental embedding mechanisms are incorporated to enable simultaneous prediction of diffusion intensity and propagation paths. Experiments conducted on two real agricultural regions, Bayan Nur and Tangshan, demonstrate that the proposed method consistently outperforms representative spatio-temporal baselines. Compared with ST-GCN, the proposed framework achieves approximately 17–22% reductions in MAE and MSE, together with 8–12% improvements in PMR, while maintaining robust classification performance with precision, recall, and F1-score exceeding 0.82. These results indicate that the proposed approach can provide reliable support for agricultural information systems and diffusion-aware decision generation.

Keywords:

agricultural image analysis; multi-temporal image modeling; UAV remote sensing; attention-based graph learning; spatio-temporal graph neural network; digital agricultural management; data-driven decision support

1. Introduction

Pest diffusion is recognized as one of the most destructive ecological processes in global agricultural production, posing long-term threats to food security, crop yield stability, and ecosystem sustainability [1]. Under the combined influence of climate change, farmland fragmentation, and increasing agricultural intensification, both the frequency and intensity of pest outbreaks have shown an upward trend, making accurate monitoring and prediction of pest diffusion a critical component of modern agricultural management [2]. Conventional pest monitoring in farmland primarily relies on manual field inspection and fixed trapping devices, whose effectiveness is constrained by limited spatial coverage, low monitoring frequency, and high labor costs, making large-scale and real-time monitoring impractical [3].

Traditional pest diffusion prediction methods are predominantly based on ecological dynamic models, such as diffusion equation models, particle diffusion models, and SIR (susceptible, infected, and recovered)-like infection models [4]. While these approaches possess clear biological interpretability, they also exhibit notable limitations. In particular, they typically rely on fixed parameters and idealized assumptions [5], such as homogeneous diffusion or unidirectional migration, thereby overlooking the highly nonlinear behaviors observed in natural environments. Furthermore, complex terrain-induced barriers and interactions among multiple wind directions are difficult to handle within these models, resulting in limited predictive capability in real-world farmland scenarios [6]. With the rapid development of agricultural big data, statistical models and machine learning techniques have been gradually introduced into pest prediction tasks, including ARIMA (autoregressive integrated moving average)-based time series analysis and random forest-based classification methods [7]. However, these methods commonly treat observations at different time points as independent samples, lacking the ability to model spatio-temporal dependencies and thus failing to reveal dynamic diffusion patterns across heterogeneous farmland parcels and varying meteorological conditions [8].

In recent years, the rapid advancement of unmanned aerial vehicle (UAV) [9] and remote sensing technologies [10] has provided new opportunities for agricultural pest monitoring [11]. Nevertheless, such imagery inherently represents static observations at discrete time points and cannot directly characterize continuous pest propagation patterns over time [12]. Consequently, deep learning approaches have been increasingly explored for pest detection and trend prediction, including convolutional neural networks (CNNs), recurrent neural networks such as long short-term memory (LSTM), and ConvLSTM architectures [13]. Although these methods have achieved progress in detection accuracy [14] and temporal modeling [15], they are predominantly focused on image-level or pixel-level static recognition tasks. For instance, LSTM-based models are effective in capturing temporal dependencies but lack explicit representations of spatial topology [16], while ConvLSTM methods perform well on regular grids yet struggle to adapt to irregular farmland layouts and wind-dominated propagation paths commonly observed in real agricultural environments [17]. Moreover, heterogeneous environmental factors such as wind speed, humidity, and soil properties often exhibit spatial distributions inconsistent with remote sensing imagery, posing significant challenges for effective multimodal data fusion in deep learning models [18]. Li et al. [19] proposed a UAV-based farmland stress detection model integrating stride attention and cross-modality fusion, achieving deep interactive fusion of multimodal data and significantly enhancing robust representations of drought, pest infestation, and disease stress features. Lu et al. [20] introduced a hyperspectral maize nitrogen content estimation model that combines spatio-temporal attention mechanisms with graph neural networks, enabling deep coupling of spectral, temporal, and spatial information. Ye et al. [21] developed a multi-scale attention-UNet based on multi-scale feature fusion and attention mechanisms to enhance satellite-based detection of pine forest pest damage.

To address the aforementioned limitations, a UAV-based pest diffusion prediction and simulation system integrating graph neural networks (GNNs) and spatio-temporal attention mechanisms, termed UAV-GNN-Pest, is proposed. The core concept is to represent farmland as a spatio-temporal graph composed of parcels or grid units, where nodes correspond to multi-source parcel-level features, including UAV-derived image representations, vegetation indices, meteorological variables, and terrain attributes, while edges encode spatial adjacency relationships and wind-weighted potential propagation pathways. Building upon this representation, a spatio-temporal attention graph encoder (STAGE), an environmental embedding fusion (EEF) module, and a diffusion path simulation and explainability (DPSE) module are constructed to jointly model temporal dependencies and spatial diffusion relationships. The main contributions of this study are summarized as follows:

A parcel-level structured pest diffusion graph modeling strategy is proposed, enabling unified representation of UAV imagery, meteorological data, and terrain information within a graph framework and facilitating efficient modeling of irregular farmland spatial relationships;
A GNN framework (STAGE) combining temporal convolution and dynamic spatial attention is designed to learn time-varying diffusion intensity and propagation directions of pest spread;
An environment-driven diffusion response modeling mechanism (EEF) is introduced to automatically learn the influence of wind direction, wind speed, and terrain barriers on pest propagation;
An interpretable diffusion path simulation module (DPSE) is developed to identify dominant diffusion channels and key contributing nodes, enhancing the practical applicability of the model in agricultural management;
Extensive validation is conducted on multi-region and multi-crop field datasets, demonstrating clear advantages in prediction accuracy, diffusion consistency, and generalization capability.

2. Related Work

2.1. Application of UAV and Remote Sensing Imagery in Pest Monitoring

The rapid development of unmanned aerial vehicle (UAV) and remote sensing technologies in agricultural monitoring has enabled large-scale observation of pest infestations [22]. The core principle of UAV-based image monitoring lies in capturing spectral reflectance characteristics of surface vegetation and crop canopies using high-resolution sensors, through which leaf damage, plant wilting, and variations in pest population density can be identified from image texture, color, spectral curves, and their temporal changes [23]. When pest infestation induces phenomena such as leaf spotting, increased defoliation rates, or abnormal reflectance in specific spectral bands, deep learning models are able to extract discriminative features from remote sensing imagery and map them to the spatial distribution of pest occurrence [24]. However, the underlying mechanism of such visual recognition models inherently emphasizes static image characteristics at the current time instance [25].

2.2. Pest Diffusion Modeling and Ecological Dynamic Prediction

In pest diffusion studies, ecological and mathematical models have historically played a dominant role [26]. Early diffusion models were primarily grounded in classical statistics and dynamical equations, aiming to describe pest population growth and migration using a limited set of parameters [27]. For instance, the SIR model characterizes pest population dynamics by partitioning populations into susceptible, infected, and recovered groups, while diffusion equation models conceptualize pest migration as a continuous spatial diffusion process, representing pest density variations over time and space through partial differential equations [28]. However, the fundamental assumptions underlying these models, such as environmental homogeneity, continuity of diffusion processes, and parameter stationarity, differ substantially from real agricultural ecosystems [29]. Nonlinear factors including boundary barriers between farmland parcels, heterogeneity in cultivation practices, and abrupt meteorological changes often induce pronounced spatio-temporal heterogeneity in pest diffusion, making fixed-parameter models inadequate for capturing realistic propagation pathways [30].

2.3. Graph Neural Networks and Spatio-Temporal Attention in Ecological Scenarios

The introduction of graph neural networks (GNNs) has provided a new theoretical foundation for addressing spatio-temporal coupling problems [31]. Unlike traditional image recognition or sequence-based models, GNNs structurally allow information to propagate from one farmland parcel to another [32], effectively resembling the real migration behavior of pests across parcels [33]. Moreover, when temporal dimensions are incorporated into graph neural networks to form spatio-temporal GNNs (ST-GNNs), temporal convolutions, self-attention mechanisms, or time encodings enable the simultaneous modeling of multi-temporal imagery [34], meteorological drivers, and temporal evolution of pest density [35]. Although GNNs and spatio-temporal attention models have achieved remarkable success in domains [36], agricultural ecosystems are characterized by high complexity, strong dependence on natural processes, and significant data noise, posing challenges when transferring methods from other domains [37]. More importantly, agricultural ecosystems involve unique driving factors, including temperature accumulation, crop phenological stages, vegetation indices, rainfall anomalies, and terrain occlusion, all of which exhibit complex nonlinear coupling with pest diffusion processes and are difficult to unify within traditional modeling approaches [38]. Consequently, constructing graph-based models capable of integrating multi-source remote sensing imagery, terrain information, meteorological drivers, and pest-related visual features has emerged as a critical research direction [39].

3. Materials and Method

3.1. Data Collection

Experimental data collection was conducted from June 2022 to August 2023, covering two representative agricultural production regions, the city of Bayan Nur in the Inner Mongolia autonomous region and Tangshan city in Hebei Province, and the Internet, as shown in Table 1. The primary study objects consisted of large-scale cultivated maize (Zea mays) and winter wheat (Triticum aestivum) fields in these regions. The selected areas exhibit pronounced differences in climatic conditions, terrain structures, and pest composition. Specifically, Bayan Nur is characterized by arid and semi-arid climates with relatively open terrain, where pest diffusion is largely dominated by wind direction and wind speed, whereas Tangshan is strongly influenced by a temperate monsoon climate, showing higher farmland spatial heterogeneity and more complex spatio-temporal pest propagation patterns. Six major agricultural pest species with long-term high incidence and notable migration or diffusion characteristics in both regions were selected as the primary targets, including Mythimna separata, Spodoptera frugiperda, Ostrinia furnacalis, Sitobion avenae, Rhopalosiphum padi, and Mythimna loreyi, as shown in Figure 1. These pest species present substantial differences in outbreak cycles, migration capabilities, and sensitivity to meteorological conditions, thereby providing diverse scenarios for diffusion modeling and prediction. The dataset used in this study is a multimodal dataset consisting of visual samples and structured textual knowledge. Temporal and spatial information is not provided as additional imagery; instead, it is encoded in the form of structured textual data, including acquisition timestamps, parcel-level spatial indices, adjacency relationships among parcels, and associated environmental records. Formally, the input for each grid unit is formatted as a multimodal tuple

X = (I_{p a t c h}, V_{a t t r})

, where

I_{p a t c h}

represents the cropped UAV image tensor and

V_{a t t r}

denotes the structured vector containing synchronized meteorological readings and spatio-temporal metadata. These textual attributes are aligned with image samples and mapped to node- and edge-level features in the graph structure. Through this design, images, together with temporally and spatially indexed textual knowledge, jointly constitute the model input, enabling integrated spatio-temporal diffusion modeling.

UAV-based remote sensing imagery was acquired using a multirotor UAV platform, on which RGB and multispectral cameras were synchronously deployed. The spatial resolution of the RGB imagery was approximately 10 cm/pixel, while the multispectral imagery covered blue, green, red, and near-infrared bands to characterize crop canopy spectral properties and pest stress responses. UAV flight altitude was maintained within the range of 60–80 m, with forward and side overlap ratios kept at no less than

75 %

and

70 %

, respectively, to ensure the accuracy and consistency of orthomosaic generation and temporal comparison. Image acquisition was scheduled at a frequency of once every five days and was intensified during pest outbreak and rapid diffusion stages in order to capture critical periods of pest emergence, spread, and decline. The raw imagery was subjected to radiometric calibration, geometric correction, and orthorectification, resulting in spatially aligned multi-temporal remote sensing image sequences.

Meteorological data were obtained from automatic weather stations deployed within and around the study areas. The collected variables included near-surface air temperature, relative humidity, wind speed, wind direction, and precipitation, with a temporal resolution of 10 min. To match the UAV image acquisition schedule, meteorological variables were aggregated within corresponding temporal windows, enabling the representation of short-term and cumulative meteorological conditions affecting pest diffusion behavior. Terrain information was derived from a digital elevation model (DEM) sourced from high-resolution regional surveying data and resampled to the same spatial resolution as the remote sensing imagery, facilitating the characterization of micro-topographic variations, slope changes, and their constraining effects on diffusion pathways. In addition, the normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI) were calculated based on multispectral imagery to reflect spatio-temporal variations in crop growth status, canopy density, and host suitability.

During spatial structure modeling, the study area was partitioned into regular grid units according to actual field boundaries, with each grid unit corresponding to a node in the graph structure, as shown in Figure 2. During spatial structure modeling, farmland parcels are first delineated according to actual field boundaries derived from UAV orthomosaic imagery. Within each parcel, a regular grid partitioning strategy is applied, where each grid cell corresponds to a fixed-size spatial unit. In this study, the grid cell size is set to 20 m × 20 m, determined by the UAV image resolution and agricultural management requirements. Each grid cell is directly aligned with real UAV RGB and multispectral imagery, and visual features are extracted by cropping the corresponding image region. As a result, each graph node represents a real and spatially continuous farmland subregion rather than an abstract unit, ensuring consistency between grid-based modeling and real-world imagery. Node features were composed of remote sensing texture features, vegetation indices, and synchronous meteorological and terrain statistics. Adjacency relationships among nodes were established based on geographic proximity and further adjusted by incorporating dominant wind direction and wind speed to explicitly characterize the asymmetric propagation behavior of pests under wind field influence. To explicitly model wind-driven asymmetric diffusion, the prevailing wind field at time step t is represented as a two-dimensional vector

w_{t} = v_{t} (\cos θ_{t}, \sin θ_{t}),

(1)

where

v_{t}

and

θ_{t}

denote wind speed and wind direction, respectively. For any pair of spatially adjacent nodes i and j, a displacement vector

d_{i j}

is defined from node i to node j. A directional modulation factor is then computed as

η_{i j}^{t} = \max (0, \frac{w_{t} \cdot d_{i j}}{∥ w_{t} ∥ ∥ d_{i j} ∥}),

(2)

which quantifies the alignment between wind direction and the potential diffusion path. This formulation naturally induces asymmetric edge weights, as

η_{i j}^{t} \neq η_{j i}^{t}

. To further account for wind intensity, wind speed is normalized to obtain a scaling factor

s_{t} = v_{t} / v_{\max}

. The final wind-aware edge weight is defined as

A_{i j}^{t} = A_{i j}^{(0)} \cdot (1 + λ s_{t} η_{i j}^{t}),

(3)

where

A_{i j}^{(0)}

denotes the original spatial adjacency weight and

λ

controls wind influence strength. This design explicitly captures direction-dependent and intensity-sensitive pest diffusion behavior under real wind field conditions. Through this procedure, a parcel-level pest diffusion graph with temporal consistency and spatial continuity was constructed. The final dataset comprises approximately 3000 spatial nodes and 12,000 weighted edges across 8 consecutive temporal snapshots, enabling a systematic representation of the spatio-temporal diffusion patterns of multiple pests in real agricultural scenarios, as shown in Table 1.

3.2. Data Preprocessing and Augmentation

To ensure standardized and learnable inputs for spatio-temporal modeling, a systematic data preprocessing and augmentation pipeline was established to eliminate sensor and environmental disturbances. At the image level, radiometric calibration was performed to convert digital numbers into surface reflectance with physical meaning, effectively mitigating inconsistencies caused by sensor response and illumination fluctuations. Geometric correction was subsequently applied using Ground Control Points (GCPs) and UAV pose parameters to ensure strict pixel-to-pixel alignment across temporal sequences, followed by cloud–shadow removal to eliminate anomalous pixel interferences.

At the annotation and feature levels, expert-delineated pest regions were rasterized into temporally continuous binary masks to serve as supervisory signals. Simultaneously, multi-source environmental data, including meteorological and terrain variables, underwent Z-score standardization and linear interpolation to ensure statistical uniformity and temporal completeness. Furthermore, to enhance model generalization and robustness against local perturbations and scale variations, data augmentation strategies—specifically CutMix and Random Crop—were employed during the training phase. The detailed mathematical formulations and implementation procedures for these preprocessing steps are provided in Appendix A.

3.3. Proposed Method

3.3.1. Overall

The proposed framework organizes multi-source data into a parcel-level graph structure and processes it through four consecutive stages: feature extraction, graph encoding, environmental embedding fusion, and diffusion path simulation. In the graph construction, farmland parcels serve as nodes characterized by remote sensing textures and vegetation indices, while edges are defined by spatial proximity and dominant wind direction to represent potential propagation pathways. The core processing begins with the graph encoding module, which employs graph attention to aggregate local neighborhood information and temporal convolutions to capture historical dynamics. This design generates unified representations that encapsulate both structural and evolutionary dependencies without relying on separate processing streams. Subsequently, the environmental fusion module integrates meteorological and topographic variables into these node features via an attention mechanism, adaptively modulating diffusion states based on external regulatory factors. Finally, the updated representations are fed into a prediction head for pest intensity forecasting. Simultaneously, an explainable simulation module accumulates attention weights along propagation trajectories to construct diffusion heatmaps, enabling the visualization and simulation of pest evolution over future horizons of 48 h or 7 days.

3.3.2. Spatio-Temporal Attention Graph Encoder

The spatio-temporal attention graph encoder (STAGE) is constructed to jointly capture the structural diffusion patterns and temporal evolutionary trends of pests. As illustrated in Figure 3, the module processes the input tensor

X \in R^{T \times N \times 128}

through a sequential pipeline of temporal signal modeling, attention-based graph aggregation, and multimodal feature fusion.

The architecture consists of two primary components operating in coordination. First, the spatial encoding component utilizes a two-layer multi-head graph attention network. Each layer employs 4 attention heads with a projection dimension of 32, concatenated to maintain a 128-dimensional feature space. This mechanism allows the model to learn anisotropic diffusion strengths between farmland parcels, emphasizing key neighboring nodes that exert greater influence on the current infestation state. Second, the temporal encoding component adopts a three-layer one-dimensional causal convolutional network. The channel dimensions are progressively expanded (

64 \to 128 \to 128

) with a kernel size of 3. A critical feature of this design is the strict causal constraint, ensuring that the prediction at time step t depends solely on historical information up to t, thereby preventing information leakage from future observations.

This design yields two significant theoretical properties: permutation invariance and temporal causality. The spatial aggregation is insensitive to the ordering of neighbor indices, ensuring robustness to changes in graph indexing, while the causal convolution enforces a valid chronological information flow. The final output tensor

H \in R^{T \times N \times 128}

serves as a structural backbone, which is subsequently modulated by environmental factors in the EEF module to account for region-specific climatic and topographic variations. The detailed mathematical formulations for the attention update and causal convolution are provided in Appendix B.

3.3.3. Environmental Embedding Fusion

The environmental embedding fusion (EEF) module is designed to modulate the structural diffusion patterns learned by STAGE using exogenous environmental constraints. As illustrated in Figure 4, the module processes three categories of environmental inputs: meteorological (

E_{m}^{t}

), topographic (

E_{g}

), and crop indices (

E_{c}^{t}

). These heterogeneous features are transformed into unified embeddings

Z_{k}^{t} \in R^{128}

(

k \in {m, g, c}

) via independent feature encoders to ensure semantic alignment with the node-level spatio-temporal representations

H^{t}

.

To quantify the dynamic influence of different environmental factors, an attention mechanism aggregates these embeddings into a unified modulation vector

Z_{e n v}^{t}

. The core fusion is achieved through an environment-aware gating mechanism, which multiplicatively regulates the diffusion representations:

{\tilde{H}}^{t} = ϕ (H^{t} ⊙ Z_{e n v}^{t} + H^{t}),

(4)

where ⊙ denotes the channel-wise Hadamard product and

ϕ (\cdot)

is a nonlinear activation function. This formulation allows environmental conditions to amplify or suppress diffusion intensities while the residual connection ensures the preservation of the underlying structural topology.

Ecologically, the attention weights derived during this process provide interpretability. For instance, a high meteorological weight (

β_{m}

) indicates diffusion driven by atmospheric transport (e.g., wind), whereas dominant vegetation weights (

β_{c}

) suggest host-availability regulated spread. This complementary design enables STAGE to capture the pathways of diffusion, while EEF determines the intensity based on environmental favorability. The detailed mathematical formulations are provided in Appendix C.

3.3.4. Diffusion Path Simulation and Explainability

The diffusion path simulation and explainability (DPSE) module is designed to transform implicit spatio-temporal dependencies into explicit, interpretable diffusion pathways. Unlike standard self-attention mechanisms that primarily focus on feature reweighting, DPSE emphasizes the topological propagation of states along spatial edges and temporal axes. By performing path-level composition of attention weights, this module reconstructs causal diffusion chains, offering physically meaningful propagation directions consistent with ecological migration mechanisms.

As illustrated in Figure 5, DPSE employs a lightweight three-layer network to process node representations and edge attention weights. The input layer accepts node embeddings with dimensionality

N \times 128

, matching the upstream encoder outputs. The intermediate layer diffuses node states along spatial edges using weighted propagation and models temporal evolution via a sliding window. Finally, the output layer aggregates these propagation results into a path importance scoring map (

N \times N

), enabling the visualization of diffusion heatmaps.

Formally, at time step t, the edge-level diffusion weight

P_{i j}^{t}

is computed by modulating the spatial attention weights with pairwise node interactions:

P_{i j}^{t} = α_{i j}^{t} \cdot σ (h_{i}^{t} W_{p} h_{j}^{t}),

(5)

where

α_{i j}^{t}

denotes the spatial attention weights from STAGE,

h_{i}^{t}

and

h_{j}^{t}

are node embeddings,

W_{p}

is a learnable projection matrix, and

σ (\cdot)

is a nonlinear activation function. To capture cumulative diffusion effects, multi-step path importance is calculated via temporal recursion:

S_{i j}^{(T)} = \sum_{t = 1}^{T} γ^{T - t} P_{i j}^{t},

(6)

where

γ \in (0, 1)

is a decay factor that attenuates distant historical influences, focusing interpretability on recent dominant processes.

Theoretically, this design functions as an attention-constrained weighted path integral over the spatio-temporal graph. With normalized attention weights and

γ < 1

, the accumulation process converges, ensuring numerical stability. By constraining path inference with both structural and environmental information, DPSE avoids generating spurious diffusion channels. Consequently, the module supports both future diffusion simulation (e.g., 48 h horizons) and path-level explanation, transforming the model into a decision-support tool.

It is important to note that DPSE provides structural interpretations of information flow rather than verified causal explanations. Unlike feature-attribution methods (e.g., SHAP) that assess input sensitivity, DPSE aggregates attention along edges to reveal propagation topology. Thus, it complements existing explainability techniques by offering a path-level perspective tailored for diffusion modeling.

4. Results and Discussion

4.1. Experimental Settings

4.1.1. Platform and Training Configuration

The hardware platform was constructed on a high-performance computing environment. The core computational device was a deep learning workstation equipped with an NVIDIA RTX 4090 GPU featuring 24 GB of onboard memory. On the software side, the experimental environment was established based on the Python 3.8 deep learning ecosystem. The operating system was Ubuntu

22.04

, and PyTorch

2 . x

was adopted as the primary deep learning framework for model training. Hardware acceleration was enabled through CUDA

12 . x

and cuDNN to facilitate efficient matrix operations. The dataset is split following a time-based strategy rather than random sampling. Specifically, all samples are chronologically ordered according to their UAV acquisition timestamps. The earliest 70% of the temporal sequence is used for training, the subsequent 15% for validation, and the most recent 15% for testing. This setting strictly preserves temporal causality and avoids information leakage from future observations into model training. To further evaluate model robustness, five-fold cross-validation is conducted within the training period only, ensuring that validation folds do not violate the temporal order of the data. Regarding training hyperparameters, the learning rate

α

was initialized as

1 \times 10^{- 4}

, and AdamW was selected as the optimizer with a weight decay coefficient of

1 \times 10^{- 5}

. The batch size was set to 32, and the maximum number of training epochs was 100. A cosine annealing learning rate scheduler was applied to achieve smoother convergence behavior. The loss function was selected as either cross-entropy loss or focal loss, depending on the task formulation. An early stopping strategy was further incorporated during training to mitigate overfitting, ensuring that optimal performance could be achieved during the testing phase.

4.1.2. Baseline Models and Evaluation Metrics

To evaluate the proposed method, multiple representative spatio-temporal prediction baseline models were selected for comparison, including the classical time series model LSTM [40], the spatio-temporal convolution-based ConvLSTM [41], the graph-structured ST-GCN [42], and the Transformer-based ViT-Spatio model with global attention mechanisms [43]. Multiple evaluation metrics were adopted to comprehensively assess model performance from different perspectives, including mean absolute error (MAE), mean squared error (MSE), Pearson correlation coefficient (R), F1-score, and path match rate (PMR) [44]. These metrics evaluate numerical prediction accuracy, spatial consistency, and diffusion path restoration capability, respectively. The mathematical definitions of these metrics are given as follows:

MAE = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|,

(7)

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2},

(8)

R = \frac{\sum_{i = 1}^{N} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}} \sqrt{\sum_{i = 1}^{N} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}},

(9)

F 1 = \frac{2 \cdot P \cdot R_{c}}{P + R_{c}},

(10)

PMR = \frac{|P_{p r e d} \cap P_{t r u e}|}{|P_{t r u e}|} .

(11)

Here, N denotes the number of samples;

y_{i}

and

{\hat{y}}_{i}

represent the ground-truth value and predicted value, respectively;

\bar{y}

and

\bar{\hat{y}}

denote their corresponding mean values. P represents precision, and

R_{c}

denotes recall.

P_{p r e d}

and

P_{t r u e}

represent the predicted diffusion path set and the ground-truth diffusion path set, respectively. The ground-truth diffusion path

P_{true}

is constructed based on the temporal evolution of expert-annotated pest occurrence regions in multi-temporal UAV imagery. Specifically, for each time step t, pest-affected areas are identified through expert labeling and mapped to parcel-level graph nodes. A diffusion link from node i to node j is established if node j becomes newly infested at time

t + 1

and is spatially adjacent to node i, which was already infested at time t. By aggregating such node-to-node transitions across all consecutive time steps, a set of ground-truth diffusion paths is obtained. This construction strictly follows temporal causality and reflects observed spatial expansion patterns of pest infestation, providing a reliable reference for evaluating diffusion path consistency using the PMR. The PMR metric is conceptually related to structural trajectory consistency measures used in spatio-temporal propagation analysis, such as node- or edge-overlap ratios. Unlike point-wise error metrics that evaluate prediction accuracy at individual locations, the PMR focuses on whether the predicted diffusion follows the same topological paths as the observed process. In practice, both predicted and ground-truth diffusion results are represented as sets of node-to-node propagation paths. The PMR is then computed as the ratio between the number of matched paths and the total number of ground-truth paths, where a path is considered matched if its node sequence is consistent in direction and connectivity. This formulation enables the PMR to capture structural agreement in diffusion behavior, making it suitable for evaluating path-level consistency beyond numerical intensity prediction.

4.2. Performance Comparison on the Bayan Nur Dataset

This experiment is designed to systematically evaluate the modeling and prediction capabilities of different mainstream temporal and spatio-temporal approaches for pest diffusion processes in the typical farmland scenario of Bayan Nur. The study area is characterized by relatively open terrain, where pest diffusion exhibits strong spatial continuity and directional patterns, while being jointly influenced by wind field variations and temporal accumulation effects. Consequently, high demands are imposed on a model’s ability to capture temporal dependencies, model spatial propagation, and maintain diffusion path consistency. By adopting the MAE, MSE, the correlation coefficient R, F1-score, and the diffusion path matching rate PMR as evaluation metrics, model performance can be comprehensively assessed from multiple perspectives, including error magnitude, overall correlation, classification discriminability, and structural consistency of diffusion.

As shown in Table 2 and Figure 6, the conventional LSTM model exhibits the weakest performance across all metrics. This limitation can be primarily attributed to its architecture, which models only temporal dependencies without explicitly incorporating spatial relationships, thereby restricting its ability to capture pest propagation between neighboring parcels. ConvLSTM introduces spatial information through local convolution operations and consequently achieves moderate improvements in error and correlation-related metrics over LSTM; however, its spatial modeling remains confined to fixed local windows, making it difficult to adapt to irregular farmland structures and non-uniform diffusion channels. ViT-Spatio further improves overall prediction accuracy by modeling global dependencies via self-attention mechanisms, yet it mainly emphasizes feature-level spatial correlations and lacks explicit graph-structured constraints on pest migration, resulting in limited gains in diffusion path consistency as reflected by the PMR. ST-GCN explicitly incorporates graph structures for spatio-temporal modeling and achieves stable improvements in MAE, R, and PMR, indicating that graph convolution is effective in capturing propagation topology among farmland parcels. Nevertheless, its reliance on fixed adjacency weights limits its ability to reflect dynamically varying diffusion intensities. In contrast, UAV-GNN-Pest consistently outperforms all competing methods across all metrics, demonstrating lower prediction errors and higher diffusion path consistency. From a mathematical modeling perspective, this performance gain is closely related to the dynamic graph attention mechanism, which adaptively reweights adjacency relationships according to temporal states, enabling asymmetric propagation driven by wind direction to be effectively captured. Meanwhile, the temporal modeling component stably captures cumulative and lagged effects during pest diffusion, leading to superior performance in both the correlation coefficient R and the F1-score.

4.3. Performance Comparison on the Tangshan Dataset

This experiment aims to evaluate the capability of different models to model and predict pest diffusion processes in the complex agricultural environment of Tangshan. Compared with Bayan Nur, agricultural fields in Tangshan exhibit higher landscape heterogeneity and more frequent meteorological disturbances, with pest diffusion often characterized by the coexistence of multi-source initiation and non-uniform propagation. As a result, this dataset places greater emphasis on model generalization under complex spatial structures as well as the ability to capture multi-stage diffusion patterns.

As shown in Table 3 and Figure 7, the traditional LSTM model again yields the weakest performance across all evaluation metrics. Its primary limitation arises from focusing solely on temporal correlations while ignoring spatial diffusion among parcels, which leads to particularly constrained performance on the spatial consistency-related PMR metric. By embedding local convolution operations within recurrent structures, ConvLSTM achieves consistent improvements over LSTM across all metrics; however, its spatial modeling remains limited to regular grids, making it insufficient for accurately characterizing irregular field boundaries and complex diffusion channels commonly observed in Tangshan. ViT-Spatio benefits from self-attention mechanisms that allow more flexible modeling of spatial relationships, leading to further improvements in MAE, R, and F1-score. Nevertheless, due to the absence of explicit propagation structure constraints, the learned spatial dependencies largely reflect feature similarity rather than true diffusion routes, resulting in limited capability to fit actual propagation paths. After introducing graph-structured modeling, ST-GCN further improves performance on the Tangshan dataset, particularly in error and correlation metrics, demonstrating the advantage of graph convolution in adapting to complex parcel topologies. However, the use of fixed adjacency matrices and static weights restricts its ability to capture dynamically changing diffusion intensities under rapidly varying meteorological conditions and multi-directional propagation scenarios. In contrast, UAV-GNN-Pest achieves the best performance across all metrics, with particularly pronounced advantages in the PMR and the correlation coefficient R. From a mathematical standpoint, this performance can be attributed to the dynamic attention mechanism that assigns adaptive weights to different adjacency relationships, freeing spatial propagation from static structural assumptions, while the temporal modeling component effectively captures lagged and cumulative effects of pest diffusion, thereby enhancing overall predictive correlation. Moreover, the incorporation of diffusion path consistency constraints encourages the model to focus on propagation structures during learning, enabling high path matching rates to be maintained even in complex regional environments. These results demonstrate that jointly modeling dynamic spatio-temporal relationships and diffusion mechanisms is particularly effective for characterizing pest diffusion processes in high-complexity agricultural systems such as Tangshan.

4.4. Ablation Study of Different Modules on Two Datasets

This ablation study is conducted to systematically evaluate the practical contributions of the core modules within UAV-GNN-Pest across different regional datasets, and to verify that the observed performance improvements are not attributable to a single architectural component but rather to the joint effects of spatio-temporal modeling, environment-driven modulation, and diffusion explainability mechanisms. By independently removing the STAGE, EEF, and DPSE modules on two agricultural scenarios with distinct characteristics, namely Bayan Nur and Tangshan, and comparing the resulting performance with that of the complete model, the specific roles of each module in prediction accuracy, correlation strength, and diffusion path consistency can be analyzed. In the specific experimental settings, for w/o STAGE, the graph encoder is replaced by a linear embedding layer that projects raw node features directly into the EEF module; for w/o EEF, the environmental modulation is omitted, and the spatio-temporal representations from STAGE are fed directly into the prediction head; and for w/o DPSE, the explicit path simulation is disabled, with diffusion paths inferred implicitly based on the temporal adjacency of predicted infestation events rather than learned edge weights.

As shown in Table 4, the complete model achieves the best performance on both datasets, indicating that the proposed framework exhibits strong structural stability and cross-regional adaptability. When STAGE is removed, the most pronounced degradation is observed across all metrics in both regions, particularly in MAE, R, and PMR, demonstrating that without explicit spatio-temporal graph encoding, the model fails to accurately capture the underlying propagation backbone among parcels. When EEF is excluded, moderate performance degradation is observed in error and correlation-related indicators, although the decline is less severe than that caused by removing STAGE. This observation suggests that environmental embeddings primarily function to refine diffusion intensity and enhance regional adaptability rather than to determine the global propagation structure. In contrast, removing DPSE results in relatively limited changes in prediction errors, while the PMR is substantially reduced on both datasets, indicating that DPSE plays a critical role in maintaining diffusion path consistency but has a comparatively smaller impact on numerical prediction accuracy. From the perspective of mathematical modeling characteristics, these results are highly consistent with the functional responsibilities of each module. STAGE jointly models temporal dependencies and spatial propagation relations on graph structures, thereby governing the backbone evolution of the pest diffusion process; its absence causes the model to degenerate into a weakly structured temporal predictor, leading to overall degradation in both error and correlation metrics. EEF modulates spatio-temporal representations through environmental conditioning, enabling dynamic adjustment of diffusion rates under varying climatic and topographic conditions; its removal mainly affects stability and fine-grained prediction accuracy in cross-regional settings, with particularly pronounced effects on the environmentally complex Tangshan dataset. Although DPSE does not directly intervene in node state updating, it reconstructs implicit propagation patterns into consistent diffusion pathway constraints at the path level, guiding the learning process toward global structural coherence, which explains the substantial decrease in the PMR when this module is omitted.

4.5. Discussion

The proposed approach is evaluated under realistic agricultural production scenarios, and its practical value for UAV- and remote-sensing-driven pest diffusion prediction is systematically validated. In Bayan Nur, where farmland parcels are contiguous and wind fields dominantly govern pest migration, species such as Spodoptera frugiperda and Mythimna separata often diffuse rapidly along prevailing wind directions. As illustrated in Figure 8, our model is capable of precisely focusing on the pest occurrence regions within the imagery, thereby identifying the core diffusion sources. Conventional approaches relying on fixed monitoring stations or single-temporal remote sensing images are often insufficient for the timely characterization of such dynamic pathways, frequently resulting in delayed intervention. By jointly modeling spatial dependencies across neighboring parcels based on these localized infestation sources, the proposed method enables early identification of potential propagation directions and high-risk areas, thereby providing actionable guidance for local agricultural technicians to deploy traps in advance or to organize regional joint prevention strategies. Conversely, in Tangshan, where farmland structures are more fragmented, crop types are interwoven, and meteorological conditions fluctuate frequently, pest diffusion tends to exhibit multi-source initiation and discontinuous propagation. In such complex environments, the proposed model maintains high prediction accuracy and diffusion path consistency, indicating that the spatio-temporal graph-based modeling strategy closely aligns with the operational mechanisms of real agricultural ecosystems.

From a decision-making standpoint, the numerical prediction of pest intensity represents only one dimension of practical utility. By explicitly visualizing potential diffusion pathways, the proposed system facilitates a transition from point-based pesticide application to pathway-oriented intervention strategies. For instance, priority treatment or biological control measures can be precisely targeted along predicted propagation corridors, effectively reducing overall pesticide usage and mitigating adverse impacts on non-target crops and the surrounding ecological environment. Furthermore, for migratory pests that traverse township or county boundaries, the proposed framework provides quantitative support for region-level coordinated control, preventing redundant application or control gaps caused by fragmented information. When integrated into existing UAV field inspection and meteorological service platforms within grassroots agricultural extension systems, the system can generate dynamically updated risk warning maps, driving a paradigm shift from reactive control toward proactive, anticipatory management.

4.5.1. Support for Agricultural Information Systems and Decision-Making

Beyond predictive accuracy, the proposed framework provides practical value for agricultural information systems and decision-making processes. By jointly modeling pest diffusion intensity and propagation pathways, the framework enables the transformation of raw UAV imagery and environmental data into structured, decision-relevant information. The predicted diffusion paths can be directly integrated into agricultural information systems as dynamic risk layers, supporting early warning, spatial prioritization, and coordinated control planning. Unlike conventional parcel-wise monitoring approaches, the path-oriented representation allows decision-makers to identify key transmission corridors and upstream risk sources, facilitating targeted interventions and resource allocation. In addition, the interpretable outputs produced by the diffusion path simulation module can be used to support explainable decision generation, enhancing trust and usability for agricultural practitioners. These characteristics make the proposed approach suitable for integration with existing digital agriculture platforms and regional pest management systems, contributing to data-driven, proactive, and sustainable agricultural decision-making.

4.5.2. Failure Modes and Limitations

Despite its overall effectiveness, the proposed framework exhibits several limitations under specific conditions. First, for highly migratory pest species whose diffusion is dominated by large-scale atmospheric transport, the parcel-level graph structure may underestimate sudden long-distance spread beyond local neighborhoods. Second, under extreme or rapidly changing weather conditions, such as abrupt wind shifts or heavy rainfall, the temporal resolution of UAV observations may be insufficient to fully capture high-frequency diffusion dynamics, leading to delayed or inaccurate path predictions. Third, during early-stage or low-intensity infestations, visual and environmental signals are often weak, which may reduce the stability of diffusion path inference. These limitations indicate that the current model is most reliable for gradual diffusion scenarios with sufficient observational support, and further integration of higher-frequency sensing or large-scale atmospheric information may be required to address these challenging cases.

4.6. Limitation and Future Work

Although the proposed UAV- and remote-sensing-based pest diffusion prediction and simulation framework achieves stable and interpretable performance across multiple real-world agricultural scenarios, several aspects remain to be further improved. At the data acquisition level, the model primarily relies on periodic UAV flights and fixed meteorological station observations. While these data sources adequately reflect regional-scale diffusion trends, the current temporal resolution may be insufficient to capture high-frequency dynamics during rapid pest outbreaks or abrupt meteorological changes. Future research may investigate the integration of pest ecological knowledge graphs or expert-defined rules into model architectures to improve biological interpretability and cross-species generalization while preserving predictive flexibility.

5. Conclusions

This study addresses the challenge of dynamically characterizing pest diffusion and providing accurate early warnings in agricultural production. To overcome the limitations of conventional monitoring approaches, including restricted spatial coverage, insufficient temporal continuity, and the predominant focus of existing intelligent methods on static recognition, a pest diffusion prediction and simulation framework integrating UAV-based remote sensing, multi-source environmental information, and spatio-temporal graph neural networks is proposed. Based on real farmland scenarios and formulated from a parcel-level diffusion modeling perspective, the pest migration process is explicitly represented as a spatio-temporal evolution problem, thereby providing a technical pathway that is more consistent with practical production patterns for regional and coordinated pest control. In experimental evaluations conducted on two representative regional datasets from Bayan Nur and Tangshan, optimal performance is consistently achieved by the proposed approach, significantly outperforming multiple mainstream baseline models in terms of prediction accuracy and diffusion path consistency. By explicitly predicting pest diffusion pathways rather than isolated infestation levels, the proposed framework enables pathway-oriented intervention strategies, supporting precision pesticide application, reducing overall chemical input, and contributing to environmentally sustainable agricultural management. These results indicate that the proposed approach can provide reliable support for agricultural information systems and diffusion-aware decision generation.

Author Contributions

Conceptualization, C.D., Z.F., Y.H. and Y.Z.; Data curation, J.C. and S.L.; Formal analysis, Y.L.; Funding acquisition, Y.Z.; Investigation, Y.L.; Methodology, C.D., Z.F. and Y.H.; Project administration, Y.Z.; Resources, J.C. and S.L.; Software, C.D., Z.F. and Y.H.; Supervision, Y.Z.; Validation, Y.L.; Visualization, J.C. and S.L.; Writing—original draft, C.D., Z.F., Y.H., Y.L., J.C., S.L. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Mathematical Formulations for Data Preprocessing

Appendix A.1. Radiometric Calibration and Geometric Correction

Radiometric calibration converts digital numbers (

D N

) into surface reflectance (R). The linear calibration model and the subsequent correction for solar zenith angle are defined as follows:

L = α \cdot D N + β, R = \frac{π L d^{2}}{E_{0} \cos θ},

(A1)

where L denotes radiance,

α

and

β

are sensor-specific coefficients, d is the normalized earth–sun distance,

E_{0}

is solar irradiance, and

θ

is the solar zenith angle.

Geometric correction maps image coordinates

(x, y)

to geographic coordinates

(X, Y)

via a transformation function

T (\cdot)

:

(X, Y) = T (x, y ∣ GCP, Θ),

(A2)

where

GCP

denotes the set of ground control points and

Θ

represents UAV attitude parameters.

Appendix A.2. Cloud Removal and Label Generation

Cloud and shadow removal is implemented by masking an anomalous pixel set

Ω

:

I^{'} (x, y) = I (x, y) \cdot (1 - M_{Ω} (x, y)),

(A3)

where

M_{Ω}

is the binary mask. Pest annotations are converted into a label matrix

Y_{t}

:

Y_{t} (i, j) = \{\begin{matrix} 1, & if pixel (i, j) is within the pest region, \\ 0, & otherwise . \end{matrix}

(A4)

Appendix A.3. Data Augmentation and Feature Standardization

The CutMix augmentation strategy generates mixed samples

\tilde{X}

and labels

\tilde{Y}

as follows:

\tilde{Y} = λ Y_{A} + (1 - λ) Y_{B}, \tilde{X} = M ⊙ X_{A} + (1 - M) ⊙ X_{B},

(A5)

where

λ

is the area ratio and M is the binary mask.

For environmental variables, Z-score standardization and linear interpolation for missing values are applied:

x^{'} = \frac{x - μ}{σ}, x_{t} = x_{t_{1}} + \frac{t - t_{1}}{t_{2} - t_{1}} (x_{t_{2}} - x_{t_{1}}),

(A6)

where

μ

and

σ

denote the mean and standard deviation, and

t_{1}, t_{2}

represent the nearest valid time points.

Appendix B. Mathematical Formulations for STAGE

For the spatial graph

G_{t} = (V, E)

at time step t, the node representation is updated via the multi-head attention mechanism. The first layer update for node i is defined as

z_{i}^{(t, 1)} {= ‖}_{h = 1}^{H} σ (\sum_{j \in N (i)} ω_{i j}^{(t, h)} W^{(1, h)} x_{j}^{(t)}),

(A7)

where

H = 4

denotes the number of heads,

ω_{i j}^{(t, h)}

represents the normalized attention weight, and ‖ denotes the concatenation operation.

Following the spatial update, the temporal evolution is modeled using causal convolution. The update rule for the first temporal layer is expressed as

u_{i}^{(t, 1)} = ρ (\sum_{τ = 0}^{K - 1} W_{τ}^{(1)} z_{i}^{(t - τ, 2)} + b^{(1)}),

(A8)

where

K = 3

is the kernel size,

W_{τ}^{(1)}

is the convolution kernel at lag

τ

, and

ρ (\cdot)

denotes the nonlinear activation function. The subsequent fusion with environmental embeddings follows the linear modulation:

F_{i}^{t} = H_{i}^{t} + W_{e} z_{e n v}^{t},

(A9)

where

H_{i}^{t}

is the backbone representation and

z_{e n v}^{t}

is the environmental embedding.

Appendix C. Mathematical Formulations for EEF

Appendix C.1. Feature Encoders

Meteorological and crop features are encoded via two-layer fully connected networks (widths 64 and 128) with ReLU activation. Topographic features are mapped via a single layer followed by linear projection to match the 128-dimensional channel space.

Appendix C.2. Environmental Attention

The environmental embeddings are aggregated using a dynamic attention mechanism. The attention weights

β_{k}^{t}

and the fused embedding

Z_{e n v}^{t}

are computed as

β_{k}^{t} = \frac{\exp (q^{t} \cdot Z_{k}^{t})}{\sum_{l \in {m, g, c}} \exp (q^{t} \cdot Z_{l}^{t})}, Z_{e n v}^{t} = \sum_{k} β_{k}^{t} Z_{k}^{t},

(A10)

where

q^{t}

is a query vector derived from a linear projection of the node representation

H^{t}

.

References

Kumar, S.; Kumar, A.; Jleli, M. A numerical analysis for fractional model of the spread of pests in tea plants. Numer. Methods Partial Differ. Equ. 2022, 38, 540–565. [Google Scholar] [CrossRef]
Adler, C.; Athanassiou, C.; Carvalho, M.O.; Emekci, M.; Gvozdenac, S.; Hamel, D.; Riudavets, J.; Stejskal, V.; Trdan, S.; Trematerra, P. Changes in the distribution and pest risk of stored product insects in Europe due to global warming: Need for pan-European pest monitoring and improved food-safety. J. Stored Prod. Res. 2022, 97, 101977. [Google Scholar] [CrossRef]
Wu, Q.; Zeng, J.; Wu, K. Research and Application of Crop Pest Monitoring and Early Warning Technology in China. Front. Agric. Sci. Eng. 2022, 9, 19. [Google Scholar] [CrossRef]
Ibrahim, E.A.; Salifu, D.; Mwalili, S.; Dubois, T.; Collins, R.; Tonnang, H.E. An expert system for insect pest population dynamics prediction. Comput. Electron. Agric. 2022, 198, 107124. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; pp. 1–8. [Google Scholar]
Koralewski, T.E.; Wang, H.H.; Grant, W.E.; Brewer, M.J.; Elliott, N.C.; Westbrook, J.K. Modeling the dispersal of wind-borne pests: Sensitivity of infestation forecasts to uncertainty in parameterization of long-distance airborne dispersal. Agric. For. Meteorol. 2021, 301, 108357. [Google Scholar] [CrossRef]
Wang, M.; Li, T. Pest and disease prediction and management for sugarcane using a hybrid autoregressive integrated moving average—A long short-term memory model. Agriculture 2025, 15, 500. [Google Scholar] [CrossRef]
Alkan, E.; Aydin, A. Machine Learning-Based Prediction of Insect Damage Spread Using Auto-ARIMA Model. Croat. J. For. Eng. J. Theory Appl. For. Eng. 2024, 45, 351–364. [Google Scholar] [CrossRef]
Abbas, A.; Zhang, Z.; Zheng, H.; Alami, M.M.; Alrefaei, A.F.; Abbas, Q.; Naqvi, S.A.H.; Rao, M.J.; Mosa, W.F.; Abbas, Q.; et al. Drones in plant disease assessment, efficient monitoring, and detection: A way forward to smart agriculture. Agronomy 2023, 13, 1524. [Google Scholar] [CrossRef]
Skendžić, S.; Novak, H.; Zovko, M.; Pajač Živković, I.; Lešić, V.; Maričević, M.; Lemić, D. Hyperspectral Sensing and Machine Learning for Early Detection of Cereal Leaf Beetle Damage in Wheat: Insights for Precision Pest Management. Agriculture 2025, 15, 2482. [Google Scholar] [CrossRef]
Aziz, D.; Rafiq, S.; Saini, P.; Ahad, I.; Gonal, B.; Rehman, S.A.; Rashid, S.; Saini, P.; Rohela, G.K.; Aalum, K.; et al. Remote sensing and artificial intelligence: Revolutionizing pest management in agriculture. Front. Sustain. Food Syst. 2025, 9, 1551460. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, J.; Huang, Y.; Tian, Y.; Yuan, L. Detection and discrimination of disease and insect stress of tea plants using hyperspectral imaging combined with wavelet analysis. Comput. Electron. Agric. 2022, 193, 106717. [Google Scholar] [CrossRef]
Mittal, M.; Gupta, V.; Aamash, M.; Upadhyay, T. Machine learning for pest detection and infestation prediction: A comprehensive review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1551. [Google Scholar] [CrossRef]
Lin, X.; Wa, S.; Zhang, Y.; Ma, Q. A dilated segmentation network with the morphological correction method in farming area image Series. Remote Sens. 2022, 14, 1771. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Peng, M.; Liu, Y.; Khan, A.; Ahmed, B.; Sarker, S.K.; Ghadi, Y.Y.; Bhatti, U.A.; Al-Razgan, M.; Ali, Y.A. Crop monitoring using remote sensing land use and land change data: Comparative analysis of deep learning methods using pre-trained CNN models. Big Data Res. 2024, 36, 100448. [Google Scholar] [CrossRef]
Mohyuddin, G.; Khan, M.A.; Haseeb, A.; Mahpara, S.; Waseem, M.; Saleh, A.M. Evaluation of machine learning approaches for precision farming in smart agriculture system: A comprehensive review. IEEE Access 2024, 12, 60155–60184. [Google Scholar] [CrossRef]
Ojo, M.O.; Zahid, A. Deep learning in controlled environment agriculture: A review of recent advancements, challenges and prospects. Sensors 2022, 22, 7965. [Google Scholar] [CrossRef]
Li, Y.; Wu, Y.; Wang, W.; Jin, H.; Wu, X.; Liu, J.; Hu, C.; Lv, C. Integrating Stride Attention and Cross-Modality Fusion for UAV-Based Detection of Drought, Pest, and Disease Stress in Croplands. Agronomy 2025, 15, 1199. [Google Scholar] [CrossRef]
Lu, F.; Zhang, B.; Hou, Y.; Xiong, X.; Dong, C.; Lu, W.; Li, L.; Lv, C. A Spatiotemporal Attention-Guided Graph Neural Network for Precise Hyperspectral Estimation of Corn Nitrogen Content. Agronomy 2025, 15, 1041. [Google Scholar] [CrossRef]
Ye, W.; Lao, J.; Liu, Y.; Chang, C.C.; Zhang, Z.; Li, H.; Zhou, H. Pine pest detection using remote sensing satellite images combined with a multi-scale attention-UNet model. Ecol. Inform. 2022, 72, 101906. [Google Scholar] [CrossRef]
Zhang, H.; Wang, L.; Tian, T.; Yin, J. A review of unmanned aerial vehicle low-altitude remote sensing (UAV-LARS) use in agricultural monitoring in China. Remote Sens. 2021, 13, 1221. [Google Scholar] [CrossRef]
Zhao, G.; Zhang, Y.; Lan, Y.; Deng, J.; Zhang, Q.; Zhang, Z.; Li, Z.; Liu, L.; Huang, X.; Ma, J. Application progress of UAV-LARS in identification of crop diseases and pests. Agronomy 2023, 13, 2232. [Google Scholar] [CrossRef]
Bai, T.; Wang, L.; Yin, D.; Sun, K.; Chen, Y.; Li, W.; Li, D. Deep learning for change detection in remote sensing: A review. Geo-Spat. Inf. Sci. 2023, 26, 262–288. [Google Scholar] [CrossRef]
Yan, T.; Xu, W.; Lin, J.; Duan, L.; Gao, P.; Zhang, C.; Lv, X. Combining multi-dimensional convolutional neural network (CNN) with visualization method for detection of aphis gossypii glover infection in cotton leaves using hyperspectral imaging. Front. Plant Sci. 2021, 12, 604510. [Google Scholar] [CrossRef] [PubMed]
Aschauer, N.; Parnell, S. Analysis of mathematical modelling approaches to capture human behaviour dynamics in agricultural pest and disease systems. Agric. Syst. 2025, 226, 104303. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Milgroom, M.G. Epidemiology and sir models. In Biology of Infectious Disease: From Molecules to Ecosystems; Springer: Berlin/Heidelberg, Germany, 2023; pp. 253–268. [Google Scholar]
Montes de Oca Munguia, O.; Pannell, D.J.; Llewellyn, R. Understanding the adoption of innovations in agriculture: A review of selected conceptual models. Agronomy 2021, 11, 139. [Google Scholar] [CrossRef]
Karimzadeh, R.; Sciarretta, A. Spatial patchiness and association of pests and natural enemies in agro-ecosystems and their application in precision pest management: A review. Precis. Agric. 2022, 23, 1836–1855. [Google Scholar] [CrossRef]
Li, Y.; Yu, D.; Liu, Z.; Zhang, M.; Gong, X.; Zhao, L. Graph neural network for spatiotemporal data: Methods and applications. arXiv 2023, arXiv:2306.00012. [Google Scholar] [CrossRef]
Lira, H.; Martí, L.; Sanchez-Pi, N. A graph neural network with spatio-temporal attention for multi-sources time series data: An application to frost forecast. Sensors 2022, 22, 1486. [Google Scholar] [CrossRef]
Lin, S.; Xiu, Y.; Kong, J.; Yang, C.; Zhao, C. An effective pyramid neural network based on graph-related attentions structure for fine-grained disease and pest identification in intelligent agriculture. Agriculture 2023, 13, 567. [Google Scholar] [CrossRef]
Zhou, X.; Chen, S.; Ren, Y.; Zhang, Y.; Fu, J.; Fan, D.; Lin, J.; Wang, Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics 2022, 11, 911. [Google Scholar] [CrossRef]
Ma, M.; Xie, P.; Teng, F.; Wang, B.; Ji, S.; Zhang, J.; Li, T. HiSTGNN: Hierarchical spatio-temporal graph neural network for weather forecasting. Inf. Sci. 2023, 648, 119580. [Google Scholar] [CrossRef]
Pan, Y.A.; Li, F.; Li, A.; Niu, Z.; Liu, Z. Urban intersection traffic flow prediction: A physics-guided stepwise framework utilizing spatio-temporal graph neural network algorithms. Multimodal Transp. 2025, 4, 100207. [Google Scholar] [CrossRef]
Han, H.; Liu, Z.; Li, J.; Zeng, Z. Challenges in remote sensing based climate and crop monitoring: Navigating the complexities using AI. J. Cloud Comput. 2024, 13, 34. [Google Scholar] [CrossRef]
Liu, T.; Yu, L.; Liu, X.; Peng, D.; Chen, X.; Du, Z.; Tu, Y.; Wu, H.; Zhao, Q. A Global Review of Monitoring Cropland Abandonment Using Remote Sensing: Temporal–Spatial Patterns, Causes, Ecological Effects, and Future Prospects. J. Remote Sens. 2025, 5, 0584. [Google Scholar] [CrossRef]
Yin, J.; Li, W.; Shen, J.; Zhou, C.; Li, S.; Suo, J.; Yang, J.; Jia, R.; Lv, C. A Diffusion-Based Detection Model for Accurate Soybean Disease Identification in Smart Agricultural Environments. Plants 2025, 14, 675. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Lou, Y.; Zhang, C.; Zheng, Y.; Xie, X.; Wang, W.; Huang, Y. Map-matching for low-sampling-rate GPS trajectories. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 4–6 November 2009; pp. 352–361. [Google Scholar]

Figure 1. Image dataset samples.

Figure 2. Schematic illustration of the spatial grid partitioning strategy.

Figure 3. Schematic illustration of the spatio-temporal attention graph encoder (STAGE).

Figure 4. Schematic illustration of the environmental embedding fusion (EEF) module.

Figure 5. Illustration of the diffusion path simulation and explainability (DPSE) module.

Figure 6. Performance comparison on the Bayan Nur dataset.

Figure 7. Performance comparison on the Tangshan dataset.

Figure 8. Visualization of representative pest samples and their corresponding diffusion prediction results.

Table 1. Overview of the UAV-based pest diffusion dataset.

Data Type	Source	Quantity/Resolution
Study regions	Bayan Nur (Inner Mongolia), Tangshan (Hebei)	2 regions
Crop types	Maize, Wheat	2 crop categories
Target pests	Mythimna separata, Spodoptera frugiperda, Ostrinia furnacalis, Sitobion avenae, Rhopalosiphum padi, Mythimna loreyi	6 species
UAV RGB images	Multirotor UAV platform	10 cm/pixel
UAV multispectral images	Multispectral camera	Blue, Green, Red, NIR bands
Acquisition frequency	Periodic UAV flights	Every 5 days
Meteorological variables	Automatic weather stations	Temperature, humidity, wind, rainfall
Terrain data	Digital elevation model (DEM)	Spatially aligned
Vegetation indices	NDVI, EVI	Multispectral-derived
Graph nodes	Parcel-level grid units	∼3000 nodes
Graph edges	Spatial and wind-weighted connections	∼12,000 edges
Temporal snapshots	Time-aligned sequences	8 time steps
Temporal information	Textual knowledge records	Time stamps aligned with UAV flights
Spatial information	Textual spatial indexing	Parcel IDs, adjacency relations
Textual knowledge data	Field records and annotations	Spatio-temporal metadata

Table 2. Performance comparison on the Bayan Nur dataset. Bold represents the best result.

Method	MAE ↓	MSE ↓	R ↑	F1 ↑	PMR ↑
LSTM	0.192	0.061	0.681	0.712	0.642
ConvLSTM	0.176	0.054	0.705	0.734	0.667
ViT-Spatio	0.168	0.050	0.721	0.748	0.683
ST-GCN	0.160	0.046	0.739	0.761	0.694
UAV-GNN-Pest (Ours)	0.130	0.036	0.814	0.821	0.779

Table 3. Performance comparison on the Tangshan dataset. Bold represents the best result.

Method	MAE ↓	MSE ↓	R ↑	F1 ↑	PMR ↑
LSTM	0.185	0.058	0.694	0.723	0.651
ConvLSTM	0.171	0.051	0.719	0.741	0.676
ViT-Spatio	0.162	0.047	0.736	0.756	0.691
ST-GCN	0.155	0.043	0.753	0.769	0.704
UAV-GNN-Pest (Ours)	0.128	0.034	0.827	0.834	0.790

Table 4. Ablation study of different modules on two datasets. Bold represents the best result.

Model Variant	Bayan Nur Dataset					Tangshan Dataset
Model Variant	MAE ↓	MSE ↓	R ↑	F1 ↑	PMR ↑	MAE ↓	MSE ↓	R ↑	F1 ↑	PMR ↑
Full Model (STAGE + EEF + DPSE)	0.130	0.036	0.814	0.821	0.779	0.128	0.034	0.827	0.834	0.790
w/o STAGE	0.167	0.049	0.724	0.746	0.683	0.162	0.047	0.739	0.756	0.694
w/o EEF	0.149	0.042	0.763	0.782	0.721	0.145	0.041	0.771	0.790	0.736
w/o DPSE	0.134	0.037	0.801	0.819	0.692	0.131	0.035	0.815	0.829	0.701

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, C.; Fu, Z.; Hu, Y.; Liu, Y.; Cao, J.; Liu, S.; Zhan, Y. Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems. Electronics 2026, 15, 356. https://doi.org/10.3390/electronics15020356

AMA Style

Du C, Fu Z, Hu Y, Liu Y, Cao J, Liu S, Zhan Y. Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems. Electronics. 2026; 15(2):356. https://doi.org/10.3390/electronics15020356

Chicago/Turabian Style

Du, Chenxi, Zhengjie Fu, Yifan Hu, Yibin Liu, Jingwen Cao, Siyuan Liu, and Yan Zhan. 2026. "Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems" Electronics 15, no. 2: 356. https://doi.org/10.3390/electronics15020356

APA Style

Du, C., Fu, Z., Hu, Y., Liu, Y., Cao, J., Liu, S., & Zhan, Y. (2026). Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems. Electronics, 15(2), 356. https://doi.org/10.3390/electronics15020356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Image-Based Spatio-Temporal Graph Learning for Diffusion Forecasting in Digital Management Systems

Abstract

1. Introduction

2. Related Work

2.1. Application of UAV and Remote Sensing Imagery in Pest Monitoring

2.2. Pest Diffusion Modeling and Ecological Dynamic Prediction

2.3. Graph Neural Networks and Spatio-Temporal Attention in Ecological Scenarios

3. Materials and Method

3.1. Data Collection

3.2. Data Preprocessing and Augmentation

3.3. Proposed Method

3.3.1. Overall

3.3.2. Spatio-Temporal Attention Graph Encoder

3.3.3. Environmental Embedding Fusion

3.3.4. Diffusion Path Simulation and Explainability

4. Results and Discussion

4.1. Experimental Settings

4.1.1. Platform and Training Configuration

4.1.2. Baseline Models and Evaluation Metrics

4.2. Performance Comparison on the Bayan Nur Dataset

4.3. Performance Comparison on the Tangshan Dataset

4.4. Ablation Study of Different Modules on Two Datasets

4.5. Discussion

4.5.1. Support for Agricultural Information Systems and Decision-Making

4.5.2. Failure Modes and Limitations

4.6. Limitation and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Mathematical Formulations for Data Preprocessing

Appendix A.1. Radiometric Calibration and Geometric Correction

Appendix A.2. Cloud Removal and Label Generation

Appendix A.3. Data Augmentation and Feature Standardization

Appendix B. Mathematical Formulations for STAGE

Appendix C. Mathematical Formulations for EEF

Appendix C.1. Feature Encoders

Appendix C.2. Environmental Attention

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI