DPP: A Novel Disease Progression Prediction Method for Ginkgo Leaf Disease Based on Image Sequences

: Ginkgo leaf disease poses a grave threat to Ginkgo biloba . The current management of Ginkgo leaf disease lacks precision guidance and intelligent technologies. To provide precision guidance for disease management and to evaluate the effectiveness of the implemented measures, the present study proposes a novel disease progression prediction (DPP) method for Ginkgo leaf blight with a multi-level feature translation architecture and enhanced spatiotemporal attention module (eSTA). The proposed DPP method is capable of capturing key spatiotemporal dependencies of disease symptoms at various feature levels. Experiments demonstrated that the DPP method achieves state-of-the-art prediction performance in disease progression prediction. Compared to the top-performing spatiotemporal predictive learning method (SimVP + TAU), our method significantly reduced the mean absolute error (MAE) by 19.95% and the mean square error (MSE) by 25.35%. Moreover, it achieved a higher structure similarity index measure (SSIM) of 0.970 and superior peak signal-to-noise ratio (PSNR) of 37.746 dB. The proposed method can accurately forecast the progression of Ginkgo leaf blight to a large extent, which is expected to provide valuable insights for precision and intelligent disease management. Additionally, this study presents a novel perspective for the extensive research on plant disease prediction.


Introduction
Ginkgo biloba, an ancient and invaluable tree species, not only possesses significant ecological and cultural value but also finds extensive application in medicine and landscape beautification [1][2][3][4].However, the frequent outbreaks of Ginkgo leaf blight in various regions are a non-negligible issue.Ginkgo leaf blight typically manifests as a yellowing of the leaf margins at an early stage.With the disease progressing, the yellowing symptoms gradually spread inward, leading to an increase in the diseased area on the leaf.In severe cases, this can result in leaf wilting and abscission.Consequently, Ginkgo leaf blight seriously threatens the growth and health of Ginkgo biloba and greatly impacts its value realization, thus causing substantial economic losses to related industries [5][6][7].Currently, the prevention and control of Ginkgo leaf blight primarily rely on regular field observation to monitor the disease progression of Ginkgo leaves and determine whether to use specific fungicides such as carbendazim and tebuconazole [2,6].This traditional method of disease management requires the allocation of extensive human resources for regular inspection and evaluation, while the timeliness and effectiveness of disease management are often difficult to guarantee.Moreover, the excessive or improper use of fungicides can adversely affect the environment and potentially induce disease resistance, thereby increasing the challenges and costs of disease management [8,9].
Plant disease prediction mainly focuses on forecasting the occurrence and progression trends of disease, which is of paramount importance for scientific disease management.Weather conditions and environmental factors, such as temperature, humidity, and precipitation, are widely recognized as playing a crucial role in the occurrence and progression of plant disease [10].Therefore, previous studies treated them as inputs to determine the complex relationship between meteorological factors and disease occurrence based on disease dynamics [11] and machine learning [12][13][14].These approaches demonstrated high accuracy in forecasting the risk level of disease occurrence and the potential severity of disease outbreaks.
Remote sensing data have been extensively used in agriculture and forestry due to their wide monitoring range, high spatiotemporal resolution, and capabilities of real-time monitoring and multi-source information fusion [15][16][17][18].Consequently, extensive studies have combined remote sensing data with meteorological factors to forecast the occurrence and severity of plant diseases at a regional scale [19][20][21][22].
Moreover, researchers have begun to focus on the prediction of plant disease progression.Förster et al. [23] employed a cycle-consistent generative adversarial network (CycleGAN) [24] to learn the daily changes in leaf disease from hyperspectral images, aiming to forecast the spread of disease symptoms on barley plants at an early stage.This method used the trained forward generator, which obtained multiple timestep prediction images by feeding the predicted image at the previous timestep as input to generate the predicted image at the next timestep.It exhibited promising visual predictive capabilities and offers valuable insights for the disease progression prediction of plant leaves.
Current studies on plant disease prediction mostly view it as a classification issue, while disease progression prediction is relatively unexplored [10].It follows intuitively that disease progression can be visually demonstrated through consecutive images, which show the changes in shape, size, and color of the diseased areas over time.Therefore, unlike previous research [23], this study aims to investigate disease progression prediction based on image sequences.By analyzing and capturing dynamic variations in disease symptoms in image sequences, future disease symptoms can be forecasted.This contributes to a deeper understanding of disease progression and provides valuable guidance for scientific and rational disease management.Furthermore, this study can be used to evaluate the effectiveness of disease control and treatment measures by comparing the predicted disease symptoms with the real-time symptoms observed after implementing corresponding measures.
Spatiotemporal predictive learning investigates a spatiotemporal sequence forecasting problem, wherein both the input and predicted output are spatiotemporal sequences [25].The goal is to learn the spatial correlations and temporal dependencies of spatiotemporal sequences to achieve future predictions.With the rapid advancement of artificial intelligence technology, spatiotemporal predictive learning has gradually become a research focus, demonstrating significant application potential in various domains, including video analysis and action recognition [26,27], traffic flow prediction [28,29], weather forecasting [30,31], healthcare, and disease prediction [32].
In essence, forecasting the disease progression based on image sequences is a spatiotemporal sequence forecasting problem where the input is a sequence of past images and the output is a sequence of a fixed number of future images.However, the spatiotemporal complexity of symptom variations in Ginkgo leaf blight presents multifaceted challenges for disease progression prediction using existing spatiotemporal predictive learning methods such as ConvLSTM [25], PredRNN [33], and SimVP [34][35][36].Spatially, the disease symptoms exhibit heterogeneous patterns with variations in shape and size, adding to the diversity and complexity of the disease manifestations.Temporally, the development of disease symptoms shows gradual or sudden variations influenced by various environmental factors and internal disease process dynamics.
In response to these challenges, this study proposes a novel disease progression prediction method (DPP) for forecasting future disease symptoms of Ginkgo leaves.This method serves as an extension and improvement of SimVP.DPP introduces a multi-level feature translation architecture to capture the spatiotemporal dependencies of disease symptoms at different feature levels, thereby establishing a comprehensive cognitive framework for understanding the spatiotemporal variations in disease symptoms on Ginkgo leaves.Moreover, an enhanced spatiotemporal attention module (eSTA) is proposed to adaptively capture long-range spatial correlations and temporal dependencies of disease symptoms on Ginkgo leaves, thereby learning the key spatiotemporal dependencies of these symptoms.The primary contributions of this study can be summarized as follows: • This study investigates plant disease progression prediction based on image sequences and constructs an image sequence dataset of Ginkgo leaf blight.• To effectively forecast the progression of Ginkgo leaf blight, we propose a novel disease progression prediction method (DPP) with multi-level feature translation architecture and an enhanced spatiotemporal attention module (eSTA).It possesses the capability to capture multi-level and robust spatiotemporal dependencies of disease symptoms on Ginkgo leaves.

•
Experimental results validate that the proposed DPP method can accurately forecast the disease progression of Ginkgo leaf blight to a great extent, outperforming existing spatiotemporal predictive learning methods.This method has the potential to guide scientific management of Ginkgo leaf disease and offers a novel perspective for plant disease prediction research.

Materials and Methods
This section begins with an exhaustive description of the process of image sequence dataset construction.Subsequently, we elaborate on the relevant knowledge regarding spatiotemporal predictive learning.Following this, we provide a detailed exposition of the proposed method (DPP).The final two parts cover the details of the experimental setup and the quantitative evaluation metrics of the method's performance.

Image Sequence Dataset of Ginkgo Leaf Blight
This study presents a dataset comprising 2261 image sequences of diseased Ginkgo leaves.Each sequence contains six images, reflecting the disease progression of individual Ginkgo leaves over six consecutive days.These image sequences were acquired through image capture and image sequence preprocessing.
Image capture was conducted at the Ginkgo plantation (34 • 9 ′ 19 ′′ N, 113 • 48 ′ 57 ′′ E) located in Zhangdi Community, Xuchang City, Henan Province, China.We selected 80 Ginkgo trees of varying ages, from each of which three to five diseased leaves were chosen.Ultimately, a total of 300 Ginkgo leaves exhibiting various degrees of disease were selected as subjects of photography.This capture process using a cellphone (iPhone 12 Pro Max, designed by Apple in Cupertino, CA, USA and assembled in China) is shown in Figure 1a.To carefully track leaf disease progression, a high-frequency capture strategy was employed, with each leaf photographed daily over a period of 20 days.During the entire image capturing process, the leaves were maintained under minimal stress to ensure that their natural growth state was not disturbed by human factors.Nevertheless, some of the selected diseased leaves dropped due to aging, pathological mechanisms, and uncontrollable factors such as wind and insect activity.Consequently, the number of captured images for some leaves was less than 20, resulting in a final set of 300 original image sequences with varying lengths.
During sequence preprocessing, a sliding window sampling approach was first applied to the original image sequences to generate image subsequences with a specified length.After weighing the expectations of obtaining a greater number of subsequences and a longer subsequence length, the window size was set to six with a sliding step of one.Each time the window slid, a consecutive set of six images within the window was selected to form a new subsequence.After filtering out subsequences with insignificant changes in diseased areas, 2261 image subsequences were successfully attained.Subsequently, each image within these subsequences was cropped to highlight the prominence of the Ginkgo leaf and eliminate irrelevant environmental background information.To investigate disease progression prediction using spatiotemporal predictive learning, this study required us to ensure the consistency of Ginkgo leaf positions within the image sequence.For this reason, we utilized professional image processing software (Adobe Photoshop CC 2019) in the following subsequence preprocessing stage.The first image in each subsequence served as the base image, with the remaining five images set as reference images.As shown in Figure 1b, the leaf disease symptoms observed in the reference image were meticulously drawn and transferred onto corresponding locations in the further-clipped base image using the clone stamp tool and healing brush tool from Adobe Photoshop to generate the mixed image.Finally, the mixed image was resized to a resolution of 100×100 and saved as a new image.This processing procedure underwent multiple careful checks and adjustments to ensure that the newly depicted areas in the base image closely matched the corresponding parts of the reference images, thereby faithfully reconstructing the evolution of each diseased Ginkgo leaf while maintaining consistency in leaf position.By applying the same processing steps to each reference image, we obtained the corresponding new images.Eventually, the resized base image and these newly generated images formed a new image sequence, which constituted the image sequence dataset for this study.
to form a new subsequence.After filtering out subsequences with insignificant changes in diseased areas, 2261 image subsequences were successfully attained.Subsequently, each image within these subsequences was cropped to highlight the prominence of the Ginkgo leaf and eliminate irrelevant environmental background information.To investigate disease progression prediction using spatiotemporal predictive learning, this study required us to ensure the consistency of Ginkgo leaf positions within the image sequence.For this reason, we utilized professional image processing software (Adobe Photoshop CC 2019) in the following subsequence preprocessing stage.The first image in each subsequence served as the base image, with the remaining five images set as reference images.As shown in Figure 1b, the leaf disease symptoms observed in the reference image were meticulously drawn and transferred onto corresponding locations in the further-clipped base image using the clone stamp tool and healing brush tool from Adobe Photoshop to generate the mixed image.Finally, the mixed image was resized to a resolution of 100×100 and saved as a new image.This processing procedure underwent multiple careful checks and adjustments to ensure that the newly depicted areas in the base image closely matched the corresponding parts of the reference images, thereby faithfully reconstructing the evolution of each diseased Ginkgo leaf while maintaining consistency in leaf position.By applying the same processing steps to each reference image, we obtained the corresponding new images.Eventually, the resized base image and these newly generated images formed a new image sequence, which constituted the image sequence dataset for this study.Two examples from the image sequence dataset are shown in Figure 2. Additionally, the dataset was divided into training, validation, and testing, with the quantity distribution detailed in Table 1.Two examples from the image sequence dataset are shown in Figure 2. Additionally, the dataset was divided into training, validation, and testing, with the quantity distribution detailed in Table 1.Two examples from our image sequence dataset.We added pink boxes to all images at the specific position within each sequence.These boxes serve as a reference for visualizing changes in disease symptoms on Ginkgo leaves.In the first sequence, the disease symptoms gradually extend beyond the upper and right edges of the boxes, progressing inward over time.Similarly, in the second sequence, the healthy areas of the leaf within the boxes gradually become eroded by the disease symptoms.

Spatiotemporal Predictive Learning
Imagine a dynamic system in a spatial region that is continually under our observation.This spatial region is divided into an M × N grid, where each cell possesses a state.Then, each cell state can be represented by Q measurements at any given moment.After this, the dynamic system can be represented by a tensor, , at any time.The observations over T timesteps are a sequence of tensors denoted as . The spatiotemporal predictive learning aims to forecast the most probable length-K sequence in the future given the observations X 1,T [25]: This also can be formulated as a mapping, θ : T,T K , with learnable parameters, θ, optimized by the following: where L represents loss functions [34].
Spatiotemporal predictive learning investigates the evolution of a dynamic system in a given spatial region.Our disease progression prediction, which was based on image sequences, primarily focused on disease progression on the leaves.However, the dynamic system presented in the original image sequence of a diseased Ginkgo leaf encompasses both leaf positional variations and disease progression.This poses a significant challenge for disease progression prediction using spatiotemporal predictive learning, as the dynamic variations in leaf position can substantially interfere with understanding disease progression.To address this issue, our study manually processed each obtained original Two examples from our image sequence dataset.We added pink boxes to all images at the specific position within each sequence.These boxes serve as a reference for visualizing changes in disease symptoms on Ginkgo leaves.In the first sequence, the disease symptoms gradually extend beyond the upper and right edges of the boxes, progressing inward over time.Similarly, in the second sequence, the healthy areas of the leaf within the boxes gradually become eroded by the disease symptoms.

Spatiotemporal Predictive Learning
Imagine a dynamic system in a spatial region that is continually under our observation.This spatial region is divided into an M × N grid, where each cell possesses a state.Then, each cell state can be represented by Q measurements at any given moment.After this, the dynamic system can be represented by a tensor, X ∈ R Q×M×N , at any time.The observations over T timesteps are a sequence of tensors denoted as X 1,T = {X 1 , X 2 , • • • , X T }.The spatiotemporal predictive learning aims to forecast the most probable length-K sequence in the future given the observations X 1,T [25]: This also can be formulated as a mapping, F θ : X 1,T → X T,T+K , with learnable parameters, θ, optimized by the following: where L represents loss functions [34].Spatiotemporal predictive learning investigates the evolution of a dynamic system in a given spatial region.Our disease progression prediction, which was based on image sequences, primarily focused on disease progression on the leaves.However, the dynamic system presented in the original image sequence of a diseased Ginkgo leaf encompasses both leaf positional variations and disease progression.This poses a significant challenge for disease progression prediction using spatiotemporal predictive learning, as the dynamic variations in leaf position can substantially interfere with understanding disease progression.To address this issue, our study manually processed each obtained original image sequence to ensure leaf position consistency within the sequence.Through this process, interference from dynamic variations in leaf position was effectively eliminated, resulting in better disease progression prediction.
For the disease progression prediction of Ginkgo leaf blight, the observation at each timestep (one day) was recorded as an RGB image, with the channels, height, and width denoted as C, H, and W, respectively.We divided this image into an H × W grid, where each cell consists of C pixel values treated as Q measurements.Thus, it can be represented by a tensor, I ∈ R C×H×W .The observations obtained over T timesteps can be described as follows: Our aim is to forecast the most likely disease symptoms of Ginkgo leaf in the next K images: Based on the foregoing analysis, the disease progression prediction of Ginkgo leaf blight essentially becomes a spatiotemporal prediction problem.In this study, we trained a neural network model with learnable parameters, θ, to learn a mapping of F θ : I 1,T → I T,K through exploring both spatial correlations and temporal dependencies of disease symptoms in image sequences.The optimal parameters, θ*, were found by minimizing the MSE loss function:

A Novel Disease Progression Prediction Method for Ginkgo Leaf Blight
To address the challenges in forecasting the progression of Ginkgo leaf blight, this study proposed a novel disease progression prediction (DPP) method.The DPP method introduced a multi-level feature translation architecture to capture rich spatiotemporal dependencies from different-level feature maps of disease symptoms.This establishes a comprehensive framework for understanding spatiotemporal dynamics of disease progression.Moreover, we proposed an enhanced spatiotemporal attention module (eSTA) to build an enhanced spatiotemporal attention translator to capture key spatiotemporal dependencies of disease symptoms.

Multi-Level Feature Translation Architecture
The proposed multi-level feature translation architecture was designed to capture the comprehensive spatiotemporal dependencies of different feature levels of disease symptoms.As shown in Figure 3, this architecture consisted of three parts: a spatial encoder, multiple spatiotemporal translators, and a spatial decoder, which resembled the SimVP structure [35].Given a batch of input images, D ∈ R B×T×C×H×W , with a batch size of B, the encoder and decoder reshape their input tensors to (B × T) × C × H × W. Accordingly, the spatiality encoder and decoder treat each input image as an individual sample, focusing exclusively on per-image feature extraction and reconstruction without accounting for temporal dependencies across the image sequence.Subsequently, the hidden representations from the encoder are reshaped into a tensor with the shape of B × (T × C) × H × W for input to the translator.By integrating multi-image-level features along the temporal axis, the translator is able to capture inherent spatiotemporal dependencies from the multi-image features.
The encoder incorporated convolutional networks (denoted by Convs) and downsampling operations (denoted by Downs) to extract multi-level features ranging from lower-level pixel attributes (such as color and texture) to higher-level semantic information (such as the morphology of diseased areas and leaf structure).This formed a comprehensive feature pyramid.Convs comprised N s groups of a vanilla convolutional layer, a normalization layer, and a nonlinear activation layer.Downs was achieved via a convolutional operation with a stride of two.The mapping function of the encoder is defined as follows: where {F i } L i=1 denotes the set of features at L levels from lower to higher.No Downs is applied when i = 1 in the encoder.The encoder incorporated convolutional networks (denoted by Convs) and downsampling operations (denoted by Downs) to extract multi-level features ranging from lower-level pixel attributes (such as color and texture) to higher-level semantic information (such as the morphology of diseased areas and leaf structure).This formed a comprehensive feature pyramid.Convs comprised  groups of a vanilla convolutional layer, a normalization layer, and a nonlinear activation layer.Downs was achieved via a convolutional operation with a stride of two.The mapping function of the encoder is defined as follows: where  denotes the set of features at L levels from lower to higher.No Downs is applied when i = 1 in the encoder.
The multi-level feature translation architecture employed L spatiotemporal translators (STTs) for parsing and learning spatiotemporal information of specific-level features.The translator is illustrated in Section 2.3.2.The mapping process for each translator is provided by Equation (6), where  denotes the feature representation of the i-th level after processing through the corresponding spatiotemporal translator.
The decoder comprised upsampling operations (Ups) and convolutional networks (Convs).The Ups were a combination of a vanilla convolutional layer with a kernel size and a stride of one, coupled with PixelShuffle [37].The decoder sequentially merges higher-level spatiotemporal representations with lower-level representations through channel concatenation and decoding.Ultimately, it maps these representations into the The multi-level feature translation architecture employed L spatiotemporal translators (STTs) for parsing and learning spatiotemporal information of specific-level features.The translator is illustrated in Section 2.3.2.The mapping process for each translator is provided by Equation (6), where H i denotes the feature representation of the i-th level after processing through the corresponding spatiotemporal translator.
The decoder comprised upsampling operations (Ups) and convolutional networks (Convs).The Ups were a combination of a vanilla convolutional layer with a kernel size and a stride of one, coupled with PixelShuffle [37].The decoder sequentially merges higherlevel spatiotemporal representations with lower-level representations through channel concatenation and decoding.Ultimately, it maps these representations into the image data space, generating images of Ginkgo leaf with predicted disease symptoms.The mapping process of the decoder is illustrated as follows: where SkipC denotes skip connections.
The multi-level feature translation architecture enables the DPP method to capture and analyze spatiotemporal dependencies of disease feature maps at different levels.This approach enriches the spatiotemporal representation of disease symptoms, offering a more comprehensive understanding of disease progression.Additionally, it improves the utilization of low-level features compared to SimVP.

Enhanced Spatiotemporal Attention Translator
In order to enhance the model's ability to capture key spatiotemporal dependencies of disease symptoms at a specific feature level, this study proposed an enhanced spatiotemporal attention module (eSTA).Subsequently, we built the enhanced spatiotemporal attention translator by stacking N t modules, each comprising an eSTA and a multi-layer perceptron (MLP), as shown in Figure 4.
where SkipC denotes skip connections.
The multi-level feature translation architecture enables the DPP method to capture and analyze spatiotemporal dependencies of disease feature maps at different levels.This approach enriches the spatiotemporal representation of disease symptoms, offering a more comprehensive understanding of disease progression.Additionally, it improves the utilization of low-level features compared to SimVP.

Enhanced Spatiotemporal Attention Translator
In order to enhance the model's ability to capture key spatiotemporal dependencies of disease symptoms at a specific feature level, this study proposed an enhanced spatiotemporal attention module (eSTA).Subsequently, we built the enhanced spatiotemporal attention translator by stacking  modules, each comprising an eSTA and a multi-layer perceptron (MLP), as shown in Figure 4.The enhanced spatiotemporal attention module consisted of spatial attention (SA) and temporal attention (TA), as shown in Figure 4. Motivated by recent research on large kernel convolutions [35,36,38] and gating mechanisms [39], we employed a point-wise convolution (PWConv), a depth-wise convolution (DWConv), and a depth-wise dilation convolution (DWDConv) to achieve a large spatial receptive field for capturing long-range spatial dependencies of disease symptoms.Furthermore, a gating unit was introduced as the pixel-wise attention module and nonlinear activation function to adaptively extract The enhanced spatiotemporal attention module consisted of spatial attention (SA) and temporal attention (TA), as shown in Figure 4. Motivated by recent research on large kernel convolutions [35,36,38] and gating mechanisms [39], we employed a point-wise convolution (PWConv), a depth-wise convolution (DWConv), and a depth-wise dilation convolution (DWDConv) to achieve a large spatial receptive field for capturing long-range spatial dependencies of disease symptoms.Furthermore, a gating unit was introduced as the pixel-wise attention module and nonlinear activation function to adaptively extract crucial spatial information.These components enable spatial attention to generate powerful spatial representations of disease symptoms.SA can be formulated as follows: where Z in , Z 1 , G s are tensors with the shape of B × (T × C) × H × W, Sigmoid represents a nonlinear activation function, and "•" is a Hadamard product.
To determine the temporal dependencies along the channels, we developed temporal attention (TA) based on the concept of squeeze and excitation [40].In this approach, attention weights are allocated to individual channels according to their correlation with underlying temporal dependencies.This allocation allows the TA to selectively amplify channels rich in significant information while suppressing those with less relevance.Additionally, a skip connection from the input of the SA to the output of the TA was introduced to preserve the additional spatial features.The TA and a skip connection can be described mathematically as follows: (10) where " * " denotes a channel-wise multiplication and α is a learnable parameter.
As a result, the eSTA module possesses the ability to adaptively capture long-range spatial correlations and temporal dependencies of disease symptoms at a specific feature level, learning the key spatiotemporal dependencies of disease symptoms.Furthermore, a multi-layer perceptron (MLP) was constructed using point-wise convolution and depthwise convolution to further integrate and refine the spatiotemporal representation obtained from the eSTA module, as illustrated in Figure 4.

Implementation Details
The experiments were conducted on our image sequence dataset of diseased Ginkgo leaves, where both the input and output image sequences had a length of three.For this study, ConvLSTM [25], PredRNN [33], and SimVP [35,36] were selected as comparison methods for forecasting the disease progression of Ginkgo leaf blight.
Both ConvLSTM and PredRNN consisted of four layers, each with hidden states of 64 channels and convolutional kernels of a size of 5 × 5.For DPP and SimVP, N s and N t were set to two and four, respectively.The multi-level feature dimensions of DPP were set to 64, 128, and 256, respectively.
Each model was trained using the Adam optimizer [41] with the ReduceLROnPlateau learning rate scheduler in PyTorch, with an initial learning rate of 0.001.The batch size was set to 16 for all models but 8 for PredRNNv2 due to its substantial memory consumption.To reduce overfitting and improve the model's generalization performance, image sequences underwent random rotations of 90 • , as well as random horizontal and vertical flips during the training phase.Moreover, all models were trained on a single NVIDIA Tesla V100 GPU for 100 epochs.
Here, lower values of MAE and MSE, coupled with higher values of SSIM and PSNR, indicate better performances.
In image evaluation, the mean absolute error (MAE) and mean squared error (MSE) are commonly used to quantify the discrepancies between the predicted and target images.In our study, the MAE measures the average of the absolute deviations in overall pixel values between each image in the predicted sequence and its corresponding image in the target sequence, while the MSE calculates the average of the squared errors in overall pixel values between each image in the predicted sequence and its corresponding image in the target sequence.Lower MAE and MSE values indicate better accuracy of the method in predicting pixel values.The calculation formulas for MAE and MSE are as follows: The structural similarity index measure (SSIM) [44] serves as a perceptual metric for assessing structural similarities between two images.In our study, the SSIM denotes a mean structural similarity index to evaluate overall image quality for a predicted image sequence.The detailed calculation process is as follows: The peak signal-to-noise ratio (PSNR) is used to quantify image distortion, where higher values signify less distortion and superior visual quality.In our study, the PSNR measured the overall image distortion of the predicted image sequence, calculated as follows: where PMSE k represents the average of the squared errors between corresponding pixel values of the k-th image in a predicted sequence and the k-th image in a target sequence.

Results and Discussion
In this section, we first conduct performance comparisons between the proposed DPP and other spatiotemporal predictive learning methods in disease progression prediction through qualitative and quantitative analyses.Following this, we perform an ablation study to elucidate the importance of multi-level feature translation architecture and the enhanced spatiotemporal attention module in disease progression prediction.

Qualitative Comparison
A visualization of predicted results for two Ginkgo leaf specimens is given in Figure 5.In the first leaf specimen, disease symptoms underwent actual variations over time primarily within a localized region, as depicted in the target images of Figure 5a.The real progression of disease symptoms on the second leaf at output timesteps (t = 4, 5, and 6) posed a challenging scenario.Specifically, the disease symptoms gradually expanded from all sides towards the center of the Ginkgo leaf at varying degrees, as shown in the target images of Figure 5b.Overall, no apparent distortion was observed in the predictions of any method.Close observation indicated that the overall disease symptoms in the predicted images generated via DPP were closer to those in the target images compared to those from other methods.posed a challenging scenario.Specifically, the disease symptoms gradually expanded from all sides towards the center of the Ginkgo leaf at varying degrees, as shown in the target images of Figure 5b.Overall, no apparent distortion was observed in the predictions of any method.Close observation indicated that the overall disease symptoms in the predicted images generated via DPP were closer to those in the target images compared to those from other methods.To better qualitatively compare the predictive performance of different methods in the disease progression prediction of Ginkgo leaves, one specified area on the first leaf specimen and two on the second were chosen for comparison.The disease symptoms in these specific areas show noticeable changes over time within target images.Subsequently, these specified areas were individually magnified in both target and predicted images of each method, as illustrated in Figure 6.
To better qualitatively compare the predictive performance of different methods in the disease progression prediction of Ginkgo leaves, one specified area on the first leaf specimen and two on the second were chosen for comparison.The disease symptoms in these specific areas show noticeable changes over time within target images.Subsequently, these specified areas were individually magnified in both target and predicted images of each method, as illustrated in Figure 6.The observed variations in disease symptoms, primarily concentrated in a specified localized region (circular annotations outlined by red dashed lines), which is selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red dashed lines).(b) The observed changes in disease symptoms, mainly distributed across two specified areas, which are selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red and blue dashed lines, respectively).The observed variations in disease symptoms, primarily concentrated in a specified localized region (circular annotations outlined by red dashed lines), which is selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red dashed lines).(b) The observed changes in disease symptoms, mainly distributed across two specified areas, which are selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red and blue dashed lines, respectively).
Figure 6 shows a detailed qualitative comparison of disease symptoms in specified areas on two leaf specimens.Compared to other methods, DPP presented minimal discrepancies in the shape, size, and color of disease symptoms between predicted images of two leaf specimens and the corresponding target images at each timestep.Moreover, the overall variation trends over the output timesteps of disease symptoms in the predicted images were similar to those observed in the target images.The results reveal that DPP provides a powerful understanding of the spatiotemporal dependencies of disease symptoms.Therefore, it can effectively forecast the disease progression of Ginkgo leaf blight.In contrast, the disease symptoms in predictions of the first leaf specimen from SimVPs (SimVP + gSTA and SimVP + TAU) exhibited certain similarities with those in target images, but such similarities were not observed in predictions of the second leaf specimen.This indicates their limitations in capturing the more complex spatiotemporal dynamics of disease symptoms, making it difficult to generate stable and effective prediction results.Furthermore, the disease symptoms in all specified areas in predictions of both leaf specimens from the ConvLSTM and PredRNN methods exhibited negligible changes across output timesteps, thereby inadequately reflecting disease progression.One reason for this outcome is that these methods seem to heavily rely on previous images, and they struggle to directly capture long-term dependencies within the complex spatiotemporal dynamics of disease symptoms [35].Hence, it is difficult for the ConvLSTM and PredRNN methods to make effective predictions of disease progression.
The results described above demonstrate that DPP can accurately forecast the disease progression of Ginkgo leaf blight to a greater extent than other current spatiotemporal predictive learning methods.This allows relevant personnel to gain early insights into potential disease progression and holds promise for guiding the formulation of precise disease prevention and control measures.Moreover, it facilitates real-time evaluation of the effectiveness of implemented measures.

Quantitative Comparison
To further investigate the performance of various methods in disease progression prediction, four evaluation metrics were calculated for the predicted images of each method: the mean absolute error (MAE), mean square error (MSE), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR).
The quantitative results of the compared methods are shown in Table 2.Among all evaluation metrics, DPP consistently outperformed other methods.Compared with the SimVP + TAU method, DPP exhibited notable reductions of 19.95% (from 227.92 to 182.45) in MAE and 25.35% (from 14.40 to 10.75) in MSE.Correspondingly, the SSIM increased by 0.011 (from 0.959 to 0.970), while the PSNR demonstrated a leading margin of 1.840 dB (from 35.906 dB to 37.746 dB).The lower MAE and MSE metrics indicate a smaller pixel-level discrepancy between the predicted and target images, which directly reflects the higher predictive accuracy of the DPP method.The improved SSIM indicates a higher degree of similarity between predicted and target images in terms of structure and detail, while the superior PSNR denotes lower noise levels and better image quality in the predicted images.Benefiting from the multi-level feature translation architecture and enhanced spatiotemporal attention module, DPP is able to achieve state-of-the-art performance in forecasting the disease progression of Ginkgo leaf blight, both in terms of pixel-level accuracy and image quality."↓" represents the better prediction performance of the method with a lower metric value."↑" represents the better prediction performance of the method with a higher metric value. 1 A + B denotes model A, with a translator built with module B. 2 DPP is our proposed model with multi-level feature translation architecture and an enhanced spatiotemporal attention module (eSTA) for disease progression prediction.
Moreover, the SimVPs and DPP methods, which are based on convolutional neural networks, significantly outperform the ConvLSTM and PredRNN methods with recurrent architectures by a large margin in terms of metrics such as the MAE, MSE, and PSNR.The results indicate that without recurrent structures, pure convolutional networks with the spatial encoder, the spatiotemporal translator, and the spatial decoder can still achieve excellent performance in disease progression prediction.This finding is consistent with the conclusions of other extensive experiments on synthetic moving digits, traffic flow forecasting, climate prediction, road driving, and human motion prediction [35].It further provides empirical support for the superior performance of pure convolutional networks in spatiotemporal predictive learning.

Ablation Study
As reported in this section, ablation experiments were conducted to investigate the significance of the multi-level feature translation architecture and the enhanced spatiotemporal attention module in disease progression prediction.The SimVP translator was built on gSTA, TAU, and eSTA, with the same being used for DPP.Subsequently, each was trained with consistent training parameters and evaluated on the image sequence dataset of diseased Ginkgo leaves.The significance of the proposed multi-level feature translation architecture was deduced by comparing the test results between the SimVP and DPP methods, which both employ the same translator.The importance of the enhanced spatiotemporal attention module was also assessed by evaluating the test outcomes of SimVP and DPP methods incorporating different translators.

Multi-Level Feature Translation Architecture
A comparative analysis of the three groups of methods listed in Table 3 (comprising Methods 1 and 4, Methods 2 and 5, and Methods 3 and 6) reveals that the DPP methods offer significant advantages over the SimVP methods in terms of four key evaluation metrics when utilizing translators with identical modules.The detailed results are as follows: 1.
In terms of the MAE, the DPP methods achieved a substantial decrease of at least 15.96% compared to the SimVP methods (Methods 2 and 5); 2.
Considering the MSE, the DPP methods offered a reduction of at least 20.43% compared to the SimVP methods (Methods 3 and 6); 3.
With respect to the SSIM, the DPP methods consistently presented an increase of 0.01 compared to the SimVP methods; 4.
Regarding the PSNR, the DPP methods offered an improvement of at least 1.514 dB compared to the SimVP methods (Methods 3 and 6)."↓" represents the better prediction performance of the method with a lower metric value."↑" represents the better prediction performance of the method with a higher metric value. 1 A + B denotes model A, with a translator built with module B.
The aforementioned results indicate that the multi-level feature translation architecture significantly outperforms the single-higher-level feature translation architecture (SimVP) in disease progression prediction.A single-higher-level feature map is often inadequate for comprehensively capturing image information due to a trade-off between coarse resolution and rich semantic content [45].This limitation stems from the fact that high-level feature maps tend to encode more abstract and semantic information while sacrificing fine-grained spatial details.Consequently, capturing spatiotemporal dependencies solely from a single-higher-level feature map of disease symptoms is insufficient for the model to thoroughly understand the spatiotemporal dynamics of the disease symptoms.In contrast, the multi-level feature translation architecture can capture comprehensive spatiotemporal dependencies from different-level feature maps of disease symptoms, thereby substantially improving the model's predictive accuracy.This finding holds the potential to provide insights for other spatiotemporal prediction tasks, thereby contributing to the advancement of the field of spatiotemporal predictive learning.
According to the quantitative analysis in Section 3.2, the contributions of the multilevel feature translation architecture to the reduction in the MAE and MSE achieved via the DPP method are at least 80% (15.96%/19.95%)and 80.6% (20.43%/25.35%),respectively.The multi-level feature translation architecture accounts for no less than 90.9% (0.01/0.011) and 82.3% (1.514 dB/1.84 dB) of the improvements in the SSIM and PSNR attained via DPP, respectively.These results clearly demonstrate that the multi-level feature translation architecture plays a decisive role in achieving the state-of-the-art performance of DPP for forecasting the disease progression of Ginkgo leaf blight.

Enhanced Spatiotemporal Attention Module
Firstly, the quantitative comparison of the impacts of the eSTA, gSTA, and TAU modules within the SimVP architecture presented an interesting result, as shown in Table 3 (Methods 1, 2, and 3).The eSTA module reached optimal performance in terms of the MSE, SSIM, and PSNR metrics.Specifically, the eSTA module achieved a notable MSE reduction of 6.18% (from 14.40 to 13.51) and a PSNR increase of 0.326 dB (from 35.906 dB to 36.232 dB) compared to the TAU.In terms of the MAE metric, the eSTA module performed marginally less well than the TAU module, with a narrow difference of just 0.17 (228.09 and 227.92).Subsequently, we further compared the effects of the eSTA, TAU, and gSTA modules within the DPP architecture, as depicted in Table 3 (Methods 4, 5, and 6).Consistently, the eSTA module surpassed both the gSTA and TAU modules across all four evaluation metrics.Notably, it demonstrated a significant reduction of 4.75% (from 191.55 to 182.45) in the MAE compared to the TAU module.
Overall, the eSTA module demonstrates notable advantages over gSTA and TAU in forecasting disease progression, regardless of integration within the SimVP or DPP architecture.It can adaptively capture spatial correlations and temporal dependencies of the disease feature map, thereby learning the key spatiotemporal dependencies of disease symptoms.Of particular note is that, when compared to the TAU module, in terms of pixel-wise error metrics, the eSTA module only reduced the MSE without affecting the MAE within the SimVP architecture.On the other hand, in the DPP architecture, the eSTA module decreased both the MSE and MAE metrics.The limitation of SimVP stems from its reliance on only capturing spatiotemporal dependencies from a single-higher-level feature map of disease symptoms.This reliance may have limited the eSTA module's ability to fully exhibit its potential for capturing more detailed spatiotemporal dependencies.Conversely, DPP overcomes this limitation by introducing a multi-level feature translation architecture that maximizes the potential of the eSTA module for capturing key spatiotemporal dependencies, thereby enhancing prediction accuracy.
Comparing the improvements in prediction performance between the multi-level feature translation architecture and the enhanced spatiotemporal attention module (eSTA) reveals that although the eSTA's contribution to enhancing prediction performance is not as pronounced as that of the former, it further elevates the DPP's performance to an advanced level in disease progression prediction.

Limitations and Future Work
Due to a trade-off between the number and length of image sequences, the sequence length in this study was only set at six.As a result, the performance of the proposed method in disease progression prediction could not be explored at a longer sequence length.Moreover, this study was unable to obtain meteorological factors from the Ginkgo plantation, relying solely on image sequences to forecast the disease progression of Ginkgo leaf blight.This might impact the further improvement of the prediction accuracy of the proposed method.
Consequently, future work will focus on the scale of the dataset, aiming to construct an image sequence dataset with a larger number of sequences and longer sequence lengths.Simultaneously, meteorological factors will be introduced to investigate more complex multi-modal disease progression prediction methods, aiming to obtain more precise prediction results.Furthermore, while this study focuses solely on forecasting the disease progression of Ginkgo leaf blight, future research will investigate the applicability of the proposed method to other plant leaf diseases.

Conclusions
This study investigated disease progression prediction using spatiotemporal predictive learning methods based on image sequences of diseased Ginkgo leaves.A novel disease progression prediction method (DPP) was proposed, which employs a multi-level feature translation architecture and enhanced spatiotemporal attention module (eSTA), to boost predictive accuracy in forecasting the disease progression of Ginkgo leaf blight.
The experimental results demonstrate that the proposed DPP method can accurately forecast the disease progression of Ginkgo leaf blight to a large extent, significantly outperforming other existing spatiotemporal predictive learning methods.Compared with the SimVP + TAU method, DPP achieved notable reductions of 19.95% and 25.35% in mean absolute error (MAE) and mean square error (MSE), respectively.In addition, DPP showed remarkable performance with a higher structure similarity index (SSIM) of 0.970 and a peak signal-to-noise ratio (PSNR) of 37.746 dB.Furthermore, the ablation study indicated that the introduction of multi-level feature translation architecture contributes to over 80% of the significant performance improvement in the proposed method.This suggests that capturing spatiotemporal dependencies from different levels of disease features and constructing multi-level spatiotemporal representations is crucial for precisely forecasting the disease progression of Ginkgo leaf blight.
These findings enable relevant personnel to understand potential disease progression at an early stage and have the potential to promote precision and intelligent management of Ginkgo leaf disease.This study also presents a novel perspective for extensive research on plant disease prediction.Future work will focus on expanding the dataset size and introducing meteorological factors to explore multi-modal disease progression prediction methods.Furthermore, the applicability of the proposed method to other plant leaf diseases is worth investigating.

Figure 1 .
Figure 1.The construction process for the image sequence dataset in this paper.(a) Description of the image capturing stage.All selected diseased Ginkgo leaves were labeled and numbered using specific tags.Each photograph was captured with the leaf placed on white paper adhered to a wooden board and flattened; (b) description of the drawing and transfer process of Ginkgo leafrelated changes from the Ref image to the Base image using the Adobe Photoshop software.This process aimed to obtain an image subsequence in which the position of the Ginkgo leaf remained consistent across all images within the subsequence.

Figure 1 .
Figure 1.The construction process for the image sequence dataset in this paper.(a) Description of the image capturing stage.All selected diseased Ginkgo leaves were labeled and numbered using specific tags.Each photograph was captured with the leaf placed on white paper adhered to a wooden board and flattened; (b) description of the drawing and transfer process of Ginkgo leaf-related changes from the Ref image to the Base image using the Adobe Photoshop software.This process aimed to obtain an image subsequence in which the position of the Ginkgo leaf remained consistent across all images within the subsequence.

Figure 2 .
Figure2.Two examples from our image sequence dataset.We added pink boxes to all images at the specific position within each sequence.These boxes serve as a reference for visualizing changes in disease symptoms on Ginkgo leaves.In the first sequence, the disease symptoms gradually extend beyond the upper and right edges of the boxes, progressing inward over time.Similarly, in the second sequence, the healthy areas of the leaf within the boxes gradually become eroded by the disease symptoms.

Figure 2 .
Figure2.Two examples from our image sequence dataset.We added pink boxes to all images at the specific position within each sequence.These boxes serve as a reference for visualizing changes in disease symptoms on Ginkgo leaves.In the first sequence, the disease symptoms gradually extend beyond the upper and right edges of the boxes, progressing inward over time.Similarly, in the second sequence, the healthy areas of the leaf within the boxes gradually become eroded by the disease symptoms.

Figure 3 .
Figure 3.The overall framework of the DPP method with a multi-level feature translation architecture and L-enhanced spatiotemporal attention translators.T represents the length of the input image sequence.C, H and W denote the channel, height and width of the input image respectively.C* represents the basic channels for features, and in this study, C* is set to 64.

Figure 3 .
Figure 3.The overall framework of the DPP method with a multi-level feature translation architecture and L-enhanced spatiotemporal attention translators.T represents the length of the input image sequence.C, H and W denote the channel, height and width of the input image respectively.C* represents the basic channels for features, and in this study, C* is set to 64.

Figure 4 .
Figure 4.The overall framework of the enhanced spatiotemporal attention translator.T represents the length of the input image sequence.C, H and W denote the channel, height and width of the features respectively.

Figure 4 .
Figure 4.The overall framework of the enhanced spatiotemporal attention translator.T represents the length of the input image sequence.C, H and W denote the channel, height and width of the features respectively.
13) where µ I k and µ I k are the means of the k-th image in a predicted sequence and the k-th image in a target sequence, respectively; σ I k I k is the covariance between the k-th image in a predicted sequence and the k-th image in a target sequence; σ 2 I k and σ 2 I k are the variances between the k-th image in a predicted sequence and the k-th image in a target sequence; MAX(I k ) represents the maximum of pixel values in the image I k ; a 1 = 0.01; and a 2 = 0.03.

Figure 5 . 5 .
Figure 5. Two examples of predicted results in the image sequence dataset of diseased Ginkgo leaves.SimVP + gSTA denotes the SimVP model, with a translator built with a gSTA module.SimVP Figure 5. Two examples of predicted results in the image sequence dataset of diseased Ginkgo leaves.SimVP + gSTA denotes the SimVP model, with a translator built with a gSTA module.SimVP + TAU indicates the SimVP model, with a translator built with a TAU module.DPP is our proposed model, with multi-level feature translation architecture and an enhanced spatiotemporal attention module (eSTA) for disease progression prediction.(a) The observed disease symptoms primarily exhibit actual changes within the local region over time.(b) The observed disease symptoms gradually expanded from all sides towards the center of the Ginkgo leaf at varying degrees.

Figure 6 .
Figure 6.Qualitative visualization of the disease symptoms in specified areas.(a)The observed variations in disease symptoms, primarily concentrated in a specified localized region (circular annotations outlined by red dashed lines), which is selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red dashed lines).(b) The observed changes in disease symptoms, mainly distributed across two specified areas, which are selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red and blue dashed lines, respectively).

Figure 6 .
Figure 6.Qualitative visualization of the disease symptoms in specified areas.(a) The observed variations in disease symptoms, primarily concentrated in a specified localized region (circular annotations outlined by red dashed lines), which is selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red dashed lines).(b) The observed changes in disease symptoms, mainly distributed across two specified areas, which are selected and magnified for comparison within both the target images and the predicted images of each method (rectangular annotations outlined by red and blue dashed lines, respectively).

Table 1 .
The numbers of the training, validation, and testing datasets, respectively.

Table 1 .
The numbers of the training, validation, and testing datasets, respectively.

Table 2 .
Quantitative results of different methods conducted on our image sequence dataset of Ginkgo leaves.

Table 3 .
Quantitative results of the ablation study on our image sequence dataset of Ginkgo leaves.