MPSTAN: Metapopulation-Based Spatio–Temporal Attention Network for Epidemic Forecasting

Accurate epidemic forecasting plays a vital role for governments to develop effective prevention measures for suppressing epidemics. Most of the present spatio–temporal models cannot provide a general framework for stable and accurate forecasting of epidemics with diverse evolutionary trends. Incorporating epidemiological domain knowledge ranging from single-patch to multi-patch into neural networks is expected to improve forecasting accuracy. However, relying solely on single-patch knowledge neglects inter-patch interactions, while constructing multi-patch knowledge is challenging without population mobility data. To address the aforementioned problems, we propose a novel hybrid model called metapopulation-based spatio–temporal attention network (MPSTAN). This model aims to improve the accuracy of epidemic forecasting by incorporating multi-patch epidemiological knowledge into a spatio–temporal model and adaptively defining inter-patch interactions. Moreover, we incorporate inter-patch epidemiological knowledge into both model construction and the loss function to help the model learn epidemic transmission dynamics. Extensive experiments conducted on two representative datasets with different epidemiological evolution trends demonstrate that our proposed model outperforms the baselines and provides more accurate and stable short- and long-term forecasting. We confirm the effectiveness of domain knowledge in the learning model and investigate the impact of different ways of integrating domain knowledge on forecasting. We observe that using domain knowledge in both model construction and the loss function leads to more efficient forecasting, and selecting appropriate domain knowledge can improve accuracy further.


Introduction
In the past few years, COVID-19 has emerged as a significant threat to both human life and the global economy.Due to its highly contagious nature, millions of people have been infected, leading to enormous pressure on healthcare systems and social order [1].Thus, it is imperative for governments and public health departments to devise effective epidemic prevention strategies, and accurate forecasting of the outbreak's future evolution is a critical factor in preventing disease transmission, mitigating its impact on public health and the economy, and enhancing the quality and efficacy of medical services [2].
Traditional epidemic forecasting models use compartmental models constructed from differential equations to simulate the potential transmission dynamics of epidemic at the patch level, such as the SIR model [3], SEIR model [4] and their variants [5,6].Taking the SIR model as an example, it is used to estimate the fluctuations in the number of susceptible, infected, and recovered individuals within a single patch to understand the dynamics of epidemic in a particular patch.Many traditional time-series methods can directly forecast the temporal dependency of epidemic outbreaks, such as ARIMA [7] and SVR [8].In recent years, deep learning has been widely used in the field of time series forecasting, and several excellent models have been proposed, including LSTM [9], GRU [10], Transformer [11], and Neural ODE [12].These models are designed to effectively handle the unique properties of time series data, such as temporal correlation, periodicity, etc.However, the above methods only consider the temporal dependence of the data and ignore the spatial dependence, which may lead to insufficiently accurate forecasting results.The reason is that the epidemic evolution of a patch is not only influenced by its own factors, such as the scale of infection and medical resources, but also by external factors, such as the mobility of people from other patches [13].Therefore, it is crucial to consider spatial dependence to improve the accuracy of epidemiological trend analysis and forecasting.The development of graph-based algorithms provides researchers with a powerful tool for taking epidemic forecasting as a spatio-temporal forecasting problem [14? ].Various methods [16][17][18] have been proposed for epidemic spatiotemporal forecasting.In essence, these methods construct a graph to predict multi-patch epidemics.Each patch is represented as a node, and each patch's historical data, such as the infected cases, recovered cases, hospitalizations, and ICU admissions, are used as node features.By modeling the temporal and spatial dependencies in epidemic data, these methods can capture potential spatio-temporal correlations to predict future trends in the epidemic spreading.With the benefit of spatio-temporal forecasting works in the traffic flow field, most of the spatio-temporal models can also be directly applied to epidemic forecasting, such as [19][20][21].Nevertheless, epidemiological evolution trends can vary considerably depending on the timing, region, and preventive measure of the epidemic outbreak.We show the number of active cases in the United States and Japan at different recording times in Fig. 1, respectively.As shown in Fig. 1, these two datasets show completely different epidemiological evolution trends.Fig. 1(a) indicates that the outbreak is ongoing, and Fig. 1(b) indicates that the outbreak is under control, where the different trends reflect the vastly different transmission dynamics of the epidemic.Traditional spatio-temporal models only find a nonlinear mapping between input and output data, without the underlying physical information, which also makes it difficult to provide stable and accurate forecasting in the face of complex trends [22].In response to this issue, [23] points out that it is not reasonable to simply apply deep learning to epidemic forecasting.Furthermore, theory-guided data science demonstrates that incorporating domain knowledge into data-driven models helps improve algorithm performance [24].Therefore, researchers have attempted to use epidemiological domain knowledge to help models better learn the underlying dynamics of epidemics.Some works, such as [25][26][27], incorporate single-patch epidemic models such as SIR and SIRD into spatio-temporal models, providing meaningful epidemiological context for neural networks and improving the performance of epidemic forecasting.However, they neglect inter-patch epidemic transmission, so some researchers [28] use population mobility data to construct a metapopulation epidemic transmission model and train the learning model using this domain knowledge.
Although existing methods have achieved success in this field, we find the following issues: (1) Most of the existing methods fail to make full use of the more reasonable epidemiological domain knowledge to help model training.They utilize domain knowledge that either ignores inter-patch interactions [25,26] or requires additional population mobility data to construct inter-patch interactions [28].The latter approach relies heavily on population mobility data, but collecting population mobility data between patches is inherently challenging and inaccurate, which can also bias the model.(2) Most of the existing domain knowledge-based models do not analyze the effectiveness of domain knowledge on model training in detail.Most methods only apply epidemiological domain knowledge to the loss function [26,28], and some works apply epidemiological knowledge to the model construction at the same time [27].However, these methods do not analyze in detail the effectiveness of domain knowledge on model construction and loss function separately for epidemic forecasting.
To address the above issues, we propose a novel approach named Metapopulation-based Spatio-Temporal Attention Network (MPSTAN).MPSTAN employs the MP-SIR model that considers inter-patch mobility to help spatiotemporal model training.Specifically, the MP-SIR physical model utilizes the neural network to learn physical model parameters both intra-and inter-patch, thus enabling adaptive construction of interactions between patches.Furthermore, we believe that different parameters are influenced by distinct types of information.The intra-patch parameters primarily represent the scale of the epidemic within a given patch, which reflects the temporal variations in population size for each state.The inter-patch parameters, on the other hand, capture the population mobility between patches and are also influenced by spatial information.Therefore, we design multiple parameter generators to solve the intra-and inter-patch parameters using the data containing different information as input, respectively.In addition, we apply the physical model to the model construction and loss function of the MPSTAN model, and thoroughly analyze the effectiveness of different ways of combining the physical model with the learning model for epidemic forecasting.Furthermore, single physical model do not accurately represent the potential epidemiological dynamics in various real-world environments.To make more accurate forecasting, selecting an appropriate epidemiological physical model tailored to the specific circumstances is necessary.In summary, the main contributions of this paper are as follows: (1) We propose a novel spatio-temporal epidemic forecasting model that employs an adaptive approach to construct the metapopulation epidemic transmission and integrates domain knowledge to aid neural network training.The remainder of this paper is structured as follows: In Section 2, we introduce the related work.Section 3 describes the detailed design of our proposed model.Section 4 demonstrates the experimental results and provides an analysis of the findings.Finally, a summary of the entire work is presented in Section 5.

Related work
Many methods have been proposed for epidemic forecasting, which are divided into four types of methods: traditional mathematical models, time series models, traditional spatio-temporal models, and domain knowledgebased spatio-temporal models.
Traditional mathematical models: Early researchers used epidemic transmission models or traditional time-series models to predict future epidemic trends.[29] uses SIR model to predict epidemics and points out that simple SIR model is not consistent with epidemic characteristics.[6,30]propose a series of variant models based on the SIR model to better adapt to complex and variable epidemic transmission.In addition, traditional time-series models can be directly used for epidemic forecasting due to the timeseries nature of the data.[31] predicts the prevalence and incidence of epidemics by ARIMA.[8] utilizes SVR to fit the epidemiological data, but the presence of numerous spikes in daily data resulted in the poor fitting.The advantages of these methods lie in their simple structure and low computational cost, but this also means that it is difficult to effectively extract the potential complex nonlinear dynamics.
Time series models: Deep learning is widely used in time series forecasting due to its powerful nonlinear mapping capability, where RNN and its variants LSTM, and GRU are frequently applied to capture temporal dependence.[32,33] consider epidemic forecasting as a time series forecasting problem, mainly using LSTM and its variants for epidemic forecasting, while [34] proposes a two-branch LSTM to aggregate different levels of epidemiological information.The attention mechanism is also commonly used for timeseries forecasting, such as [35] proposes a transformer-based model to predict the change in influenza cases and design a new loss function to avoid the performance degradation of the target value.In addition, [36] combines transformer with LSTM for effective short-and long-term epidemic forecasting.Time series forecasting models typically take into account only time dependence without considering spatial dependence.However, in the case of epidemic transmission, such models ignore the effect of inter-patch interactions on epidemic evolution.Thus, relying on temporal dependence alone can lead to inaccurate epidemic forecasting.
Traditional spatio-temporal models: Numerous studies have indicated that Graph Convolutional Network(GCN) show superior results in processing data with spatial structure [37,38], and epidemic transmission can automatically be translated into graph structure due to its spatial nature [39,40].[16] uses time series data as input to GCN for epidemic forecasting.[17] proposes a dynamic locationaware attention mechanism to capture the spatial relationships between patches.Furthermore, [18] fuses multimodal information in a spatio-temporal model to explore regional correlations in the epidemic transmission process.Due to the inherent nature of spatio-temporal features, models from other domains can also be applied to epidemic forecasting, such as [21] proposes adaptive adjacency matrices to learn the relationships between nodes in a graph, [41] chooses to model the temporal and spatial dimensions in parallel, since the complex mapping of serial neural network structures may cause the original spatio-temporal relationships to change, and [42] combines Neural ODE with GCN, proposes a tensor-based model that models the spatio-temporal dependencies simultaneously to avoid limiting the model representation capability.Nevertheless, traditional spatiotemporal models lacking physical information are difficult to fit the potentially complex dynamics [43].
Domain knowledge-based spatio-temporal models: Several works have incorporated domain knowledge from epidemiology into neural networks.[25] utilizes a spatiotemporal model to predict the infection rates and combines them with the SIR model to predict infected cases.[26] constructs a physical-guided dynamic constraint model which uses the SIR model to constrain the propagation dynamics in neural network forecasting.This dynamic constraint is based on the infection and recovery rates, as well as the previous moment data, to recursively derive the predicted values.Moreover, [27] proposes a causal encoder-decoder structure based on the SIRD model, which applies not only to the loss function, but also iteratively for model construction.However, this domain knowledge (SIRD model) neglects the interactions between patches.Additionally, [28] combines population mobility data to construct a metapopulation epidemic transmission model and incorporates the domain model into a neural network to help learn potential epidemic transmission dynamics.Although, it is worth noting that the accuracy and completeness of mobility data can significantly affect its performance.

Methodology
In this section, we first give the problem description for epidemic forecasting.Then, we present an overview of the proposed model and details of the modules.

Problem Description
We use the graph (, ) to represent a spatial network, where  denotes the set of patches, and  denotes the set of edges between patches.The adjacency matrix ∈ ℝ × represents the connections between patches.In particular, we construct the adjacency matrix by using the gravity model [44].The edge weight between patches and , is defined as: where ( ) denotes the population size of patches ( ), denotes the distance between patches and . 1 , 2 , are the hyperparameters.It indicates that if there is high population size and closer distance between a pair of patches, there is stronger correlation epidemic propagation between the patches.We further select the maximum edge weights for all patches to make the adjacency matrix sparse, and thus reduce the computational complexity.If belongs to the set of maximum edge weights of patch , = 1, otherwise = 0. We use  = ] ∈ ℝ × × to denote the spatio-temporal graph feature matrix, where , ∈ [1, ] is the graph feature matrix at time step and is the number of node features.Here, node features include the number of daily active cases, daily recovered cases, and daily susceptible cases.For epidemic forecasting, our goal is to learn a function (⋅), which uses the adjacency matrix and the node feature matrix − ∶ of historical time steps as inputs to predict the number of daily active cases +1∶ + ′ of future ′ time steps.The problem can be formulated as follows: (2)

Model Overview
The overall framework of the MPSTAN model is shown in Fig. 2. The model consists of a recurrent architecture and each model cell contains four modules, namely, the spatiotemporal module, the epidemiology module, the multiple parameter generators module, and the information fusion module.At first, we use the spatio-temporal module to learn the spatio-temporal information from the input data.The learned spatio-temporal information is then passed into the parameter generation module to learn the epidemiological parameters for the epidemiological model.Further, the input and the learned parameters are passed into the epidemiological module to achieve epidemic forecasting.Finally, the learned spatio-temporal information is fused with the physical forecasting information in the information fusion module, and the output containing the fused information is passed to the MPSTAN cell at the next time step.

The Spatio-Temporal Module
The spatio-temporal module uses the spatio-temporal feature matrix  ∈ ℝ × × and the adjacency matrix ∈ ℝ × to learn the spatio-temporal information of the epidemic data.This module embeds graph attention network(GAT) into gated recurrent unit(GRU), which learn spatial dependence and temporal dependence.
Temporal embedding.Initially, GRU is widely used for time series forecasting due to its ability to efficiently model time series, thus, we use GRU to learn the temporal embedding of each patch.In the GRU, , denote the update gate and reset gate at time step , ̃ denotes the hidden embedding at time step , −1 denotes the output of the MPSTAN cell at time step − 1, and , denotes the output containing the temporal dependence at time step : where ⊙ denotes the element-wise multiplication, , , ℎ , , , ℎ , , , ℎ denote the learnable parameters.
Spatial embedding.The epidemic evolution of each patch is not independent, but is influenced by other patches at the spatial level.This is in a similar way as GAT, which combines attention mechanism to aggregate information from neighbor patches and update the embedding for each patch.Therefore, we use a two-layer multi-head GAT to capture the spatial dependence of epidemic evolution among patches.Firstly, we take the embedding of each patch as input and use the multi-head mechanism to compute independent attention weights.The attention weight between patch and patch at the -th head as is given by, where , denote the learnable parameters of theth head, (⋅ ∥ ⋅) denotes the vector concatenation, denotes the nonlinear activation function, and omits the subscript .Then, we use the softmax function to calculate the attention scores of all the edges.The attention score between patch and patch at the -th head as is expressed as: Finally, the attention scores are used to aggregate the information from neighboring patches and update the patches embeddings ∈ ℝ × , where denotes the embedding dimension of each patch.The embedding of patch as is calculated as: where  denotes the set of neighbors of patch .If = 1, it indicates that patch belongs to the set of neighbors of patch .

The Epidemiology Module
We observe that the results of epidemic forecasting using only spatio-temporal models are not accurate and stable, and it is also very challenging to predict for datasets with different epidemiological evolution trends (e.g., outbreak, outbreak under control) [22].Therefore, some works choose to use epidemiological domain knowledge to help model training, such as [26,27].These works mainly use compartmental models as domain knowledge, such as the SIR model.The SIR model is the most typical model in epidemic transmission, where S denotes the susceptible individuals, I denotes the infected individuals, and R denotes the recovered individuals.The model uses three differential equations to represent the number of changes in the three state populations in patch : where and denote the infection and recovery rate of epidemic transmission in patch ∈ [1, … , ].However, the SIR model is limited to simulate epidemic transmission within a single patch, and neglect the inter-patch interactions.Therefore, [28] uses population mobility data to construct a metapopulation epidemic model, and iteratively calculates the daily confirmed cases using neural networks.In addition, other mobility change data (e.g., GPS trajectory data) can also be used to construct the metapopulation epidemic model.However, accurate collection of mobility data is challenging, and other data may not fully reflect actual population mobility patterns.
To overcome the limitation of data availability, we develop an adaptive approach to define inter-patch interactions and construct a metapopulation epidemic model, named the metapopulation-based SIR (MP-SIR) model, which does not rely on mobility data.The MP-SIR model is based on the original SIR model with inter-patch mobility parameters to represent the mobility of populations at each state between patches: where ( | ) denotes the mobility probability of patch to patch , and , , denote the mobility rates of susceptible, infected, and recovered individuals in patch .
Taking Eq. 13 as an example, the change in the number of infected individuals within patch is affected by four aspects: (i)susceptible individuals become infected individuals with probability after contact with infected individuals ; (ii)infected individuals recover with probability ; (iii)infected individuals within patch move to other patches with the mobility rate ; (iv)infected individuals from patch move toward patch with the mobility rate .We simply assume that the probability of a patch migrating to other neighboring patches is equal.Formally, the mobility probability of patch to patch ( | ) is computed as follows: We use neural networks to generate intra-and interpatch MP-SIR model parameters ] ∈ ℝ ×3 , and will describe them in detail in Section 3.5.Finally, the epidemic data and the generated MP-SIR model parameters are used as inputs to the MP-SIR model for domain knowledge-based epidemic forecasting: where Δ ℎ , ∈ ℝ ×3 denotes the change in the number of individuals in each state at time step and ℎ , +1 = ℎ , +1 , ℎ , +1 , ℎ , +1 ∈ ℝ ×3 denotes the epidemic forecasting at time step + 1.

The Multiple Parameter Generators Module
We use embeddings containing different information to learn intra-and inter-patch physical model parameters ∈ ℝ ×2 , ∈ ℝ ×3 , separately, instead of directly using embeddings containing spatio-temporal information.The intra-patch physical model parameters , indicate the epidemic evolution within a single patch and are mainly affected by the temporal dependence, while the inter-patch physical model parameters , , indicate the interpatch population mobility and are mainly affected by the spatio-temporal dependence.Therefore, we generate these two types of physical model parameters by passing embeddings containing only temporal dependence and spatiotemporal dependence to the two fully connected layers, respectively:

The Information Fusion Module
In this module, the information between neural network forecasting ∈ ℝ × and physical model forecasting using a fully connected layer which aims to keep the physical forecasting the same dimensions as the neural network forecasting, Next, the neural network forecasting is concatenated with the physical forecasting.Finally, a fully connected layer is used to generate the final output ∈ ℝ × of the MPSTAN cell at time step , where denotes the dimension of GRU:

Output Layer
The output of the MPSTAN model is divided into two parts: neural network forecasting and physical model forecasting.
Neural network forecasting.We use the final output of MPSTAN as the input of a fully connected layer to predict the number of infected individuals ∈ ℝ × ′ in all patches for the next ′ time steps: Physical model forecasting.The input data from the last day and the final trained model parameters are used as inputs for the MP-SIR model to recursively predict the number of infected individuals ℎ ∈ ℝ × ′ in all patches for the next ′ time steps:

Optimization
We utilize epidemiological domain knowledge for model construction and loss functions to more effectively help MP-STAN models learn the epidemiological evolution trends.We compare the predicted values , ℎ of neural networks and physical models with the ground truth ̂ and then optimize a MAE loss via gradient descent:

Experiments 4.1. Datasets
Our experiments are conducted on two real-world datasets: the US dataset and the Japan dataset.As shown in Table 1, the US dataset is state-level data collected from the Johns Hopkins University Coronavirus Resource Center [45], which records the number of daily active cases, daily recovered cases, daily susceptible cases and total population for 52 states from May 1, 2020 to December 31, 2020 (245 days).The Japan dataset is prefecture-level data collected from the Japan LIVE Dashboard [46], which records the number of daily active cases, daily recovered cases, daily susceptible cases and total population for 47 prefectures from January 15, 2022 to June 14, 2022 (151 days).
(1) SIR [3]: The SIR model uses three differential equations to calculate the change in the number of susceptible, infected and recovered cases in a single patch.(2) ARIMA [31]: The Auto-Regressive Integrated Moving Average model is widely used for time series forecasting.We use ARIMA to predict daily active cases for each patch.(3) GRU [10]: The Gated Recurrent Unit is a variant of RNN that uses fewer parameters to implement the gating mechanism compared to LSTM.We use a GRU for each patch separately to predict daily active cases.(4) GraphWaveNet [21]: GraphWaveNet combines adaptive adjacency matrix, diffusion convolution, and gated TCN to capture spatio-temporal dependencies.( 5) STGODE [42]: STGODE proposes a spatio-temporal tensor model by combining Neural ODE with GCN to achieve unified modeling of spatio-temporal dependencies.
(6) CovidGNN [16]: CovidGNN uses the time series of each patch as node features and predicts epidemics using GCN with skip connections.( 7) ColaGNN [17]: ColaGNN designs dynamic adjacency matrix using attention mechanism and adopts a multiscale dilated convolutional layer for long-and shortterm epidemic forecasting.( 8) STAN [26]: STAN applies epidemiological domain knowledge to the loss function, which specifically constructs a dynamics constraint loss by combining the SIR model.
Settings.We split the two datasets into training sets, validation sets and test sets at the ratio of 60%-20%-20% and normalize all the data to the range (0, 1).To verify the effectiveness of the model in short-and long-time forecasting, we set the input time length as 5, the forecasting time length as 5 and 10 for short-term forecasting, and the forecasting time length as 15 and 20 for long-term forecasting.In the model, the dimensions of GRU and GAT are set to 64 and 32 respectively.Besides, the number of heads K in GAT is set to 2. We set epoch numbers as 50 and use Adam optimizer with the learning rate 1e-3.
The US dataset  where denotes the correlation coefficient between the two variables, and denote the mean of the two variables, and 2 , 2 are the corresponding variances.

Forecasting Performance
As shown in Table 2 and Table 3, we evaluate the performance of our method with all the baselines on the the US dataset and the Japan dataset for predicting daily active cases, respectively, where bolded and underlined indicate optimal and suboptimal, and Improvement denotes the improved rate of MPSTAN compared to the suboptimal forecasting results.On the US dataset, our method achieves state-of-the-art (SOTA) performance for both short-term (T=5,10) and long-term (T=15,20) forecasting.In particular, our forecasting results for all the forecasting tasks show significant improvements over the suboptimal forecasting, where MAE improves at least 19.05%,RMSE improves at least 7.72%, MAPE improves at least 23.97%, PCC improves at least 0.34%, and CCC improves at least 0.11%.While our method may not fully achieve the SOTA performance on the Japan dataset, it can achieve optimal or competitive forecasting results compared to other models, demonstrating strong competitiveness, where MAE improves at least 16.45%, RMSE improves at least 17.40%, MAPE improves at least 6.38%, and CCC improves at least 2.66%.In summary, compared to all baseline models, MPSTAN can provide more accurate and stable forecasting for different real-world epidemic datasets.
Next, we discuss specifically the performance comparison between different models.Traditional mathematical models (e.g., SIR, ARIMA) often outperform neural network models in short-term forecasting, but the performance becomes worse in long-term forecasting.This may be because the predictive accuracy of traditional mathematical models is highly dependent on the time length, and longterm forecasting requires more historical data.Insufficient historical data can lead to forecasting errors, and the cumulative effect of errors increases with longer forecasting times, resulting in worse long-term forecasting results.
In addition, we observe that traffic flow models, particularly the STGODE, face challenges in providing stable and accurate forecasting for different tasks.This may be attributed to the fact that epidemic data is sparser and noisier than traffic flow data, increasing the likelihood of these models overfitting when applied to epidemic data.Through observation, it is noticed that the ColaGNN model also faces difficulties in providing accurate forecasting.It is believed that the ColaGNN model was originally designed for influenza-like illnesses, while COVID-19 data is more complex and on a larger scale.As a result, the ColaGNN model is not well-suited for these tasks.By comparing domain knowledge-based models (e.g., STAN, MPSTAN) with other baselines, we observe that STAN and MPSTAN outperform other models in terms of accuracy, indicating that neural networks incorporating epidemiological domain knowledge better capture the underlying dynamics of epidemic transmission and achieve more accurate forecasting.In particular, the results show that MPSTAN performs better than STAN, highlighting the value of this integrated neural network framework that combines epidemiological domain knowledge to achieve more accurate forecasting.This framework involves two main aspects: integrating domain knowledge and modeling metapopulation transmission.Furthermore, in section 4.4, we will discuss the impact of these two aspects on forecasting results, including the effects of integration methods and inter-patch interactions.

Ablation Study
To explore the impact of epidemiological domain knowledge on epidemic forecasting and to verify the effectiveness of the model components, we further conduct ablation experiments on the US and Japan datasets.We generate all the physical model parameters using a single parameter generator for embeddings containing spatio-temporal information.
The results of the ablation experiments are shown in Table 4      poorer forecasting performance.Therefore, we believe that incorporating domain knowledge into model construction is essential, and simultaneously applying it to the loss function can improve the predictive accuracy of the model.For the remaining model components, the effectiveness of the metapopulation model establishment and multiple parameter generators can be verified by using MPSTAN w/o Mobility and MPSTAN w/o MPG, respectively.On the US dataset, MPSTAN outperforms MPSTAN w/o Mobility for forecasting tasks with T=5, 10, and 15.However, the opposite result is observed for the T=20 task, which may be due to the fact that inter-patch physical parameters are no longer sufficient to define the population mobility when the forecasting time is longer.Overall, MP-SIR, a metapopulation epidemic model that considers population mobility, is more beneficial for model training than traditional SIR.Additionally, comparing MPSTAN with MPSTAN w/o MPG reveals that using only one parameter generator to generate all physical model parameters may lead to poorer predictive performance.
On the Japan dataset, we observe that the performance of MPSTAN w/o Mobility and MPSTAN w/o MPG is mostly superior to MPSTAN.We believe that this is due to the fact that these two datasets are collected at different times and locations, leading to differences in disease control measures and public awareness.To confirm this, we randomly select five cities from each dataset and display the normalized daily active cases of these cities in Fig. 3.It clearly shows that US cities are experiencing a surge in active cases, while Japan cities are effectively controlling the spread of the disease, resulting in a decrease in active cases.Moreover, we investigate the Covid-19 Community Mobility Reports [47] from Google for the corresponding time periods of these two datasets.We observe that the park population movement in the US is higher than the pre-epidemic baseline, while in Japan it is lower than the baseline.Possible reasons for the above situation could be that the data collected in the United States is from an earlier period when the COVID-19 prevention and control policies are not yet well-established, resulting in greater population mobility.On the other hand, the data collected in Japan is from a later period when more comprehensive measures have been implemented and the public has become more aware of the importance of selfisolation, leading to lower population mobility.Therefore, on the Japan dataset, the traditional SIR model is more suitable to be combined with neural networks for epidemic forecasting.The multiple parameter generators (MPG) are essentially based on the metapopulation epidemic model, and thus, the forecasting accuracy of MPSTAN w/o MPG is higher.
Furthermore, we recognize that no single domain knowledge can be universally applied to all complex epidemic data.Thus, when selecting domain knowledge to integrate into neural networks, it is necessary to consider the actual circumstances and choose more representative knowledge to achieve more accurate forecasting.

Effect of Hyperparameters
In this section, we study the effect of hyperparameters on performance, focusing on the dimensions of GRU and GAT.We vary one parameter at a time while keeping the other parameter constant.In addition, the dimension range is set to [8,16,32,64,128], T=5 is selected as the task on the US dataset, while MAE, RMSE, and MAPE are chosen as the evaluation metrics.
Fig. 4 shows the effects of different dimensions of GRU and GAT on the performance, respectively.It can be seen that the forecasting performance is poor when the number of dimensions is small, and gradually becomes better when the number of dimensions increases, which is because more parameters are involved in fitting the potential dynamics of the epidemic.When the number of dimensions continues to increase, the forecasting performance will also become worse.The possible reason of this issue may be that the epidemic data are sparse and the excessive number of parameters will lead to the overfitting problem.

Model Complexity
We analyze the model complexity by comparing the neural network parameters of all models.As shown in Fig. 5, the number of neural network parameters in MPSTAN is significantly less than in other spatio-temporal models.This is because MPSTAN makes extensive use of epidemiological domain knowledge (e.g., model construction, loss functions), thus reducing the reliance on neural networks and lowering the number of parameters.By comparing GRU and MSPTAN, we find that the number of parameters is similar, but the former ignores the spatial dependence and the intrinsic propagation mechanism of the epidemic which can only be used for temporal forecasting of a single patch, while the latter perfectly solves the above problems and provides stable and accurate forecasting for different trends.

Conclusion
In this paper, we propose a Metapopulation-based Spatio-Temporal Attention Network (MPSTAN) for epidemic forecasting.The model uses an adaptive approach to define interactions between patches and applies the constructed domain model to model construction and loss function of MPSTAN to better learn the underlying dynamics of epidemic propagation.Experiments show that the MPSTAN outperforms other baselines and is more stable on two real datasets with different epidemiological evolution trends.Additionally, we further analyze the effectiveness of incorporating domain knowledge and find that it improves the accuracy of forecasting in the learning model.Specifically, domain knowledge plays a more critical role in model construction than loss functions, and applying it to both aspects can better fit to potential epidemiological dynamics.We also recognize that no single domain knowledge can perfectly fit epidemic forecasting in different real-world situations.Instead, we should select domain knowledge that is more representative based on the actual circumstances to achieve more accurate forecasting.We also discuss the impact of hyperparameters on the model, as excessively small or large hyperparameters can lead to underfitting or overfitting, respectively, so appropriate hyperparameters must be chosen.Finally, we analyze the model complexity and find that compared to all baselines, MPSTAN requires fewer neural network parameters due to its greater integration of domain knowledge.
Our model achieves state-of-the-art or competitive results in epidemic forecasting for different epidemic trends, but there are still several aspects where performance can be improved.Firstly, graph construction has a significant impact on the entire learning model, as it affects the propagation of spatial information and the inter-patch interactions of the physical model.Therefore, a reasonable graph structure is crucial.Currently, we use the gravity model to construct the graph structure, which relies on prior knowledge, but may overlook some potential information, resulting in an incomplete capture of the correct graph information between patches.In addition, the graph information between patches changes over time, rather than being fixed.Hence, in the future, we will combine potential graph information to construct a dynamic graph structure to better describe the interactive graph of epidemics.Furthermore, in the model construction, we currently simply connect the neural network results with domain knowledge from physical model without considering their respective roles or weights, which may also lead to a decrease in accuracy.Therefore, we will carefully analyze the roles of the neural network and domain knowledge in epidemic forecasting and explore more effective methods to fuse the information of the two, such as introducing gating mechanisms to achieve more accurate forecasting.

Fig. 1 :
Fig. 1: Illustration of active cases on the US and Japan datasets.

( 1 )
MPSTAN w/o Phy-All: Remove epidemiological domain knowledge from both the model construction and loss function.We use only the spatio-temporal module for epidemic forecasting.

( 2 )
MPSTAN w/o Phy-Loss: Remove epidemiological domain knowledge from the loss function.We only combine the knowledge into the model construction.(3) MPSTAN w/o Phy-Model: Remove the epidemiological domain knowledge from the model construction.We predict physical model parameters in the output layer and combine the knowledge into the loss function.(4) MPSTAN w/o Mobility: Combine epidemiological domain knowledge without considersing population mobility into the model, mainly by using the SIR model instead of the MP-SIR model.(5) MPSTAN w/o MPG: Remove multiple parameter generators (MPG).

Fig. 3 :
Fig. 3: Samples of typical cities in the US and Japan datasets.

Table 1
Statistical information of the datasets.

Table 2
Performance comparison with baseline on the US dataset.

Table 3
Performance comparison with baseline on the Japan dataset.
and Table5, where bold indicates better performance for the ablation model or MPSTAN.Firstly, we analyze the effectiveness of domain knowledge in epidemic forecasting by comparing the performance of the MPSTAN with MPSTAN w/o Phy-All on two datasets.The results show that the MPSTAN w/o Phy-All model, which lacks domain knowledge, performs extremely poorly in epidemic forecasting, highlighting the crucial role of epidemiological domain knowledge in epidemic forecasting.

Table 4
Ablation study on the US dataset.

Table 5
Ablation study on the Japan dataset.

Table 4 .
In Table5, for short-term forecasting on the Japan dataset, MPSTAN performs worse than MPSTAN w/o Phy-Loss, which only applies domain knowledge to model construction, but still provides competitive forecasting.In long-term forecasting, MPSTAN outperforms the other two models.Overall, incorporating domain knowledge into both model construction and loss function can better help the model learn the basic dynamics of epidemic transmission and improve forecasting accuracy.By comparing MPSTAN w/o Phy-Loss and MPSTAN w/o Phy-Model on two datasets, we find that the former performs better in all forecasting tasks, indicating that applying domain knowledge to model construction is more beneficial for accurate epidemic forecasting than applying it to the loss function.In addition, by comparing MPSTAN w/o Phy-All and MPSTAN w/o Phy-Model, we find that using domain knowledge to only constrain the loss function may lead to