Network Autoregressive Model for the Prediction of COVID-19 Considering the Disease Interaction in Neighboring Countries

Predicting the way diseases spread in different societies has been thus far documented as one of the most important tools for control strategies and policy-making during a pandemic. This study is to propose a network autoregressive (NAR) model to forecast the number of total currently infected cases with coronavirus disease 2019 (COVID-19) in Iran until the end of December 2021 in view of the disease interactions within the neighboring countries in the region. For this purpose, the COVID-19 data were initially collected for seven regional nations, including Iran, Turkey, Iraq, Azerbaijan, Armenia, Afghanistan, and Pakistan. Thenceforth, a network was established over these countries, and the correlation of the disease data was calculated. Upon introducing the main structure of the NAR model, a mathematical platform was subsequently provided to further incorporate the correlation matrix into the prediction process. In addition, the maximum likelihood estimation (MLE) was utilized to determine the model parameters and optimize the forecasting accuracy. Thereafter, the number of infected cases up to December 2021 in Iran was predicted by importing the correlation matrix into the NAR model formed to observe the impact of the disease interactions in the neighboring countries. In addition, the autoregressive integrated moving average (ARIMA) was used as a benchmark to compare and validate the NAR model outcomes. The results reveal that COVID-19 data in Iran have passed the fifth peak and continue on a downward trend to bring the number of total currently infected cases below 480,000 by the end of 2021. Additionally, 20%, 50%, 80% and 95% quantiles are provided along with the point estimation to model the uncertainty in the forecast.


Introduction
SARS-Cov-2 (COVID-19) is on the rise and it is quickly infecting new people every day. Currently, two years after the onset of this pandemic, this ascending trend has not yet stopped and it is even multiplying in some countries [1]. When a person is determined to be infected with the disease in a country, there may be two possibilities where he is infected: 1.
The first case concerns the situation where both carriers and recipients of the disease are in the same country. This type of disease transmission is considered "local"; 2.
The second is for cases infected in another country and transferred to a second country by travel. This type of disease transmission is called "imported".
Communication among nations is one of the main causes of disease transmission, and is called disease interaction between countries in this paper. In addition to disease progress in target communities when examining its spreading profile, it is also of the utmost importance to reflect on its prevalence rate in other countries, including those with a high volume of travel [2]. The number of cases infected with this health condition can be thus deemed as a timeseries, taking account of the related statistics in the form of data over time [3].
In this regard, numerous researchers have thus far attempted to utilize a wide range of statistical tools to predict the number of cases of COVID-19 in the future to guide health care officials to make informed decisions [4]. For example, Shadabfar et al. used a susceptibleexposed-infected-vaccinated-recovered (SEIVR) model combined with the Monte Carlo (MC) sampling method to probabilistically investigate the COVID-19 spreading profile in the United States (USA) [5,6].
In general, different stochastic computations [7,8] and numerical methods [9][10][11][12][13][14] are exploited to assess the various aspects of the COVID-19 outbreak. In this sense, Katoch et al. used the autoregressive integrated moving average (ARIMA) model to forecast the COVID-19 dynamics in India [15]. Kumar Sahai et al. also modeled and predicted this pandemic via the ARIMA model [16]. Using the same ARIMA model, Malki et al. further predicted the second rebound of this disease; they also projected the end of the pandemic based on the ARIMA model [17]. Chaurasia et al. additionally used ARIMA and a regression model to forecast mortality rates in this respect [18]. Furthermore, Kumar et al. employed timeseries methods to analyze the COVID-19 spreading profile in ten affected countries [19]. Using the α-Sutte indicator and ARIMA, Attanayake et al. modeled COVID-19 [20]. Hernandez et al. correspondingly forecasted COVID-19 per region using the ARIMA model and polynomial functions [21]. Moreover, Yang et al. defined the data as timeseries and predicted the COVID-19 spreading profile in Wuhan, China [22].
Even though these studies have been to take advantage of different regression and optimization techniques to obtain the best fit of the data and consequently provide reliable timeseries forecasting, they typically suffer from one limitation, that is, their prediction remains independent of the disease spreading profile in other nations in the region. In fact, concerning the development trends of the disease interactions in neighboring countries, it seems ideal to measure the relationship between the disease spreading profiles in relevant nations to consider its impact on predicting the disease timeseries in target countries and regions.
To fill this gap, this paper utilizes a Network Autoregressive (NAR) Model. For this purpose, the COVID-19 data are initially retrieved from the World Health Organization (WHO) and the Johns Hopkins University online official websites and databases for seven different countries, namely, Iran, Turkey, Iraq, Azerbaijan, Armenia, Afghanistan, and Pakistan [23]. Thereafter, by constructing a network in the region, in which each vertex corresponds to a country and each edge represents the correlation of the total number of currently infected cases, the correlation matrix of the area is established. After that, the timeseries forecasting for Iran is performed using the NAR model, providing the number of infected cases up to December 2021. Comparing the root mean square error (RMSE) and mean absolute percentage error (MAPE) between autoregressive integrated moving average (ARIMA) and NAR models demonstrate that a better fit is obtained over the data once interactions among neighboring countries are taken into account. The method proposed in this paper can thus be implemented systematically to provide a reference for the investigation of the disease spreading profile in other countries and regions.
The rest of this study is organized as follows. Section 2 introduces the study area and then reviews the disease progression across the countries in the region concerned, from the onset of the COVID-19 pandemic in February 2020. Section 3 sheds light on the details of both methods implemented in this study, namely, the ARIMA and the NAR models, and subsequently describes how to consider the disease interactions in the neighboring nations in the proposed formulation. Next, in Section 4, the ARIMA and the NAR models are fitted to the existing data. In addition, upon comparing both methods, it is settled that the consideration of the disease interactions in the neighboring countries can enhance the prediction accuracy. Thus, the NAR model is employed to forecast the number of cases infected in Iran until the end of December 2021, and the results are reported. Then, in Section 5, the criteria for choosing the threshold are clarified in more detail. Finally, the contents are summarized and concluded in Section 6.

Target Region and Data Description
To implement this work, the records of the COVID-19 data from the WHO and Johns Hopkins University official websites are used [24]. It should be noted that the data reported by the WHO contain some uncertainty and do not reflect the complete and accurate status of the disease in society [25,26]. However, the approach presented in the current research is implemented based on the disease statistics provided by the WHO as the reference dataset. The authors do not claim that the prediction made in the paper is the real state of the disease in society but acknowledge that it will be the disease's future according to WHO data. The data show confirmed cases, daily recovery, and death rates. The total of currently infected patients is accordingly calculated as follows: As mentioned earlier, the primary purpose of this study is to address the impact of COVID-19 interactions in the neighboring countries on the timeseries forecasting model of the number of cases infected in Iran. As a result, some neighboring nations, including Turkey, Iraq, Azerbaijan, Armenia, Afghanistan, and Pakistan, are considered the target region here. The COVID-19 data from Turkmenistan are not publicly available, so they are not reflected in this study. A comparison of the geographic locations of these countries with Iran is further depicted in Figure 1. The timeseries of the rate of infected cases and infected cases in these nations as of 10 September 2021 are shown in Figure 2. A closer look at Figure 2 also reveals that different countries have so far experienced similar trends of this condition at the same time, which reinforces the hypothesis that the nations located in this region interact with the spread of the disease. For example, Iran and Turkey simultaneously experienced three peaks in March 2020, December 2020, and April 2021.

NARI Model
In this section, the NAR model for total infected people, hereafter referred to as NARI, is provided, and its characteristics are explained. First, the data on the infected are transformed to a new timeseries, after that the correlations between countries are calculated and a network is created over the countries in which each edge denotes the correlation between a pair of countries. This network is then introduced into the NARI model to predict the time history of the target country given the correlation values. Assuming N is the number of countries, the difference of logs of the total infected people, s it , is defined as follows: where δ is a constant, which is a hyper parameter in the model. Using trial and error in the countries concerned, the optimal value of δ is computed as 0.5. Next, expanding this equation gives: wherein infected(i, t) refers to the number of total infected cases from country i at time t.
The NARI model is formulated as follows: where (i) N is the number of countries; (ii) s it represents the difference of logs of infected cases from country i at time t; (iii) A = (a ij ) N×N shows the adjacency matrix of the correlation between the log-returns of N countries; (iv) n i is the sum of the ith row at the adjacency matrix A; (v) it follows the normal distribution.
The assumption of a normal distribution for the error term in Equation (4) has also been adopted in other studies such as [27]. The main idea behind assuming a normal distribution for the noise term is that with this assumption we have the smallest variance between all of the estimators. This assumption helps the algorithm to approximate the MLE in a straightforward process and to facilitate the time prediction process. For more details, refer to [28].
The cross-correlations between the infected cases in different countries are similarly considered in terms of matrix C, whose elements are given through the following equation [29]: where the brackets mean the temporal average over the infected cases during the time considered. Then, c ij can vary between [−1, 1]. The case of c ij = 1(−1) also denotes that two countries i and j are completely correlated (anti-correlated), while c ij = 0 implies that they are uncorrelated. Suppose that A = (a ij ) (N×N) is the adjacency matrix of the correlations among N countries; by adjusting a threshold as θ, −1 ≤ θ ≤ 1, the matrix A is defined as follows [30]: and n i = ∑ N j=1 a ij refers to the ith row sum of the adjacency matrix, and E( it ) = 0 and var( it ) = σ 2 . This threshold value and its calculation details are described in Section 5. For convenience, Equation (4) can be rewritten in a matrix form as: or in a concise form as: in which S t = (s 1t , ..., s Nt ) ∈ R N , A 0 = α1 wherein 1 = (1, ..., 1) , G = α 1 W + α 2 I in which W = diag{n −1 1 , ..., n −1 N }A, and I is an identity matrix. Under the NARI model framework, the model might be based on three factors; first, s it might be affected by itself but from the previous time point, s i(t−1) , called the autoregressive effect; second, s it might be influenced by its neighbors, which are collected by {j : a ij = 1}, labeled the "neighborhood effect". The unexplained variation should also be attributed to an independent random noise, it . For example, for a country i = 1 at t = 3, s 13 is as follows: Therefore, α 2 s i(t−1) is not incorporated into the first term. It can be proved that S t has a stationary property (for more details see [31]). To estimate α = (α 0 , α 1 , α 2 ), maximum likelihood estimation (MLE) is also used as follows: For estimating the unknown parameter α, the NARI model (Equation (4)) is rewritten as: Then, it is written as: where , then the above model can be written as: Then, a maximum likelihood (ML) estimator in the logarithmic form can be obtained as follows: Differentiating this expression with respect to α, the ML estimates will be as follows: Substituting Equation (13) into the estimatorα in Equation (16), there is:

ARIMA Model
Box and Jenkins [32] published a technique to merge both autoregressive (AR) and moving average (MA) models, called the ARMA (p, q) model, as a union of AR (p) and MA (q) models, generally deployed for univariate timeseries modeling. The ARMA (p, q) model is thus presented as follows: where the θ 1 , . . . , θ q and ϕ 1 , . . . , ϕ p , are the parameters of the model and ε is the white noise. If the series is not stationary at the first level, there is a need to subtract it by d (d = 1, 2, 3, . . .) times to make it stationary. Such a timeseries model is called an ARIMA (p, d, q) model. There are three steps in ARIMA model creation, namely identification, parameter estimation, and diagnostic checking [33,34]. In this regard, for the identification process of the model, after checking the stationarity of the timeseries, the AR and MA terms are derived from the auto-correlation function (ACF) plot. ACF is a statistical metric of the correlation that is used to check if previous values in the timeseries analysis have a certain relationship with the latest values or not [35]. After that, ARIMA parameters, namely (p, d, q), are estimated by the least square method. The three main methods commonly used to select appropriate models are Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC) and the Second-order Akaike's Information Criterion (AICc), which are presented in Equations (19)- (21) for AIC, BIC and AICc, respectively [32,36].
where n refers to the size of the series and k denotes the number of the parameters of the ARIMA method. It is experimentally proved that the given model becomes efficient when the AIC value is smaller. According to [34], an optimal forecast model is selected based on the best fitting that has the minimum AIC value of the group.
To evaluate the prediction models, the following statistical measures are used for i(1, · · · , 7) as follows: where s it denotes the actual value andŝ it and T are the modeled values and the total number of days.

ARIMA Model Results
The ARIMA model is used in this study as a benchmark to compare the results with the proposed NAR method. To implement the ARIMA model on it, the data are split from 15 February 2020 to 10 September 2021 into two parts; the first part, the training dataset, from 15 February 2020 to 20 May 2021, and the second part, the testing dataset, from 21 May 2021 to 10 September 2021. This data division process is applied to the COVID-19 data of all countries. Therefore, it is shown once in Figure 3. Considering the training part, the ARIMA model is conducted; then, the results are validated with the testing dataset once the parameters are estimated.  After splitting the data into the training and test sets, they are transformed to s it for smoothing purposes as expressed in Equation (2). If the data are found to be non-stationary for each country; they are made so by subtracting them from the previous day. The number of times the timeseries data become stationary through difference disposal becomes the value of parameter d.
Upon stabilizing the data, the parameters are estimated. First, the ACF of the dorder difference timeseries is calculated. The order of the Auto-correlation Function (ACF) exceeding the confidence boundary lag also becomes the value of q. Second, the value of p is computed, which is the order of the Partial Auto-correlation Function (PACF) exceeding the confidence boundary lag. PACF gives the partial correlation of a stationary timeseries with its own lagged values. By observing the ACF and PACF of the residuals, it is determined whether they are white noise or not. Consequently, the fit of the model is assessed by checking the R 2 value. Ultimately, the model is validated and evaluated by applying the ARIMA method to predict the remaining 10% of the data. After that, the RMSE is used, as explained in Section 2, to evaluate the model. The whole process is depicted in Algorithm 1. The ARIMA models are fit for the datasets of each country to compare them with the proposed model. Both Augmented Dickey-Fuller (ADF) and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests also authenticate that the training timeseries are stationary at a 5% significance level. Accordingly, there is a need to apply the differencing method two times for Iran's dataset. Later, diverse models are designed by adjusting various parameters for the MA and AR components of the ARIMA model, as summarized in Table 1. The ARIMA (2,2,2) model additionally assumes AICc criteria. Therefore, the chosen model is checked for some assumptions. The residual analysis of the model is presented in Figure 4. The Ljung-Box test on the residuals, as well as the squared residuals, is also statistically significant at a 5% level with p-value = 0.9153. Therefore, the selected model with the minimum summary measures is appropriate with the lowest AIC, BIC, and AICc values.   The same approach is implemented in the training data series of other countries. The best model for each country is also selected, which has minimum AIC, BIC, and AICc measures, whose results are summarized in Table 2. The residual analysis of the best model in each country is depicted in Figures A1-A6 in the Appendix A.

NARI Model Results
In this section, the constructed NARI model is deployed on the COVID-19 dataset. As mentioned in the previous section, the data are split into two subsets of training and test datasets, as shown in Figure 3. The NARI model is then applied to the training dataset of each country. For this purpose, the training dataset is first transformed into a smooth timeseries, calculated by Equation (2). Then, the correlation matrix of countries is created by Equation (5), as reported in Figure 5. This allows the NARI model to be established for the whole network using Equation (4), given the parameter α as α = (0.014, 0.005, 0.79). The overall process of estimating α is presented in Algorithm 2.  Change the threshold and go to the previous step; end end Upon estimating the parameter θ, the constructed NARI model needs to be validated. For this purpose, the 50 days are predicted using the constructed ARIMA models in the previous section and are then compared with the 50 days of the test dataset through the RMSE and MAPE metrics. The same procedure is also repeated for the NARI model, and the results are subsequently compared. The outcomes for both models are summarized in Table 3, wherein the NARI model outperforms the ARIMA one. The graphs of the transformed data of the total infected cases in Iran and the modeled data using the NARI and ARIMA models are shown in Figure 6. Upon finding the best COVID-19 modeling, the next 110 days of the disease are predicted according to the infected cases. The forecast results are plotted in Figure 7. Timeseries forecasting for the spreading disease profile can be implemented considering disease interactions in different countries or without any correlation among them. Using the NARI model, considering these interactions, helps the constructed model to examine one of the most critical characteristics of disease transmission between societies and significantly enhances the accuracy of the timeseries prediction. This issue can be assessed by comparing the two models of ARIMA and NARI. It is shown in Table 3 that the RMSE for Iran, considering the disease interactions among neighboring countries, equals 3.06 for the NARI model and without considering it equals 5.42 for the ARIMA model. This means that the deliberation of the disease communications in neighboring countries promotes the prediction certainty substantially. Therefore, more reliable determinations can be made by policymakers to control the disease. It is of note that this prediction is based on the current pandemic situation. The results will also be altered in the case of events that change the present circumstances, such as enforcing stricter social distancing or wider public vaccination.
The forecast shows that the disease trend in Iran has passed the fifth peak, and the downward trend of the disease will continue after September 2021, so that the total number of infected cases per day will fall to less than 480,000 by the end of 2021. This is the point estimation prediction obtained from the NARI model, which is the most statistically possible case. However, Figure 7 also provides the quantiles of 20%, 50%, 80% and 95%, indicating the prediction uncertainty. The lower bounds of the 20%, 50%, 80% and 95% quantiles, respectively, indicate that the downward trend could be relatively more severe, bringing the total number of infected cases below 390,000, 320,000, 220,000 and 130,000 by the end of 2021. In addition, the upper limits of these quantiles indicate a possible new peak in Iran's new COVID-19 data. In this case, the total number of infected cases corresponding to 20%, 50%, 80% and 95% quantiles may reach 560,000, 640,000, 730,000 and 820,000, respectively, by the end of 2021.

Discussion
In Section 3, it was discussed that the adjacency matrix, representing the disease interaction among nations, is formed by adopting a threshold and comparing it to the correlation matrix. In order to determine this threshold and explain how to implement the process, additional analysis is required, which is discussed in this section.
To explain the approach adopted to compute the correlation threshold, it should be explained that there are two constraints to meet. First, none of the n i in Equation (7) should be zero; otherwise, an infinity term would appear in this equation. Besides, a value of n i , which can minimize the RMSE, is preferred as it helps the algorithm gain better accuracy.
To implement an algorithm that can satisfy these two conditions, the threshold value is defined as a decision variable. An external loop is then added to the main algorithm to change the value of θ and calculate the corresponding n i and RMSE. The results for different θ values from 0 to 1 with an increment of ∆θ = 0.1 are reported in Table 4. As seen, θ values greater than 0.6 give infinity values for n i , thus cannot be selected as a correlation threshold in the algorithm. Moreover, for the rest of the cases, θ = 0.5 gives the minimum amount of RMSE. Hence, it is selected as the optimal case and is utilized in the model implementation.

Summary and Conclusions
In this paper, the COVID-19 spreading profile in Iran is predicted in view of the influence of the severity and correlation of the disease in neighboring countries. To this end, the timeseries of COVID-19 infection among seven countries in the region, including Iran, Turkey, Iraq, Azerbaijan, Armenia, Afghanistan, and Pakistan, are downloaded from the online databases provided by the WHO and Johns Hopkins University. Then, a network is formed in the region to establish the correlation matrix among the countries concerned. Furthermore, by incorporating the correlation matrix into the proposed formula and calculating the model coefficients, the NARI model is used to predict the number of infected cases in Iran up until the end of September 2021, taking into account the impact of the disease in neighboring countries. The main results obtained in this study are as follows: 1. The correlation matrix obtained from the network of the countries in the region shows that the greatest impact of COVID-19 on Iran comes from Iraq, Turkey, Pakistan, Azerbaijan, Afghanistan and Armenia, with correlation coefficients of 0.86, 0.83, 0.64, 0.56, 0.55, 0.16, respectively. This result can also be seen in the trend of infected cases. The increasing/decreasing trend and the number of disease peaks in Iran, Iraq, and Turkey are very similar and have occurred within a short period of time. This indicates that the proposed correlation criterion is able to capture the similarity between infected data and disease peaks; 2. Timeseries predictions can be made with or without considering disease interactions in different countries. Incorporating the disease interaction not only helps the algorithm assess one of the most important components of disease transmission between societies but also significantly increases the accuracy of the timeseries prediction. This issue can be examined by comparing the two models of ARIMA and NARI. The RMSE with and without considering the disease interactions among neighboring countries is equal to 5.42 and 3.06 for ARIMA and NARI, respectively. This means that the consideration of the disease interactions in neighboring countries improves the prediction accuracy considerably. As the model's accuracy in predicting disease increases, more reliable tools are provided for policymakers to take informed controlling decisions; 3. The point estimation obtained from the NARI model indicates that the number of infected cases in Iran declines after September 2021, so the total currently infected cases will fall below 480,000 by the end of 2021. According to the prediction corresponding to the lower limit of 20%, 50%, 80%, and 95% quantiles, the total number of infected persons will fall below 390,000, 320,000, 220,000 and 130,000, respectively, by the end of 2021.
Iran's close neighbors, sharing common borders, and their impacts on the COVID-19 spreading profile in Iran are examined in this paper. However, ideally, more distant countries in the region that have direct or indirect demographic relationships with Iran can be also considered. Such a high volume of interactions between the countries requires the construction of a larger network to cover more countries and to subsequently provide a more reliable prediction. Such a model imposes more complexities on the problem, making the prediction results more accurate and reliable. Moreover, various factors, such as hospitalization, social distancing, quarantine, and so forth, can affect the number of people infected with COVID-19 in a society. However, the spreading profile of disease under the effects of the involved factors is not in the scope of the current research. Simulating the disease spread, taking into account the factors involved, requires establishing a system of differential equations in a so-called compartmental model and solving it incrementally to simulate the disease profile in the future. This topic is under investigation by the authors. Funding: The work described in this paper is supported by the Faculty of Economics and Business Administration of Yibin University. Data Availability Statement: Data, models, or code that support the findings of this study are available from the authors upon reasonable request.