Analysis of the Spread of COVID-19 in the USA with a Spatio-Temporal Multivariate Time Series Model

With the rapid spread of the pandemic due to the coronavirus disease 2019 (COVID-19), the virus has already led to considerable mortality and morbidity worldwide, as well as having a severe impact on economic development. In this article, we analyze the state-level correlation between COVID-19 risk and weather/climate factors in the USA. For this purpose, we consider a spatio-temporal multivariate time series model under a hierarchical framework, which is especially suitable for envisioning the virus transmission tendency across a geographic area over time. Briefly, our model decomposes the COVID-19 risk into: (i) an autoregressive component that describes the within-state COVID-19 risk effect; (ii) a spatiotemporal component that describes the across-state COVID-19 risk effect; (iii) an exogenous component that includes other factors (e.g., weather/climate) that could envision future epidemic development risk; and (iv) an endemic component that captures the function of time and other predictors mainly for individual states. Our results indicate that maximum temperature, minimum temperature, humidity, the percentage of cloud coverage, and the columnar density of total atmospheric ozone have a strong association with the COVID-19 pandemic in many states. In particular, the maximum temperature, minimum temperature, and the columnar density of total atmospheric ozone demonstrate statistically significant associations with the tendency of COVID-19 spreading in almost all states. Furthermore, our results from transmission tendency analysis suggest that the community-level transmission has been relatively mitigated in the USA, and the daily confirmed cases within a state are predominated by the earlier daily confirmed cases within that state compared to other factors, which implies that states such as Texas, California, and Florida with a large number of confirmed cases still need strategies like stay-at-home orders to prevent another outbreak.


Introduction
The pandemic due to the coronavirus disease 2019 , caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1], was the most disastrous incident in 2020, causing millions of deaths and resulting in economic activity worldwide falling sharply. According to the latest report of the World Health Organization (WHO), the cumulative cases around the world reached 28,637,952 and the cumulative deaths were 917,417 as of 13 September 2020 (https://www.who.int/docs/default-source/ coronaviruse/situation-reports/20200914-weekly-epi-update-5.pdf?sfvrsn=cf929d04_2). Furthermore, the World Bank suggested that most countries would be expected to suffer from economic recession in 2020 (https://www.worldbank.org/en/news/feature/2020/06/08 /the-global-economic-outlook-during-the-covid-19-pandemic-a-changed-world).
Although some countries and regions (e.g., North America, China, and Europe) are actively developing vaccines with some showing encouraging signs [2][3][4], it is almost impossible to provide sufficient effective vaccines to every person in the next few years. Hence, this pandemic will undeniably last several months or even a few years. As one of the most developed countries in the world, the United States declared a public health emergency on 31 January 2020, and preventive and proactive measures (e.g., suspending the entry and the quarantine of foreign nationals seeking entry) have been taken to control the spread of the virus and treat those affected. However, it has become one of the most severely affected nations as the respective numbers of confirmed cases and deaths approximately account for 1/5 (i.e., 6,386,832 and 191,809) of the whole global cases as of 13 September 2020. Even worse, as Chowell and Mizumoto [5] argued, states and territories with the largest proportions of older populations (such as Florida, Maine, and Puerto Rico) have become the places with the largest number of confirmed cases. The spread of this pandemic in the USA has become a global concern.
Typically, susceptible-infected-recovered (SIR) based models (e.g., SIRD [6]), first proposed by Kermack and McKendrick [7], are widely used models due to their simplicity and good performance. However, these models (such as susceptible-infected-recoveredsusceptible (SIRS), susceptible-exposed-infected-recovered (SEIR), and susceptible-exposedinfected-recovered-susceptible (SEIRS)) only take into account the tendency of the related epidemic transmission corresponding to one single region, and other useful information can hardly be uncovered, which includes the impacts derived from the place itself, other areas, and other exogenous variables. Some other models that are utilized to characterize epidemic pervasion are based on time series models. For instance, seasonal autoregressive integrated moving average (SARIMA) models were employed for modeling infectious disease count data in Helfenstein [8] and Trottier et al. [9]. Recently, different time series models (e.g., auto-regressive integrated moving average (ARIMA), the Holt-Winters additive model, and HWAAS) and machine learning approaches (e.g., Prophet, DeepAR, and N-Beats) have been adopted to analyze and compare the prediction accuracy of the percentage of active cases per population based on the COVID-19 data from ten countries with the highest number of confirmed cases as of 4 May 2020 [10].
Held et al. [11] proposed a space-time multivariate time series model (denoted as the HHHmodel) that can be applied to model multiple-unit cases where the "unit" can be different geographical regions, different age groups, or different epidemics caused by different pathogens. Motivated by the HHH model, Paul et al. [12] and Paul and Held [13] developed a spatio-temporal framework to jointly model several epidemics by considering the spatial interaction effect, as well as the time autoregressive effect. Their models have been applied to analyze the transmission of dengue fever in Guandong Province in China in 2014 [14], malaria and cutaneous leishmaniasis analysis in Afghanistan [15], hemorrhagic fever with renal syndrome in Zhejiang Province of China [16], and the effect of containment measures for COVID-19 in Italy [17].
One drawback of the model proposed by Paul et al. [12] and Paul and Held [13] is that it mainly takes care of the connection between the current number of infected cases and the previous numbers of infected cases and the adjacent units/areas, which may ignore other exogenous predictors. As a result, it is limited in its interpretability and applicability for infectious diseases such as COVID-19. In fact, recent studies (e.g., [18,19]) have pointed out that some weather/climate related variables show statistically significant associations with the transmission of COVID-19.
To overcome this limitation for the analysis of state-level time series of the COVID-19 contagion effect, we consider a spatio-temporal framework based on the multivariate time series model proposed by Paul et al. [12] and Paul and Held [13], which however decomposes the COVID-19 risk additively into autoregressive, spatiotemporal, exogenous, and endemic components. Briefly, the autoregressive and spatiotemporal components respectively describe the within-state and across-state COVID-19 risk effects . The exogenous component includes other factors that could affect future epidemic development risk,  while the endemic component captures the function of time and other predictors mainly  about individual states. Briefly, some weather/climate related variables are carefully selected as exogenous factors in our analyses. Indeed, some climate/weather related variables have been shown to be correlated with epidemic transmission in the relevant literature. For instance, in a study of the influence of weather on the foot-and-mouth disease epidemic spread from 1967 to 1968, Hugh-Jones and Wright [20] argued that wind and precipitation played a major role in the spread of the disease, especially wind. According to Tan et al. [21], the environmental temperature can influence the spread of SARS. Qi et al. [18] found that the daily average temperature and daily average relative humidity are significantly negatively associated with the daily confirmed cases of COVID-19 in Hubei, China. Similarly, Tosepu et al. [19] found that the average seasonal temperature was significantly correlated with COVID-19 in Jakarta, Indonesia.
The rest of this article is organized as follows. In Section 2, we elaborate our spatiotemporal multivariate time series model, including the sub-models of each coefficient in the model. In Section 3, we employ our model for analyzing the COVID-19 count data of the USA and show our main findings. In Sections 4 and 5, the discussion and conclusions are presented, respectively. We report the technical materials for parameter estimation in Appendix A.

Development of the Model Models
Let Y r,t denote the number of infected cases in state r at time point t with r = 1, · · · , R, t = 1, · · · , T. Usually, Y r,t is assumed to follow a Poisson distribution [22][23][24]. Since the number of infected cases in each state is hardly totally observed (i.e., the existence of heterogeneity for different states), employing the Poisson assumption could underestimate the underlying dispersion. Here, we adopt the negative binomial distribution [11,12,14]. That is, suppose Y r,t follows a conditional negative binomial distribution, i.e., Y r,t |Y ·,t−l , V ∼ NegBin(µ r,t , ε r ) for r = 1, · · · , R, t = 1, · · · , T, with conditional mean µ r,t and conditional variance: σ 2 r,t = µ r,t (1 + ε r µ r,t ), where Y ·,t−l indicates the vector consisting of the number of infected cases of all states at time point t − l, l is the time lag term satisfying l ∈ {1, · · · , T − 1}, ε r is the overdispersion parameter of state r, and V is a random effect vector with V ∼ N(0, Σ) with Σ = diag{σ 2 (λ) , σ 2 (ψ) , σ 2 (θ) , σ 2 (ζ) } ⊗ I R×R , ⊗ being the Kronecker product and I R×R an R × R identity matrix. It is easy to see that when ε r equals zero, the distribution of Y r,t reduces to a Poisson distribution, whereas the larger the value of ε r , the greater the overdispersion is. Thus, comparing with the Poisson assumption, the negative binomial assumption has wider applicability.
To embed other predictors in the distribution of Y r,t , a hierarchical modeling procedure is employed here. In the first layer, the conditional mean µ r,t is formulated as follows: where: Ψ r,t−l = ∑ j→r ω r,j Y j,t−l , Θ r,t−l = ∑ j η r,j x j,k,t−l , j = 1, · · · , R, k = 1, · · · , K.
Here, x j,k,t denotes the observation at time t of the k-th exogenous factor in state j, which could have an influence on state r. η r,j is an indicator with the value being 1/m j if x j,k has influence on state r and zero otherwise, where m j is the number of factors that have an influence on the number of cases of the j-th state. j → r indicates that states j and r are neighbors that share the same border. ω r,j is an indicator with the value being 1/n j if state r is adjacent to state j and zero otherwise, where n j is the number of states that have a common border with state j. Other choices of weights (i.e., ω r,j ) are also available in [12][13][14] and the references therein.
According to Giuliani et al. [25], λ r,t Y r,t−l , ψ r,t Ψ r,t−l , and ζ r,t are respectively called the epidemic-within component, epidemic-between component, and endemic component. In this paper, we adopt the terminology from Giuliani et al. [25], and we further call θ r,t Θ r,t−l the epidemic-boosted component. It is noteworthy that Ψ r,t−l , which is based on the spacetime dimension, mainly includes the interaction information between one state and other states neighboring that state, while Θ r,t−l contains the correlation information of other exogenous factors between one state and other states neighboring that state. Comparing with the model proposed in Paul et al. [12] and Paul and Held [13], our proposed model in (1) improves the interpretability, as well as the applicability.
In the second layer, for parameters λ r,t , ψ r,t , and ζ r,t , we adopt the same strategy as given in Paul et al. [12] and Paul and Held [13]; that is, each parameter assumes the following log-linear form: where V (·) is assumed to have a multivariate normal distribution with zero mean and covariance matrix σ 2 (·) I R×R , i.e., V (·) ∼ N(0, σ 2 (·) I R×R ). We further discuss the formulation of each parameter as follows.
We first consider autoregressive parameter λ r,t . As suggested in Paul et al. [12], Cheng et al. [14], Adegboye et al. [15], and Wu et al. [16], λ r,t is formulated by the following log-linear form: where α (λ) is related to the intercept term and V (λ) r ∼ N(0, σ 2 (λ) ). For ψ r,t , it satisfies: where α (λ) is related to the intercept term and V (λ) r ∼ N(0, σ 2 (λ) ). For the choice of Pu r,t , unlike Paul et al. [12] and Paul and Held [13], Giuliani et al. [25] argued that it should be a variable that reflects the possible heterogeneous influence for different regions, and their choice was the population of a state. However, we believe that people of different ages could be significantly divergent under the consideration of the infection effect from the population. In this regard, we define Pu r,t as the population size of people whose ages are under 65 in state r at time t based on the fact that this group of people is more likely to travel to other places.
For θ r,t , the log-linear formula may not be suitable as the influence of such exogenous variables could have a positive or negative effect on epidemic transmission, i.e., the sign of θ r,t could be "+" or "−". Here, we suppose that θ r,t follows a normal distribution with mean α (θ) and variance σ 2 (θ) , i.e., where V (θ) r ∼ N(0, σ 2 (θ) ). Giuliani et al. [25] employed a second-order polynomial log-linear regression to evaluate the fluctuation of the number of confirmed cases from the perspective based on the time dimension. This is reasonable when the epidemic is in the early stage (i.e., t is relatively small). However, with the development of the pandemic, the reproduction number will inevitably tend to be small as the population for a specific state is limited.
Thus, we suggest to use the s-shaped growth curve-logistic growth model-which was also employed to study age-specific case-fatality rates of COVID-19 in China and Italy [26]. That is, we consider: where: is a logistic function and Po r,t is defined as the population size of people whose ages are over 65 years old in state r at time t.

Data of Interest
Study area: Here, we consider the 50 states plus Washington, D.C. (DC), for our COVID-19 analyses. However, American Samoa, Guam, the Northern Mariana Islands, the Commonwealth of Puerto Rico, and the Virgin Islands are excluded from our study for simplicity.
COVID-19 data: We obtained the state-level confirmed cases data on COVID-19 in the USA from Kaggle, which are available from https://www.kaggle.com/sudalairajkumar/ covid19-in-usa. Here, we are mainly interested in the cumulative positive cases, as this will be used for the calculation of the number of daily increased COVID-19 cases in the USA. On 14 March 2020, the U.S. President held a coronavirus conference, one day after he declared the pandemic a national emergency (https://www.rev.com/blog/transcripts/ donald-trump-coronavirus-press-conference-transcript-march- 14). For this reason, we consider those data starting from March 15th, 2020.
Weather/climate data: The state-level weather and climate data we use in this paper are openly available from Kaggle, which is fully powered by Dark Sky and can be downloaded from https://www.kaggle.com/eeemonts/weatherclimate-data-covid19?select= csv. The factors included in our analyses are the maximum temperature (MaT), minimum temperature (MiT), humidity (Hu), the probability of precipitation appearance (PA), the percentage of cloud coverage (CC), sea-level air pressure (AP), wind speed (WS), and the columnar density of total atmospheric ozone (CDTAO). Consistent with the COVID-19 data, we only consider weather/climate data starting from March 15th, 2020.
Population data: Both the state-level population data of 2019 and the state-level population percentage of people over 65 years old were collected from Population Reference Bureau (PRB), which is available from (https://www.prb.org/usdata/indicator/age65/ snapshot). Since the direct accessibility of the population over 65 years old is denied, we simply used the state-level population of 2019 multiplying the related percentage over 65 years old to get the approximation of the state-level population over 65. Figure 1 depicts the daily confirmed cases and the cumulative confirmed cases in the USA. One can see that the cumulative confirmed cases and daily confirmed cases in some states such as Connecticut (CT), New Jersey (NJ), and New York (NY) have eased up; some states like North Dakota (ND) became more and more severe; and others tended to recur. Figure 2 shows state-level subplots of the cumulative confirmed cases on September 15th, the daily confirmed cases on September 15th, the total populations in 2019, the populations over 65, and the populations under 65. It appears that both the cumulative confirmed cases and the daily confirmed cases have a strong consistency with the populations under 65, as well as the populations over 65. The appearance of a high infection rate among individuals under 65 could mean that there is a trend of younger people having severe COVID-19 infections in the USA as warned by Kass et al. [27], which can cause much worse situations (e.g., infecting more older adults) [28].

Weather and environmental factors selection 167
To explore the association between weather/climate based factors and COVID-19 transmission, 168 Tosepu et al. [19] applied the Spearman-rank correlation test while Bashir et al. [29] utilized the Kendall 169 and Spearman-rank correlation tests. Here, we first use the Kendall and Spearman-rank correlation 170 tests to identify factors that have significant correlation with daily confirmed cases.

Weather and Environmental Factors' Selection
To explore the association between weather/climate based factors and COVID-19 transmission, Tosepu et al. [19] applied the Spearman-rank correlation test, while Bashir et al. [29] utilized the Kendall and Spearman-rank correlation tests. Here, we first use the Kendall and Spearman-rank correlation tests to identify factors that have a significant correlation with daily confirmed cases. Figure 3 shows the results from the Spearman and Kendall tests. According to the scatter plots, the p-values associated with MaT, MiT, Hu, CC, and CDTAO with respect to all states are mostly below 0.10, whereas those associated with AP and PA are mostly over 0.10. On the other hand, results from bar plots suggest that no particular factor has a remarkable association with the majority of the states. Briefly, AP and PA only have an influence on less than 25 states, while MaT, MiT, and CDTAO have a strong association with more than 40 states. According to Figure 4, it can be safely concluded that MaT, MiT, and CDTAO are the major factors that contribute to the strong associations in most of the states and are taken into account for further modeling analyses.

Optimal Parameters' Determination
For better fitting the confirmed cases in different states, the optimal time lag l needs to be determined. Considering that the desirable range for l is dynamically increased, point-wise optimization seems too tedious and less efficient. Thus, we adopted the possible range from one to 14, which is the time interval that the CDC suggested to stay at home after one's last contact with a person who has COVID-19 (https://www.cdc.gov/ coronavirus/2019-ncov/if-you-are-sick/quarantine.html). Besides, all weather/climate factors were respectively nondimensionalized by Studentization to mitigate the impact of the different units. Table 1 summarizes the correlation between different time lags and the related penalized log-likelihood values. It is noticed that when l = 2, the average estimate of l(π, v) has the largest value (i.e., 48,510,611.518) with the smallest sd (i.e., 301,043.787). Hence, l = 2 will be used in all subsequent analyses.   Table 2 shows the estimates of π and σ based on the configuration for time lag l = 2. From the estimates of the dispersion parameters ε's, we observe that the daily confirmed cases of COVID-19 show obvious overdispersion in almost every state (especially in New Mexico (NM)), which confirms that using the negative binomial distribution is a sensible choice for analyzing the transmission of COVID-19 in the USA.
The estimates of σ characterize the heterogeneity of COVID-19 transmission across states. According to the results in Table 2, there is spatial variation concerning the epidemicwithin component withσ λ = 0.851, the epidemic-between component withσ (ψ) = 0.836, the epidemic-boosted component withσ (θ) = 0.816, and the endemic component witĥ σ ζ = 0.853. Therefore, we believe that there is significant spatial heterogeneity in the epidemic-within, endemic-between, epidemic-boosted, and endemic component. Table 2. Optimal parameter estimates with time lag l = 2. One-hundred different initial processes are randomly implemented. The related mean and standard deviation (sd) are calculated based on 100× outcomes.  Estimates  Figure 5 shows the state-level estimated random effects with respect to epidemicwithin component, epidemic-between component, epidemic-boosted component, and endemic component. Clearly, heterogeneity appears in all components, and the random effects from the four components have a significant effect in most of the states. Here, we mainly focus on random effects from the epidemic-within component and epidemicboosted component. From Subplot (a) in Figure 5, the estimates of α (λ) + V (λ) for most states are smaller than zero, which implies that community-level spread in most states has considerably alleviated from the perspective of epidemic-with component. Similarly, the estimates of α (λ) + V (λ) are smaller than the negative estimate of α (θ) (see Subplot (c) in Figure 5), which indicates higher values of weather/climate factors corresponding to less daily confirmed cases. The estimates of σ characterize the heterogeneity of COVID-19 transmission across states.

196
According to the results in Table 2, there is spatial variation concerning the epidemic-within component   Figure 1, we find that with relatively less confirmed cases, the endemic component seems to be more dominant in these states, which is also shown in CT, NH, and RI.

Discussion
In this article, we find that MaT, MiT, and CDTAO have statistically significant associations with daily confirmed cases in almost all the states in America, based on both the Kendall and Spearman-rank correlation tests. Furthermore, from the estimated coefficients of the epidemic-boosted component, we identify that this association is negative. That is, higher MaT, MiT, and CDTAO correspond to smaller daily confirmed cases. However, further analysis uncovers that the previous daily confirmed cases in one state itself are generally predominant for the next confirmed cases, which suggests that states with a large number of confirmed cases tend to cause more infections.
Recent research [18,19] has shown some evidence of the correlation between weather/ climate factors and COVID-19 transmission. Our work, based on an extended multivariate time series model, further confirms and quantifies the existence of a similar relationship. Unlike some existing models (e.g., [10]), our model successfully facilitates the interpretability and practicability by an additional term that characterizes the degree of the influence from some other external factors. However, one obvious drawback of our model is that a large number of unknown parameters need to be estimated, and the computational cost is therefore high. As a result, the effective sample size also needs to be sufficiently large. Recently, Kimball et al. [30] investigated a COVID-19 outbreak in a long-term care skilled nursing facility (SNF) in King County, Washington, recognized on 28 February 2020, and discovered that screening the SNF residents based on the symptoms related to this epidemic could not to discover all SARS-CoV-2 infections since they found that 23 (30.3%) workers had SARS-CoV-2 positive tests even if they were asymptomatic or presymptomatic on the testing day. Such clustering based infections dramatically aggregate the transmission intensity, making the analysis based on not only our model, but others previously mentioned non-trivial challenges. More recently, Rader et al.'s analysis [31] unveiled that "epidemics in crowded cities are more spread over time, and crowded cities have larger total attack rates than less populated cities". Such phenomena further imply that an area with a higher population density could cause a much severer outbreak, which is also a situation that requires further investigation.

Conclusions
As COVID-19 has become the most disastrous health event in the world, especially in the USA by far, so understanding the transmission pattern of this pandemic has become more urgent. Our analyses of the COVID-19 surveillance data depict remarkably heterogeneous transmission across states during the COVID-19 outbreak in the USA from 15 March 2020 to 15 September 2020. The degree of heterogeneity is characterized by random effects parameter estimates. With the Kendall and Spearman-rank correlation tests, we explore the association between weather/climate factors and daily confirmed COVID-19 cases for each state, which is further used for the analysis of the spatial and temporal occurrence of COVID-19.
Some interesting findings are noteworthy. First, the heterogeneity of COVID-19 transmission across states is observed in all four components, which implies that there are different situations in different states and the same strategies may not work perfectly to contain this pandemic in all states.
Second, some weather/climate factors (i.e., CC, Hu, MaT, MiT, and CDTAO) demonstrate significant correlations with daily confirmed cases in many states. In particular, MaT, MiT, and CDTAO have a strong association with most states. Based on the estimated coefficients from the epidemic-boosted component, one can further find that these variables correspond to daily confirmed cases with a negative correlation in almost all states, i.e., higher MaT, MiT, and CDTAO correlate with less daily confirmed cases. This phenomenon suggests that climate change in the local and adjacent areas could affect the possibility of infection in this area.
Third, since the estimates of the epidemic-within component in most states are predominant, their corresponding values can represent the fluctuation of the daily confirmed cases. Since the estimated coefficients of the epidemic-within component in most states are smaller than one (exp α (λ) + V (λ) r < 1), the community-level spread of COVID-19 in most states is remarkably mitigated and the transmission intensity is decreased. Furthermore, we believe that the relatively large number of daily confirmed cases in the current stage are mainly due to the previous large number of infected cases, and the number of new confirmed cases per day will gradually decrease as time goes on.
Fourth, for the future tendency of the daily confirmed cases, the influence of the previous confirmed cases is the most important among the four components. This means that for states like Texas (TX), which has a large number of confirmed cases, the risk of a sharp increase of the daily confirmed cases is still higher than other states with less confirmed cases. Therefore, there is no doubt that regulations like social distancing or wearing masks in public places are clearly necessary.
Final, since the endemic components for some states (e.g., Vermont (VT)) show obvious predominance, other possible variables that could influence COVID-19 transmission need to be further determined. ( µ r,t ε r 1 + µ r,t ε r ) y r,t ( 1 1 + µ r,t ε r ) 1 εr · (2π) −2R |Σ| −1/2 exp(− 1 2 v Σ −1 v) ∝ ∏ r Γ(y r,t + 1 ε r ) Γ( 1 ε r ) ( µ r,t ε r 1 + µ r,t ε r ) y r,t ( 1 1 + µ r,t ε r ) 1 εr or the logarithm of the conditional likelihood function is: log Γ(y r,t + 1 ε r ) − log Γ( 1 ε r ) + y r,t log( µ r,t ε r 1 + µ r,t ε r ) where π = (α , β , ε ) , | · | denotes the determinant of a matrix and Γ(·) represents the gamma function. Since (A2) includes the random effects, maximizing the above equation with respect to π and v does not generate the log-likelihood, but the penalized log-likelihood or posterior mode estimates from a Bayesian viewpoint [32]. Furthermore, not only the sample observations v are unavailable, but also the values of parametersσ 2 (λ) ,σ 2 (ψ) ,σ 2 (θ) , andσ 2 (ζ) are unknown. To address this issue, Paul and Held [13] proposed a Laplace approximation based method to estimate the covariance by maximizing the marginal likelihood with respect to π and ε. It is noteworthy that their approach can be applied to general cases. Note that the covariance matrix (i.e., Σ) is assumed to be a diagonal matrix and V (·) r ∼ (0, σ 2 (·) I), r = 1, · · · , R. If R is sufficiently large, we can replace σ 2 (·) by its sample variance. In our case, R represents the number of states in the USA, which can be safely assumed to be sufficiently enough. Specifically, we have the sample variance as: The estimate of v Σ −1 v then satisfies: and the estimate of |Σ| is thus: Therefore, (A2) can be simplified as: Due to the difficulty of implementing the gamma function with a large input value (e.g., Γ(200) = ∞ in the R software), we replace the gamma function Γ(·) by Stirling's approximation [33], i.e., √ 2π exp(−u)u u−1/2 . One noticeable advantage for this replacement is that one can directly calculate the logarithm value first rather than the value of the gamma function first (and the logarithm value next), which can significantly mitigate the data overflow phenomenon (e.g., log(Γ(u))| u=200 = ∞, but log( √ 2π exp(−u)u u−1/2 )| u=200 = 1 2 log(2π) − u + (u − 1 2 ) log(u)| u=200 ≈ 857.933 in the R software). Accordingly, (A4) can be approximated as: l(π, v) ∝ ∑ r,t y r,t + 1 ε r log 1 + y r,t ε r 1 + µ r,t ε r + y r,t log(µ r,t ) − 1 2 log(1 + ε r y r,t ) To obtain estimates of π and v by maximizing (A5), we apply the Adam algorithm, which is sufficient and accessible for multi-parameter optimization and only requires first-order gradients with little memory requirement [34]. Its pseudo-code is shown in Algorithm A1 in Appendix A.3. Here, the initial values for v's, α's, and β's are generated from the standard normal distribution (N(0, 1)), while those for ε's are generated from the uniform distribution (U(0.01, 1)), which guarantees the non-negativity.