1. Introduction
The Coronavirus disease (COVID-19) was first reported in China at the end of 2019 [
1,
2,
3]. The rapid spread of this disease encouraged researchers to develop epidemiological models for the spread of this disease. Such models are useful for understanding the transmission patterns of the disease, which in turn, help us in formulating optimal strategies to curb the spread and lessen the impact of the outbreak [
4,
5].
The main objective of a standard epidemiological model is to estimate, at a certain time point, the number of individuals who are infected by the disease, suspected to have been infected, recovered from the illness, etc. We can model these quantities of interest, called compartment, deterministically or stochastically, i.e., as random variables. In the case of COVID-19, one category of approaches formulates the relationships between those quantities explicitly as a set of differential equations, which are then solved analytically, numerically, or through simulations. Depending on which compartments were being considered, these led to the SIR [
6,
7,
8], SEIR [
9,
10,
11], SLIR [
12,
13,
14], SIRD [
15,
16], SEIRU [
17], SLIAR [
12], and SIDARTHE [
18,
19] models. Meanwhile, another category of approaches employs machine learning models to capture the relationships between those compartments [
20,
21]. Such models are trained using data, which may be available as a time series. Some researchers even use news and social media data to model the trends in the growth of COVID-19 cases [
4].
Although these models can provide estimates of the compartments, they do not capture how the disease can be transmitted between neighboring regions through the movement of individuals between them. As a result, such estimates tend to be inaccurate. In fact, a study by Anirudh [
22] pointed out that the error estimate of such models ranges between 13% and 225%. One contributing factor is the lack of spatial elements in the models, and Bohner et al. [
23] argued that such spatial elements need to be incorporated to obtain a better model.
Traditional approaches for such spatially aware epidemiological models use a set of partial differential equations (PDEs) where each compartment variable is not only time-dependent but also spatial-dependent, as described by Murray [
24]. Therein, the spatial interaction is represented by a diffusion coefficient that indicates the dispersion rate of the disease along the spatial dimensions, and the resulting PDEs include the rate of changes with respect to both temporal and spatial dimension. A more fine-grained metapopulation model (also known as the patch occupancy model) for the epidemiological model is proposed by Arino et al. [
25], who introduce spatial patches, which correspond to regions in space, allowing a more explicit modeling of interaction between different regions in the epidemiological model. Instead of PDEs, Arino et al. uses ordinary differential equations (ODEs) and demonstrates the simulation of the model only in the case of one-way and two-way migration between neighboring regions that form a ring structure topologically—each region has only two neighbors on its “left” and “right”.
To accommodate a more complex spatial interaction within the epidemiological model, a more general framework is needed [
26]. To this end, we turn to cellular automata [
27]. Cellular automata (CA) is a dynamic system that is defined over a
d-dimensional lattice of cells. At each time step, a cell has a value that corresponds to one of the possible states. In its most basic form, there are only finitely many possible such states for each cell. However, it is possible to consider an infinite, or even uncountable, set of states. Interaction between a cell with another cell in its neighborhood is expressed in terms of transition rules that determine how the cell’s state can change due to that interaction. When a CA is simulated by running it for a certain number of time steps, we can capture a variety of spatiotemporal dynamics of the states. Unsurprisingly, this characteristic is appropriate for epidemiological models whose aim is to represent the spreading behavior of infectious diseases particularly during an outbreak, which is spatiotemporal in nature [
28,
29,
30,
31,
32,
33,
34].
When using CA for modeling disease outbreaks, we usually restrict our attention to a particular region as a scope in which the CA operates, e.g., a country, a district, or even a residential area. The CA cells are defined over such a region, and a straightforward way to represent dynamics in epidemiological modeling is to associate each cell’s state to the condition of one individual with respect to the disease being modeled, e.g., exposure, infection, recovery, etc. That is, the set of states of the CA is discrete. Here, each cell implicitly corresponds to a single individual only, and the transition rules govern how the disease can be transmitted from one individual to another. Simulating such local interactions between individuals in the population over a time period can lead to interesting patterns that can be analyzed at the global level [
35]. This approach differs from metapopulation models such as that from Arino et al. [
25], which allows only global interactions between spatial patches.
In the case of COVID-19, epidemiological models based on CA have been proposed by a number of researchers [
36,
37,
38,
39,
40]. All of them employ the so-called probabilistic CA, where the transition rules are probabilistically defined as a discrete set of states, as described earlier. These probabilistic CAs are also uniform in nature in the sense that the same probabilistic transition rules are applied to all cells.
The usage of these CA models for modeling the spread of COVID-19 follows a rather typical simulation-based approach. Specifically, such a CA model would have a fixed set of transition rules, which governs both the movement of individuals from one cell to another as well as the epidemiological dynamics due to disease transmission, mortality, and other related factors. To predict the size of the compartments, e.g., the number of infected individuals, at the
nth time step, we would start with an initial state with parameters initialized from data and background knowledge about the disease. Then, we run a simulation on the CA by executing transition rules at each time step until reaching the
nth time step. At the end of the simulation, a comparison could be made for predictions of solutions to the standard epidemiological modeling based on differential equations. However, such a comparison was not performed by all of the CA-based epidemiological models, except by Ghosh and Bhattacharya [
37], who reported a prediction accuracy in the range of 65–75% at the peak of the epidemic. Note that in all cases, the performance depends in principal on the initial parameter setting prior to simulation. Hence, we need to conduct a trial-and-error approach to obtain an optimal setting.
Our aim in this paper is to come up with a way to reduce a prediction error using CA-based modeling. Such an error can be caused by a number of unaccounted factors in the model such as complex interactions between neighboring regions due to movements of individuals in the population, the differing characteristics of crowds at different locations, level of vaccination in the population, or the prevalence of comorbidities. Among these factors, we focus on the influence of imperceptible interactions between neighboring regions that may influence disease spread in the epidemiological modeling for better modeling by proposing a way to represent them that improves the predictive power of CA-based models.
More precisely, we start from a sequence of data points
of compartment variable values obtained from real-world data, e.g., the number of infected individuals over consecutive days. Then, we generate a sequence
where each
is the prediction for the value of
obtained by solving the standard mathematical model of epidemiology such as the SIRD model using the value of
and the known epidemiological model parameters at the
th time step (transmission rate, mortality rate, and incubation period). In our case, the transmission rate follows that of Viceconte [
41]. The incubation period and mortality rate follow that of Li et al. [
2]. The difference between
and
constitutes an error due to unaccounted interactions between regions. We then formulate a multiple regression model to estimate for each cell the influence of each of its neighboring cells on the prediction. Here, instead of explicitly including movements of individuals as part of the model, we abstract them away as influence measures between neighboring regions. Once these influence measures are estimated, we can use them to predict the state of the epidemiological variable at the
nth time step with
by running a simulation on a CA initialized with parameters from data at the
kth time step.
Note that the CA we employ is continuous because the states of the CA can contain real numbers during simulation [
42]. That is, the simulation essentially computes
up to
where for
,
depends on the value of
(except
that depends on
). Here,
need not be an integer, and moreover, it is not estimated based on
, which is not present when
. Obviously, at the
nth time step, the predicted value of the considered compartment variable may need to be rounded to an integer if we wish to obtain an actual count. Note that this is in contrast to the existing CA-based models for the COVID-19 spread, where the states correspond to single individuals and thus have discrete values.
It is also non-uniform because the transition rules are different among different cells [
30]. The differences are due to the fact that the states of the CA are continuous and the transition rules for each cell depend on the influence measure from its neighbors, which may be different. In addition, in our CA model, each cell explicitly refers to an actual subregion part of the considered region (e.g., an area in a province of a country).
Overall our contribution is as follows. First, we propose a way to represent the influence between neighboring regions in COVID-19 epidemiological modeling that abstracts away from explicit inclusion of movement of individuals employed by existing CA-based models. It shows that a CA-based approach can be applied without requiring explicit individual movement. These influence measures are estimated via multiple regression models making use of the difference between compartment variable values according to the actual data and those values according to the standard epidemiological model. Second, we propose the use of non-uniform continuous CA (N-CCA) as a prediction engine through simulation where the initial parameters take into account the aforementioned influence measures between neighboring regions. We then demonstrate that this approach improves the prediction of accuracy on the number of infected individuals up to 14 days in the future using data points from the past 42 days, compared to a baseline model from Fong et al. [
43]. This baseline model uses a composite Monte-Carlo simulation to forecast the overall trend and propagation of the infected cases during the early period of the epidemic in China. The simulation is enhanced by a deep learning network and fuzzy rule induction applied to limited data and takes into account the spatiotemporal influence of nearby cities. In their evaluation, a 14-day forecast is created for all of China, and the result yields an RMSE of about 62,077.26.
This paper is organized as follows. After the introduction in this section,
Section 2 details relevant related works.
Section 3 provides a more formal definition of CA employed in our model. In addition, in this section, we describe the detailed formulation of our model and the data set used to build it. The simulation and prediction results are presented in
Section 4, followed by a discussion of the results in
Section 5. In
Section 6, we present the conclusion of our research and suggestions for future works.
3. Materials and Methods
Our overall workflow is shown in
Figure 1. To start the process, we took the values of the three epidemiological parameters,
,
, and
, for China from a pre-existing study and input them into the model. The model reads data from input tables consisting of region data, cell coordinate data, and time-series data cases.
In the next step, we simultaneously carried out a process for: computing the average cases of each cell for each region of time series cases, reading cells of neighborhood data, and calculating estimated cases, adding each cell with an epidemiological formula. In the next step, we compose a time series of cases in neighboring format for each cell by boundary detection. In the next block, we find the values of , , , , , and k for each cell using multiple regression and use it to predict cases for the next few days on the next block. In the last block, the prediction results are evaluated and displayed.
3.1. Cellular Automata
A cellular automaton (CA) is a 4-tuple
where
is simply a set of
D-tuples of positive integers called a cellular space [
28]. Elements of
C are called
cells. A cellular space is typically arranged as a
d-dimensional grid following a coordinate system based on matrix-indexing convention where the
D-tuple containing all ones,
lies on the origin position of the grid (i.e., the top-left corner in a 2-dimensional grid), and the cell
borders with
other cells (except when
c is at the edge of the grid). For example, in 2-dimensional grid, cell
borders with
at the top, with
to the right, with
at the bottom, and with
to the left [
27,
44,
45].
is a set of states for each cell, which encapsulates all possible conditions a cell may be in. Depending on the nature of the problem to be modeled, may consist of discrete scalar values, continuous scalar values, or even tuples of such values. A CA is continuous if is continuous. Otherwise, the CA is discrete, which is how CA is traditionally defined.
defines the neighborhood frame of a cell c. Here, m is fixed and each is a fixed function returning a neighbor cell of the given cell c. Thus, c can have up to m neighbors in the model and each of those neighbors contributes to the dynamics of c when the CA is simulated.
Two of the most commonly used neighborhood frames are von Neumann neighborhood and Moore neighborhood [
29,
30]. Von Neumann neighborhood with radius
of a cell
c consists of cells with a Manhattan distance of at most
r from
c. Meanwhile, a Moore neighborhood with radius
of a cell
c consists of cells with a Chebyshev distance of at most
r from
c. (A Manhattan distance of
and
is
, while their Chebyshev distance is
). Note that the indices for such neighborhood frames follow the matrix-index convention when, i.e., for cells
in a 2D cellular space,
i increases from “top” to “bottom”, while
j increases from “left” to “right”.
Figure 2 illustrates Von Neumann and Moore neighborhood frames in the 2-dimensional cellular space. The Von Neumann neighborhood of
with radius 1 is
. Meanwhile, the Moore neighborhood of
with radius 1 is
.
Next,
is a
local transition function that governs how the new state of a cell is obtained from the old state of that cell and the states of that cell’s neighbors [
28]. Therefore, if
are the current state of a cell
c and its neighbors, then
is the new state of the cell
c. A CA is traditionally
uniform in the sense that
f has the same definition across all cells. However, in this paper, we allow
f to be different across different cells, i.e., the CA to be
non-uniform.
The simulation of a CA can be understood as a sequence of applications of the local transition function on the state of all cells at consecutive time steps. Therefore, we use to denote the state of a cell c at time step t. Then, a configuration at time step t is the set . A one-step simulation of the CA is a function G that returns given defined as . Finally, a simulation of the CA starting from an initial configuration is simply an iterative application of G denoted by . Therefore, obviously corresponds to k steps of simulation of the CA starting from .
3.2. SIRD Epidemiological Model
Here, we introduce the SIRD model as the underlying mathematical model of the spread of infectious diseases based on the formulation by Siettos and Russo [
46], Nepomuceno et al. [
47]. The SIRD model is a variant of the basic SIR model where at any given time, we consider the population (in which the disease spreads) partitioned into four disjoint subsets or
compartments, namely the susceptibles (
S), the infected (
I), the recovered (
R), and the deaths (
D). The relationships between these compartments can be expressed by the following system of ordinary differential equations (ODEs) where
denotes the
mortality rate among the infected,
denotes the
transmission rate among the infected, i.e., how many healthy individuals can become infected per unit time by an already infected individual, and
denotes the incubation period, i.e., the duration in which the disease remains infectious—beyond this period, one assumes that an infected individual no longer spreads the disease to others.
The variables
, and
D are time-dependent and the parameters
, and
are specified in terms of a particular time unit. Intuitively, Equation (
1) states that the size of the susceptibles at a given time decreases due to a proportion of it becoming infected. Equation (
2) expresses the change in the size of the infected due to the addition of newly infected individuals as well as the removal of individuals due to fatalities as well as those who have passed the incubation period. Equation (
3) describes the addition to the recovered by individuals who are no longer infectious, while Equation (
4) expresses that a proportion of the infected do not survive and is thus added to the death compartment.
Note that
, and
D can also be viewed as states in which an individual in the population can belong. The relationships between them can also be illustrated as a state diagram as in
Figure 3.
Given the epidemiology model parameters
, and
, we can solve the ODE system to obtain an explicit form of
, and
D as a function of time. A comparison of data can then be made to see if the parameter setting is appropriate. If the parameters are unknown, the estimation can be performed via various approaches of nonlinear regression [
48]. However, note that this formulation does not take into account the spatial aspect of the disease spread. Our next aim is to accommodate this using a CA-based model, specifically to estimate the number of infected at a given time step
t. We apply our model to the spread of disease. In this matter, we limit our scope only to the infected compartment and focus our attention on Equation (
2).
3.3. Proposed CA-Based SIRD Epidemiological Model
In our CA-based approach, we work on cells in a fixed finite 2-dimensional grid that forms a cellular space. Each of these cells maintains a variable
that represents the number of infected in cell
at time step
t. Here, each cell should be viewed as a representation of an actual spatial region in which a number of infected/susceptible/healthy individuals may reside. Since the CA configuration has discrete characteristics, we could change Equation (
2) into discrete form. Discretizing Equation (
2), we obtain the following relation:
The left-hand side of Equation (
5) expresses the change of
between consecutive time steps according to the real data, while the right-hand side (except the
term) represents the approximation of that change according to the SIRD epidemiological model. The
term corresponds to the discrepancy between the two. Our idea is to model
as the result of the interaction between cell
and its neighbors that contributes to the change in the number of infected in cell
. Concretely, infected individuals from a neighboring cell can cause the disease to spread to individuals in cell
. This is represented by an
interaction coefficient q specific for that neighboring cell. Specifically, if a cell
has
m neighbors, then there are
different coefficients that determine the current level of infection
as well as at its neighboring cells added to the future level of infection at
.
To simplify our discussion, we focus on the Von Neumann neighborhood with radius 1 for the CA-based approach in this paper. Thus, a cell
has four neighboring cells, namely the “north” cell
, the “east” cell
, the “south” cell
, and the “west” cell
. The interaction coefficients for those neighboring cells are, respectively,
, and
. In addition,
represents the inner interaction coefficient of the “center” cell, i.e., cell
to itself. Note that all these interaction coefficients are dependent on
, i.e., their values are different when a different “center” cell is considered. Hence, as illustrated in
Figure 4, we can rewrite Equation (
5) into Equation (
6):
where
is a bias or offset term, while
, and
are the estimated number of
additional infected individuals in each cell
, the neighborhood frame of cell
, which are expressed as Equation (
7) below.
As stated in the introduction as well as
Figure 1, we work with a time series data representing the number of infected individuals during a certain period of time. Therefore, we have data points for each cell
as a sequence of values
, each representing the number of infected individuals in cell
at time step
. Based on these data, Equation (
6) yields the following system of
N linear equations for cell
in Equation (
8) (with subscript
omitted for
P’s and
q’s).
The above system can be expressed in a matrix form (Equation (
9)) for each cell
:
where
is an N-dimensional vector whose tth element () is for each ,
is an matrix whose t’th row is the vector
is a 6-dimensional vector with .
Our aim is to find the values of for all cells such that the changes on the number of infected individuals at time t predicted by accounting for the neighboring cell interactions as represented by accurately approximates the changes of the number of infected individuals in the actual data as represented by . Therefore, we view this as a multiple linear regression problem where we optimize .
More precisely, let
be a sum-of-squares loss function (expressed using L2 vector norm) for the aforementioned regression problem where we omit the subscript
:
Examining the shape of the matrices and vectors above, we notice that both
and
are scalars. Therefore,
. Thus, Equation (
10) becomes Equation (
11):
Therefore, finding an optimal
means we need to solve the following optimization problem in Equation (
12):
The gradient of the loss function is
, defined in Equation (
13):
Setting
yields Equation (
14):
Hence, the solution for
is:
By Cramer’s rule, the
k’th element of
in Equation (
15) can be computed using determinants as follows:
where
denotes the
matrix whose
k’th column is replaced by
.
Equation (
16) gives us the six interaction coefficients for a particular cell. In general, we compute them for each cell
separately. Next, having obtained those interaction coefficients for all cells, we can rewrite Equation (
6) into a linear approximation function in Equation (
17), where
is the approximate number of infected individuals at time step
with
N the number of time steps for which data on the number of infected individuals per cell are available.
To employ Equation (
17) as a prediction model, we use
as the starting value of
and proceed by computing the subsequent
up to the desired time step
K.
3.4. Prediction Model Using Aggregated Data
The CA-based model described in
Section 3.3 assumes that data on the number of infected individuals are available for each cell separately. That is,
is explicitly given for each cell
with
. However, we often do not have access to such data. Rather, data are given at the regional level, e.g., per district or province. Such regions have varying sizes and also possess rather complex border relationships with their neighboring regions, which are difficult to accommodate in a CA-based framework. To work around this, our approach is to simply pick a fixed spatial size for a cell previous to modeling, and then, each region is divided into varying numbers of cells. The number of infected individuals per such cell is approximated by averaging the number of infected individuals in the region.
Let C be the cellular space of the CA in our model, which is a finite 2-dimensional grid. Consider a set of regions where some regions may border each other spatially. For example, we can think of as a country, while are its provinces. We overlay C over such that each cell in C can be assigned to no more than one region. Thus, we define a region simply as a set of cells such that no cell belongs to more than one region, i.e., the regions are pairwise disjoint. Obviously, may be of irregular form, causing some cells to be spatially part of more than one region. In such cases, we make a simplifying assumption whereby we assign the cell to the region that covers it the most. Note that each cell in a region remains a particular cell in C and can be referred to as such, independent of the region to which that cell belongs.
Our model is then adjusted as follows. Let
be the number of infected individuals in region
at time step
, which corresponds to the data we have at hand. Denote the number of cells in region
by
. Then, we set the number of infected individuals for a cell
c via:
Equation (
18) allows us to obtain values of
for all cells
of the CA at all time steps
. This can then be used to construct the multiple regression model of Equation (
8) after which we proceed as described in
Section 3.3.
For predicting the number of infected individuals at time step
, we modify Equation (
17) by setting the following for each region
:
where, following Equation (
7):
Equation (
19) is the initialization for the prediction model for each cell. Then we calculated the average prediction of each cell for each region using Equation (
20). We rewrite Equation (
17) as Equation (
21) as the prediction model. The step in Equation (
20), followed by Equation (
21), is repeated along with the
days of prediction that we want.
To evaluate the prediction accuracy provided by our model, we use the usual Root Mean Square Error (RMSE) in Equation (
22). That is, if we have a ground truth number of infected individuals in region
at time step
, i.e.,
, then:
5. Discussion
From the results (
Figure 15,
Figure 16 and
Figure 17) in
Section 4.2, we can see that the error for predictions up to day 14 is relatively small (ranging between 0 and 160), except for Hubei. This study compares the baseline model from Fong et al. [
43] that used a Composite Monte-Carlo method simulation, enhanced by a deep learning network and fuzzy rule induction applying to limited data. The baseline model forecasts the overall trend and propagation of the infected cases during the early epidemic influenced by the temporal-spatial data of the nearby cities around China. The model uses the same number of future days to forecast, i.e., for the next 14 days, for China and obtains an RMSE of about 62,077.26. Meanwhile, in our study, the total RMSE in China was only 6631.42, about ten times smaller (
Table 2).
From the error analysis using RMSE (Equation (
22)) as shown in
Table 2, the data that caused the unexpected prediction results were outliers, such as the case in Hubei. The outlier was interesting since there was a crowded traditional market full of visitors in the city of Wuhan. We can propose the outlier as a separate study.
The results of visualizing the spreading pattern of COVID-19 for eight weeks, in
Figure 8, show that Hubei is constantly in the red area, which implies that this area remains the center of the disease spread. However, it also causes a high average of the training error, which means Hubei is outlier data. However, as a method with spatial characteristics, CA turned out to be pretty reasonable to model the spread of disease. It demonstrated that the CA model could enhance the accuracy of the prediction model by increasing the amount of training data or adding other related factors that influenced the spreading of disease.
Compared to previous studies in epidemiology, mainly related to COVID-19 at the beginning of the outbreak, our proposed model can show the visualization of the result on the level of vulnerability. Thus, the model can directly observe the position of clusters of regions. Until the end of 2020, the previous related study still used a mathematical model and statistical data processing for calculating the parameters of the epidemiological model of COVID-19 in Wuhan. It uses a graphical representation (chart) for showing the tendency of spreading without including spatial information [
9,
51]. Thus, the ability to represent the COVID-19 outbreak with spatial information can significantly contribute to the COVID-19 spreading model. We expect that the visualization based on spatial information could help the decision-maker determine a more precise policy.
The advantage of using CA is that the prediction results are more precise than standard forecasting methods for the whole data set. One of the reasons is that we can divide the COVID-19 data into small regions. Thus, we can analyze separately by considering the influence and behavior of its neighborhood, whereas standard forecasting methods override the influence of a neighborhood. However, the drawback of the model is the difficulty in defining cellular space. The configuration of cellular space and the definition of a neighborhood is not generic. Thus, we should redefine them to adjust the model if the characteristics of the problem and data set change. In addition, finding the optimal parameters is time-consuming and involves an array of variables that are sometimes more than two dimensions.
6. Conclusions
We succeeded in representing the influence of neighboring regions in COVID-19 epidemiological modeling, which we have proposed as the N-CCA model for our contribution. The N-CCA model has succeeded in modeling the influence of neighborhood interactions on the spread of COVID-19 in a region. In this study, we propose the N-CCA model, a non-uniform continuous CA-based approach combined with multiple regression to determine the coefficients representing the magnitude of the interaction influence.
We apply the N-CCA model to predict the spread of COVID-19 cases in China. We developed a prediction model using 42 days of cases as a training data set and using the next 14 days as a validation data set to evaluate the model’s prediction results. We used RMSE to calculate the prediction error on the average number of cases for each cell in a region. We obtain the RMSE values in the length range 0–42.32 and 0–67.43 per cell for 7 and 14 days of prediction except for Hubei. Meanwhile, the RMSE for all of China is about 6631.42. Thus, our overall model is better than the baseline.
The CA method has an opportunity to improve the predictive model of disease spread. We can improve the model by including other parameters, e.g., vaccination, comorbidity, and isolation factors. Moreover, we can also change the definition of cellular space, such as changing the number of cells and changing the membership definition of each cell. We can conduct another improvement by using the Moore neighborhood as a neighborhood frame model. Lastly, we can consider the population density when calculating the case spreading.
One of the challenges of using the CA approach in modeling disease spread is to define different cellular spaces for other geographic and demographic conditions. In addition, in this study, the number of regression problems that must be solved is 258, which is as many as the number of cells defined. If there are more cells and neighbors and the model has a more complex factor involvement, we can use machine learning or deep learning to solve the regression problems to obtain a predictive model with the CA approach.
The current study uses time-series data that are limited to 42 days. In future work, there are several things we can try to improve the accuracy, including adding time series data and an explicit movement factor, among others. For building a predictive model, adding more data can solve using machine learning for time series data. In addition, the consequence of adding an explicit movement factor needs modification of the design of the predictive model.