A Data-Driven Pandemic Simulator with Reinforcement Learning

Zhang, Yuting; Ma, Biyang; Cao, Langcai; Liu, Yanyu

doi:10.3390/electronics13132531

Open AccessArticle

A Data-Driven Pandemic Simulator with Reinforcement Learning

¹

Department of Automation, Xiamen University, Xiamen 361102, China

²

School of Computer Science, Minnan Normal University, Zhangzhou 363000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2531; https://doi.org/10.3390/electronics13132531

Submission received: 26 April 2024 / Revised: 21 May 2024 / Accepted: 24 May 2024 / Published: 27 June 2024

(This article belongs to the Special Issue Data-Driven Intelligence in Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

After the coronavirus disease 2019 (COVID-19) outbreak erupted, it swiftly spread globally and triggered a severe public health crisis in 2019. To contain the virus’s spread, several countries implemented various lockdown measures. As the governments faced this unprecedented challenge, understanding the impact of lockdown policies became paramount. The goal of addressing the pandemic crisis is to devise prudent policies that strike a balance between safeguarding lives and maintaining economic stability. Traditional mathematical and statistical models for studying virus transmission only offer macro-level predictions of epidemic development and often overlook individual variations’ impact, therefore failing to reflect the role of government decisions. To address this challenge, we propose an integrated framework that combines agent-based modeling (ABM) and deep Q-network (DQN) techniques. This framework enables a more comprehensive analysis and optimization of epidemic control strategies while considering real human behavior. We construct a pandemic simulator based on the ABM method, accurately simulating agents’ daily activities, interactions, and the dynamic spread of the virus. Additionally, we employ a data-driven approach and adjust the model through real statistical data to enhance its effectiveness. Subsequently, we integrated ABM into a decision-making framework using reinforcement learning techniques to explore the most effective strategies. In experiments, we validated the model’s effectiveness by simulating virus transmission across different countries globally. In this model, we obtained decision outcomes when governments focused on various factors. Our research findings indicate that our model serves as a valuable tool for decision-makers, enabling them to formulate prudent and rational policies.

Keywords:

COVID-19; agent-based model; deep Q-network; pandemic simulator; data-driven

1. Introduction

The coronavirus disease 2019 (COVID-19) is a contagious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease quickly spread worldwide, resulting in the COVID-19 pandemic. Due to the highly contagious virus, controlling COVID-19’s infection and fatality rates is critically dependent on the successful implementation of public health measures, such as social distancing, isolation, quarantine, and the usage of masks. Isolating symptomatic cases, enforcing lockdowns, maintaining social distancing, and wearing masks have proven to be the most effective measures to curtail viral transmission. While this has become common knowledge, in practical terms, adherence to these measures, due to economic and public compliance factors, is often challenging. Loosening lockdown policies can exacerbate the outbreak, and conversely, strict measures can have economic repercussions.

Different countries have taken various measures in response to the COVID-19 pandemic. The UK had implemented relatively lenient measures in response to the pandemic. While the government enforced restrictions on gatherings and closed certain venues, some individuals disregarded stay-at-home directives. Consequently, the government moderately altered the implemented measures: certain regions were allowed to reopen, leading to suboptimal pandemic control. This resulted in multiple virus mutations, such as variant B.1.1.7. In contrast, China imposed comprehensive lockdown measures, advocating for home isolation and reducing non-essential travel in the first stage. In severely affected areas, the lockdown was enforced to contain further spread. While effectively controlling the outbreak, these measures severely impacted the economy. Despite the varied strategies adopted by different governments, unfavorable outcomes emerged in terms of both the economy and public health. It is a challenging job of balancing the pandemic containment with fostering economic development.

Traditionally, statistical and mathematical models, such as the susceptible–infectious–recovered (SIR) model, have been used to simulate and forecast epidemic trends. Unfortunately, these models often oversimplify the problem domain, failing to account for human behavior, contact patterns, population diversity, and host variation. Additionally, the policy implementation tends to exhibit some lag, and immediate results may not be guaranteed. With the rapid advancement of AI, using Agent-Based Models (ABMs) provides a solution to overcome the limitations of traditional models. ABM is a modeling simulation approach that simulates and analyzes complex systems by modeling individual entities within the system. We built the pandemic simulator based on ABM simulation technologies. The pandemic simulator factors in individual behaviors, interactions, and environmental elements, simulating the spread of epidemics and the effectiveness of control strategies based on these factors. While many have adopted ABMs for the analysis, simplistic simulations often fall short of providing significant assistance in real-world decision making. To enhance the model’s utility, a data-driven approach is adopted that incorporates real-world data to augment the simulations. We employed the deep Q-Network (DQN) method to formulate decisions based on the simulation model. DQN is a form of deep reinforcement learning that navigates decision making in complex environments and optimizes learning through a feedback mechanism. DQN can learn to choose optimal control measures, such as vaccine distribution and social distancing strategies, to minimize infection or mortality rates in epidemic prevention and control.

We have proposed a method that combines ABM and DQN to combine individual intelligence with deep learning and provide more accurate and dynamic decision support for epidemic prevention and control. Within this model, the pandemic simulator refines the environmental and individual state information to DQN, which trains intelligent decision networks to select optimal control strategies. This fusion allows a deeper understanding of the impact of individual behaviors and environmental factors on epidemic spread and control, facilitating the formulation of more effective prevention and control strategies. This decision-support model aids health departments, governments, and policymakers in formulating scientific and viable epidemic prevention and control measures, aiming to minimize the impact of pandemics to the fullest extent.

This article mainly contains the following contributions: (a) We construct an epidemic simulator to model the virus transmission process, simulate real human work activities, and illustrate the impact of government decisions on outcomes. (b) We integrate ABM with DQN, enabling decision making within complex simulation environments, learning, and optimizing through feedback mechanisms, thereby assisting governments in obtaining rational decision solutions. (c) We adopt a data-driven approach based on real-world data. To enhance the model’s effectiveness, we conducted simulations based on real data and enabled the simulation model to aid governmental decision-making processes.

The remaining sections of this paper are outlined as follows. In Section 2, we present the current mainstream methods related to addressing pandemic issues. Section 3 briefs the methods, and a detailed exposition of the epidemic simulator used in the experiments is presented in Section 4. Section 5 introduces the design ideas of the DQN model. The simulation results are discussed in Section 6. Finally, Section 7 concludes the paper by summarizing its key points.

2. Related Work

Since the massive outbreak of COVID-19, scientists have been dedicating themselves to studying the dynamic transmission process of the virus to find targeted methods to control the spread of the pandemic. The mainstream research methods adopt traditional mathematical models, such as SIR [1,2,3] and susceptible-exposed-infectious-recovered (SEIR) models [4,5,6,7], to characterize the transmission process of the virus. Additionally, to consider more factors, researchers have added new states to improve the original models, such as quarantined (Q) [8], asymptomatic (A), symptomatic (I), hospitalized (C) [9].

Single mathematical models may not always be able to capture the complexity of dynamic processes. To address this, many researchers have employed hybrid modeling approaches by combining mathematical models with other types of models to enable more complex analyses: for instance, an integrated model for predicting the number of confirmed cases from the perspective of Bangladesh combines the SEIR epidemiological model and neural networks [10]; based on the SEIR model, the improved SEIR models were established considering the incubation period, the isolated population, and the genetic algorithm (GA) parameter optimization method [11]; a hybrid model constructed by the SEIR model based on the traditional infectious disease dynamics and the differential autoregressive integrated moving average (ARIMA) model is proposed to make predictions and analyze the novel coronavirus pneumonia epidemic in different periods and locations [12]; using SEIR and system dynamics as the base models in developing the designed model, making use of the causal loop diagram (CLD) to understand the factors which play a major role in the spread and containment of COVID-19 [13].

Similarly, a portion of researchers have employed the ABM approach to simulate the spread of the virus. ABM models can incorporate factors such as heterogeneous populations, mobility patterns, and social networks, to provide a more realistic and detailed representation of the disease transmission dynamics: for example, a framework to evaluate the effects of the pandemic by combining agent-based simulations—based on the SIR model—with a hybrid neural network [14]. ABMs can also help to identify the best policy interventions by simulating different scenarios and predicting their outcomes. For example, ABMs can be used to evaluate the effects of school closures, lockdowns, mask mandates, and vaccination campaigns on the spread of the virus. They can also help to optimize resource allocation, such as identifying the most effective testing and contact tracing strategies. Some researchers use agent-based models for estimating demand for hospital beds during the COVID-19 pandemic [15] and assessing the transmission dynamics and health systems burden for COVID-19 [16].

During the COVID-19 pandemic, governments around the world have had to make thoughtful decisions to address challenges such as protecting public health, maintaining economic operations, and mitigating the spread of the virus. Researchers have primarily utilized reinforcement learning methods to address these decision-making challenges. Padmanabhan et al. utilized a reinforcement learning framework to mitigate the impact of widespread viral transmission [17]. Khalilpourazari et al. applied Q-learning to predict the pandemic’s progression [18]. Furthermore, several studies have adopted DQN and its variants to explore optimal strategies during the pandemic [19,20,21].

3. Preliminaries

3.1. Agent-Based Model

ABM is a computational modeling technique that describes complex systems as collections of autonomous agents interacting with each other and with their environment. Each agent has its own set of rules, behaviors, and decision-making processes that govern their actions, and the interactions between agents and their environment produce emergent phenomena and patterns at a system level.

ABM has been used in various fields, including social sciences, economics, biology, ecology, and computer science. The development of ABM can be traced back to the 1940s with the work of von Neumann and Morgenstern on game theory and the development of cellular automata by Ulam and von Neumann. The first ABM was developed in the 1960s by Thomas Schelling, who used it to study racial segregation.

ABMs have been widely used in simulating the spread of COVID-19 and analyzing the effectiveness of government policies. ABMs can simulate the complex interactions between individuals, such as social contact patterns, adherence to preventive behaviors, and movement patterns. By incorporating real-world data and assumptions about individual behavior, ABMs can provide insights into the dynamics of the pandemic, including the spread of the virus, the impact of interventions, and the effectiveness of different policies. Some authors use it to evaluate the resource requirements during the peak of the pandemic [22] or determine the required medical resources and vaccine supplies over time [23].

Overall, ABMs have played a crucial role in understanding the spread of COVID-19 and analyzing the effectiveness of government policies. Their ability to capture the heterogeneity and complexity of individual behavior and interactions makes them a valuable tool for policymakers and researchers alike.

3.2. NetLogo

There are various methods for implementing ABM, allowing users to choose appropriate tools and techniques based on specific problems and preferences. In this context, we use the NetLogo simulation platform. NetLogo is a free and open-source platform designed for modeling and simulating complex systems, making it suitable for researching complex adaptive systems, collective behaviors, and distributed problems. NetLogo was developed by Dr. Uri Wilensky at Cornell University and first released in 1999. It offers an intuitive interface and a graphics-based programming language, enabling users to easily create and simulate a wide range of complex systems, including ecological, social, and economic systems. The simulation environment of NetLogo consists of both an interface and code. The interface provides a user-friendly graphical representation to display the model’s state and results, allowing users to interact with the model. The code is the critical component where users define the model’s behavior by specifying the rules, interactions, and simulation processes for the agents within the model. To date, numerous scientific studies have employed NetLogo to conduct simulation analyses of complex systems in natural, social, and engineering domains [24,25].

3.3. Q-Learning

Reinforcement Learning (RL) is a type of machine learning where an agent interacts with an environment to learn how to make decisions that maximize a cumulative reward signal. RL is based on the idea of trial and error, where the agent learns by exploring different actions in the environment and receiving feedback in the form of rewards or penalties. The goal of RL is to develop an optimal policy, which is a mapping from states to actions that maximize the expected cumulative reward.

To formalize this problem, we can use the mathematical framework of Markov decision processes (MDPs). A MDP is defined by a tuple (S, A, P, R,

γ

), where S is a set of states, A is a set of actions, P is the transition probability function, R is the reward function, and

γ

is the discount factor. The transition probability function

P (s^{'} | s, a)

gives the probability of transitioning to state

s^{'}

given that the agent takes action and in-state s. The reward function

R (s, a, s^{'})

gives the reward received by the agent for transitioning from state s to state

s^{'}

after taking action a. The discount factor

γ

determines the relative importance of immediate versus future rewards.

Q-learning is a popular RL algorithm for finding an optimal policy in a MDP. Q-learning learns a value function, which estimates the expected cumulative reward of taking an action in a given state and following the optimal policy thereafter. The value function is defined as follows:

Q (s, a) = E [R_{t + 1} + γ \max_{a^{'}} Q (s^{'}, a^{'}) | s, a],

(1)

where s is the current state, a is the action taken,

R_{t + 1}

is the reward received at the next time step,

γ

is a discount factor that determines the importance of future rewards,

s^{'}

is the next state, and

a^{'}

is the next action.

The Q-learning algorithm updates the value function using the following update rule:

Q (s, a) \leftarrow Q (s, a) + α (R_{t + 1} + γ \max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)),

(2)

where

α

is the learning rate that determines the weight given to new experiences compared to past experiences.

3.4. Deep Q-Network

DQN is an extension of Q-learning that uses a neural network to approximate the value function. DQN addresses the problem of high-dimensional state spaces by using a deep neural network to approximate the Q-function.

The DQN algorithm updates the parameters of the neural network using stochastic gradient descent to minimize the following loss function:

L (θ) = E [{(R_{t + 1} + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))}^{2}],

(3)

where

θ

represents the parameters of the network,

θ^{-}

represents the target network parameters that are updated less frequently, and

L (θ)

is the mean squared error between the target Q-value and the predicted Q-value.

DQN also uses experience replay, where the agent stores experience in a replay buffer and samples mini-batches of experiences to train the network. This improves the efficiency of learning by breaking the correlation between consecutive updates and reducing the variance of the updates.

4. Pandemic Simulator

We have developed a pandemic simulator (PS) to model the spread of the virus. The simulator employs an agent-based model to simulate disease transmission under various lockdown strategies. It delineates communities and diverse scenarios by customizing interpersonal interactions when different governmental strategies are employed, enabling the observation of viral spread outcomes. The simulator’s primary objective is to integrate with the DQN algorithm to devise rational policies that maximize both economic and health well-being. This PS can be defined by a triad <E, A, I>, where E represents the environment, A signifies the agents, and I denotes the infection mechanism. Figure 1 represents a schematic diagram of the pandemic simulator. Next, we will elaborate on the PS from these perspectives.

4.1. Environment

We use NetLogo to simulate the spread of viruses in the population. The environment used in the simulation is a standard NetLogo world, consisting of

64 \times 64

patches. Each patch represents a discrete location in the simulated world, and agents move between patches to travel. To measure the passage of time in the simulation, each day is divided into 82 ticks. The parameter settings are shown in Table 1:

Before the simulation begins, the world is in a default state. As shown in Figure 1, the world is divided into four regions, each containing various elements such as houses, points of interest, and hospitals. Points of interest represent facilities such as workplaces, schools, and shopping centers. Each element is randomly assigned to a location within the world.

4.2. Agent

In the PS environment, the agent simulates the behavior of real humans. An agent can be defined by the two-tuple <Action, State>, where action defines the action mode of the agent and state represents the health status of the agent. The behavior of the agent includes travel (the agent can choose to stay home, go to points of interest such as workplaces, schools, and shopping centers, or go to hospitals), whether to wear a mask (wearing a mask can effectively reduce the risk of infection [28]). The agents in the PS can exist in one of several states, including healthy, latent, asymptomatic, symptomatic, immune, and dead. The specific state transition process will be described in the infection mechanism section.

Once the simulation starts, a given number of agents is generated. Each agent mainly acts in a certain area and is randomly assigned to a house as their residence within the given area. Additionally, each agent is assigned a point of interest within the same region as their house as their primary destination when leaving home.

All agents follow the same travel process each day. From midnight until 8 a.m., the agents stay at home or at hospital. At 8 a.m., the agents decide whether to travel based on government policies or the personal health status with a given probability. If they choose to travel, they go to their point of interest or the hospital, depending on their health status. At 4 p.m., each agent decides whether to return home or stay at the hospital, depending on their health status, and remain there until 8 a.m. the next day.

Since the agents’ movements are limited when they are at home or work and more extensive when commuting, the probability of infection varies during different periods. To account for this, the simulation time is divided into unequal segments. Each day is divided into 82 ticks with each travel time lasting 30 ticks and each non-travel time lasting 1 tick per hour.

4.3. Infection Mechanism

As shown in Figure 1, the agents in the simulation can exist in one of several states, including healthy, latent, asymptomatic, symptomatic, immune, and dead. The infection process can be seen from Figure 2. In the figure, the meanings of different colored agents are shown in Figure 1. At the start of the simulation, a given number of agents are infected, and the remaining agents are set to a healthy state. At each tick, the simulation checks whether each healthy agent has any infected agents within a certain radius. If there are any infected agents nearby, the healthy agent has a probability of becoming infected. If the healthy agent becomes infected, they enter a latent period during which they are not contagious. After the latent period ends, the agent can transition to either an asymptomatic or symptomatic state with symptomatic agents developing symptoms after the incubation period and becoming infectious at that point. Once the infectious period ends, the agent either dies or recovers and becomes immune with a given probability. However, being immune does not provide complete protection against future infections, and agents in this state still face a lower risk of infection.

To accurately simulate the transmission dynamics of the novel coronavirus, this research employs a data-driven methodology, leveraging empirical data derived from statistical investigations of COVID-19 for simulation purposes. The infection parameter settings are shown in Table 2:

5. Decision-Making Based on DQN

5.1. Problem Statement

Mitigating the COVID-19 pandemic is a decision-making problem where governments determine policies based on the process of disease transmission. The simulation based on DQN is an agent-based epidemic simulator that can be used to analyze and improve mitigation policies for the economic and health impacts during pandemics. The purpose of developing this simulator is to train reinforcement learning algorithms by combining artificial intelligence with epidemiology to simulate disease transmission and improve pandemic mitigation policies.

In this model, our goal is to make rational decisions at each decision point within the given world and time to maximize the overall reward. The experiment is set for 50 days, taking into account the limited understanding of the virus in the early stages of the epidemic, making it difficult to make accurate judgments. By default, starting from the tenth day, a decision is made every five days. In real-life epidemic prevention and control, the spread of the virus is complex, and the effectiveness of the government’s containment policies is difficult to reflect in real time and accurately. Therefore, network learning is adopted to reflect the impact of policies. Through this approach, we can evaluate the effectiveness of different policies on the spread of the virus and make better decisions to control the pandemic.

5.2. Policy of Government Agent

The government agent will periodically select an appropriate strategy to mitigate the outbreak based on available data. The agents’ behavior is governed by the set of restrictions imposed by the various lockdown levels. The specific restrictions include the following:

Wearing masks (W). The primary route of transmission of COVID-19 is through the respiratory tract. There is substantial evidence that wearing a mask in laboratory and clinical settings reduces the transmission of infected respiratory particles, thereby reducing the transmission rate per exposure [28]. Therefore, during outbreaks, governments actively urge the public to wear masks. The acceptance of masks varies depending on the culture and lifestyle habits of different countries. The value range of W is 0 to 1, where 0 means there is no policy for wearing a mask, people are also not required to wear masks, and 1 means all people wear masks when they go outside as required.
Staying at home (S). People are required to isolate themselves at home to avoid the spread of the outbreak. For example, in the UK, home guidelines were issued in March 2020 to reduce the spread of the outbreak [37]. S is defined as a value between 0 and 1, where 0 can be considered as people are not restricted from going out, and 1 can be considered situations where any travel is forbidden and people have to be isolated at home.
Gathering limits (G). In public areas, it is forbidden to gather more than a certain number of people. For example, the UK has issued guidelines requiring people in medium local alert level areas not to socialize in groups of more than six [38]. The value range of G is not limited. The value of infinity means no upper limit on the number of people, and 0 means crowd gathering is strictly prohibited.
Area lockdown (A). Area blockade refers to some of the points of interest that were asked to close and prohibited from admitting crowds to prevent the spread of the outbreak. A takes the values of 0 or 1: 0 means that no lockdown in the area, allowing people to move freely within points of interest, and 1 means that points of interest are blocked and people are asked to stay home.
Isolation if Infected (I). The value range of I is 0 or 1, where 0 can be considered as no travel restrictions for infected people, and 1 can be considered as requiring isolation after infection.

5.3. Model Process

The specific model process is shown in Figure 3.

In DQN, the components include the training network, the target network, experience replay, and the environment, where the epidemic simulator represents the environment. The agent interacts with the environment through a series of actions, observations, and rewards. At each time step, the agent simulates the environment based on its current state and action, receiving the current step’s reward and the next state.

DQN maintains a replay buffer that stores the tuple of data (state, action, reward, next state) sampled from the environment at each step. During training, the Q-network is updated by randomly sampling batches of data from the replay buffer. The training network is updated by calculating the loss function between the training network and the target network, using gradient descent as indicated in Equation (3). The target network’s parameters are synchronized with the training network periodically to avoid instability caused by having identical parameters in both networks.

Finally, based on the state provided by the simulator and the trained network, the action with the highest estimated policy value is selected as the next action, proceeding to the next time step.

5.4. Environment Settings

5.4.1. State Space

We represent the model’s state using a five-tuple <healthy people rate, latent people rate, asymptomatic people rate, symptomatic people rate, death people rate>. This choice is informed by our focus on maximizing overall economic gains, which are closely tied to the distribution of different health statuses. Healthy individuals contribute stable and consistent income to the government, while those in the latent or infected stages are advised to work from home to avoid further spread, leading to reduced economic productivity. By incorporating this state representation, we can effectively capture the economic impacts of different health conditions in our decision-making process.

5.4.2. Action Space

In our research, we found that different regions have corresponding epidemic prevention policies, which can generally be classified into different levels based on the severity of the lockdown measures. By combining real-world epidemic prevention policies with the government policies provided in the model, we can define various levels of lockdown as bundles of all kinds of strategies in the model.

For example, in the early stages of the epidemic or when there is insufficient understanding of the virus, the government may choose to only appeal to the public to wear masks, and some people may choose to stay at home to reduce the risk of infection. Wearing masks can effectively reduce the probability of infection, which is reflected in the simulation program as a decrease in the probability of infection when people have close contact with each other. When the epidemic becomes more severe, the government may implement stricter measures, requiring people to wear masks and urging them to stay at home as much as possible instead of going out. This is reflected in the simulation program as an increase in the probability that people will not go out, and if they do go out, they will wear masks. In the most severe stage of the epidemic, the government may enforce strict epidemic prevention policies, requiring people only to go out if necessary, allowing only a small amount of daily outdoor activity, and implementing lockdown measures in severely affected areas, limiting gatherings, and closing non-essential venues to control the spread of the virus.

In the model, we divided these policies into six different levels, with level 0 representing the lowest level of lockdown and level 5 representing the highest level of lockdown, as shown in Table 3. Among them, W, S, G, A, and I, respectively, represent wearing masks, staying at home, circulation limits, gathering limits, and area lockdown in the policies mentioned above.

5.4.3. Reward Design

The reward function is a crucial component of Q-learning, as it determines the agent’s behavior by assigning a value to each state–action pair. The reward function reflects the goal of the reinforcement learning problem, and therefore, its design should align with the desired behavior of the agent.

In the early stages of the COVID-19 outbreak, governments lacked knowledge of the virus’s characteristics. Consequently, most governments adopted conservative observation measures to protect their economies from significant losses caused by lockdowns. However, as the number of COVID-19 cases and deaths increased rapidly, governments had to implement lockdown measures to prevent greater disasters. Policymakers faced the challenge of balancing the impact of factors such as the economy and public health.

To comprehensively consider economic, health, and psychological factors, we designed a multi-objective reward function. This function can comprehensively consider the impact of various factors on policy selection and reflect the degree of importance that policymakers attach to different factors by adjusting the weights. Policymakers can achieve a balance between economic and health factors by adjusting different parameters, thus achieving the goal of maintaining sound economic development while protecting the physical and mental health of the public.

For economic factor, in our simulation experiments, only agent individuals are included, which do not reflect the actual economic processes in the community. Therefore, we quantify the impact of the economy on policy making by measuring the contribution of each individual to the economy. We assume that a healthy individual in normal working conditions has a contribution value of 1 to the economy per day. When the government adopts certain measures such as working from home, the individual’s contribution to the economy will be substantially affected. Similarly, when an individual is in an unhealthy state, their ability to contribute to the economy is limited and varies with their physical condition. Therefore, we comprehensively consider their working and health status to quantify their potential economic contribution. The expression for economic contribution is

R_{c} = \frac{\sum_{i} p_{i} \times s_{i}}{n},

(4)

where i represents the i-th agent,

p_{i}

denotes the location of agent i,

s_{i}

denotes the life status of agent i, and n denotes the number of agents. The value of

p_{i}

and

s_{i}

is shown in Table 4; here, “POI” represents points of interest, referring to places other than home and hospital where agents might appear.

Similar to the economic factor, for the health factor, we assume that an individual in a healthy state has a contribution value of 1 per day. As the health status of an individual worsens, their contribution value decreases accordingly. For individuals who receive timely medical treatment, their symptoms will be alleviated; meanwhile, the duration of infection and the mortality rate will decrease correspondingly. As a result, their contribution value will increase. The specific expression for health contribution is as follows:

R_{h} = \frac{\sum_{i} h_{i}}{n},

(5)

here,

h_{i}

denotes the healthy parameter of agent i, the value of

h_{i}

is shown in Table 4, “other” refers to health states other than asymptomatic, symptomatic, and death, and it is assigned a value of 1 in these cases.

In regard to the psychological factor, it is important to consider that individuals tend to prefer freedom of movement and work. According to an Indian online survey, more than two-fifths of the people are experiencing common mental disorders due to lockdown and the prevailing COVID-19 pandemic. While lockdown can be a significant and effective strategy of social distancing to tackle the increasing spread of the highly infectious COVID-19 virus, at the same time, it can have some degree of psychological impact on the public [39]. Additionally, the more severe the level of lockdown measures, the stronger the resistance and negative psychological effects on individuals. Therefore, to reflect the psychological state of the public, we convert the level of lockdown measures into the psychological factor, which is expressed as follows:

R_{p} = 1 - \frac{l}{n_{l} - 1},

(6)

where l represents the lockdown level, and

n_{l}

denotes the number of the lockdown levels.

Using the formulation of

E_{c}

,

E_{h}

, and

E_{p}

above, our objective is to maximize the following multi-objective function:

R = \sum_{t} (α R_{c}^{t} + β R_{h}^{t} + γ R_{p}^{t}),

(7)

here, the superscript t indicates the contribution value on day t, and the parameters

[α, β, γ]

are used to adjust the weight, where

α + β + γ = 1

, and

α

,

β

, and

γ

are within the range [0, 1].

5.4.4. Training Parameters

In DQN, the parameter settings are crucial for the algorithm’s performance. Different parameter configurations can significantly impact the algorithm’s stability, convergence speed, and final performance. The settings for some of the training parameters in our experiments are as follows:

Learning Rate: The learning rate controls the step size for each parameter update. In our experiments, the learning rate is set to 0.001.
Discount Factor: The discount factor ( $γ$ ) measures the relative importance of current and future rewards. A higher discount factor means the agent places more emphasis on future rewards. In our experiments, the discount factor is set to 0.9, with $γ$ ranging [0, 1].
Target Network Update Frequency: In DQN, two neural networks are used: the evaluation network and the target network. The frequency of updating the target network’s parameters is crucial for learning stability. In our experiments, the evaluation network’s parameters are copied to the target network every 100 steps.
$ϵ$ -greedy Exploration Rate:The $ϵ$ -greedy strategy balances exploration and exploitation during training. The experiment set the initial exploration rate to 1.0 and lowered the exploration rate every 50 generations.

6. Experiments

In this section, we undertake an examination of the model’s sensitivity and efficacy while also evaluating whether the model can produce reasonable decision outcomes for the government based on its varying emphases on different factors.

6.1. Sensitivity Analysis

To investigate the impact of different strategies on the target variable, we maintain the simulation parameters relevant to the world fixed and control other strategies, using the method of univariate analysis.

In univariate sensitivity analysis, one policy is adopted at a time while relaxing other policies. Each run records the percentage of total infections achieved after the policy has been implemented for a certain period. For different independent variables, the experiments are conducted with a certain increment from the absolute minimum to the absolute maximum. To ensure the validity and statistical robustness of the experiments, 10 trials are conducted for each increment of each independent variable, and the experimental results are recorded.

The sensitivity index attributed to the total infections for each parameter is calculated using the formula below and compared with other variables:

s e n s i t i v i t y = \frac{Y_{m a x} - Y_{m i n}}{Y_{m a x}},

(8)

where

Y_{m a x}

and

Y_{m i n}

, respectively, represent the percentage of the maximum total number of infections and the minimum total number of infections.

After conducting multiple experiments, we obtained data related to the sensitivity under different policies and the peak of active cases under different policies, as shown in Table 5 and Table 6. Here, “Mean”, “Median”, “Range”, “Variance”, and “Standard Deviation” represent the average value, median, range (maximum value minus minimum value), variance, and standard deviation of the peak results obtained from multiple experiments, respectively. From Table 5 and Table 6, we have the following conclusions.

Wearing masks has a manifest impact on infection. Based on the results presented in Table 5 and Table 6, it can be seen that the sensitivity index for the mask usage rate is 0.94, indicating that an increase in mask usage has a significant impact on the number of active infections. Figure 4a plots the number of active cases for different proportions of the population wearing masks. As shown in Figure 4a and Table 6, as the mask usage rate increases, there is a notable reduction in the peak number of active infections. However, the time required for active cases to be cleared also increases proportionally. When the mask usage rate reaches 100%, the virus’s spread is restricted, leading to a significant reduction in the total number of infections and the peak. The effectiveness of preventing the spread of the virus becomes evident under such conditions.

The impact of staying at home on infections is also apparent. Based on the results presented in Table 5 and Table 6, it can be observed that the sensitivity index of staying at home is 0.93, implying that an increase in the number of people staying at home has a significant impact on reducing the number of active infections. Figure 4b plots the number of active cases for different proportions of the population staying at home. As can be seen from Figure 4b and Table 6, similar to wearing a mask, the more individuals who choose to stay at home, the lower the peak number of infections will be.

Limiting crowd gatherings can reduce the peak number of infections. Based on the results in Table 5 and Table 6, the sensitivity index for limiting the maximum number of people gathered was found to be 0.92. This implies that controlling the number of people gathering in one place has a significant impact on reducing the number of active cases. Figure 4c plots the number of active cases under conditions that limit the number of people. Figure 4c and Table 6 reveal that by restricting the maximum number of people gathered in public places, the likelihood of people coming into contact with each other is decreased, thereby limiting the spread of the virus. If stringent restrictions are put in place to control the maximum number of people who can gather at any given time, the spread of the virus can be effectively contained.

Regional restrictions have certain impact on infections. Based on the results presented in Table 5 and Table 6, the sensitivity index for area lockdown is 0.60, which is relatively lower compared to other epidemic prevention methods. Figure 4d plots the number of active cases with or without a regional lockdown required. Nevertheless, it can be observed from Figure 4d and Table 6 that the implementation of area lockdown can lead to a significant decrease in both the peak and total number of infections. The primary reason for this is that the area lockdown restricts the movement of infected individuals within a certain area, thereby reducing the spread of the virus to other regions and lowering the overall number of infections. However, it is important to note that an area lockdown alone may not be sufficient for controlling the spread of the virus. In the absence of other preventive measures, the region may eventually be gradually infected.

Isolation in case of infection can impact infection. Based on the results obtained in Table 5 and Table 6, it is evident that the sensitivity index for the isolation of infected individuals is 0.78, which is higher than that of area lockdown. Figure 4e plots the number of active cases with or without requiring people to quarantine if infected. The effectiveness of isolation in reducing the peak number of infections is clearly visible from Figure 4e and Table 6. This is because isolation, especially when immediately after the infection is detected, can effectively limit the spread of the virus to other individuals. Therefore, the isolation of infected individuals is an important measure for controlling the spread of the virus and should be implemented as part of a comprehensive strategy for combating the pandemic.

6.2. Effectiveness Analysis

To verify the effectiveness of the model, we compared it with the actual situation in different countries and explored the trend of the number of infections to analyze the model’s validity. We selected China, Poland, and the US as the countries for analysis by examining the infection data and policy measures implemented during the epidemic period. The relevant data for each country are listed in Table 7. The comparison was carried out by analyzing the model’s ability to simulate the epidemic’s trend in each country and comparing it with the actual data. The selected countries represent a range of different epidemic scenarios, including different transmission rates, levels of intervention, and control strategies, making the evaluation more comprehensive and universal. The results of this comparison will help validate the model’s ability to simulate real-world epidemics and inform future policy decisions.

6.2.1. Poland

According to the actual situation and the demonstration of Figure 5, at the beginning of the epidemic, the Polish government did not take strict measures. For instance, on 10–12 March 2020, lockdown-type control measures were implemented, closing schools, university classes, and offices as well as canceling mass events, and they were strengthened on 25 March, limiting non-family gatherings to two people and religious gatherings to six and forbidding non-essential travel. Lockdown restrictions were tightened on 31 March 2020 by government regulation, requiring individuals walking in streets to be separated by two meters, closing parks, boulevards, beaches, hairdressers, and beauty salons, and forbidding unaccompanied minors from exiting their homes [41].

In the simulation, a similar policy was implemented on the corresponding date. Since the Polish government measures were not strict enough and there was a lag regarding the policy itself, the number of infections still increased at a certain rate after the implementation of the strategy, and after 60 days, the number of active infections showed a decrease, and the results of the simulation were similar to the trend of the real data.

6.2.2. China

Based on the actual situation and the demonstration of Figure 6, the government took a wait-and-see attitude due to the first discovery of COVID-19 in China and the lack of knowledge about the virus, resulting in a very violent spread of the virus in the early stages. After the virus was found to be extremely transmissible and pathogenic, very strict measures were quickly taken: for example, on 23 January 2020, the central government of China imposed a lockdown in Wuhan and other cities in Hubei to quarantine the center of an outbreak of COVID-19 [42], resulting in a severe hindrance to the spread of the virus, a significant decrease in active cases, and a slow increase in the total number of infections. The simulated results are similar to the realistic data trends.

6.2.3. The US

In light of the actual situation and the findings from Figure 7, the U.S. government believes their path forward relies on giving schools and businesses the tools they need to prevent economic and educational shutdowns. Consequently, they have decided to keep schools and businesses largely open [43]. The decisions of the government led to the spread of the virus in the US without much resistance, and the number of active cases and the total number of infections are still in the exponential growth stage. The simulated results are similar to the realistic data trends.

6.3. Results under Different Factors

In this part, we mainly study whether our model can obtain a reasonable result through iteration to maximize benefits when considering different factors such as economic, health, or psychological factors. Due to the randomness of the experiment, we obtain decision-making sequence results with a certain tendency: that is, the government tends to block or open. Therefore, we derived the decision sequence for each case by averaging the results of multiple experiments, as illustrated in Figure 8.

6.3.1. Economic Factor

The results of model iteration considering economic factors are shown in Figure 9a. It can be observed from the figure that as the number of iterations increases, the mean reward of the model is gradually improving. However, the final reward may not stabilize at the maximum value due to the limitations of the model itself and other factors such as the number of iterations and parameters. Based on the decision sequence obtained from Figure 8, it can be inferred that considering the impact of economic factors, the government tends to implement lockdown measures until the number of infections decreases to a certain level. This is because incomplete lockdown measures would lead to a continuous increase in the number of infections, causing sustained economic consequences. On the other hand, continuously lifting restrictions would result in multiple waves of infections and also lead to economic decline.

6.3.2. Health Factor

The outcomes of model iteration, taking into account health factors, are illustrated in Figure 9b. Similar to the consideration of economic factors, the final reward, when considering health factors, does not exhibit a stable trend toward the maximum value. Analyzing the decision sequence derived from Figure 8, it can be observed that under the influence of health factors, the government also tends to opt for lockdown measures until the number of infections decreases to a certain threshold. This preference arises from the understanding that partial lockdowns would sustain the spread of the virus. Recognizing the potential consequences of reinfections and extensive virus transmission, only a comprehensive lockdown can effectively eliminate the virus, as an alternative approach would strain healthcare resources and compromise public health.

6.3.3. Psychological Factor

The outcomes of model iteration, considering psychological factors, are depicted in Figure 9c. It is evident from the graph that as the number of iterations increases, the average reward of the model gradually improves. Moreover, as the number of iterations increases, the final reward stabilizes toward the maximum value. Analyzing the decision sequence derived from Figure 8, it can be observed that under the influence of psychological factors, the government tends to lean toward implementing an open policy. This inclination stems from the recognition that strict lockdown measures can have a profound impact on individuals’ mental well-being. Imposing stringent lockdown policies can have detrimental effects, as mandatory confinement at home daily can negatively affect people’s mood, leading to feelings of depression.

6.3.4. All Factors

When making any decision, it is crucial to consider the impact of multiple factors rather than focusing solely on one. Figure 9d presents the results of model iteration, taking into account various factors such as economy, health, and psychology. In this analysis, we set the parameters

α = β = γ = 1 / 3

. The graph clearly illustrates that as the number of iterations increases, the average reward of the model gradually improves. Furthermore, with each iteration, the final reward tends to stabilize toward its maximum value. Analyzing the decision sequence derived from Figure 8, it becomes evident that when considering the collective impact of all factors, the government tends to lean toward implementing an open policy. By appropriately easing restrictions, the government aims to prevent the exacerbation of psychological issues while still implementing certain levels of lockdown measures to slow down the spread of the virus and maintain it within manageable limits.

7. Conclusions

The COVID-19 outbreak had a profound impact on the world. The pandemic irreversibly affected national economies, public health, and psychological well-being. Hence, it is essential to select appropriate policies based on the epidemic spread. Governments worldwide adopted different strategies to address the crisis. The impact of lockdown policies is uncertain, prompting us to employ an ABM to construct a pandemic simulator to simulate human behavior and virus transmission, demonstrating the effectiveness of government policies. To enhance the persuasiveness of the simulator, a data-driven approach was utilized to adjust the model based on real-world statistical data. We apply deep reinforcement learning on the top of the simulator to explore optimal lockdown strategies. The results indicate that through model training, it is possible to identify rational strategies tailored to the circumstances of different countries to mitigate the impact of the pandemic. The simulator provides important knowledge to understand the pandemic and its impact on policy. We will continue to improve the simulator by considering more factors and real-world data that are gradually available in the public domain. This will facilitate the generalization of the COVID-19 simulator in other pandemic control.

Author Contributions

Conceptualization, Y.Z., B.M. and L.C.; methodology, Y.Z., Y.L. and B.M.; validation, Y.Z.; formal analysis, Y.Z. and Y.L.; investigation, Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, B.M.; visualization, Y.Z.; supervision, B.M. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (Grants No. 62176225 and 62276168) and the Natural Science Foundation of Fujian Province, China (Grant No. 2022J05176).

Data Availability Statement

The real-world data can be obtained from https://github.com/owid/covid-19-data, and the simulation data is obtained from the simulator in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABM	Agent-based modeling
DQN	Deep Q-network
MDP	Markov decision process
PS	Pandemic simulator
SIR	Susceptible–infectious–recovered
SEIR	Susceptible-exposed-infectious-recovered
GA	Genetic algorithm
ARIMA	Auto regressive integrated moving average
CLD	Causal loop diagram

References

Kudryashov, N.A.; Chmykhov, M.A.; Vigdorowitsch, M. Analytical features of the SIR model and their applications to COVID-19. Appl. Math. Model. 2021, 90, 466–473. [Google Scholar] [CrossRef] [PubMed]
Abdy, M.; Side, S.; Annas, S.; Nur, W.; Sanusi, W. An SIR epidemic model for COVID-19 spread with fuzzy parameter: The case of Indonesia. Adv. Differ. Equ. 2021, 2021, 105. [Google Scholar] [CrossRef] [PubMed]
Ramos, A.; Ferrández, M.; Vela-Pérez, M.; Kubik, A.; Ivorra, B. A simple but complex enough θ-SIR type model to be used with COVID-19 real data. Application to the case of Italy. Phys. D Nonlinear Phenom. 2021, 421, 132839. [Google Scholar] [CrossRef] [PubMed]
Annas, S.; Pratama, M.I.; Rifandi, M.; Sanusi, W.; Side, S. Stability analysis and numerical simulation of SEIR model for pandemic COVID-19 spread in Indonesia. Chaos Solitons Fractals 2020, 139, 110072. [Google Scholar] [CrossRef] [PubMed]
Cai, M.; Em Karniadakis, G.; Li, C. Fractional SEIR model and data-driven predictions of COVID-19 dynamics of Omicron variant. Chaos Interdiscip. J. Nonlinear Sci. 2022, 32, 071101. [Google Scholar] [CrossRef] [PubMed]
Kamrujjaman, M.; Saha, P.; Islam, M.S.; Ghosh, U. Dynamics of SEIR model: A case study of COVID-19 in Italy. Results Control Optim. 2022, 7, 100119. [Google Scholar] [CrossRef]
Efimov, D.; Ushirobira, R. On an interval prediction of COVID-19 development based on a SEIR epidemic model. Annu. Rev. Control 2021, 51, 477–487. [Google Scholar] [CrossRef] [PubMed]
López, L.; Rodo, X. A modified SEIR model to predict the COVID-19 outbreak in Spain and Italy: Simulating control scenarios and multi-scale epidemics. Results Phys. 2021, 21, 103746. [Google Scholar] [CrossRef] [PubMed]
Sardar, T.; Nadim, S.S.; Rana, S.; Chattopadhyay, J. Assessment of lockdown effect in some states and overall India: A predictive mathematical study on COVID-19 outbreak. Chaos Solitons Fractals 2020, 139, 110078. [Google Scholar] [CrossRef]
Zisad, S.N.; Hossain, M.S.; Hossain, M.S.; Andersson, K. An integrated neural network and SEIR model to predict COVID-19. Algorithms 2021, 14, 94. [Google Scholar] [CrossRef]
Qiu, Z.; Sun, Y.; He, X.; Wei, J.; Zhou, R.; Bai, J.; Du, S. Application of genetic algorithm combined with improved SEIR model in predicting the epidemic trend of COVID-19, China. Sci. Rep. 2022, 12, 8910. [Google Scholar] [CrossRef] [PubMed]
Dong, Z.G.; Song, B.; Meng, Y.X. Prediction of COVID-19 based on mixed SEIR-ARIMA model. Comput. Mod. 2022, 1. [Google Scholar]
Kumar, A.; Priya, B.; Srivastava, S.K. Response to the COVID-19: Understanding implications of government lockdown policies. J. Policy Model. 2021, 43, 76–94. [Google Scholar] [CrossRef] [PubMed]
Prajapati, S.P.; Bhaumik, R.; Kumar, T. An Intelligent ABM-based framework for developing pandemic-resilient urban spaces in post-COVID smart cities. Procedia Comput. Sci. 2023, 218, 2299–2308. [Google Scholar] [CrossRef] [PubMed]
Hadley, E.; Rhea, S.; Jones, K.; Li, L.; Stoner, M.; Bobashev, G. Enhancing the prediction of hospitalization from a COVID-19 agent-based model: A Bayesian method for model parameter estimation. PLoS ONE 2022, 17, e0264704. [Google Scholar] [CrossRef] [PubMed]
Narassima, M.; Anbuudayasankar, S.; Jammy, G.; Pant, R.; Choudhury, L.; Ramakrishnan, A.; John, D. An agent based model for assessing transmission dynamics and health systems burden for COVID-19. Indones J. Electr. Eng. Comput. Sci. 2021, 24, 1735–1743. [Google Scholar]
Padmanabhan, R.; Meskin, N.; Khattab, T.; Shraim, M.; Al-Hitmi, M. Reinforcement learning-based decision support system for COVID-19. Biomed. Signal Process. Control 2021, 68, 102676. [Google Scholar] [CrossRef]
Khalilpourazari, S.; Hashemi Doulabi, H. Designing a hybrid reinforcement learning based algorithm with application in prediction of the COVID-19 pandemic in Quebec. Ann. Oper. Res. 2022, 312, 1261–1305. [Google Scholar] [CrossRef]
Kwak, G.H.; Ling, L.; Hui, P. Deep reinforcement learning approaches for global public health strategies for COVID-19 pandemic. PLoS ONE 2021, 16, e0251550. [Google Scholar] [CrossRef]
Uddin, M.I.; Ali Shah, S.A.; Al-Khasawneh, M.A.; Alarood, A.A.; Alsolami, E. Optimal policy learning for COVID-19 prevention using reinforcement learning. J. Inf. Sci. 2022, 48, 336–348. [Google Scholar] [CrossRef]
Trad, F.; El Falou, S. Towards using deep reinforcement learning for better COVID-19 vaccine distribution strategies. In Proceedings of the 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA), Riyadh, Saudi Arabia, 1–3 March 2022; pp. 7–12. [Google Scholar]
Bovim, T.R.; Gullhav, A.N.; Andersson, H.; Dale, J.; Karlsen, K. Simulating emergency patient flow during the COVID-19 pandemic. J. Simul. 2023, 17, 407–421. [Google Scholar] [CrossRef]
Angelopoulou, A.; Mykoniatis, K. Hybrid modelling and simulation of the effect of vaccination on the COVID-19 transmission. J. Simul. 2022, 18, 88–99. [Google Scholar] [CrossRef]
Wilensky, U.; Rand, W. An Introduction to Agent-Based Modeling: Modeling Natural, Social, and Engineered Complex Systems with NetLogo; Mit Press: Cambridge, MA, USA, 2015. [Google Scholar]
Izquierdo, L.R.; Olaru, D.; Izquierdo, S.S.; Purchase, S.; Soutar, G.N. Fuzzy logic for social simulation using NetLogo. J. Artif. Soc. Soc. Simul. 2015, 18, 1. [Google Scholar] [CrossRef]
Indicator. Available online: https://data.worldbank.org/indicator (accessed on 15 February 2023).
Statista. Available online: https://www.statista.com/ (accessed on 15 February 2023).
Howard, J.; Huang, A.; Li, Z.; Tufekci, Z.; Zdimal, V.; van der Westhuizen, H.M.; von Delft, A.; Price, A.; Fridman, L.; Tang, L.H.; et al. An evidence review of face masks against COVID-19. Proc. Natl. Acad. Sci. USA 2021, 118, e2014564118. [Google Scholar] [CrossRef] [PubMed]
Lauer, S.A.; Grantz, K.H.; Bi, Q.; Jones, F.K.; Zheng, Q.; Meredith, H.R.; Azman, A.S.; Reich, N.G.; Lessler, J. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: Estimation and application. Ann. Intern. Med. 2020, 172, 577–582. [Google Scholar] [CrossRef] [PubMed]
Information, H.; Authority, Q. Evidence Summary for COVID-19 Viral Load over Course of Infection; Health Information and Quality Authority: Dublin, Ireland, 2020. [Google Scholar]
Xiang, Y.; Jia, Y.; Chen, L.; Guo, L.; Shu, B.; Long, E. COVID-19 epidemic prediction and the impact of public health interventions: A review of COVID-19 epidemic models. Infect. Dis. Model. 2021, 6, 324–342. [Google Scholar] [CrossRef]
Verity, R.; Okell, L.C.; Dorigatti, I.; Winskill, P.; Whittaker, C.; Imai, N.; Cuomo-Dannenburg, G.; Thompson, H.; Walker, P.G.; Fu, H.; et al. Estimates of the severity of COVID-19 disease. MedRxiv 2020. [Google Scholar] [CrossRef]
Setti, L.; Passarini, F.; De Gennaro, G.; Barbieri, P.; Perrone, M.G.; Borelli, M.; Palmisani, J.; Di Gilio, A.; Piscitelli, P.; Miani, A. Airborne transmission route of COVID-19: Why 2 meters/6 feet of inter-personal distance could not be enough. Int. J. Environ. Res. Public Health 2020, 17, 2932. [Google Scholar] [CrossRef] [PubMed]
Ma, Q.; Liu, J.; Liu, Q.; Kang, L.; Liu, R.; Jing, W.; Wu, Y.; Liu, M. Global percentage of asymptomatic SARS-CoV-2 infections among the tested population and individuals with confirmed COVID-19 diagnosis: A systematic review and meta-analysis. JAMA Netw. Open 2021, 4, e2137257. [Google Scholar] [CrossRef]
Eythorsson, E.; Runolfsdottir, H.L.; Ingvarsson, R.F.; Sigurdsson, M.I.; Palsson, R. Rate of SARS-CoV-2 reinfection during an omicron wave in Iceland. JAMA Netw. Open 2022, 5, e2225320. [Google Scholar] [CrossRef]
Sotoodeh Ghorbani, S.; Taherpour, N.; Bayat, S.; Ghajari, H.; Mohseni, P.; Hashemi Nazari, S.S. Epidemiologic characteristics of cases with reinfection, recurrence, and hospital readmission due to COVID-19: A systematic review and meta-analysis. J. Med. Virol. 2022, 94, 44–53. [Google Scholar] [CrossRef] [PubMed]
COVID-19: Guidance for Households with Possible Coronavirus Infection. Available online: https://www.gov.uk/government/publications/covid-19-stay-at-home-guidance (accessed on 15 February 2023).
Rule of Six Comes into Effect to Tackle Coronavirus. Available online: https://www.gov.uk/government/news/rule-of-six-comes-into-effect-to-tackle-coronavirus (accessed on 15 February 2023).
Grover, S.; Sahoo, S.; Mehra, A.; Avasthi, A.; Tripathi, A.; Subramanyan, A.; Pattojoshi, A.; Rao, G.P.; Saha, G.; Mishra, K.; et al. Psychological impact of COVID-19 lockdown: An online survey from India. Indian J. Psychiatry 2020, 62, 354–362. [Google Scholar] [CrossRef] [PubMed]
United Nations, Department of Economic and Social Affairs, Population Division. World Population Prospects 2022, Online Edition. Available online: https://population.un.org/wpp/DataSources/ (accessed on 15 February 2023).
The Republic of Poland. Coronavirus: Information and Recommendations. Available online: https://www.gov.pl/web/coronavirus (accessed on 15 February 2023).
The State Council, the People’s Republic of China. Epidemic Prevention and Control. Available online: http://www.gov.cn/fuwu/zt/yqfwzq/yqfkblt.htm (accessed on 15 February 2023).
The White House. Available online: https://www.whitehouse.gov/covidplan/ (accessed on 15 February 2023).
Raciborski, F.; Pinkas, J.; Jankowski, M.; Sierpiński, R.; Zgliczyński, W.S.; Szumowski, Ł.; Rakocy, K.; Wierzba, W.; Gujski, M. Dynamics of the coronavirus disease 2019 outbreak in Poland: An epidemiological analysis of the first 2 months of the epidemic. Pol. Arch. Intern. Med. Pol. Arch. Med. Wewn. 2020, 130, 615–621. [Google Scholar]
Wu, Z.; McGoogan, J.M. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: Summary of a report of 72·314 cases from the Chinese Center for Disease Control and Prevention. JAMA 2020, 323, 1239–1242. [Google Scholar] [CrossRef]
Holshue, M.L.; DeBolt, C.; Lindquist, S.; Lofy, K.H.; Wiesman, J.; Bruce, H.; Spitters, C.; Ericson, K.; Wilkerson, S.; Tural, A.; et al. First case of 2019 novel coronavirus in the United States. N. Engl. J. Med. 2020, 382, 929–936. [Google Scholar] [CrossRef]

Figure 1. Pandamic simulator.

Figure 2. Infection process.

Figure 3. Model process.

Figure 4. The results of different policies.

Figure 5. Number of actual and simulated cases in Poland.

Figure 6. Number of actual and simulated cases in China.

Figure 7. Number of actual and simulated cases in the US.

Figure 8. Decision sequence results under different factors.

Figure 9. The result of different factors.

Table 1. Simulator initial parameter settings.

Parameter	Value	Source/Justification
Size of world	64 × 64	Easy to visualize
Meters per patch	10	Easy to show the process of virus transmission
Population	500	Population density indicator from the World Bank [26]
Number of houses	200	Number of dwellings indicator from statista [27]
Number of interests	50	Arbitrary value for the ease of comparison
Hospital beds (per 1000 people)	8	Indicator from the World Bank [26]

Table 2. Parameter settings for the infection process.

Parameter	Value	Source
Incubation period (days)	$\ln (X) \sim N (5.8, {1.98}^{2})$	[29]
Infectious period (days)	$X \sim N (8.5, {0.75}^{2})$	[30]
Latent period (days)	3	[31]
Infection-fatality rate (%)	0.9	[32]
Initial infection	5	\
Infection radius (m)	2	[33]
Asymptomatic infection rate (%)	50	[34]
Reinfection rate (%)	11	[35,36]

Table 3. Levels of lockdown.

Levels	W	S	G	A	I
0	0.1	0	∞	0	0
1	0.3	0.2	∞	0	0
2	0.3	0.5	∞	0	0
3	0.3	0.5	8	0	0
4	0.3	0.5	4	0	1
5	0.5	0.8	0	1	1

Table 4. The states and values of variables

p_{i}

,

s_{i}

and

h_{i}

.

Table 4. The states and values of variables

p_{i}

,

s_{i}

and

h_{i}

.

Variables	States & Values
$p_{i}$	POI	Home	Hospital
$p_{i}$	1	0.8	0
$s_{i}$	Asymptomatic	Symptomatic	Death	Other
$s_{i}$	0.8	0.2	0	1
$h_{i}$	Asymptomatic	Latent	Healthy
	0.5	0.9	1
	Symptomatic	Treatment	Death
	0	0.9	−2

Table 5. Sensitivity of different policies.

Wear Masks	Stay at Home	People Gathered	Area Lockdown	Isolation If Infected
0.94	0.93	0.92	0.60	0.78

Table 6. Peak percentage of active cases under different measures.

Measures	Parameters	Mean	Median	Range	Variance	Standard Deviation
No Measures	∖	60.1	60.7	11.6	14.61	3.82
Wear Masks Rate (%)	25	51.8	53.1	7.8	8.82	2.97
	50	38.3	38.3	5.0	3.61	1.90
	75	27.3	26.4	8.8	11.10	3.33
	100	11.6	11.9	5.4	3.25	1.80
Stay at Home Rate (%)	30	47.2	48.4	8.4	9.39	3.06
	60	25.9	25.7	8.2	8.68	2.95
	90	11.7	11.1	6.0	5.52	2.35
Maximum Number of People Gathered	2	12.5	13.0	5.2	2.90	1.70
	4	26.1	26.1	1.6	0.33	0.57
	6	44.8	44.8	11.2	15.39	3.92
	8	55.1	53.2	9.6	14.46	3.80
Area Lockdown	∖	29.2	27.8	15.2	32.19	5.67
Isolation if Infected	∖	19.6	20.6	6.4	6.11	2.47

Table 7. Background of Poland, China, and the US.

Country	Poland	China	The US
Population (million)	37.9 [40]	1410 [40]	338 [40]
Population density (per km²)	124 [40]	153 [40]	37 [40]
Lockdown implemented	General [41]	Strict [42]	Lenient [43]
Date of first case	4 March 2020 [44]	8 December 2019 [45]	20 January 2020 [46]
Population ages 65 and above (% of total population)	19 [26]	13 [26]	17 [26]
Hospital beds per 1000 people	6.5 [26]	4.3 [26]	2.4 [27]
Physicians per 1000 people	2.4 [26]	2.0 [26]	2.6 [27]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Ma, B.; Cao, L.; Liu, Y. A Data-Driven Pandemic Simulator with Reinforcement Learning. Electronics 2024, 13, 2531. https://doi.org/10.3390/electronics13132531

AMA Style

Zhang Y, Ma B, Cao L, Liu Y. A Data-Driven Pandemic Simulator with Reinforcement Learning. Electronics. 2024; 13(13):2531. https://doi.org/10.3390/electronics13132531

Chicago/Turabian Style

Zhang, Yuting, Biyang Ma, Langcai Cao, and Yanyu Liu. 2024. "A Data-Driven Pandemic Simulator with Reinforcement Learning" Electronics 13, no. 13: 2531. https://doi.org/10.3390/electronics13132531

APA Style

Zhang, Y., Ma, B., Cao, L., & Liu, Y. (2024). A Data-Driven Pandemic Simulator with Reinforcement Learning. Electronics, 13(13), 2531. https://doi.org/10.3390/electronics13132531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Levels	W	S	G	A	I
0	0.1	0	∞	0	0
1	0.3	0.2	∞	0	0
2	0.3	0.5	∞	0	0
3	0.3	0.5	8	0	0
4	0.3	0.5	4	0	1
5	0.5	0.8	0	1	1

Levels	W	S	G	A	I
0	0.1	0	∞	0	0
1	0.3	0.2	∞	0	0
2	0.3	0.5	∞	0	0
3	0.3	0.5	8	0	0
4	0.3	0.5	4	0	1
5	0.5	0.8	0	1	1

Article Menu

A Data-Driven Pandemic Simulator with Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Agent-Based Model

3.2. NetLogo

3.3. Q-Learning

3.4. Deep Q-Network

4. Pandemic Simulator

4.1. Environment

4.2. Agent

4.3. Infection Mechanism

5. Decision-Making Based on DQN

5.1. Problem Statement

5.2. Policy of Government Agent

5.3. Model Process

5.4. Environment Settings

5.4.1. State Space

5.4.2. Action Space

5.4.3. Reward Design

5.4.4. Training Parameters

6. Experiments

6.1. Sensitivity Analysis

6.2. Effectiveness Analysis

6.2.1. Poland

6.2.2. China

6.2.3. The US

6.3. Results under Different Factors

6.3.1. Economic Factor

6.3.2. Health Factor

6.3.3. Psychological Factor

6.3.4. All Factors

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Levels	W	S	G	A	I
0	0.1	0	∞	0	0
1	0.3	0.2	∞	0	0
2	0.3	0.5	∞	0	0
3	0.3	0.5	8	0	0
4	0.3	0.5	4	0	1
5	0.5	0.8	0	1	1