All articles published by MDPI are made immediately available worldwide under an open access license. No special
permission is required to reuse all or part of the article published by MDPI, including figures and tables. For
articles published under an open access Creative Common CC BY license, any part of the article may be reused without
permission provided that the original article is clearly cited. For more information, please refer to
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature
Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for
future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive
positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world.
Editors select a small number of articles recently published in the journal that they believe will be particularly
interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the
most exciting work published in the various research areas of the journal.
With the advent of the first pandemic wave of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), the question arises as to whether the spread of the virus will be controlled by the application of preventive measures or will follow a different course, regardless of the pattern of spread already recorded. These conditions caused by the unprecedented pandemic have highlighted the importance of reliable data from official sources, their complete recording and analysis, and accurate investigation of epidemiological indicators in almost real time. There is an ongoing research demand for reliable and effective modeling of the disease but also the formulation of substantiated views to make optimal decisions for the design of preventive or repressive measures by those responsible for the implementation of policy in favor of the protection of public health. The main objective of the study is to present an innovative data-analysis system of COVID-19 disease progression in Greece and her border countries by real-time statistics about the epidemiological indicators. This system utilizes visualized data produced by an automated information system developed during the study, which is based on the analysis of large pandemic-related datasets, making extensive use of advanced machine learning methods. Finally, the aim is to support with up-to-date technological means optimal decisions in almost real time as well as the development of medium-term forecast of disease progression, thus assisting the competent bodies in taking appropriate measures for the effective management of the available health resources.
The health crisis caused by the SARS-CoV-2 pandemic, combined with the economic consequences and the shock to health systems, has created serious concerns on how to make timely and valid decisions about prevention and social distancing measures to be taken . The COVID-19 pandemic has created a rapidly changing environment where a huge amount of data related to virus spread updates is daily presented. The effective utilization of this data and the provision of the thorough and at the same time fast analysis of the most up-to-date information to support the best decisions requires their intelligent processing in near real time .
The analysis of the spread rate of COVID-19 disease is directly related to the general concerns and challenges of large-scale near real-time data analysis procedures. Specifically, it is directly related to the high velocity with which the relevant information arrives, how this information is collected and stored (its volume), the variety of unstructured or semi-structured data forms that can be collected, their variability as epidemiological data change in importance over time, their visualization and the diagnosis of whether the information is accurate or incomplete and inaccurate (its veracity), and finally determining their final operational value . Understanding how the parameters of these data are linked can help civil protection organizations identify in a clear and fully understandable way what capabilities they need to develop or acquire to make full use of the data they have to strengthen public safety, health, and consequently safeguarding the state’s health system .
Beyond their management, the biggest modern challenge for large-scale data such as those related to COVID-19 disease is to analyze them functionally to finally reveal the hidden knowledge contained in this information. For example, using pattern recognition methods, it is possible to identify trends or patterns, to identify unknown correlations, as well as other useful information, to achieve behavioral prediction and make optimal decisions . It is important to note that the above analysis can be used not only to implement appropriate policies to prevent and deal with future epidemics by giving a retrospective picture of the pace and ways of its spread but also to make optimal decisions and actions in almost real time .
This very ability to process huge amounts of data, using advanced algorithms and generally intelligent analysis and processing tools, is a very promising solution to the effective detection and tracing of active cases, while also creating the background for the development of spatio-temporal solutions adapted to real needs, but also methods of timely forecasting of potential threats to public health .
Due to the extremely urgent need to take action to reduce the spread of the disease, the requirements of civil and health protection mechanisms must include appropriate algorithms for fast to instantaneous processing of large volumes of data with high complexity, and possible high inhomogeneity . In general, the approaches that should be chosen to shield the public health system should meet specific specifications, ensuring at least multiple design aspects, such as :
Ability to securely exchange data between distributed systems.
The above requirements have led to the parallel development of both the infrastructure that supports large-scale data and the algorithmic standardizations that must be followed to ensure public health . In this spirit, the study of how to record, analyze, and model the problem of the spread of the disease is extremely important, both from an epidemiological point of view and from a mathematical point of view .
This paper proposes a novel model for the near-real-time analysis of COVID-19 disease data, as well as an intelligent machine learning system for predicting disease progression, in order to assist in deciding on predictive or suppressive measures of social distancing or taking appropriate measures related to the management of the health system. The proposed system is based on automated data collection and analysis, while the medium-term forecast is based on advanced machine learning methods. Within this context, the proposed method can be applied to different aspects of the COVID-19 temporal spread in Greece and her border countries to present an exploratory study of COVID-19 disease progression (real-time statistics about the cumulative number of infections, deaths, ICU patients, and epidemiological indicators). In practical implementation, the proposed methodology offers an active method for modeling and forecasting the pandemic, which is capable of removing the disconnected past data from the time-series structure in order to provide a modeling and forecasting tool facilitating decision making and resource management in epidemiology, which can contribute to the ongoing fight against the pandemic of COVID-19.
The rest of the work is structured as follows. Initially, relevant research papers are presented on how to record, analyze, and model the problem of pandemic spread. Then, the third section presents the way of mathematical modeling and analysis of epidemiological data using non-spatial causal models and indicators. The time-series forecasting methodology is presented in the next section, while chapter five presents the data used and the results obtained. Finally, in the last section, there is an extensive analysis and discussion of the general methodology that took place, and the study closes with the presentation of future research that is proposed to be followed.
2. Related Work
Methodologies for mathematical modeling of the spread of the disease  and especially techniques for predicting the future variation of the epidemic curve  are deemed as a constant demand by the research community, with remarkable findings already recorded, offering an important legacy of knowledge [14,15,16].
For example, the detailed research of Sarkodie et al.  temporally models the evolution of the pandemic, constructing at the same time conceptual tools for linking the relationships between confirmed cases and deaths, based on four characteristic health indicators. The final assessment of this research is based on cross-sectional dependence, endogeneity, and unobserved heterogeneity. Although the linear relationship between deaths and confirmed cases are revealed, as well as the non-linear correlation between recovery cases and confirmed cases, the study fails to provide a final model with substantial generalization possibilities as it uses limited in scale non-critical data that cannot be used for extensive identification of the phenomenon.
On the other hand, the purpose of this work  is to give a contribution to the understanding of the COVID-19 contagion in Italy. To this end, the authors developed a modified Susceptible–Infected–Recovered–Deceased (SIRD) model for the contagion, and they used official data of the pandemic for identifying the parameters of this model. Their approach features two main non-standard aspects. The first one is that model parameters can be time-varying, allowing them to capture possible changes of the epidemic behavior, due for example to containment measures enforced by authorities or modifications of the epidemic characteristics and to the effect of advanced antiviral treatments. The time-varying parameters are written as linear combinations of basis functions and are then inferred from data using sparse identification techniques. The second non-standard aspect resides in the fact that they consider as model parameters also the initial number of susceptible individuals, as well as the proportionality factor relating the detected number of positives with the actual (and unknown) number of infected individuals. Identifying the model parameters amounts to a non-convex identification problem that they solve by means of a nested approach, consisting of a one-dimensional grid search in the outer loop, with a Lasso optimization problem in the inner step.
In contrast, Anastassopoulou et al. , using more complete datasets and heuristic methodology for estimating epidemiological parameters, model the rates of disease spread with a much more complete and substantial contribution to the way the pandemic is assessed. However, the reverse prediction process based on spread scenarios, which reproduces the confirmed hypotheses, creates a directed trend that is part of a very specific framework, suitable only for the verification of simulation techniques.
A fully technical prototype of high research interest was presented in the work of Fong et al. , where they presented an optimized prediction model of polynomial neural networks with corrective feedback, which can generalize, even in cases where the samples are minimal. Although the methodology is very robust, it needs to be compared with competing algorithms, taking into account additional process evaluation criteria apart from those describing the level of accuracy/error.
Differently from the related literature, where modeling and controlling the pandemic contagion is typically addressed on a national basis, this paper  proposes an optimal control approach that supports governments in defining the most effective strategies to be adopted during post-lockdown mitigation phases in a multi-region scenario. Based on the joint use of a non-linear Model Predictive Control scheme and a modified Susceptible–Infected–Recovered (SIR)-based epidemiological model, the approach is aimed at minimizing the cost of the so-called non-pharmaceutical interventions (that is, mitigation strategies), while ensuring that the capacity of the network of regional healthcare systems is not violated. In addition, the proposed approach supports policymakers in taking targeted intervention decisions on different regions by an integrated and structured model, thus both respecting the specific regional health systems characteristics and improving the system-wide performance by avoiding uncoordinated actions of the regions. The methodology is tested on the COVID-19 outbreak data related to the network of Italian regions, showing its effectiveness in properly supporting the definition of effective regional strategies for managing the COVID-19 diffusion.
Given the scale of the pandemic in different countries, many researchers have focused on local analyses based on officially available data. For example, Mahase et al.  present the statistical data of the United Kingdom after the implementation of social distancing. A particularly detailed research effort to localize the phenomenon is presented in the article , which explores the spatio-temporal trend of the epidemic in Italy. This study is based solely on statistical modeling without taking into account the statistical significance tests used to test the scientific hypothesis that is initially taken into account. The severity of this weakness is magnified by the fact that the object of epidemiological studies is an occurrence function and more specifically a measure of association that quantifies the relationship between the identifier studied and the outcome, which is required to decide whether this relationship is statistically significant or not.
Respectively, focusing on the peculiarities of the spread of COVID-19 in Greece, ref.  offers an exploratory time study of the course of the disease while at the same time proposing a realistic model for predicting high reliability. Specifically, a statistical analysis of the evolution of epidemiological data in Greece is presented, where the rate of spread and the perceived spread of the disease are approximated and standardized with mathematical standards. Respectively, a methodology for predicting the high solvency of total cases, deaths, and intensive care unit beds is proposed based on the Regression Splines algorithm. The important innovation of the proposed model is that it bases its operation on the previous modeling with a Complex Network of the social distancing measures taken in Greece, thus implementing a fully functional and realistic system of evaluation and interpretation of disease-related events.
Evolving the above investigation, ref.  attempts to anticipate the “Flattening of the Curve”, to make optimal decisions regarding the support of the health system and the implementation of additional measures being taken, such as a reduction of social distancing. The proposed system approaches offer realism in the way of their evaluation while offering a powerful mechanism for modeling the spread of the pandemic.
The local evaluation of the phenomenon, while it is an essential basis of evaluation, also contains serious weaknesses if it is not based on solid conditions. For example, a subjective approach in predicting disease spread based on exponential smoothing models is presented in the paper ; here, the trend index, which is calculated following the pattern of the disease of the past based on local data and the smoothing of the curve, is predicted based on similar case studies of other countries leading the pandemic.
Focusing on the specifics of the spread of the disease both epidemiologically and in terms of the implementation of preventive and repressive measures, this paper presents an exploratory study for the near real-time analysis of large-scale disease data with advanced intelligent machine learning techniques, which uses the visualized material that can be produced by the corresponding information system. The aim is to reveal the knowledge hidden in the epidemiological data, deciphering, and capturing the mathematics of the pandemic and specifically the indicators that can model the spatio-temporal evolution and the spread of the disease.
3. Mathematical Modeling and Pandemic Analytics
Spatio-temporal modeling of the circulation of pathogens between hosts and through transmitters is used to simplify the reality or complex correlations associated with a chaotic phenomenon such as the pathogen–host interaction . In particular, mathematical modeling, especially when performed in real time, is a powerful tool for studying the dynamic transmission of infectious diseases using non-spatial causal models (Susceptible Infectious, Recovered—SIR) and in general in assisting in optimal decision making .
Decision making in epidemiology  is based on predicting or simulating behaviors and properties of complex systems based on mathematical modeling. Epidemiology is the study of the distribution and evolution of various diseases in the human population (descriptive epidemiology) and the factors that shape them or can influence them (analytical epidemiology) .
3.1. Real-Time Statistics
Greece at the time of completing the study (17 June 2021) had 417,253 coronavirus cases, 12,488 deaths, and 396,317 recovered, with daily variance as shown in Figure 1 .
Respectively, the following Figure 2, Figure 3, Figure 4 and Figure 5 show the daily variation of the cases with Greece’s neighboring countries (Albania, Bulgaria, Turkey, and North Macedonia) to assist the decision-making system and the corresponding social distancing mechanisms .
For the most complete and effective decision making, real-time statistical analysis of the pandemic is required at a level where the technical characteristics of the problem can be captured. Detailed statistical analysis for Greece is presented in the following Table 1, Table 2, Table 3 and Table 4 :
It should be noted that the stringency index is an index provided by the Oxford COVID-19 Government Response Tracker , which includes a team of one hundred experts, who constantly update a database with 17 government response indicators, considering restraint policies such as school and workplace closures, public events, public transportation, home accommodation policies, etc. Essentially, it is a number ranging from 0 to 100 that reflects the 17 rating indicators, with the highest score indicating the highest level of rigor. The graphical representation of the statistical analysis of the pandemic in Greece is also presented in the following Figure 6 .
The correlation between the above-examined variables of Table 1, Table 2, Table 3 and Table 4 is presented in the following figure, and a table of the degree of Pearson correlation is defined in the Figure 7 :
Essentially, the above table shows the degree of linear correlation of the variables X and Y with the dispersion of and respectively and covariance . The correlation coefficient R, similar to the covariance , expresses the degree and the way the two variables are correlated, that is, how one random variable varies concerning the other. takes values that depend on the value range of X and Y, while the coefficient R takes values in the interval [−1, 1]; where R = 1, there is a perfect positive correlation between X and Y; if R = 0, there is no linear correlation between X and Y; and if R = −1, there is a perfectly negative correlation between X and Y. When R = ± 1, the relation is causal and not probabilistic because knowing the value of one random variable, the exact value of the other variable is also known. When the correlation coefficient is close to −1 or 1, the linear correlation of the two variables is strong (|R| > 0.9), while when it is close to 0, the variables are practically unrelated .
3.2. Near Real-Time Analytics
From the moment the epidemic was identified as the result of the new coronavirus SARS-CoV-2, the main priorities of the scientific community were to collect appropriate data to be able to develop the most important parameters of descriptive epidemiology, which can model its evolution and spread disease, to make optimal decisions and ensure public health .
These data must be combined with epidemiological indicators related to the spread of COVID-19 disease, analyses for areas of interest that are directly related to the spread of the pandemic, as well as systems for recording and describing data such as tables, diagrams, etc. It should be emphasized that these mechanisms should not only be based on the logical results of the calculations performed but also on the time at which these results are available, because timing is a fundamental event in a real critical time system, such as the one under examination. Violation of time constraints implies the inability to make timely decisions and therefore implement incomplete measures that cannot work in a pandemic .
In this study, a thorough description of how the pandemic spread in Greece is presented , by presenting a data analysis system with machine learning methods, which was developed to capture in real time, taking into account the availability of data, statistics, correlations, charts, and comparative tables provided by official health agencies, plus any other relevant information related to the pandemic. The following Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 show comparative diagrams with Greece’s neighboring countries (Bulgaria, Albania, Turkey, and North Macedonia), aiming at assisting the decision-making system and the corresponding mechanisms of social distancing .
In addition to a thorough analysis of the data provided, this system can calculate in real time the most important epidemiological indicators, which are presented below.
3.2.1. Basic Reproduction Number (R0)
In epidemiology, R0 can be thought of as the expected number of outbreaks at the beginning of an epidemic that results directly from an outbreak in a population where all individuals are susceptible to infection when there is no immunity in the population (natural or vaccinated) and no restrictive measures have begun to be implemented [27,28,32].
If, for example, R0 = 3, each case can infect another three people on average, and these, in turn, another three each, and so on. As a result, the number of cases gradually increases, and there is an extensive dispersion. If R0 < 1, then there is no risk of epidemic. This is because, in this case, one case can infect another person, and therefore, the transmission gradually declines. In general, the higher the value of R0, the more difficult it is to control the epidemic. For simple models, the percentage of the population to be immunized to prevent the prolonged spread of the infectious disease must be greater than 1 − . On the other hand, the percentage of the population that remains prone to infection during the endemic equilibrium is .
It is important to note that R0 is not a biological constant for a pathogen, as it is also influenced by other factors, such as environmental conditions and the behavior of the infected population. In addition, R0 does not in itself assess how quickly an infection is spreading in the population but should be considered in a broader research horizon. In addition, the estimated values of R0 depend on the model used and the values of other parameters, which suggests that the estimated values only make sense in the given space-time frame, and it is recommended not to use outdated values or to compare values based on different models .
3.2.2. Effective Reproduction Number (Rt)
When restrictive measures are implemented to reduce transmission, such as social distancing, the interest shifts from R0 to Rt. This indicator expresses the number of people who can infect a case based on the restrictions imposed by the implementation of these restrictive measures [6,27,32].
This value may change over time as the gradual introduction of measures and the change in the behavior of the population (e.g., hand hygiene, contact restriction, etc.) make transmission increasingly difficult. The aim is to reduce it to Rt < 1, as this indicates that control of the epidemic has been achieved.
Monitoring the course of Rt is extremely important, and its assessment should be updated at regular intervals based on the data collected from epidemiological surveillance (diagnosed cases per day) with the application of an appropriate methodology. In this way, the course of the epidemic and the effectiveness of the measures in real time can be approximated, since there is inevitably a delay from the moment a person becomes infected until he is diagnosed. Consequently, a possible increase in infections today could be reflected in the diagnosed cases of the coming days.
It is important to note that even if the epidemic has been reduced and the Rt reduced to low levels, the stopping of the measures may lead to an increase of cases, which is a typical example we have seen in Greece. Therefore, in the phase of gradual phasing out of the measures, the monitoring of Rt is very important as it will allow decisions to be taken for corrective actions if Rt is approaching or exceeding the value of 1.
The first step in modeling the Rt index is the input process of the recorded cases. A popular option for distributing these arrivals is to use the Poisson distribution, which is a distinct distribution function that expresses the probability of a given number of events occurring over a fixed period if these events occur by a known means rhythm and are independent of the time from the last case, as in the case under investigation. The Poisson distribution has the parameter λ that indicates the average percentage of infections per day, which are independent of the last time of occurrence of the event, which is interpreted as the probability of occurrence of new cases every day and is given by the following function [26,28]:
Given the Poisson distribution, we can construct the probability distribution of new cases for a set of λs. The distribution of λ on k is called the probability function. The representation of the probability function by determining the number of new cases observed k is calculated from the probability function in a range of values λ.
Under this relation, we can look for a new set L(Rt|kt), which parameterizes the relation between the Poisson distribution and the index Rt and is expressed by the following relation [33,34]:
where γ is the inverse of the serial interval (about 4 days for COVID19) and kt−1 is the number of new cases observed in time t − 1.
Since we know the exact number of cases per day, we can reformulate the probability function as Poisson, which is parameterized by specifying k and changing Rt and specifically as follows (Figure 14):
For each day, there is an independent conjecture about Rt. To combine the actual information from the previous days with the current day, Bayes’ theorem is used to inform the hypotheses about the true value of Rt based on the number of new cases reported daily. By this logic, Bayes’ theorem is used as follows:
Using the probability of the previous period P(Rt − 1|kt − 1), the previous equation is written as follows:
With iterative iterations up to t = 0, the relation becomes:
With a uniform previous P(R0), this is reduced to:
Taking the posterior probability, there is a significant change in the variance, as shown graphically in Figure 15 below.
When estimating the quantity, it is very important to give a sense of the error surrounding the estimation. A popular way to do this is to use higher density intervals. This calculation is done with the highest density interval (HDI) algorithm of posterior distributions. HDI can be used in the context of the uncertainty of classifying rear distributions as Credible Intervals (CI), where all points within this interval have a higher probability density than points outside the interval. With this parameterization, both the most probable values for the Rt index and the HDI fluctuation over time can be plotted (Figure 16) .
This is a very useful representation, as it shows how the components change every day. In essence, this view gives the most probable value of Rt, while expressing the certainty expressed over time, where the interval of the highest density decreases as the daily recorded cases increase. Below is captured each day (row) of the rear distribution that is designed simultaneously. The rear distributions start without much confidence (wide) and gradually become more confident (narrower) for the true value of Rt (Figure 17, Figure 18, Figure 19, Figure 20 and Figure 21).
Since the results include uncertainty, it is desirable to show the most probable value of along with the higher density interval. In addition, taking into account the direct relationship that may exist in the spread of the virus with the opening of the borders and especially of the neighboring countries with land borders with Greece, this study includes similar studies for Albania, Bulgaria, Turkey, and North Macedonia, as shown in the Figure 22 below [11,28,32].
Respectively in the following diagrams are presented detailed data on the variation of the Rt for the examined countries and the probabilities related to the mentioned index (Figure 23 and Figure 24) [6,11,33].
The following Table 5 presents the index by country based on the statistical analysis for the most common values, as well as the respective Low and Max.
3.2.3. Case Fatality Rate (CFR)
The CFR is the ratio of deaths from the virus to the total number of people diagnosed with the disease over a given period of time. It is essentially an assessment of the risk of death from the disease, and mortality is usually expressed as a percentage and is an indicator of the severity of the disease, while it is important to note that disease mortality is not stable. It varies between populations and varies over time, due to the interaction between the causative agent of the disease, the host, the environment, as well as the available treatment infrastructure and the quality of medical care resulting from the health system [27,32].
Reliable CFRs that can be used to assess deaths and evaluate any public health measures taken are calculated at the end of an epidemic, after resolving all cases related to affected individuals who have either died or recovered. Figure 25 below shows the CFR index for Greece and its peripheral states.
3.2.4. Mortality Rate (MR)
Mortality or mortality rate is a measure of the number of deaths (either in general or due to a specific cause) in a given population, in terms of population size, per unit of time. As a rule, the unit of mortality is the number of deaths per 1000 people per year. The general form of the mortality calculation formula is , where d is the number of deaths from the cause being studied, p is the size of the population from which the deaths came, and is a conversion factor that determines the size of the denominator. Specifically, the MR index is calculated as follows [6,32]:
Figure 26 below shows the mortality rate index for the countries under study.
3.2.5. Recovery Rate (RR) or Discharge Rate (DR)
In its simplest form, the RR is calculated by dividing the number of recoveries by the number of confirmed cases. Specifically, the RR index is calculated as follows [27,28,32,33]:
Figure 27 below shows the RR index for the countries under study.
3.2.6. Infection Rate (IR)
IR is the apparent rate of infection, which is an estimation of the rate of disease progression, based on proportional measures of the extent of infection at different times.
Initially, a proportional measure of the extent of the infection is chosen as a measure of the extent of the disease. Then, measurements of the extent of the disease are taken over time, based on an appropriate mathematical model. The model is based on the assumption that the progression of the infection is limited by the amount of the population remaining to be infected, in which case the extent of the infection is limited, and otherwise, it would show exponential growth. A model of its calculation can be calculated in detail using the following formula [26,28,32]:
where t1 is the time of the first measurement, t2 is the time of the second measurement, x1 is the proportion of infection measured at time t1, and x2 is the proportion of infection measured at time t2. The values for the maximum infection rate of the study countries are presented in the Table 6 below [6,28,32,33].
This is the proportion of a specific population that is found to be affected by the epidemic and essentially expresses the actual number of patients in the population. It comes from comparing the number of people found to have the disease with the total number of people studied and is usually expressed as a fraction, percentage, or the number of cases per 10,000 or 100,000 people. Point prevalence is the proportion of a population that has the disease at a given time, while period prevalence is the proportion of a population that has the disease at any given time in a given period (e.g., twelve-month prevalence). Lifetime prevalence is the proportion of a population that at some point in its life (up to the time of assessment) has been affected by the disease (Figure 28) [26,27,32].
4. Prediction Model
Making a decision is a complex process, which must take into account many different factors. As part of an ideal process, information should be gathered on all the possible factors involved, the weight and influence of each factor should be understood, an exhaustive list and meticulous study of all possible solutions should be made, and the benefits and costs for each of them should be assessed. Such an ideal process yields the optimal solution [6,14].
The ability to accurately predict the course of the pandemic is an extremely important but difficult task. Due to the limited knowledge of the new COVID-19 disease, the high uncertainty, and the complex socio-political factors that affect the spread of the new virus, the constant information and any scientifically substantiated methodology of analysis or prediction of the phenomenon is an important legacy.
Focusing on the specifics of the spread of the disease, both epidemiologically and in terms of implementation of preventive and repressive measures, this paper conducts an exploratory study, which is based on the analysis of time-series data related to COVID-19 disease and the prediction of the future development of the pandemic for Greece but also for the border countries.
To accurately approach the problem, the goal is to find the mathematical relationship that can model the data on the spread of the disease and how the cases increase over time. Facebook’s Prophet, an innovative and highly reliable time series prediction model, was used as the forecasting methodology.
Prophet is based on the general methodology of Generalized Additive Models (GAM) [36,37,38], which is a modeling method that uses non-parametric techniques offering significant advantages over conventional regression methods. That is, it offers an opportunity to overcome the statistical problems associated with the normality and linearity assumptions that are necessary for linear regression.
The name Additive refers to the multivariate hypothesis of the underlying model, according to which the predictors have a cumulative structure. Such models are interesting if they fit the data because they are easier to interpret. In general, a cumulative regression model uses cumulative adaptive methods for modeling. Thus, the researcher is not required to look for the correct transformation of each variable.
More specifically, the estimation of the dependent variable Y in this case for a single independent variable can be given by the following equation [37,38]:
where s(X) is an unspecified smoothing function, while error is the error that usually has zero mean value and constant dispersion. For example, the smoothing function can be determined by the current mean or by the current median or by the local least squares method, the Kernel method, the Loess method, or the spline method. The term current means the serial calculation of a statistic applied to overlapping intervals of values of the independent variable, such as the running mean. In GAM modeling, the classical linear hypothesis is extended to include any probability distribution (Poisson, Gamma, Gaussian, Binomial, and Inverse Gaussian) error by the exponent group.
Similar to a GAM, with time as a regressor, Prophet can adapt to many linear and non-linear functions of time as components, wherein its simplest form, three basic elements are used: trend, seasonality, and holidays, which are combined in the following equation [39,40]:
, trend models non-periodic changes (i.e., growth over time)
, ties in effects of holidays (on potentially irregular schedules ≥1 day(s))
, covers idiosyncratic changes not accommodated by the model
In general, the whole equation can be written as follows:
In a more thorough analysis, the test variables can be structured as follows:
1. Trend. The process includes two possible trend models for g(t), namely a Saturating Growth Model and a piecewise linear model as follows [39,40]:
a. Saturating Growth Model. If the data suggests promise of saturation:
where is the carrying capacity, is the growth rate, and is an offset parameter.
It is possible to incorporate trend changes in the model, explicitly specifying the change points where the growth rate change is allowed. Assuming that there are S change points during periodic , then Prophet defines a vector of rate change settings in time , with . So, at any time t, the rhythm can be formulated as . If in this relation, the vector is also determined, so that:
then, the rhythm at the moment t is . When the rate is adjusted, the offset parameter must also be adjusted to connect the endpoints of the sections. The correct setting at the change point is easily calculated as:
The final function is completed as follows:
b. Linear Trend with Changepoints. This is a Piecewise Linear Model with a constant growth rate, which is calculated as follows:
where is the growth rate, has the rate adjustments, is the offset parameter, and to make the function continuous, is set to
c. Automatic Changepoint Selection. To identify changepoints, it is recommended to identify a large number of changepoints as follows:
where directly controls the flexibility of the model in altering its rate. It should be noted that a sparse previous adjustment has no effect on the primary growth rate , so it progresses to 0, and the adjustment reduces the typical (no piecewise) logistic or linear growth.
d. Trend Forecast Uncertainty.
When the model deviates beyond the background to make a prediction, the trend will have a steady pace. Uncertainty in the forecast trend is assessed by extending the production model forward where there are change points over a history of points , each of which has a change of pace derived from the data, which is achieved by estimating the maximum probability of the rate scale parameter as follows:
Future sample change points are randomized in such a way that the mean frequency of change points matches the corresponding historical points as follows:
2. Seasonality. The seasonal variable s(t) provides adaptability to the model allowing periodic changes based on daily, weekly, and annual seasonality. Prophet relies on the Fourier series to provide a flexible model of periodic modeling, where approximately arbitrarily smooth seasonal snapshots are associated with a typical Fourier series:
3. Holidays and Events. The item h(t) reflects predictable events of the year, including those on irregular schedules, which, however, create serious bias in the model. Assuming that the holiday effects are independent, seasonality is calculated by the model creating a regression matrix:
5. Data and Results
The data used to mathematically model and predict disease spread are freely available for use at the COVID-19 data repository by the Center for Systems Science and Engineering at Johns Hopkins University , and they include the daily measurements during the period from 26 February 2020 to 31 May 2021 of the total recorded cases.
With an initial approach to measurements related to the spread of COVID-19 disease, we find that this is a dataset that is collected over time and expresses the evolution of values over equal successive periods (daily measurements). In particular, it is a continuous-time series, where the price trend is initially upward, while there are intervals that show signs of stability.
Respectively, no fluctuations of the values that vary with time were found, as the time series does not show periodic fluctuations or changes that occur due to exogenous factors during specific periods. Although the test sample is not large enough, the above two tests confirm that the time variation of COVID-19 disease is recorded with data that are part of a static time series.
With a more thorough analysis, we look on the one hand for those characteristics that focus on estimating the system that produces the time series and on the other hand at finding the corresponding characteristics that contribute to understanding the historical behavior of the disease, thus allowing the prediction of its future prices.
In attempting to predict the spread of the disease in Greece, the Prophet algorithm was applied . Specifically, considering all the pairs of arithmetic figures of the spread of the disease in Greece, the proposed forecasting system aims to calculate an optimal approach to the spread of the pandemic with , so that the estimated is as close as possible to the real . The main objective of the process is to calculate the value for for generalization purposes, i.e., the implementation of a realistic model that will not be completely guided by the historical data, which are its reference point.
Given the fact that the time series under consideration has a constant rate of change, the Prophet algorithm was used for the daily forecast from 1 June to 1 September 2021, using as training data the daily cases from 26 February 2020 to 28 February 2020 (369 days) and as a set of testing and confirmation of the model the period from 1 March 2021 to 31 May 2021 (92 days).
The following metrics were used to confirm the result:
1. Coefficient of Determination— . To express the correlation of two random variables, is used, which is expressed as a percentage (%). It gives the percentage of variability of Y values calculated from X and vice versa and is a useful way to accurately determine the correlation of two random variables. is defined as follows:
where represents the observed values of the dependent variable, represents the estimated values of the dependent variable, represents the arithmetic mean of the observed values, and n represents the number of observations. expresses the percentage of variability of the dependent variable explained by the existence of independent variables in the model and takes values in the interval [0, 1], with optimal performance when its value approaches the unit, which is interpreted that then, the regression model adapts optimally to the data.
2. Root Mean Squared Error—RMSE . The RMSE is directly related to the Standard Error of the Regression (SER) and calculates the average error of the predicted values about the actual values. It is calculated based on the following formula:
where is the value predicted by the program i for a simple hypothesis j and is the target value for the simple hypothesis j. The success of a regression model requires extremely small values for the root of the mean square error, while the best case, which implies an absolute correlation between actual and predicted values and therefore the absolute success of the model, is achieved when .
Mean Absolute Error—MAE. The MAE is the measure of quantification of the error between the estimate or forecast to the observed values. It is calculated by the formula:
where is the estimated value and is the true. The average of the absolute value of the quotient of these values is defined as the absolute error of their relation .
3. Mean Absolute Percentage Error—MAPE . The average percentage absolute difference provides an objective measure of the forecast error as a percentage of demand (e.g., the forecast error is on average 10% of actual demand), without depending on the order of magnitude of demand. It is calculated by the formula:
The diagram of the process including the trend changes is presented in the following image.
Finally, the total table of detailed forecasts of the methodology from 1 June 2021 to 1 September 2021 is presented in the Table 8 below.
6. Discussion and Conclusions
Focusing on the specifics of the ongoing and deadly pandemic, the spread of the disease both epidemiologically and at the level of implementation of preventive and repressive measures is an extremely urgent and important process aimed at revealing the knowledge hidden in the epidemiological data and deciphering indicators that can model the spatio-temporal evolution and spread of the disease.
In this paper, an exploratory study was conducted for the near-real-time analysis of COVID-19 disease data, as well as an intelligent model for predicting disease progression, to assist in deciding on predictive or suppressive measures of social distancing or taking appropriate measures related to the management of the health system. The study was conducted based on an automated system of data collection and analysis, while the medium-term forecast was based on advanced machine learning methods.
The ability to process data in real time, using the tools of intelligent analysis, visualization, and analytical processing, is the basis for methods of dealing with the pandemic and in particular for the effective detection and tracking of active cases. Respectively, the development and use of spatio-temporal forecasts adapted to real data and needs allow the timely methodization of issues related to public health.
Due to the extremely urgent issue, civil protection mechanisms need to incorporate in their technological arsenal systems that are capable of fast to instantaneous data processing, which involve high complexity and possibly great heterogeneity.
Specializing and attempting an evaluation of the results of the forecasting method, it is easy to conclude that the proposed method is a particularly valuable decision support system, as it creates a robust and reliable system of intelligent inference. Reliability is indicative of how the method handles the available data, its mathematical background, and the completeness of the handling of specialized cases that may create noise in the model. In addition, one of the key advantages that need attention is the high reliability that results from the very low error values that resulted from the tests and the forthcoming predictions that were made.
It is also important to note that the proposed methodology models the spread of the disease in the timeliest way, taking into account the actual variation of the recorded cases, which adds complexity to the methodology but also realism. The tests obtained should be considered statistically and semantically significant compared to any other methodology, as they are an indicator of how to study the pandemic at a broader level.
In addition, the proposed model can be used in other scenarios where data are less accurate because Prophet can easily detect the trend of long-term growth with an annual cycle. In addition, the prediction result includes the confidence interval derived from the complete posterior distribution, that is, Prophet provides a data-driven risk estimate. Changepoints (inflection points where the trend changes significantly) can be identified automatically or defined manually to take more control of forecasting, and the outliers can be handled well by the model itself without any requirement for imputation. In case the forecast is going beyond a certain limit based on case study understanding, it can be fixed by setting up a forecasting cap and modeling using logarithmic growth instead of linear growth. In this study, the time-series data have a natural temporal ordering without taking into account the pandemic waves. The changepoints (the waves of the pandemic) can be identified automatically by Prophet to take more control of forecasting.
Finally, the use of the Prophet algorithm is a very serious proposal for managing chronological data of high complexity and uncertainty such as the one under consideration, which also shows variability, which can be attributed to several unspecified parameters. This technique, as proved mathematically, offers high accuracy predictions and stability, as the overall behavior of the method minimizes noise and at the same time reduces the overall risk of a particularly poor choice that can result from poor sampling or arbitrariness in the parameterization of hyperparameters. The above view is also aided by the fact that the spread of the prediction error is minimized, which clearly states the reliability of the system and the ability to generalize to new data.
Summarizing, we have frequently used Prophet as a replacement for the forecast package in many settings because of two main advantages:
Prophet makes it much more straightforward to create a reasonable, accurate forecast. The forecast package includes many different forecasting techniques (ARIMA, exponential smoothing, etc.), each with its own strengths, weaknesses, and tuning parameters. We have found that choosing the wrong model or parameters can often yield poor results, and it is unlikely that even experienced analysts can choose the correct model and parameters efficiently given this array of choices.
Prophet forecasts are customizable in ways that are intuitive to non-experts. There are smoothing parameters for seasonality that allow us to adjust how close to fit historical cycles, as well as smoothing parameters for trends that allow us to adjust how aggressively to follow historical trend changes. For growth curves, we can manually specify “capacities” or the upper limit of the growth curve, allowing us to inject our own prior information about how the forecast will grow (or decline). Finally, we can specify irregular holidays to model such as the dates of the local holidays, etc.
However, an important issue at the moment is the fact that in general, modeling a problem with methods such as the proposed one requires a lot of historical data, which is not yet available. However, even if a system based solely on historical data was available, it could only contribute to one aspect of the decisions. A more detailed methodology would be useful in linking technical forecasts to other decision-making factors and study processes that are more complex and potentially more complete. At the same time, no predictions are certain, as the future is seldom repeated in the same way as the past. In addition, it should be noted that forecasts are affected by data reliability and the variables that make up the problem over time. Psychological factors also play an important role in the way people perceive and react to the risk of illness and the fear that it may affect them personally.
Therefore, it is important to keep in mind that these models do not simulate nature itself, which often surprises us, but mathematically represent our perceptions of it and help conditionally explain the epidemiological data, reducing them to a small number of variable factors. In this sense, it is very important to have scientific methodologies and appropriate technical tools or modeling tools such as the proposed one, which can realistically explain similar phenomena and offer valuable assistance in making optimal decisions. It is also important to note that due to the limited knowledge of the new COVID-19 disease, the high level of uncertainty, and the complex socio-political factors influencing the spread of the new virus, no scientifically substantiated methodology for analyzing or predicting the phenomenon is an important legacy. Nevertheless, the ability to accurately predict the course of the pandemic is an extremely difficult and complex task.
Proposals for the development and future improvements of this methodology should focus on further optimizing the parameters of the forecasting system used to achieve an even more efficient, accurate, and realistic process of approaching the spread of the disease. It would also be important to study the extension of this system by implementing a broader spatio-temporal study at the pan-European or world level to verify the generalization of the method in more complex environments. Finally, an additional element that could be studied in the direction of future expansion concerns the implementation of a hybrid learning system based on the proposed architecture, which with methods of redefining its parameters automatically and in real time can fully automate the forecasting process.
Conceptualization, K.D., D.T. (Dimitrios Taketzis), D.T. (Dimitrios Tsiotas), L.M., L.I., P.K.; methodology, K.D.; software, K.D.; validation, K.D., D.T. (Dimitrios Taketzis), D.T. (Dimitrios Tsiotas), L.M., L.I., P.K.; formal analysis, D.T. (Dimitrios Tsiotas), L.M., L.I.; investigation, K.D.; resources, L.M., L.I., P.K.; data curation, K.D., D.T. (Dimitrios Taketzis), D.T. (Dimitrios Tsiotas), L.M., L.I., P.K.; writing—original draft preparation, K.D., D.T. (Dimitrios Taketzis); writing—review and editing, K.D., D.T. (Dimitrios Taketzis), D.T. (Dimitrios Tsiotas), L.M., L.I., P.K.; visualization, K.D., D.T. (Dimitrios Taketzis), D.T. (Dimitrios Tsiotas); supervision, L.M., L.I., P.K.; project administration, L.M. All authors have read and agreed to the published version of the manuscript.
Bragazzi, N.L.; Dai, H.; Damiani, G.; Behzadifar, M.; Martini, M.; Wu, J. How Big Data and Artificial Intelligence Can Help Better Manage the COVID-19 Pandemic. Int. J. Environ. Res. Public Health2020, 17, 3176. [Google Scholar] [CrossRef] [PubMed]
Brodeur, A.; Gray, D.M.; Islam, A.; Bhuiyan, S. A Literature Review of the Economics of Covid-19 (SSRN Scholarly Paper ID 3636640). Social Science Research Network. 2020. Available online: https://papers.ssrn.com/abstract=3636640 (accessed on 14 June 2021).
Calafiore, G.C.; Novara, C.; Possieri, C. A time-varying SIRD model for the COVID-19 contagion in Italy. Annu. Rev. Control2020, 50, 361–372. [Google Scholar] [CrossRef] [PubMed]
Carli, R.; Cavone, G.; Epicoco, N.; Scarabaggio, P.; Dotoli, M. Model predictive control to mitigate the COVID-19 outbreak in a multi-region scenario. Annu. Rev. Control2020, 50, 373–393. [Google Scholar] [CrossRef] [PubMed]
Chowdhury, S.D.; Oommen, A.M. Epidemiology of COVID-19. J. Dig. Endosc.2020, 11, 3–7. [Google Scholar] [CrossRef]
Du, Z.; Xu, X.; Wu, Y.; Wang, L.; Cowling, B.J.; Meyers, L. Serial Interval of COVID-19 among Publicly Reported Confirmed Cases. Emerg. Infect. Dis.2020, 26, 1341–1343. [Google Scholar] [CrossRef]
Fong, S.J.; Li, G.; Dey, N.; Gonzalez-Crespo, R.; Herrera-Viedma, E. Finding an Accurate Early Forecasting Model from Small Dataset: A Case of 2019-nCoV Novel Coronavirus Outbreak. Int. J. Interact. Multimed. Artif. Intell.2020, 6, 132. [Google Scholar] [CrossRef]
Petropoulos, F.; Makridakis, S. Forecasting the novel coronavirus COVID-19. PLoS ONE2020, 15, e0231236. [Google Scholar] [CrossRef]
Ganesan, S.; Subramani, D. Spatio-temporal predictive modeling framework for infectious disease spread. Sci. Rep.2021, 11, 1–8. [Google Scholar] [CrossRef]
Ganyani, T.; Kremer, C.; Chen, D.; Torneri, A.; Faes, C.; Wallinga, J.; Hens, N. Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020. Eurosurveillance2020, 25, 2000257. [Google Scholar] [CrossRef]
Gao, Y.; Cai, G.-Y.; Fang, W.; Li, H.-Y.; Wang, S.-Y.; Chen, L.; Yu, Y.; Liu, D.; Xu, S.; Cui, P.-F.; et al. Machine learning based early warning system enables accurate mortality risk prediction for COVID-19. Nat. Commun.2020, 11, 1–10. [Google Scholar] [CrossRef]
Giuliani, D.; Dickson, M.M.; Espa, G.; Santi, F. Modelling and predicting the spatio-temporal spread of COVID-19 in Italy. BMC Infect. Dis.2020, 20, 1–10. [Google Scholar] [CrossRef]
Hadi, A.G.; Kadhom, M.; Hairunisa, N.; Yousif, E.; Mohammed, S.A. A Review on COVID-19: Origin, Spread, Symptoms, Treatment, and Prevention. Biointerface Res. Appl. Chem.2020, 10, 7234–7242. [Google Scholar]
Hamed, S.M.; Elkhatib, W.F.; Khairalla, A.S.; Noreddin, A.M. Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology. Sci. Rep.2021, 11, 1–8. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely
those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or
the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas,
methods, instructions or products referred to in the content.