Expert System to Model and Forecast Time Series of Epidemiological Counts with Applications to COVID-19

: We developed two models for real-time monitoring and forecasting of the evolution of the COVID-19 pandemic: a non-linear regression model and an error correction model. Our strategy allows us to detect pandemic peaks and make short- and long-term forecasts of the number of infected, deaths and people requiring hospitalization and intensive care. The non-linear regression model is implemented in an expert system that automatically allows the user to ﬁt and forecast through a graphical interface. This system is equipped with a control procedure to detect trend changes and deﬁne the end of one wave and the beginning of another. Moreover, it depends on only four parameters per series that are easy to interpret and monitor along time for each variable. This feature enables us to study the effect of interventions over time in order to advise how to proceed in future outbreaks. The error correction model developed works with cointegration between series and has a great forecast capacity. Our system is prepared to work in parallel in all the Autonomous Communities of Spain. Moreover, our models are compared with a SIR model extension (SCIR) and several models of artiﬁcial intelligence. control approach to the Covid-19 disease using a SEIHRD dynamical model.


Introduction
The coronavirus disease 2019 , caused by the so-called SARS-CoV-2 virus, has spread throughout the world leading to a terrible pandemic. Starting in China in December 2019, the following two countries were Italy and Spain. The infection's high transmissibility led some regions to suffer a special impact. Such is the case of the Autonomous Region of Madrid in Spain. On March 11, the World Health Organization (WHO) declared COVID-19 a pandemic. On the same date, Madrid was already in an extremely serious situation and all educational centers were closed. Three days later, on March 14, the state of alarm was declared throughout the country. On March 30, freedom of activity outside the home was reduced to essential services. These conditions were relaxed on April 13, with the permission of some non-essential activities. From April 26, children could go outside for 1 h every day, and one week later this measure was extended to the general population. Measures of varying magnitude have been taken in the following three waves that have followed one after the other so far.
Mathematical models to track changes in the behavior and patterns of infection appear to be essential tools for making future decisions. Some authors (see, e.g., [1,2]) were pioneers in collecting solutions based on artificial intelligence and expert knowledge, among the thousands of articles published on the subject.
Artificial intelligence is useful to monitor the evolution of the pandemic even in real time, either through expert systems or with a predictive approach based on machine learning. Researchers have been able to validate the effectiveness of these models with different illnesses. For example, through a dynamic neural network, it was possible to understand the evolution of Zika (see [3]). The same has happened with Ebola or the common flu. Currently, models are being retrained with new data related to the COVID-19 (see [4]). Abhari et al. [5] used a previously developed agent-based artificial intelligence simulation platform (EnerPol) coupled with big data.
It is worth highlighting the need for artificial intelligence tools to be easy to use by those who want to operate with them. This is why, around the idea of monitoring and forecasting, projects have been generated to visualize the information collected. With this approach, in [6], we can find an ordered list of the most interesting sites with dashboards: UpCode, NextStrain, CSSE (Johns Hopkins), Thebaselab, the BBC, the New York Times, HealthMap and COVID-19 Tracker (Microsoft).
The SIR model ("Susceptible", "Infected", "Recovered") and its extensions are traditional epidemiological models. They are compartmental models of differential equations that relate the variations of different population groups (compartments) through the infection rate and the average infectious period. Most recent studies are based on modifications of the SIR model (see, e.g., [7][8][9]). The underlying idea is to model the waves of a pandemic as exponential increases and decreases to the left and right of a peak of maximum incidence. In Spain, it is worth highlighting the work carried out by the Interdisciplinary Group of Complex Systems at UC3M [10] and the work carried out by the MOMAT Group at UCM [11]. For comparisons purposes, we implement the SIR model extension developed by Castro et al. [10], SCIR: a SIR model with "confinement".
In this paper. we approach the problem from another point of view: non-linear regression predictive models. Researchers from the Andalusian School of Public Health of Granada have developed a predictive model of the COVID-19 epidemic in Spain with an adjustment to a Gompertz curve [12]. For the adjustment of the Gompertz curve to the observed accumulated data of cases and deaths, they used the Nelter-Mead algorithms [13] implemented by Nash [14]. The software used for the calculations was R via drc package. Our strategy extends this approach by allowing greater flexibility in fitting to Gompertz curves, especially in the distribution tails. Another Gompertz approximation was proposed by Català et al. [15]. Our expert system automatically chooses the best fit from a variety of models, including the Gaussian, double exponential and double Pareto curves. In addition, all programming, the optimization algorithm and the heuristic are original.
Moreover, we develop an Error Correction Model (ECM). This approximation belongs to a category of multiple time series models for data where the underlying variables have a long-run common stochastic trend.
Our research group registered in the "Mathematical action against coronavirus", a cooperative prediction initiative of the Spanish Mathematics Committee (CEMat). (A meta-predictor has been built to provide authorities with information on the short-term behavior of variables of great interest in the spread of the COVID-19 virus. The method uses the predictions from different models/algorithms, provided by the participating researchers, and constructs optimized combinations of them, disaggregated by Autonomous Communities.) Within this initiative, we have participated together with other research groups in the "Cooperative Prediction" action [16], providing daily predictions with our preliminary model since March 2020, during the entire first wave of the pandemic. All models that participate in the construction of the metapredictor developed by the cooperative prediction action promoted by the CEMat have been validated continuously since the beginning of the pandemic.
The results obtained in this paper are reproducible using the code from our public repository. The code for the developed graphical interface that allows the user to interact with our system is also included in: https://github.com/mikiNadal/covid19_article_ reproducible (accessed on 22 June 2021).
Section 2 introduces the non-linear regression model. Section 3 introduces the error correction model. Section 4 introduces the SCIR model. Section 5 compares the three models with different metrics. Finally, the conclusions are presented in Section 6.

Non-Linear Regression Model
We aim to develop a theoretical framework that allows us to detect peaks and make short-and long-term monitoring and forecasting of the number of people infected, people requiring hospitalization and deaths during an infectious disease. With short-term prediction, we refer to the task that we performed for the CEMat during the first wave, consisting of giving predictions every day with a horizon of 8 days. From the second wave, we were asked for predictions every week with a horizon of 14 days. With long-term forecast, we refer to the prediction of the peak, the total number of infected at the end of a wave under study and giving commitment dates for which only a small percentage of the area under the model remains. These values are monitored day by day and are an indication, for example, of when a wave is exhausted.
This model is implemented with an expert system of artificial intelligence based on non-linear regression and is extremely useful to estimate the effectiveness of the interventions prompted by the governments and to advise on how to proceed in future outbreaks. Furthermore, the machine learning algorithm developed allows parallel running and introduction of new data in real time, and it is scalable.
Our model is based on directly estimating the distribution function of each of the series under study and on the duality between the distribution function and the density function. Since those two functions fully characterize the probability distribution of a continuous variable, our model is able to capture the main characteristics of epidemic outbreaks. To this, we can add its simplicity, since it is formulated only through three parameters. Hereinafter, we refer to our first proposed epidemiological model as the MATGEN model in honor of our group enrolled in the Mathematical action against coronavirus [16], an initiative of CEMat (Spanish Mathematics Committee).

The Model
The notation employed in this work is as follows: Let the well-known density function of a normal variable of mean be µ and variance σ 2 , N(µ, σ) (see Figure 1a), given by and both its distribution function and the right tail as follows: Note that the density peak of N(µ, σ) is reached in µ, and it is a point of inflection of its distribution function. Furthermore, it verifies that For a review of the properties of the distribution function, the density function, the characteristic function of a random variable and the relationships between them, see the work of Quesada and Pardo [17].  Data series of COVID-19 in Spain include day by day the cumulative number of people infected, people requiring hospitalization and deaths. These data can be downloaded from ISCIII [18].
We denote both the relative and cumulative frequencies at time t as follows: N t cumulative per day, where n is the total number of cases at the end of the pandemic. Furthermore, we introduce the average of the cumulative frequencies at time t given by In this context, we work with the following non-linear regression model: where the parameters n, µ and σ are estimated by the least squares method.
For an introduction to frequentist and bayesian regression, see the work of Gómez Villegas [19].

Other Wave Models
The next subsections detail a basic guide for the correct implementation of the least squares method and the algorithm designed for the detection of pandemic peaks.
In statistics, the coefficient of determination is the proportion of the variance in the dependent variable F i that is predictable from the independent variable F(i). It is a statistic used in the context of goodness of fit and provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. This coefficient takes values between 0 and 1, and, between two models, the one with the highest determination coefficient is preferred. Furthermore, with this criterion, the best model is the one that maximizes the coefficient of determination within a plausible family of models: max n,µ,σ R 2 (t, n, µ, σ|n 1 , . . . , n t ) .
On the other hand, we control at the same time the adjustment of the observed frequencies by means of the theoretical density function. To this end, the criterion that we follow is to perform a linear regression and to introduce the constraint where F obs is the observed value of the test statistic for testing H 0 : b = 0 vs. H 1 : b = 0 and F 1,t−2 is its theoretical distribution under H 0 , that is a Snedecor's F distribution with 1 and t − 2 degrees of freedom. We propose to solve the multicriteria optimization problem by obtaining the values of n(t), µ(t) and σ(t), so that max n,µ,σ R 2 (t, n, µ, σ|n 1 , . . . , n t ) , under the constraint p − value(F obs ) = P(F 1,t−2 > F obs ) > 0.1. Now, stop if F(t) = F t = 0.5, otherwise incorporate the data t 1 , do t = t 1 and repeat. In parallel, fit a model for each of the series, namely the number of new positive cases per day, the number of new deaths per day and the number of new ICU admissions per day, and choose the model that simultaneously maximizes the three values of R 2 .
Stop when F(t) = F t = 0.5 is accomplished in all three series. It is important to note that the algorithm allows the introduction of new data in real time, and it is scalable.

The Algorithm: Commitment Dates
Let t p be the first day that F t p = F t p = 0.5, and n t p , µ t p and σ t p are the optimal values of the parameters at that time point.
If f t p +1 ≥ f t p , do µ = t p + 1 and compute n and σ so that F t p + 1 = F t p +1 = 0.5. Otherwise, determine the value of t max so that f t max = max t≤t p f t , do µ = t max and compute n y σ le f t so that F(t max ) = F t max = 0.5 and σ right fit the series for t ≥ t max .
Finally, the percentiles qnorm 0.99 and qnorm 0.999 of the normal probability distribution N(t max , σ right ) are the commitment dates to lift the restrictions from least to most conservative.

The Heuristic
To perform an effective optimization, we opt for an ambitious heuristic that we detail below.
Let σ = σ 0 , starting at σ 0 = 15. Move σ between σ 0 − 14 and σ 0 + 14. At this point, it is important to note that the incubation period of the disease is between 2 and 14 days (see [20]). In addition, the delay between the time of infection and the report as a positive case is considered.
Move µ between the first day of each of the series and t 0 + 2σ 0 . For example, the first day of the series of the number of people infected in the Region of Madrid is Day 26, which corresponds to February 25.
Generate k = 10,000 values of a uniform random variable between 0 and 1. Compute n = N p for each value p generated in the last step; N = approximately 6,550,000 in the Region of Madrid.
Discard the values of n < N t 0 .
Find the feasible models with p-value > 0.1 for the noise and select the one with the largest R 2 of the fit in the cumulative frequencies.
In practice, the running of the heuristic generates a .csv file that contains several columns. The columns corresponding to the fitted parameters, µ, σ and n, the coefficient of determination and the p-value are included. Moreover, two columns are added to register every day the moment of the real peak, which corresponds with the day with the highest frequency observed to date, and the day when the cutoff between the models fitted to both sides is observed, i.e., when the distribution becomes positively skewed. The algorithm tries to match the value of the real peak, the cutoff and the parameter µ. It also allows fitting a different σ to the left and right of the cutoff. The last two columns include the commitment dates corresponding to percentiles qnorm 0.99 and qnorm 0.999 of the fitted model to the right.
In the next subsections, we present the results that are obtained from the run of the previous algorithm programmed through our expert system. To do this, we consider the data series of COVID-19 in Spain, which are published in [18]. Specifically, we study the case of the Region of Madrid.

COVID-19 Data Sets
It is necessary to consider the time required to test the presence of the infection and obtain a report to test positive for the virus. This is especially relevant when there are problems with access to care and with bottlenecks in laboratory testing. At some moments, this led the health system in Madrid to test only people with severe symptoms. In addition, the delay of up to several weeks in the notification of positive cases by the laboratories led to changes in the data history depending on the day the data were downloaded (see Tables 1 and 2).  Even when the data come from official sources, they may present inconsistencies that must be taken into account. The portal to access the European Union open data [21] publishes data on the evolution of COVID-19 by continent and broken down by country. It can be verified that only positive PCRs are counted in the series of cases on this portal. On the other hand, Spain (see [18]) and Italy (see [22][23][24]) offer more detailed information through their national institutional portals. For example, on April 17, Spain introduces two columns, PCR and TestAc, and TestAc is empty until April 18, when the government introduces this type of test into the count. On the recommendation of the Spanish Mathematics Committee, we chose to consider confirmed cases as PCR+TestAc. This situation has been remedied since the second wave of the pandemic.
Another controversial point is the notification of deaths due to COVID-19 and the real deaths registered by the undertaking services of the Autonomous Communities of Spain. This suggests that only a percentage of the real deaths due to COVID-19 were recorded.
As a matter of fact, many elderly people died in nursing homes before being tested for COVID-19 and in most cases they were not included in the number of deaths.
One more problem that we have had to face is related to the series of ICUs in the Region of Madrid, in which we found an anomaly. Since April 28, the Region of Madrid offers cumulative data on the number of people with coronavirus who have gone through the ICU. Before this date, the data were those of daily occupation. Furthermore, the number of ICU beds varied throughout the course of the epidemic. The maximum number was changing due to the increase in ICU beds in the large hospitals in Madrid and especially the provision of new hospital beds in the IFEMA hospital.

Expert System
Our expert system was designed to facilitate obtaining the results and it consists of several parts. First, it allows downloading and updating the data file in real time every day. Once the data file has been updated, the expert system allows us to run the algorithm for all the series in parallel or one in particular, as well as for all the provinces of Spain or a specific one. Once the algorithm has been run, our expert system returns three reports:

1.
A .pdf file with three graphs for the current time: one of the fitted distribution function, one of the fitted density function and one of the adjustment to white noise. Figure 2a,b shows these reports for cases and deaths in the Region of Madrid.

2.
A .csv file (see Tables 3 and 5) with the results of the entire day-to-day history of the process, from which the following can be extracted: (i) the optimal parameters; (ii) the coefficient of determination of the fit to the distribution function; and (iii) the p-value of the fit to white noise of the relative errors of the fit to the density function.
In addition, this .csv file also contains the day-to-day commitment dates.

3.
A .csv file (see Table 6) with an 8-day horizon of the forecast made with the fitted model and a graph with the future model.
(a) Report of cases (b) Report of deaths (c) Report of ICUs (d) Parameter variation per day until May 11. This report has been automatically generated by our expert system. From top to bottom: peak, sigma left, sigma right and n.     The real peak of the series is one of the most difficult values to predict. Some days after it has taken place, it is easy to know that the peak was already reached on March 26. Table 3 and Figure 2 indicate that µ matches the value of the real peak between April 17 and 21. However, as the curve of µ variation per day (see Figure 2) begins to flatten from March 20, the model alerts us in advance of the possibility that the real peak appears at any time after that date.

New and Cumulative Confirmed Cases per Day Series
In a virus-free transmission situation, the model would fit to a perfect Gaussian distribution and µ would be equal to the real peak. Therefore, small deviations from the model to the left of the real peak may indicate a change in the evolution of the virus. For example, between March 8 and 14, the curve of the relative frequencies (blue) is above the model fitted for the density function (orange) (see Figure 2a), which may indicate the dangerous situation present in Madrid before March 8. This dangerous situation could be a consequence of individual events increasing close contact between people. This was already evident in the fitted models until this date. The period prior to March 11 can be considered free of disease transmission because interventions have not been applied. Nevertheless, that virus-free model is observed several days after that date. In fact, on March 17, there is a small peak of cases, which could be due to a high number of contagions during different massive events in Madrid on March 8. On March 22, the measures imposed by the government started to be noticed, since the observed data lie below the fitted model, and that situation continues until the global peak of cases is reached around March 26. On March 30, after the peak of cases, freedom of activity outside the home was reduced to essential services. The effect of this intervention is noted on April 9, when the real data lie below the fitted model, and that trend remains until April 12. It seems that interventions take around 11 days to be noted.
It is important to note that the parameter σ of the model changes to the right of the real peak of cases. This indicates that the containment measures are in fact effective. This results in an increase in the variance of the model to the right of the turning point µ, which indicates a slowdown in infections (see Figure 2). On April 18, with the addition of the column TestAc to the dataset, an explosion in the graph of the density of cases is observed (see Figure 2a). Except for this incident, the model remains quite stable to the right of the peak of cases and commitment dates for the progressive lifting of mobility restrictions can be proposed, as Figure 2 and Table 4 show. For example, May 11 shows commitment dates between May 19 and June 5 on the basis of 0.01 and 0.001 for the right tail area of the model (with a forecast of 72,200 for the total of confirmed cases at the end of the pandemic).
This report concludes with a forecast for cumulative cases over the 8-day horizon. For example, Table 6 shows the forecast from May 11 to 18 generated on May 11. A similar analysis to the one explained for cases can be done for deaths. Some days after the real peak of deaths has taken place: it is easy to know that is was already reached on March 28. The peak of the model, µ, was reached around April 1 (see Figure 2b).

New and Cumulative Deaths per Day Series
The update on May 11 shows two early peaks on March 14 and 16. This situation was followed by an increase with respect to the fitted model between March 22 and 28 and then between April 2 and 9. Between these two periods of time, the situation of ICUs is dramatic, as explained in the following subsection.
It is important to highlight that the model changes to the right side due to the effectiveness of the containment measures. This translates into an increase in the variance of the model on the right side, which indicates a slowdown in deaths. On May 11, the fitted model forecasts a total of around 9000 deaths at the end of the pandemic. Although the official data are confusing, the fitted model for new and cumulative ICUs (to fit a suitable model to the series of ICUs, it was necessary to modify the algorithm considering two uniform and one exponential models to adequately describe the consecutive situations of plateau to the right of the normal model) reveals that the median of the model occurred between March 24 and 25 (see Figure 2c). It is important to note that the cumulative frequencies of new ICUs evolve very slowly, as the first graph in Figure 2c shows. It can be seen that 10% of the probability distribution of the model remains after May 11.

New and Cumulative ICUs per Day Series
Taking into account that the duration of ICU stay depends on each patient and it usually ranges between 8 and 28 days, one can understand the saturation of the ICUs in the Region of Madrid.
The real peak of the series is one of the most difficult features to forecast. It may have occurred after the peak of deaths on March 28. The date when more new cases were incorporated into ICUs was March 20 with 205. This situation is detected with the value of the parameter µ of the normal model fitted to the left, which corresponds to March 21 (see the second graph of Figure 2c). However, that date does not correspond to the real peak because ICUs became saturated. On April 2, the reported number of ICUs reached the highest value: 1528. The two dates cited are around the worst moments in terms of numbers of deaths.
On May 11, the fitted model forecasts that a total of 4000 people will have gone through the ICU at the end of the pandemic.

Second Wave
After the first wave, the format in which the data were provided changed and their quality increased, although the series continues to change from one day to the next. Figure 3 shows the fitted model for Madrid on December 12, for the second wave of all series, cases, deaths, hospitalizations and ICU admissions. Figure 4 shows the monitoring of the parameters: µ, σ le f t , σ right and n. Note how the effect of the interventions is manifested in a preview of the peak, in the jump that σ le f t experiences with respect to σ right and in the stabilization of n after reaching the peak. In addition, the peak is predicted in the future from the end of August.   Figures 5 and 6 shows the fitted models of the second wave for Asturias. These figures make evident the usefulness of testing different regression functions. Unlike in Madrid, the expert system selects as best fits those made with a double exponential or double Pareto model.
During the third and fourth waves, we incorporated the error correction model into our expert system that increased our predictive capacity. This is the model with which we currently give our 14-day predictions to the Spanish Mathematics Committee. We found in the comparison tool implemented on the website of this initiative that this new model remains among the top three over time with respect to all error metrics. Another of its advantages is that it adjusts the four data series at the same time: cases, deaths, hospitalized and ICUs.

Model Definition
It is intuitive to think that a peak in the confirmed series will increase the other three data series with some delay. This type of relationship is usually modeled including the displacement of the confirmed series p times to the future and using this feature to predict the present of other series. This method for one series is called auto-regressive model and is generalized to more than one series in the vector auto-regressive model. This type of model needs the time series to be stationary, which means that the process has the first two moments constant along time. This restriction is a problem in this case, where we want to predict future changes in trends.
When one has a set of stationary time series, y t = (y 1t , . . . , y Kt ), one can define the stable auto-regressive vector model of order p as: where A i are matrices of parameters, ν = (ν 1 , . . . , ν K ) represents the mean of each time series and u t represents the random error term of the model, with u t ∼ N (0, Σ u ).
The usual procedure when working with this model with non-stationary series is to differentiate them until they become stationary, but this procedure clouds the inference about the model.
There will be a long-term relationship. The stochastic trend of the series will be shared between the series as they will go down and up in the same way, keeping a certain distance between them.
We can say that the set of time series y t have a long equilibrium if there exists a vector β such that β y t = β 1 y 1t + . . . + β K y Kt = 0 and define the process z t = β y t as the deviations from this relation.
Thus, the series in the set y t are said to be cointegrated if y it is a series of order 1 for all i = 1, . . . , K and there exists a vector β such that z t = β y t is a stationary process.
With this relationship in mind, it is possible to define a model based on the vector auto-regresive for this type of data, which is the error correction model and is given by where Γ i are matrices of parameters, d t is the deterministic part of the model, u t ∼ N (0, Σ u ) and Π = αβ with α the loading matrix and β the cointegration matrix. Integrated and cointegrated systems must be interpreted cautiously. A tool to make inferences about the model is the impulse response function. This tool describes the evolution of the model variables in reaction to a shock in one or more of them. The impulse response function is developed for VAR models, but we can express the error correction model as a VAR model, as mentioned in [25].
For the application of the error correction model to the COVID-19 data, the following modifications are made: • It is first necessary to apply the logarithm to each of them; this is because seasonality is multiplicative while the model is additive, so it is necessary to transform this relationship and capture it with the proposed model.

•
The data series show a strong seasonality due to the data collection and publication policy of each region. To capture this seasonality in the model, the deterministic component, d t , a dummy variable relative to each day of the week except one, is introduced (to avoid collinearity). • A last change in the model is carried out using a dummy variable for a change of scenario: that of the test policy, since tests were not available at the beginning of the pandemic. While at the beginning of the first wave only some suspected cases could be tested for SARS-CoV-2, later the scenario changes and diagnostic testing can be extended to all suspected cases, close contacts and even several mass screenings are performed.
With these changes, the final model formula is given by Equations (3) and (4). and Once the model is defined, there is only one parameter to decide, the regression order p, that is, the number of lags of each of the series in the equation. For the decision of this parameter and given the ease of calculating this model, a cross-validation procedure is carried out with different p, taking the order p with the smallest mean absolute percentage error.

Application to the Case Study
To assemble the model, we proceed as described in the previous section. A logarithmic transformation is applied to the data and the two-to-two cointegration of the series under study is checked with the Engle and Granger [26] procedure and the Phillips-Ouliaris test [26]. Both tests work with the null hypothesis of non-existence of cointegration and the p-values of series two by two are shown in Table 7. The cointegration relationship shows us that there is a long-term equilibrium between the series. Another present relationship between these is the regressive part. In order to confirm it, a cross-correlation study was carried out among the series of ICU, hospitalized and deaths against the series of confirmed patients. To avoid spurious correlations, a differentiation process is carried out, to transform these into stationary series individually, applying the logarithm and differentiating once.
It is observed in Figure 7a that there are important shifts of the series in the confirmed series (as well as between all of them, outside the scope of this article) indicated by entering the rejection region of the test for correlation 0.
To obtain the optimal order for all of them simultaneously, the auto-regressive order p is optimized, obtaining in this case that the optimal p is 9.
To ensure the adequacy of the model, the independence of the errors is checked. The p-values for the augmented Dickey-Fuller test [26] and the Phillips-Perron test [26] are included in the auto-correlation graph of the residuals in Figure 7b.
No pattern is observed in the auto-correlation graph of the residuals in Figure 7b, and the p-values of the tests allow us to reject the hypothesis of the presence of a unit root, which is why it is concluded that the errors are stationary.
Another good check in the case study is to see how the peak of the wave is predicted. To this end, Figure 7c presents the current date on the x-axis and the peak date on the y-axis. The black line represents the maximum value of the series up to the real date, while the blue line represents the date of the peak predicted by the model.
It can be observed in the graph that, while in the first half of July the peak was predicted in mid-August, in the second half of July the prediction of the peak rises sharply until mid-September, remaining stable in this prediction until the actual date of the peak (18 September 2020).
In addition, it is appreciated that the model is capable of detecting when the peak has passed. This is observed in Figure 7c by noticing how from the real date of the peak (18 September 2020) the model predicts the peak to pass.
To check the adequacy of the model for prediction, error metrics are calculated [27] using cross-validation techniques [28] for fitted and forecast value and presented for frequency (Table 8) and cumulative data (Table 9).
Finally, the results of applying the impulse response function with two different histories are presented. The first represents the data from the beginning of the pandemic to before the second wave. The second represents the data from the beginning of the pandemic to before the third wave. This graph can be observed in Figure 8a.
In this figure, the maximum influence and the moment at which its effects decrease can be observed, the curves being less pronounced.    It should be noted that, initially, this model was not developed for the prediction of confirmed cases, but rather for the prediction of the other three series that depend closely on it. Despite this, the model works really well with this series, but it works better in the inpatient series.
The implementation of the model allows quick execution of the fit and forecast for all the Autonomous Communities (Figure 8b).
For the comparisons between models, we considered the following extension of the SIR model proposed by Castro et al. [10] and an automatic implementation of ARIMA forecast model by Hyndman and Khandakar [29]. We chose the SCIR model because it incorporates the compartment of the deaths into its definition and is formulated through only five parameters. Different extensions of SIR models can be found in [30,31]. The automatic ARIMA forecast model was chosen due to its widespread use in time series forecasting. Some example can be found in [32,33].

SCIR Model: A SIR Model with Confinement
The SCIR model includes the usual states of an SIR model plus a class C for individuals sent to confinement who are susceptible, but not infected. Susceptible individuals (S) can enter and exit confinement (C) or become infected (I). Infected individuals can recover (R) or die (D). Figure 9b shows the diagram of the equations of the SCIR model.
For the optimization of the parameters, we included in the objective function an accuracy measure that combines the determination coefficients of the fits of both the cases and the deaths.
In the next section, the comparison between the three models is illustrated with the second wave for the Region of Madrid.

Automatic SARIMA Model
The procedure followed to apply the ARIMA automatic adjustment method given in [29] is: • Automatic model adjustment, with the selection of the model's hyper-parameters (p, d, q)(P, D, Q) and a Box-Cox transformation [34] with parameter λ. The result of this step in the data is the model with: p = 1, d = 1, q = 2, P = 0, D = 1, Q = 1 and λ = 0.2086024. The following model equation was chosen: where B is the backshift operator.
• In the validation phase, the regression parameters are recalculated with these same hyper-parameters.

Other Models
To verify the contribution of the two models developed, other algorithms such as neural networks [35] and machine learning algorithms qwew tested. To present some of the results, the predictions are shown in Figure 10. The same dates for the implementation of random forest and xgboost with a direct forecasting strategy are presented [36].
In [37], a deep long short-term memory network is used very successfully for financial time series forecasting. We tried to apply this neural network to our problem, but the result is not satisfactory. This is undoubtedly due to the fact that the number of samples of each of the waves is very insufficient for the training of a deep learning network.  Figures 11 and 12 show the fit and 14-day forecasts with the three models for the time interval corresponding to the second wave for eight different endings of the historical time series (all of them starting on June, 24th), respectively, for the cumulative and noncumulative absolute frequencies of the cases.        Figure 14 shows the monitoring of the peak detection achieved with the three models. The model that best detects the peak is the non-linear regression model.

Conclusions
A non-linear regression model for count data in time series is developed and implemented by means of an expert system of artificial intelligence. It is based on directly estimating the distribution function of each of the series under study and on the duality between the distribution function and the density function. Since those two functions fully characterize the probability distribution of a variable, our model is able to capture the main characteristics of epidemic outbreaks. The simplicity of MATGEN must also be noted, since it is formulated only through four parameters: µ, σ le f t , σ right and n. The monitoring of all model parameters makes it possible to easily quantify and detect the effect of interventions over time. Furthermore, the machine learning algorithm developed is scalable, allows parallel running of different data series and is capable of introducing new data in real time.
We apply this model to the COVID-19 series of the Region of Madrid (Spain) during the first and second waves to give an eight-day forecast for the Spanish Mathematical Initiative [16]. This theoretical framework allows us to detect pandemic peaks and make short-and long-term monitoring and forecasting of the number of people infected, people requiring hospitalization and deaths. This expert system proves very useful to estimate the effectiveness of the interventions prompted by the government, which seems to have an impact after 11 days of its implementation during the first wave. Moreover, it is useful to propose commitment dates to lift the mobility restrictions and to advise on how to proceed in future outbreaks. On May 25, the Region of Madrid entered Phase 1 of the de-escalation. The MATGEN update on May 11 showed commitment dates between May 19 and June 5 (with a forecast of 72,200, 9000 and 4000 for the total numbers of confirmed cases, deaths and ICUs at the end of the pandemic, respectively).
The flexibility of our theoretical framework allows us to fit different regression functions (Gaussian, Gompertz, double Pareto, double exponential and uniform) between different dates. During the first wave, the series of new cases per day and deaths in the Region of Madrid only needed one cutoff and normal models in each part. The ICU series adjustment was more difficult: three cutoffs and different models in a wide family of probability distributions were needed. The computational cost of the last situation is higher than that of the first. In this case, the algorithm needed to detect the appropriate number of cutoffs and tried all the possible combinations of the models belonging to the family considered.
At present, in order to provide 14-day forecasts for the Spanish Mathematical Initiative [16], we implement another MATGEN model based on error correction models, useful for estimating both short-and long-term effects of one time series on another. This model is based on the cointegration of the four series in the study, namely cases, deaths, hospitalizations and ICUs, and it was incorporated into the expert system too.
Finally, the comparison from different points of view of our two models with the SCIR model [10] yields the following conclusions: • Among the advantages of using the SCIR model, we can highlight its simplicity, since it is formulated using five parameters and the interpretability of them from an epidemiological point of view. • The MATGEN non-linear regression model (formulated with four parameters per series that are easy to monitor) is the most explanatory for studying the peak detection and the effect of interventions. In addition, it is equipped with a control procedure that allows detecting trend changes in the tails that indicate the start of a new wave. Unlike the two other models, the fit of the four series is in parallel.

•
The MATGEN error correction model depends on a greater number of parameters but allows us to approximate the four series simultaneously, and it is the best in all the metrics, both in fit and in forecast. In addition, this model incorporates the impulse response function, a method that allows making inference about the impact of one series on the remaining ones.
Therefore, MATGEN combines our two proposed models in one expert system as a new epidemiological tool that can be proved extremely useful in new COVID-19 outbreaks and future epidemics of infectious diseases.  Acknowledgments: We thank the reviewers for their suggestions that have helped to improve the initial draft of this article.

Conflicts of Interest:
The authors declare no conflict of interest.