Next Article in Journal
Convergence Analysis of a Numerical Method for a Fractional Model of Fluid Flow in Fractured Porous Media
Next Article in Special Issue
Global Food Security, Economic and Health Risk Assessment of the COVID-19 Epidemic
Previous Article in Journal
Some Notes on a Formal Algebraic Structure of Cryptology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Two-Step Polynomial and Nonlinear Growth Approach for Modeling COVID-19 Cases in Mexico

by
Rafael Pérez Abreu C.
1,
Samantha Estrada
2 and
Héctor de-la-Torre-Gutiérrez
1,*
1
Aguascalientes Campus, Centro de Investigación en Matemáticas, A. C., Calzada de la Plenitud 103, José Vasconcelos Calderón, Aguascalientes 20200, Mexico
2
Department of Psychology and Counseling, University of Texas at Tyler, 3900 University Blvd, Tyler, TX 75799, USA
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(18), 2180; https://doi.org/10.3390/math9182180
Submission received: 29 July 2021 / Revised: 1 September 2021 / Accepted: 2 September 2021 / Published: 7 September 2021

Abstract

:
Since December 2019, the novel coronavirus (SARS-CoV-2) and its associated illness COVID-19 have rapidly spread worldwide. The Mexican government has implemented public safety measures to minimize the spread of the virus. In this paper, we used statistical models in two stages to estimate the total number of coronavirus (COVID-19) cases per day at the state and national levels in Mexico. In this paper, we propose two types of models. First, a polynomial model of the growth for the first part of the outbreak until the inflection point of the pandemic curve and then a second nonlinear growth model used to estimate the middle and the end of the outbreak. Model selection was performed using Vuong’s test. The proposed models showed overall fit similar to predictive models (e.g., time series and machine learning); however, the interpretation of parameters is simpler for decisionmakers, and the residuals follow the expected distribution when fitting the models without autocorrelation being an issue.

1. Introduction

The world is currently experiencing a pandemic caused by the novel coronavirus, formally named COVID-19 by the World Health Organization (WHO). Development of a vaccine and antiviral drugs to treat COVID-19 is still ongoing, resulting in hospitalization and intensive care unit management as the only option in treating COVID-19. Thus, there is a dire need for research on modeling the outbreak of COVID-19 to help officials in their decision-making processes regarding interventions and allocation of resources [1]. At the time this manuscript was being written, the pandemic was ongoing, and most of the epidemiological models developed focused on short-term predictions, identifying the daily peak of COVID-19 cases, predicting the duration of the pandemic, and estimating the possible impact of the measures implemented for minimizing exposure to the virus and decrease the fatality rate [2,3,4,5,6,7,8,9,10].
As of 24 September 2020, the cumulative number of COVID-19 cases in Mexico was reported as 715,457 [5], and 32,245,122 cases were reported worldwide [11]. Thus, the main objective of this paper was to model the total number of COVID-19 cases per day at the national and the state level in Mexico while simultaneously providing straightforward information to decisionmakers; additionally, we sought to determine which model provides the most stable short-term predictions. Figure 1 shows the accumulated cases and new cases at the national level in Mexico. This figure shows the first wave peak of the pandemic until the data cut-off of 24 September. Until the 24 September date, only a single wave of infections had been observed (on 1 August). The models developed in this research facilitate the obtaining of information to support decisionmakers in the strategic planning activities of the Mexican states, metropolitan areas, municipalities, or cities with high population density. Mexican officials can use these models to aid in the management process involving the needs and resources of the health services such as available hospital beds, intensive care units, and respirators, as well as personal protective equipment (PPE) for health personnel. For decisionmakers, such as public health officials, having access to daily and permanent monitoring at the center of the pandemic allows them to anticipate the purchase of the necessary medical equipment in advance. Further, the authors would like to share these models so that officials and statisticians outside of Mexico can make use of them for their own decision-making procedures during the length of the COVID-19 pandemic. The proposed methodology in this paper can easily be applied to COVID-19 worldwide.
In this work we refer to 1–2–3 models as two-step models due to the method used to estimate their parameters [12]. The method is performed in two steps that will combine information from time series models with non-linear growth models and polynomial models. The two-step estimation is a process, also known as the Cochrane–Orcutt procedure, which is defined as:
“A two-step estimation of a linear regression model with first-order serial correlation in the errors. In the first step the first-order autocorrelation coefficient is estimated using the ordinary least squares residuals from the main regression equation. In the second step this estimate is used to rescale the variables so that the regression in terms of rescaled variables has no serial correlation in the errors. This is an example of feasible generalized least squares estimation” [13].
Several machine learning (ML) and artificial intelligence (AI) models have demonstrated acceptable performance in the modeling of the COVID-19 pandemic; our proposed methodology meets this expectation in addition to a simple estimation of the parameters. Unlike the susceptible–infected–recovered–deceased (SIRD) models, the proposed models do not require setting or assuming the value of any parameter to obtain the estimates [14,15]. Finally, another advantage of our models is the interpretability of their parameters, that is, estimates of some parameters directly linked to the pandemic can be obtained.
The article is structured as follows: In Section 2, we summarize the most relevant literature regarding the modeling of the COVID-19 pandemic. Next, in Section 3, the data used are presented and the proposed methodology is described. Section 4 shows the main results of the investigation. Finally, in Section 5, the main conclusions are presented, and the limitations of this research are discussed.

2. Literature Review

Research exists with data-driven approaches such as autoregressive (AR) and autoregressive integrated moving average (ARIMA), ranging from simple models (exponential smoothing) to more complex models such as ARIMAX, ARCH, GARCH, and ARFIRMA [2,7,8,10,16,17]. For example, we used an ARIMA model with data compiled by Johns Hopkins University to predict models for the daily confirmed cases in countries where the pandemic was peaking and to predict and anticipate the resources of the healthcare systems [18]. Unfortunately, these data-driven models fail to fit the data and often lack accuracy [6,19]. Additionally, the parameters of these models cannot be interpreted according to the reality of the pandemic. This interpretability barrier causes statisticians and officials to make their decisions on the basis of predictive models instead of the peak of the pandemic or the growth of the pandemic. A useful model for policy and public health decisionmakers during the COVID-19 pandemic would be a model that, in addition to obtaining accurate predictions, provides insights on the evolution or current behavior of the pandemic. Another approach is real-time forecasting using a generalized logistic growth model. This method has been previously used in China to generate short-term forecasting of COVID-19 cases [20,21], as well as with data from Canada, France, India, South Korea, and the UK to forecast daily cases [22,23]. These models are incredibly useful in that they provide information on the current state of the pandemic. However, in this study, we had two aims regarding logistic growth models: (1) to demonstrate that their assumption of independence of and (2) that their modeling performance at the earlier stages of the pandemic is not optimal but can be improved by the incorporation of an autoregressive component [4,9].
The models used in this paper are based on statistical linear models, classic time series, and restricted growth—called limited growth or nonlinear growth models [1,2,3,13,24]—as well as real-time forecasting using generalized logistic growth model [4,25]. In this paper, we propose estimations in two stages of the pandemic utilizing polynomial and nonlinear growth functions while incorporating an autoregressive component with the purpose of meeting the assumption of independence of residuals. First, we propose using a polynomial function estimated using the Prais–Winsten methodology to estimate the first stage of the pandemic (when exponential growth of COVID-19 cases was observed). Our rationale for choosing a third-degree polynomial model was the following: under certain scenarios: it can be converted into an increasing monotonic function, which is essential when modeling the total of accumulated cases; the degree is three since it shows simplicity with respect to higher-order polynomials; and when its behavior is observed, it has the shape of an “S”.
Next, in the second stage of the pandemic (when the peak of COVID-19 cases was reached), we propose utilization of nonlinear growth functions, logistic, and Gompertz in order to predict the total cumulative number of cases and the growth rate of the spread of COVID-19. In each of the models estimated before and after the peak of the pandemic, at the second stage of modeling, we added an autoregressive component of order one (AR (1)) to compare the results to the models that do not account for the violation of independence of residuals. This approach has been successfully used to model plant and animal growth where measuring the same unit can lead to violation of independence of residuals [9,25,26]. Next, we selected the best estimate equation to model the pandemic by using Vuong’s test criterion [27]. We anticipated the proposed model to have good performance similar to neural networks (NN) or support vector machines (SVM). A disadvantage of NN is that it is difficult to generate a day-to-day prediction in addition to finding the growth rate and finding the maximum number of cases. These are not a problem for the functions we propose. Furthermore, artificial intelligence (AI) has been used to identify, track, and forecast COVID-19 cases. However, AI models are difficult to interpret—the process by which they arrive at a decision is often referred to as a “black box” due to the complexity in understanding how AI models arrive at certain conclusions [28]. The complex interpretation may create a barrier for decisionmakers looking for straightforward solutions in the middle of a pandemic. Moreover, we anticipate our proposed modeling of the pandemic will meet the assumption of independence of residuals. In contrast to NN, our proposed model can provide predictions and will facilitate interpreting parameters such as the highest number of infected individuals, growth speed, the initial number of infected individuals, and the autoregressive parameter to measure the lag in reporting the new daily cases of infections.
In summary, we present these two models (polynomial and nonlinear growth models) because of ease of interpretation for non-statistician decisionmakers, their stable and useful predictions while accounting for the autocorrelation of the data, and their insights regarding the current state of the pandemic. Therefore, the objectives of this paper were to:
(1)
specify the polynomial function and nonlinear growth models (logistic and Gompertz) that include an autoregression component for dealing with the autocorrelated observations in the growth data in two stages: before and after the inflection point of the pandemic is reached, and
(2)
compare the different polynomial and nonlinear growth functions in their ability to describe the number of cases of the COVID-19 pandemic.

3. Methods

3.1. Dataset

The COVID-19 data used in this research was obtained from the publicly available data of the Mexican Secretaria de Salud Federal that contained the number of cases confirmed from 27 February to 24 September 2020. The dataset also included the number of recovered patients and fatalities and can be downloaded in .csv format from the government’s website (https://www.gob.mx/salud/documentos/datos-abiertos-152127, accessed on 27 August 2021) [5]. Data to test the models focused on the national level and three Mexican states: Campeche, Quintana Roo, and Tamaulipas. To demonstrate the benefit of utilizing the autoregressive term, we used data from the state of Aguascalientes. The time series beginning point (t = 1) established was the day in which a positive COVID-19 case had not been reported. For example, in the state of Tamaulipas, the first positive COVID-19 case was 17 March 2020, and thus 16 March 2020 was considered (t = 1). The aforementioned data are noisy due to the dynamics of registration of new cases, that is, there is a known lag in the registration of new cases that depends on the site in the country. Another source of noise linked to the dynamics of registration is with respect to the cases detected on weekends, some of which are not reported until the following Monday. Therefore, and in order to reduce the effect of noise caused by the dynamics of registering new cases, we pre-processed the data by means of a moving average of two observations, that is, Yi = (Xi + Xi−1)/2, where Xi and Xi−1 correspond to the total number of COVID-19 cases reported up to time i and i − 1, respectively. For the case of i = 1, Y1 = X1.
Analyses were conducted in R (version 3.6.2) and STATA (version 15.1) [29,30].

3.2. Model Selection

The current study proposes the utilization of a two-stage approach: First, Stage I model was fit throughout the pandemic and before reaching the peak of cases (before the inflection point), and in the second stage, once the peak of cases was reached, a different type of model was used. We utilized Vuong’s test for model selection [27]. Figure 2 shows a graphical abstract of the proposed methodology for this study.

3.2.1. Stage I: Before the Inflection Point

For the current study, we examined a variety of models that successfully model the behavior of the pandemic (e.g., the maximum number of cases, growth rate) while simultaneously meeting the assumption of independence of residuals in the data. Using the Akaike information criteria (AIC) and the root mean square error (RMSE) criterions, we found that the models that best described the total accumulated cases of COVID-19 in the four data examples utilized were:
  • Polynomial model of order three with an autoregressive error component of order one Equation (1), known as Prais–Winsten or Cochrane–Orcutt estimation, and
  • Nonlinear growth models, including logistic and Gompertz, which had the best fit of the models examined (Equations (2) and (3)).

3.2.2. Stage II: After the Inflection Point

To examine which models were the best once the peak of positive COVID-19 cases had been reached, we focused on the same three state data that showed good fit up to the peak of the pandemic. We sampled two data periods after the inflection point: 20 days after the inflection point (19 October) and the day before we began writing the results of this paper (24 September 2020). The results obtained from this stage of the pandemic were compared with those obtained before the inflection point of the pandemic. We hypothesized that one of the nonlinear growth models would have a better fit for the stage before the peak of cases.

3.3. Statistical Procedures

The complete two-step polynomial model used in this work has the form:
Y t = α 0 + α 1 t + α 2 t 2 + α 3 t 3 + ρ u t 1 + ε t
where Y t is the number of total positive COVID-19 cases reported at time t ; the coefficients α 0 ,   α 1 , α 2 , α 3 were the parameters of the polynomial component of the model; ρ is the coefficient of the autoregressive component u t 1 obtained in the second step of the estimation; and ε t is a random error term, with t = 0 ,   1 ,   2 ,   , i . The value t = 1 is selected as equal to the day of the first positive case.
The two-step logistic model used the following form:
Y t = β 1 1 + β 2 e β 3 t + ρ u t 1 + ε t
where Y t is the number of total positive COVID-19 cases reported at time t; the coefficients β 1 , β 2 , β 3 correspond to the logistic component; ρ is the coefficient of the autoregressive component u t 1 obtained in the second step of the estimation; and ε t is a random error term, with t = 0 ,   1 ,   2 , ,   i . β 1 models the highest number of infected, β 2 growth speed, β 3 is the initial number of infected individuals, and ρ models the autoregressive process to incorporate the delay in the process of reporting the new cases of infection as well as the inherent pandemic dynamic. Recall that the value t = 1 is set to the day of the first positive case observed.
The two-step Gompertz model used the following form:
Y t = β 1 e ( e ( β 2 ( t β 3 ) ) ) + ρ u t 1 + ε t
where Y t is the number of total positive COVID-19 cases reported at time t; the coefficients β 1 ,   β 2 ,   β 3 correspond to the Gompertz component; ρ is the coefficient of the autoregressive component u t 1 obtained in the second step of the estimation; and ε t is a random error term, with t = 0 ,   1 ,   2 ,   ,   i . The β 1 parameter estimates the top number of infected, β 2 is growth speed, β 3 is the initial number of infected individuals, and ρ models the autoregressive component. Similarly, the value t = 1 is set to the day of the first positive case observed. More information on two-step procedures can be found in [12].
The models were fitted in six time periods: the end date of the study (24 September 2020); 20 days before the peak of the pandemic for each of the Mexican states examined; peak day; and 10, 20, and 30 days after the peak of cases was reached. The models were compared in pairs utilizing Vuong’s test, a classical likelihood ratio approach to model selection for nested and non-nested models, which uses the Kulback–Leiber information criterion [27]. The hypotheses for Vuong’s model selection test are:
Hypothesis 1 (H1).
Model fits are equal for the focal population.
Hypothesis 2A (H2A).
Model 1 fits better than Model 2.
Hypothesis 2B (H2B).
Model 2 fits better than Model 1.
Thus, the results will provide a p-value from Vuong’s test that can be compared to a significance level α and aid in selecting the model that best fits the data. Graphically, we can observe in Figure 3 the approximate modeling of the pandemic, as well as identify the peak (when there is a change in the growth rate) and the steps where the models of interest in the project will be used.

4. Results

Table 1 displays the dates when COVID-19 cases peaked at the national level and in the three Mexican states selected for the study.
As previously mentioned, the proposed models were adjusted to six time periods. We used Voung’s test for each time point to see which model fit best. The p-values obtained via Voung’s test for the alternative Hypothesis 2A (H2A) can be seen in Table 2. Note that the p-values for the alternative Hypothesis 2B (H2B) were the complement of the p-values for the alternative Hypothesis 2A (H2A). For example, comparing the polynomial and logistic models, in Table 2, we can see the p-value obtained for the state of Tamaulipas on 24 September was 0.9650 for the H2A and its complement 0.0350 was the p-value for the H2B. At the significance level α = 0.05, we rejected the null hypothesis in favor of H2B, suggesting that the logistic model fits the data better than the polynomial model.
Table 3 summarizes which models fit better for each region at each of the six time periods examined. In general, it is easy to see that the nonlinear growth models did not fit better than the polynomial models on dates before and during the peak of the first wave of the pandemic was reached. On the other hand, when examining the dates prior to the first wave peak of the pandemic, we found a negligible difference between the models. More importantly, examinations of model comparisons between dates before and during the peak of the pandemic revealed that no model worked better than any other. That is, in terms of modeling the growth rate of COVID-19 cases, there was no difference between the models. However, in the late phase of the pandemic, after the inflection point, the nonlinear growth models performed better than the polynomial model fit.
Polynomial and nonlinear growth models were useful for modeling the beginning of the epidemic (see Figure 1) until reaching the maximum peak of daily cases for the pandemic [5]. On the other hand, nonlinear growth models were more accurate and effective when more information was available, and the maximum peak of daily cases was reached. Furthermore, time series models allowed for practical real-time monitoring of when (a) exponential growth was beginning, (b) exponential growth was in effect, and (c) exponential growth was about to end, which indicated that the epidemic was reaching its end. Finally, the nonlinear growth models allowed for describing the behavior at the end of the pandemic and monitoring and detecting a possible second wave of the epidemic. In general, the logistic and Gompertz models had the better fit. For example, for the state of Tamaulipas, we used the Gompertz model, which was one of the models with better fit for the peak point + 20 days. We could estimate the maximum number of COVID-19 cases β 1 = 63,640, the cases’ growth speed β 2 = 0.0156, and the initial number of cases β 3 = 162.

4.1. Model Performance

Once the models were estimated, it was possible to predict the total cases and the most recent information on rates (or percentage) of positive active COVID-19, outpatients, stable hospitalized patients, seriously hospitalized patients, and intubated hospitalized patients. Likewise, with the information from the SENTINEL Prevention Model, we estimated the total number of asymptomatic COVID-19 positive cases. Figure 4, Figure 5, Figure 6 and Figure 7 show the cumulative total cases of COVID-19 through 15 October (21 days out of the initial sample which was 24 September) for the four case studies (solid black line). These figures also show, with a red line, the point predictions made by each of the three models, as well as the area covered by the prediction intervals of said estimates with gray shading. In Figure 4, we can observe that the predictions made by the logistic and Gompertz models were relatively good, but not so in the case of the polynomial model, which even predicted a decrease in the total accumulated cases (which was not possible in the context studied). Figure 5 corresponds to the state of Tamaulipas—in this figure, we can see that the worst predictions were also made by the polynomial model. In Figure 6, we can see the predictions for the state of Quintana Roo, wherein the model that best made predictions was the logistic one, and the total number of cases was overestimated by the two other models. Regarding the predictions made for the country, in Figure 7, we can see that the best predictions, by far, were made by the Gompertz model. Table 4 shows the root mean square error (RMSE) of each model proposed for the three state case studies and at the national level at the six time periods of interest.
As a result of the 21-day predictions mentioned above, the number of new cases of COVID-19 can be obtained, and the results for the four case studies and the three models are shown in Table 5, Table 6, Table 7 and Table 8. These results are based on the official data source of the daily releases issued by the Mexican Secretaria de Salud on its website https://coronavirus.gob.mx/as of 15 October 2020 [5].

4.2. Autocorrelation

An autoregressive model was fitted to the residuals of the logistics and Gompertz models estimated in a single stage to eliminate the autocorrelation. We fit an autoregressive model to the logistic and Gompertz functions to eliminate the residual autocorrelation. The analysis revealed that the inclusion of an autoregressive component of the second order improved the fit of the data. For example, for the state of Aguascalientes, Figure 8a,b shows the autocorrelation residual and partial autocorrelation plots for models without the autoregressive terms. In these figures, it is clear that the residuals show a pattern of dependency. Furthermore, Figure 8c,d demonstrates a good fit, and there is no evidence of linear relationship. The patterns in the data revealed that nonlinear models with autoregressive terms met assumptions of independence. The models showed good fit while accounting for autocorrelation of residuals while providing better interpretability of the model coefficients in terms of growth rate, maximum number of cases, and initial number of cases. Thus, the model proposed in this paper produces a substantial improvement of the predictions.

5. Conclusions

The modeling of the COVID-19 cases is a unique challenge for statisticians, considering the fact that the data are limited and there are often delays in updating the data. Our approach to modeling the pandemic can provide assistance to decision-making officials for containing and anticipating a “second wave” of the COVID-19 pandemic in Mexico. We divided the pandemic data into two stages, and at each stage, three two-step models were adjusted. In Stage I, as can be seen in Table 3, the polynomial model of order three with an autoregressive error component of order one was computed through applying the Prais–Winsten or Cochrane–Orcutt estimation, which had a better performance than nonlinear growth models. In Stage II, after the peak of COVID-19 cases, two-step nonlinear growth models outperformed the polynomial model. Further, the models used in this paper were different than those used in predictive models using time series or machine learning; however, our model met the assumption of independence of residuals, and the interpretability of our model was superior to those of machine learning—particularly for government officials without a statistics or machine learning background.
Although it is not the objective of this research, another purpose for the proposed models concerns the identification of the peak of the pandemic. That is, an analytical way to know if the pandemic is currently in a period of growth, peak, or decline of sustained cases is through the adjustments of the models. For example, at the moment that any of the non-linear growth models fits significantly better than the other models, we will be at the point of sustained reduction of daily cases, which would mean that the peak has already passed, and that said model must be used to make predictions of total cases and obtain insights of the pandemic at that point.
In summary, this paper showed the efficacy of utilizing two different types of models estimated in two stages (stages depending on the state of the pandemic). The majority of the efforts to model the COVID-19 pandemic are nonlinear models, such as logistic and Gompertz [4,5]. However, these models do not take into account autoregression, thus possibly skewing short-, medium-, and long-term predictions. In contrast with the SIR models, where it is necessary to fix or assume a few initial parameters, our proposed models do not require initial parameter assumptions. Our first recommendation, due to simplicity in fitting the model, is that in the early stages of the pandemic where there was an exponential growth (during or before the peak), one should utilize a polynomial model of the third order estimated with an autoregressive component. Our second recommendation is that for the later, more advanced stages when the peak of the pandemic was reached, one should utilize a nonlinear growth model (logistic or Gompertz) estimated with an autoregressive component. In the event that two models show the best fit, if any of these is the polynomial model, it will be necessary to question whether it is in the public health decisionmakers’ interest to have insights (provided by the βs of the non-linear growth models) of the current state of the pandemic; if yes, then we recommend using the non-linear growth model. In the opposite case, where it is not in the interest of the decisionmakers to know insights of the pandemic, the use of the polynomial model is recommended since it does not require initial values for its estimation. Another scenario could be that within the tie, there are two linear growth models, wherein either of the two can be used.
This research is not without limitations. The projections resulting from the models were estimated without considering any type of intervention. In other words, the intervention effects—such as mask mandates, lockdowns, social distancing, or vaccines—are not considered in our models. Any of the aforementioned variables should be considered with care. Regarding vaccination, as this article was being written, there was no official site of the Mexican government where the total number of people vaccinated could be retrieved, only unofficial sites where progress is reported; however, these numbers are not trustworthy. Regarding the measures of lockdowns and use of face masks, given the federal nature of the country, the states of the republic are free to take or not take actions on their population; thus, to analyze any variable of this nature, an exhaustive study must be carried out on the states that took similar measures, and the effect of these measures must be evaluated by means of an effect or additive or multiplicative variable in the statistical model.
Due to the characteristics of the models used, the atypical characteristics of the COVID-19 pandemic, and the results and information derived from the monitoring strategy—well known in epidemiology science as SENTINEL Prevention Model—the following should be considered:
The data show a lot of variability from one day to another. Part of the noise caused by the dynamics of new case records was smoothed out by pre-processing the data (moving average) and the AR (1) component of the model; in future work, to further reduce the noise, we could add complex structures in the residuals, such as ARMA, ARIMA, or SARIMA. Furthermore, at the time of the data cut-off date, the second wave of the pandemic had not yet occurred. In the event that it is required to use this model for a second wave of the pandemic, care must be taken with the estimation of the components of the non-linear models (specifically the autoregressive and nonlinear component). That is, the nonlinear models used are not designed to model spikes in cases (second wave or third waves), and this will affect the estimation of parameters—such as growth speed, maximum cases, and the autoregressive component—being able to obtain estimates out of context of the pandemic (very high or low numbers), or uniroot problems in the time series component.
As previously mentioned, the residual autocorrelation analysis was performed, looking for linear autocorrelations. Regarding the precision of the predictions, future research will focus on comparing our proposed models against machine learning and artificial intelligence models. In some cases, such as the state of Aguascalientes, it was necessary to add a second order autoregressive component.
The benefit of utilizing a second-order autoregressive component is shown in Table 9 for the state of Aguascalientes (before 20 August 2020). For the Gompertz nonlinear growth model, we corrected for the dependency between observations by including an autocorrelation term of the second order in the model that considerably improved the overall fit criteria. Thus, we recommend the inclusion of an autoregressive term of the second order when modeling COVID-19 case growth.

Author Contributions

Conceptualization, R.P.A.C.; methodology, R.P.A.C., H.d.-l.-T.-G.; writing—original draft preparation, methodology, coding, S.E.; writing—review and editing, S.E.; visualization, S.E., H.d.-l.-T.-G.; supervision, R.P.A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the Centro de Investigación en Matemáticas, A. C. and The University of Texas at Tyler. H.T.G. would like to acknowledge thanks Catedras CONACyT fellowship program (project number 720) and Sistema Nacional de Investigadores (548421). S. E. would like to acknowledge The Office of Research and Scholarship, the Robert R. Muntz Library and the College of Education and Psychology.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is publicly available through Gobierno de Mexico Secretaria de Salud: https://www.gob.mx/salud/documentos/datos-abiertos-152127.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Villela, D.A. Discrete time forecasting of epidemics. Infect. Dis. Model. 2020, 5, 189–196. [Google Scholar] [CrossRef] [PubMed]
  2. Ribeiro, M.H.D.M.; da Silva, R.G.; Mariani, V.C.; dos Santos Coelho, L. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals 2020, 135, 109853. [Google Scholar] [CrossRef] [PubMed]
  3. De Pinho, S.Z.; de Carvalho, L.R.; Mischan, M.M.; Passos, J.R.d.S. Critical points on growth curves in autoregressive and mixed models. Sci. Agric. 2014, 71, 30–37. [Google Scholar] [CrossRef] [Green Version]
  4. Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2021. [Google Scholar]
  5. Gobierno de Mexico, Secretaria de Salud. Datos Abiertos Dirección General de Epidemiología. Available online: https://www.gob.mx/salud/documentos/datos-abiertos-152127 (accessed on 18 August 2021).
  6. Zhang, X.; Ma, R.; Wang, L. Predicting turning point, duration and attack rate of COVID-19 outbreaks in major Western countries. Chaos Solitons Fractals 2020, 135, 109829. [Google Scholar] [CrossRef] [PubMed]
  7. Mazurek, J.; Nenickova, Z. Predicting the Number of Total COVID-19 Cases and Deaths in the USA by the Gompertz Curve; Elsevier: Amsterdam, The Netherlands, 2020; submitted. [Google Scholar]
  8. Batista, M. Estimation of the final size of the COVID-19 epidemic. medRxiv 2020. [Google Scholar] [CrossRef] [Green Version]
  9. Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis, Control, and Forecasting; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  10. Lin, Q.; Zhao, S.; Gao, D.; Lou, Y.; Yang, S.; Musa, S.S.; Wang, M.H.; Cai, Y.; Wang, W.; Yang, L.; et al. A conceptual model for the coronavirus disease 2019 (COVID-19) outbreak in Wuhan, China with individual reaction and governmental action. Int. J. Infect. Dis. 2020, 93, 211–216. [Google Scholar] [CrossRef]
  11. World Health Organization. WHO Coronavirus (COVID-19) Dashboard. 2020. Available online: https://covid19.who.int (accessed on 18 August 2021).
  12. Murphy, K.M.; Topel, R.H. Estimation and Inference in Two-Step Econometric Models. J. Bus. Econ. Stat. 1985, 3, 370–379. [Google Scholar]
  13. Oxford. Cochrane—Orcuttprocedure. Available online: https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095620898 (accessed on 18 August 2021).
  14. Calafiore, G.C.; Novara, C.; Possieri, C. A time-varying SIRD model for the COVID-19 contagion in Italy. Annu. Rev. Control 2020, 50, 361–372. [Google Scholar] [CrossRef]
  15. Carli, R.; Cavone, G.; Epicoco, N.; Scarabaggio, P.; Dotoli, M. Model predictive control to mitigate the COVID-19 outbreak in a multi-region scenario. Annu. Rev. Control 2020, 50, 373–393. [Google Scholar] [CrossRef]
  16. Coutin Marie, G. Utilización de modelos ARIMA para la vigilancia de enfermedades transmisibles. Rev. Cuba. Salud Pública 2007, 33. [Google Scholar]
  17. Benvenuto, D.; Giovanetti, M.; Vassallo, L.; Angeletti, S.; Ciccozzi, M. Application of the ARIMA model on the COVID-2019 epidemic dataset. Data Brief 2020, 29, 105340. [Google Scholar] [CrossRef] [PubMed]
  18. Chakraborty, T.; Ghosh, I. Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis. Chaos Solitons Fractals 2020, 135, 109850. [Google Scholar] [CrossRef] [PubMed]
  19. Grzegorczyk, A. Application of the Richards function to the description of leaf area growth in maize (Zea mays L.). Acta Soc. Bot. Pol. 1994, 63, 5–7. [Google Scholar] [CrossRef]
  20. Roosa, K.; Lee, Y.; Luo, R.; Kirpich, A.; Rothenberg, R.; Hyman, J.; Yan, P.; Chowell, G. Real-time forecasts of the COVID-19 epidemic in China from February 5th to February 24th, 2020. Infect. Dis. Model. 2020, 5, 256–263. [Google Scholar] [CrossRef] [PubMed]
  21. Shen, C.Y. Logistic growth modelling of COVID-19 proliferation in China and its international implications. Int. J. Infect. Dis. 2020, 96, 582–589. [Google Scholar] [CrossRef]
  22. Chimmula, V.K.R.; Zhang, L. Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos Solitons Fractals 2020, 135, 109864. [Google Scholar] [CrossRef]
  23. Menon, V.K. Prediction of number of cases expected and estimation of the final size of coronavirus epidemic in India using the logistic model and genetic algorithm. arXiv 2020, arXiv:2003.12017. preprint. [Google Scholar]
  24. Choi, S.; Jung, E.; Choi, B.Y.; Hur, Y.J.; Ki, M. High reproduction number of Middle East respiratory syndrome coronavirus in nosocomial outbreaks: Mathematical modelling in Saudi Arabia and South Korea. J. Hosp. Infect. 2018, 99, 162–168. [Google Scholar] [CrossRef] [Green Version]
  25. Porter, T.; Kebreab, E.; Kuhi, H.D.; Lopez, S.; Strathe, A.B.; France, J. Flexible alternatives to the Gompertz equation for describing growth with age in turkey hens. Poult. Sci. 2010, 89, 371–378. [Google Scholar] [CrossRef]
  26. Tariq, M.M.; Iqbal, F.; Eyduran, E.; Bajwa, M.A.; Huma, Z.E.; Waheed, A. Comparison of non-linear functions to describe the growth in Mengali sheep breed of Balochistan. Pak. J. Zool. 2013, 45, 661–665. [Google Scholar]
  27. Vuong, Q.H. Likelihood ratio tests for model selection and non-nested hypotheses. Econom. J. Econom. Soc. 1989, 57, 307–333. [Google Scholar] [CrossRef] [Green Version]
  28. Zhou, J.; Tse, G.; Lee, S.; Liu, T.; Wu, W.K.; Zeng, D.; Wong, I.C.K.; Zhang, Q.; Cheung, B.M.Y. Identifying main and interaction effects of risk factors to predict intensive care admission in patients hospitalized with COVID-19: A retrospective cohort study in Hong Kong. medRxiv 2020. [Google Scholar] [CrossRef]
  29. R Core Team. R: A Language and Environment for Statistical Computing (Version 3.0. 2); R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
  30. Statacorp. Stata Statistical Software: Release 15; StataCorp LP: College Station, TX, USA, 2017. [Google Scholar]
Figure 1. (a) Accumulated cases and (b) new daily cases of COVID-19 in Mexico.
Figure 1. (a) Accumulated cases and (b) new daily cases of COVID-19 in Mexico.
Mathematics 09 02180 g001
Figure 2. Graphical abstract of the methodology.
Figure 2. Graphical abstract of the methodology.
Mathematics 09 02180 g002
Figure 3. Polynomial versus a nonlinear growth model.
Figure 3. Polynomial versus a nonlinear growth model.
Mathematics 09 02180 g003
Figure 4. Forecast for total accumulated cases at 21 days for the state of Campeche.
Figure 4. Forecast for total accumulated cases at 21 days for the state of Campeche.
Mathematics 09 02180 g004
Figure 5. Forecast for total accumulated cases at 21 days for the state of Tamaulipas.
Figure 5. Forecast for total accumulated cases at 21 days for the state of Tamaulipas.
Mathematics 09 02180 g005
Figure 6. Forecast for total accumulated cases at 21 days for the state of Quintana Roo.
Figure 6. Forecast for total accumulated cases at 21 days for the state of Quintana Roo.
Mathematics 09 02180 g006
Figure 7. Forecast for total accumulated cases at 21 days at the national level in Mexico.
Figure 7. Forecast for total accumulated cases at 21 days at the national level in Mexico.
Mathematics 09 02180 g007
Figure 8. Analysis of autocorrelation in residuals.
Figure 8. Analysis of autocorrelation in residuals.
Mathematics 09 02180 g008
Table 1. COVID-19 peak dates.
Table 1. COVID-19 peak dates.
Peak Date
Mexico (national level)1 August
Campeche25 July
Quintana Roo23 July
Tamaulipas1 August
Table 2. Vuong’s model selection test p-values corresponding to the H2A for the two-step proposed models.
Table 2. Vuong’s model selection test p-values corresponding to the H2A for the two-step proposed models.
DateModels (Model 1–Model 2)MexicoCampecheQuintana RooTamaulipas
24 SeptemberPolynomial–Logistic0.96501.00001.00000.9743
Polynomial–Gompertz0.98641.00000.97410.9803
Logistic–Gompertz0.96100.00000.00000.0957
20 days after the peak dayPolynomial–Logistic0.06370.97480.99970.9277
Polynomial–Gompertz0.53430.99120.99940.9309
Logistic–Gompertz0.97990.07770.00030.1960
Peak dayPolynomial–Logistic0.39320.61680.01490.8275
Polynomial–Gompertz0.88660.66650.00030.7908
Logistic–Gompertz0.82120.42200.47230.1727
10 days before peak dayPolynomial–Logistic0.06820.89930.19910.3957
Polynomial–Gompertz0.95410.69580.05000.7888
Logistic–Gompertz0.99480.05600.24500.6750
20 days before peak dayPolynomial–Logistic0.20280.87380.93840.9199
Polynomial–Gompertz0.85930.95210.93210.8577
Polynomial–Gompertz0.96410.76590.10790.0749
30 days before peak dayPolynomial–Logistic0.56770.88740.38750.9423
Polynomial–Gompertz0.92560.89190.65840.7892
Logistic–Gompertz0.97370.07300.87770.3109
H2A: Model 1 fits better than Model 2. If p < 0.05, Model 1 has better fit; if p > 0.95, Model 2 has better fit. If 0.05 < p < 0.95, both models fit.
Table 3. Model comparison summary according to Vuong’s test.
Table 3. Model comparison summary according to Vuong’s test.
DateMexicoCampecheQRRTamaulipas
24 SeptemberGompertzLogisticLogisticGompertz, logistic
20 days after the peak dayAnyGompertz, logisticLogisticAny
Peak dayAnyAnyPolynomialAny
10 days before peak dayGompertzAnyAnyAny
20 days before peak dayAnyAnyAnyAny
30 days before peak dayAnyAnyAnyAny
Table 4. Root mean square error (RMSE) values for the four case studies for three different models examined.
Table 4. Root mean square error (RMSE) values for the four case studies for three different models examined.
DateModelsMexicoCampecheQuintana RooTamaulipas
24 SeptemberPolynomial 631.6817.6827.29120.98
Logistic608.4013.1524.76113.62
Gompertz593.0714.7526.70118.01
20 days after the peak dayPolynomial 557.9115.1327.21127.13
Logistic577.1613.8125.19121.63
Gompertz556.3614.3726.25123.51
Peak dayPolynomial 495.6711.0021.9388.45
Logistic498.4210.8722.4481.66
Gompertz482.7410.9222.4585.92
10 days before peak dayPolynomial 427.878.0716.0150.46
Logistic436.887.7716.2050.73
Gompertz406.158.0216.2650.20
20 days before peak dayPolynomial 386.636.3313.7942.74
Logistic392.596.2713.4641.11
Gompertz374.126.2513.6042.39
30 days before peak dayPolynomial 326.505.7211.4530.72
Logistic325.065.4311.5530.21
Gompertz303.465.5111.3730.48
Smaller values of RMSE value being lower indicate better fit.
Table 5. Daily new COVID-19 cases predicted from total cumulative cases in Mexico.
Table 5. Daily new COVID-19 cases predicted from total cumulative cases in Mexico.
PolynomialLogisticGompertz
DateCasesNew Cases% Prediction ErrorNew CasesAccumulatedNew CasesAccumulatedNew Cases
9/24/20715,457 3.40% 3.60% −0.11%
9/25/20720,85854013.48%6001−3.72%2502−0.16%4285
9/26/20726,43155733.50%5938−4.16%2407−0.35%4227
9/27/20730,31738863.74%5924−4.37%2336−0.30%4182
9/28/20733,71734004.04%5910−4.51%2267−0.20%4137
9/29/20738,16344464.19%5896−4.82%2200−0.25%4092
9/30/20743,21650534.27%5881−5.22%2134−0.38%4047
10/1/20748,31550994.34%5866−5.63%2070−0.53%4003
10/2/20753,09047754.44%5850−6.00%2007−0.63%3958
10/3/20757,95348634.53%5833−6.40%1946−0.76%3914
10/4/20761,66537124.76%5816−6.63%1886−0.73%3870
10/5/20765,08234175.02%5798−6.84%1828−0.67%3826
10/6/20769,55844765.15%5780−7.20%1772−0.76%3782
10/7/20774,02044625.27%5762−7.56%1716−0.85%3739
10/8/20779,12751075.31%5742−8.02%1663−1.03%3696
10/9/20784,58054535.31%5723−8.54%1610−1.26%3652
10/10/20789,77951995.33%5702−9.02%1559−1.46%3610
10/11/20792,92031415.60%5682−9.23%1510−1.40%3567
10/12/20796,39934795.82%5660−9.49%1462−1.38%3524
10/13/20800,47440755.96%5639−9.83%1415−1.45%3482
10/14/20805,51250385.99%5616−10.32%1369−1.65%3440
10/15/20810,88353715.98%5593−10.85%1325−1.89%3399
Table 6. Daily new COVID-19 cases predicted from total cumulative cases in the state of Campeche.
Table 6. Daily new COVID-19 cases predicted from total cumulative cases in the state of Campeche.
PolynomialLogisticGompertz
DateCasesNew CasesAccumulatedNew CasesAccumulatedNew CasesAccumulatedNew Cases
9/24/206027 2.64% −0.40% 2.90%
9/25/20603362.57%5−0.48%52.92%15
9/26/206046132.41%3−0.60%52.93%14
9/27/206056102.27%1−0.68%52.99%14
9/28/206072162.01%0−0.87%52.95%14
9/29/20607641.92%−2 *−0.86%53.09%13
9/30/206089131.66%−3 *−1.00%43.08%13
10/1/206106171.31%−5 *−1.21%43.00%13
10/2/206116101.05%−6 *−1.31%43.03%12
10/3/206128120.73%−8 *−1.45%43.02%12
10/4/206143150.33%−10 *−1.64%42.96%11
10/5/20615512−0.04%−11 *−1.78%32.94%11
10/6/20616510−0.41%−13 *−1.89%32.95%11
10/7/2061738−0.78%−15 *−1.97%32.99%11
10/8/2061829−1.20%−16 *−2.07%33.00%10
10/9/2061919−1.64%−18 *−2.17%33.01%10
10/10/2061943−2.02%−20 *−2.18%33.11%10
10/11/20620915−2.63%−21 *−2.39%23.02%9
10/12/20622415−3.28%−23 *−2.59%22.92%9
10/13/2062306−3.81%−25 *−2.66%22.96%9
10/14/2062377−4.39%−27 *−2.74%22.98%9
10/15/2062469−5.05%−29 *−2.85%22.97%8
* Due to the nature of the polynomial model that was used only for Stage I, we predicted a decrease in the total number of cases accumulated that did not match reality, as we can see from the negative numbers. The table above reflects the disadvantage of this type of model.
Table 7. Daily new COVID-19 cases predicted from total cumulative cases in the state of Quintana Roo.
Table 7. Daily new COVID-19 cases predicted from total cumulative cases in the state of Quintana Roo.
PolynomialLogisticGompertz
DateCasesNew CasesAccumulatedNew CasesAccumulatedNew CasesAccumulatedNew Cases
9/24/2011,455 5.13% 0.30% 3.80%
9/25/2011,500455.03%770.02%393.96%66
9/26/2011,583834.93%75−0.41%343.79%65
9/27/2011,621385.19%74−0.46%323.99%65
9/28/2011,653325.49%73−0.46%314.23%64
9/29/2011,693405.72%72−0.54%314.40%63
9/30/2011,742495.87%72−0.71%304.49%63
10/1/2011,832905.68%71−1.23%294.24%62
10/2/2011,888565.76%70−1.47%284.26%61
10/3/2011,956685.74%69−1.82%274.18%61
10/4/2012,013575.79%68−2.08%264.18%60
10/5/2012,048356.01%67−2.16%254.36%59
10/6/2012,05576.44%66−2.01%244.74%58
10/7/2012,146916.21%65−2.57%244.46%58
10/8/2012,172266.48%65−2.59%234.68%57
10/9/2012,17976.88%64−2.46%225.05%56
10/10/2012,189107.25%63−2.36%215.38%56
10/11/2012,241527.29%62−2.62%215.38%55
10/12/2012,3471066.91%61−3.34%204.96%54
10/13/2012,366197.19%60−3.33%195.21%54
10/14/2012,405397.30%59−3.50%195.30%53
10/15/2012,447427.39%58−3.69%185.35%52
Table 8. Daily new COVID-19 cases predicted from total cumulative cases in the state of Tamaulipas.
Table 8. Daily new COVID-19 cases predicted from total cumulative cases in the state of Tamaulipas.
PolynomialLogisticGompertz
DateCasesNew CasesAccumulatedNew CasesAccumulatedNew CasesAccumulatedNew Cases
9/24/2028,159 7.60% −1.20% 2.30%
9/25/2028,3201617.85%295−1.30%852.35%165
9/26/2028,4541348.29%294−1.48%822.44%162
9/27/2028,534808.90%294−1.48%792.69%159
9/28/2028,606729.52%294−1.46%762.96%156
9/29/2028,7641589.86%293−1.76%732.93%153
9/30/2028,8478310.42%293−1.80%703.14%151
10/1/2028,9469910.92%293−1.91%673.29%148
10/2/2029,08513911.29%292−2.17%643.29%145
10/3/2029,22413911.65%292−2.44%613.29%143
10/4/2029,2664212.30%292−2.37%593.60%140
10/5/2029,3195312.90%291−2.36%563.86%138
10/6/2029,3644513.51%291−2.32%544.14%135
10/7/2029,49613213.86%290−2.60%524.12%133
10/8/2029,61712114.23%289−2.84%494.13%130
10/9/2029,71910214.65%289−3.03%474.20%128
10/10/2029,8139415.08%288−3.19%454.28%125
10/11/2029,8736015.60%288−3.24%434.47%123
10/12/2029,877416.27%287−3.11%414.82%121
10/13/2030,07319616.39%286−3.64%404.56%118
10/14/2030,1043116.97%286−3.62%384.81%116
10/15/2030,22412017.28%285−3.90%364.78%114
Table 9. Comparison models by goodness of fit criteria for the state of Aguascalientes.
Table 9. Comparison models by goodness of fit criteria for the state of Aguascalientes.
Model TypeDurbin–WatsonAICBIC
Gompertz(3, 139) = 0.05691509.951519.08
Gompertz + AR (1)(4, 139) = 1.1040968.83980.56
Gompertz + AR (1) + AR (2)(5, 139) = 2.2986 *937.6952.27
* p-value > 0.05 taken from Durbin–Watson statistical table values.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pérez Abreu C., R.; Estrada, S.; de-la-Torre-Gutiérrez, H. A Two-Step Polynomial and Nonlinear Growth Approach for Modeling COVID-19 Cases in Mexico. Mathematics 2021, 9, 2180. https://doi.org/10.3390/math9182180

AMA Style

Pérez Abreu C. R, Estrada S, de-la-Torre-Gutiérrez H. A Two-Step Polynomial and Nonlinear Growth Approach for Modeling COVID-19 Cases in Mexico. Mathematics. 2021; 9(18):2180. https://doi.org/10.3390/math9182180

Chicago/Turabian Style

Pérez Abreu C., Rafael, Samantha Estrada, and Héctor de-la-Torre-Gutiérrez. 2021. "A Two-Step Polynomial and Nonlinear Growth Approach for Modeling COVID-19 Cases in Mexico" Mathematics 9, no. 18: 2180. https://doi.org/10.3390/math9182180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop