A Two-Step Polynomial and Nonlinear Growth Approach for Modeling COVID-19 Cases in Mexico

: Since December 2019, the novel coronavirus (SARS-CoV-2) and its associated illness COVID-19 have rapidly spread worldwide. The Mexican government has implemented public safety measures to minimize the spread of the virus. In this paper, we used statistical models in two stages to estimate the total number of coronavirus (COVID-19) cases per day at the state and national levels in Mexico. In this paper, we propose two types of models. First, a polynomial model of the growth for the ﬁrst part of the outbreak until the inﬂection point of the pandemic curve and then a second nonlinear growth model used to estimate the middle and the end of the outbreak. Model selection was performed using Vuong’s test. The proposed models showed overall ﬁt similar to predictive models (e.g., time series and machine learning); however, the interpretation of parameters is simpler for decisionmakers, and the residuals follow the expected distribution when ﬁtting the models without autocorrelation being an issue.


Introduction
The world is currently experiencing a pandemic caused by the novel coronavirus, formally named COVID-19 by the World Health Organization (WHO). Development of a vaccine and antiviral drugs to treat COVID-19 is still ongoing, resulting in hospitalization and intensive care unit management as the only option in treating COVID-19. Thus, there is a dire need for research on modeling the outbreak of COVID-19 to help officials in their decision-making processes regarding interventions and allocation of resources [1]. At the time this manuscript was being written, the pandemic was ongoing, and most of the epidemiological models developed focused on short-term predictions, identifying the daily peak of COVID-19 cases, predicting the duration of the pandemic, and estimating the possible impact of the measures implemented for minimizing exposure to the virus and decrease the fatality rate [2][3][4][5][6][7][8][9][10].
As of 24 September 2020, the cumulative number of COVID-19 cases in Mexico was reported as 715,457 [5], and 32,245,122 cases were reported worldwide [11]. Thus, the main objective of this paper was to model the total number of COVID-19 cases per day at the national and the state level in Mexico while simultaneously providing straightforward information to decisionmakers; additionally, we sought to determine which model provides the most stable short-term predictions. Figure 1 shows the accumulated cases and new cases at the national level in Mexico. This figure shows the first wave peak of the pandemic until the data cut-off of 24 September. Until the 24 September date, only a single wave of infections had been observed (on 1 August). The models developed in this research facilitate the obtaining of information to support decisionmakers in the strategic planning activities of the Mexican states, metropolitan areas, municipalities, or cities with high In this work we refer to 1-2-3 models as two-step models due to the method used to estimate their parameters [12]. The method is performed in two steps that will combine information from time series models with non-linear growth models and polynomial models. The two-step estimation is a process, also known as the Cochrane-Orcutt procedure, which is defined as: "A two-step estimation of a linear regression model with first-order serial correlation in the errors. In the first step the first-order autocorrelation coefficient is estimated using the ordinary least squares residuals from the main regression equation. In the second step this estimate is used to rescale the variables so that the regression in terms of rescaled variables has no serial correlation in the errors. This is an example of feasible generalized least squares estimation" [13].
Several machine learning (ML) and artificial intelligence (AI) models have demonstrated acceptable performance in the modeling of the COVID-19 pandemic; our proposed methodology meets this expectation in addition to a simple estimation of the parameters. Unlike the susceptible-infected-recovered-deceased (SIRD) models, the proposed models do not require setting or assuming the value of any parameter to obtain the estimates [14,15]. Finally, another advantage of our models is the interpretability of their parameters, that is, estimates of some parameters directly linked to the pandemic can be obtained. In this work we refer to 1-2-3 models as two-step models due to the method used to estimate their parameters [12]. The method is performed in two steps that will combine information from time series models with non-linear growth models and polynomial models. The two-step estimation is a process, also known as the Cochrane-Orcutt procedure, which is defined as: "A two-step estimation of a linear regression model with first-order serial correlation in the errors. In the first step the first-order autocorrelation coefficient is estimated using the ordinary least squares residuals from the main regression equation. In the second step this estimate is used to rescale the variables so that the regression in terms of rescaled variables has no serial correlation in the errors. This is an example of feasible generalized least squares estimation" [13].
Several machine learning (ML) and artificial intelligence (AI) models have demonstrated acceptable performance in the modeling of the COVID-19 pandemic; our proposed methodology meets this expectation in addition to a simple estimation of the parameters. Unlike the susceptible-infected-recovered-deceased (SIRD) models, the proposed models do not require setting or assuming the value of any parameter to obtain the estimates [14,15]. Finally, another advantage of our models is the interpretability of their parameters, that is, estimates of some parameters directly linked to the pandemic can be obtained.
The article is structured as follows: In Section 2, we summarize the most relevant literature regarding the modeling of the COVID-19 pandemic. Next, in Section 3, the data used are presented and the proposed methodology is described. Section 4 shows the main results of the investigation. Finally, in Section 5, the main conclusions are presented, and the limitations of this research are discussed.

Literature Review
Research exists with data-driven approaches such as autoregressive (AR) and autoregressive integrated moving average (ARIMA), ranging from simple models (exponential smoothing) to more complex models such as ARIMAX, ARCH, GARCH, and ARFIRMA [2,7,8,10,16,17]. For example, we used an ARIMA model with data compiled by Johns Hopkins University to predict models for the daily confirmed cases in countries where the pandemic was peaking and to predict and anticipate the resources of the healthcare systems [18]. Unfortunately, these data-driven models fail to fit the data and often lack accuracy [6,19]. Additionally, the parameters of these models cannot be interpreted according to the reality of the pandemic. This interpretability barrier causes statisticians and officials to make their decisions on the basis of predictive models instead of the peak of the pandemic or the growth of the pandemic. A useful model for policy and public health decisionmakers during the COVID-19 pandemic would be a model that, in addition to obtaining accurate predictions, provides insights on the evolution or current behavior of the pandemic. Another approach is real-time forecasting using a generalized logistic growth model. This method has been previously used in China to generate short-term forecasting of COVID-19 cases [20,21], as well as with data from Canada, France, India, South Korea, and the UK to forecast daily cases [22,23]. These models are incredibly useful in that they provide information on the current state of the pandemic. However, in this study, we had two aims regarding logistic growth models: (1) to demonstrate that their assumption of independence of and (2) that their modeling performance at the earlier stages of the pandemic is not optimal but can be improved by the incorporation of an autoregressive component [4,9].
The models used in this paper are based on statistical linear models, classic time series, and restricted growth-called limited growth or nonlinear growth models [1][2][3]13,24]as well as real-time forecasting using generalized logistic growth model [4,25]. In this paper, we propose estimations in two stages of the pandemic utilizing polynomial and nonlinear growth functions while incorporating an autoregressive component with the purpose of meeting the assumption of independence of residuals. First, we propose using a polynomial function estimated using the Prais-Winsten methodology to estimate the first stage of the pandemic (when exponential growth of COVID-19 cases was observed). Our rationale for choosing a third-degree polynomial model was the following: under certain scenarios: it can be converted into an increasing monotonic function, which is essential when modeling the total of accumulated cases; the degree is three since it shows simplicity with respect to higher-order polynomials; and when its behavior is observed, it has the shape of an "S".
Next, in the second stage of the pandemic (when the peak of COVID-19 cases was reached), we propose utilization of nonlinear growth functions, logistic, and Gompertz in order to predict the total cumulative number of cases and the growth rate of the spread of COVID-19. In each of the models estimated before and after the peak of the pandemic, at the second stage of modeling, we added an autoregressive component of order one (AR (1)) to compare the results to the models that do not account for the violation of independence of residuals. This approach has been successfully used to model plant and animal growth where measuring the same unit can lead to violation of independence of residuals [9,25,26]. Next, we selected the best estimate equation to model the pandemic by using Vuong's test criterion [27]. We anticipated the proposed model to have good performance similar to neural networks (NN) or support vector machines (SVM). A disadvantage of NN is that it is difficult to generate a day-to-day prediction in addition to finding the growth rate and finding the maximum number of cases. These are not a problem for the functions we propose. Furthermore, artificial intelligence (AI) has been used to identify, track, and forecast COVID-19 cases. However, AI models are difficult to interpret-the process by which they arrive at a decision is often referred to as a "black box" due to the complexity in understanding how AI models arrive at certain conclusions [28]. The complex interpretation may create a barrier for decisionmakers looking for straightforward solutions in the middle of a pandemic. Moreover, we anticipate our proposed modeling of the pandemic will meet the assumption of independence of residuals. In contrast to NN, our proposed model can provide predictions and will facilitate interpreting parameters such as the highest number of infected individuals, growth speed, the initial number of infected individuals, and the autoregressive parameter to measure the lag in reporting the new daily cases of infections.
In summary, we present these two models (polynomial and nonlinear growth models) because of ease of interpretation for non-statistician decisionmakers, their stable and useful predictions while accounting for the autocorrelation of the data, and their insights regarding the current state of the pandemic. Therefore, the objectives of this paper were to: (1) specify the polynomial function and nonlinear growth models (logistic and Gompertz) that include an autoregression component for dealing with the autocorrelated observations in the growth data in two stages: before and after the inflection point of the pandemic is reached, and (2) compare the different polynomial and nonlinear growth functions in their ability to describe the number of cases of the COVID-19 pandemic.

Dataset
The COVID-19 data used in this research was obtained from the publicly available data of the Mexican Secretaria de Salud Federal that contained the number of cases confirmed from 27 February to 24 September 2020. The dataset also included the number of recovered patients and fatalities and can be downloaded in .csv format from the government's website (https://www.gob.mx/salud/documentos/datos-abiertos-152127, accessed on 27 August 2021) [5]. Data to test the models focused on the national level and three Mexican states: Campeche, Quintana Roo, and Tamaulipas. To demonstrate the benefit of utilizing the autoregressive term, we used data from the state of Aguascalientes. The time series beginning point (t = 1) established was the day in which a positive COVID-19 case had not been reported. For example, in the state of Tamaulipas, the first positive COVID-19 case was 17 March 2020, and thus 16 March 2020 was considered (t = 1). The aforementioned data are noisy due to the dynamics of registration of new cases, that is, there is a known lag in the registration of new cases that depends on the site in the country. Another source of noise linked to the dynamics of registration is with respect to the cases detected on weekends, some of which are not reported until the following Monday. Therefore, and in order to reduce the effect of noise caused by the dynamics of registering new cases, we pre-processed the data by means of a moving average of two observations, that is, Y i = (X i + X i−1 )/2, where X i and X i−1 correspond to the total number of COVID-19 cases reported up to time i and i − 1, respectively. For the case of i = 1, Y 1 = X 1 .

Model Selection
The current study proposes the utilization of a two-stage approach: First, Stage I model was fit throughout the pandemic and before reaching the peak of cases (before the inflection point), and in the second stage, once the peak of cases was reached, a different type of model was used. We utilized Vuong's test for model selection [27]. Figure 2 shows a graphical abstract of the proposed methodology for this study.

Stage I: Before the Inflection Point
For the current study, we examined a variety of models that successfully model the behavior of the pandemic (e.g., the maximum number of cases, growth rate) while simultaneously meeting the assumption of independence of residuals in the data. Using the Akaike information criteria (AIC) and the root mean square error (RMSE) criterions, we found that the models that best described the total accumulated cases of COVID-19 in the four data examples utilized were: Polynomial model of order three with an autoregressive error component of order one Equation (1), known as Prais-Winsten or Cochrane-Orcutt estimation, and b.
Nonlinear growth models, including logistic and Gompertz, which had the best fit of the models examined (Equations (2) and (3)).
Mathematics 2021, 9, x FOR PEER REVIEW 5 of 17 Figure 2. Graphical abstract of the methodology.

Stage I: Before the Inflection Point
For the current study, we examined a variety of models that successfully model the behavior of the pandemic (e.g., the maximum number of cases, growth rate) while simultaneously meeting the assumption of independence of residuals in the data. Using the Akaike information criteria (AIC) and the root mean square error (RMSE) criterions, we found that the models that best described the total accumulated cases of COVID-19 in the four data examples utilized were: a. Polynomial model of order three with an autoregressive error component of order one Equation (1), known as Prais-Winsten or Cochrane-Orcutt estimation, and b. Nonlinear growth models, including logistic and Gompertz, which had the best fit of the models examined (Equations (2) and (3)).

Stage II: After the Inflection Point
To examine which models were the best once the peak of positive COVID-19 cases had been reached, we focused on the same three state data that showed good fit up to the peak of the pandemic. We sampled two data periods after the inflection point: 20 days after the inflection point (19 October) and the day before we began writing the results of this paper (24 September 2020). The results obtained from this stage of the pandemic were compared with those obtained before the inflection point of the pandemic. We hypothesized that one of the nonlinear growth models would have a better fit for the stage before the peak of cases.

Statistical Procedures
The complete two-step polynomial model used in this work has the form: where is the number of total positive COVID-19 cases reported at time ; the coefficients 0 , 1 , 2 , 3 were the parameters of the polynomial component of the model; is the coefficient of the autoregressive component −1 obtained in the second step of the estimation; and is a random error term, with = 0, 1, 2, … , . The value = 1 is selected as equal to the day of the first positive case.
The two-step logistic model used the following form:

Stage II: After the Inflection Point
To examine which models were the best once the peak of positive COVID-19 cases had been reached, we focused on the same three state data that showed good fit up to the peak of the pandemic. We sampled two data periods after the inflection point: 20 days after the inflection point (19 October) and the day before we began writing the results of this paper (24 September 2020). The results obtained from this stage of the pandemic were compared with those obtained before the inflection point of the pandemic. We hypothesized that one of the nonlinear growth models would have a better fit for the stage before the peak of cases.

Statistical Procedures
The complete two-step polynomial model used in this work has the form: where Y t is the number of total positive COVID-19 cases reported at time t; the coefficients α 0 , α 1 , α 2 , α 3 were the parameters of the polynomial component of the model; ρ is the coefficient of the autoregressive component u t−1 obtained in the second step of the estimation; and ε t is a random error term, with t = 0, 1, 2, . . . , i. The value t = 1 is selected as equal to the day of the first positive case. The two-step logistic model used the following form: where Y t is the number of total positive COVID-19 cases reported at time t; the coefficients β 1 , β 2 , β 3 correspond to the logistic component; ρ is the coefficient of the autoregressive component u t−1 obtained in the second step of the estimation; and ε t is a random error term, with t = 0, 1, 2, . . . , i. β 1 models the highest number of infected, β 2 growth speed, β 3 is the initial number of infected individuals, and ρ models the autoregressive process to incorporate the delay in the process of reporting the new cases of infection as well as the inherent pandemic dynamic. Recall that the value t = 1 is set to the day of the first positive case observed. The two-step Gompertz model used the following form: where Y t is the number of total positive COVID-19 cases reported at time t; the coefficients β 1 , β 2 , β 3 correspond to the Gompertz component; ρ is the coefficient of the autoregressive component u t−1 obtained in the second step of the estimation; and ε t is a random error term, with t = 0, 1, 2, . . . , i. The β 1 parameter estimates the top number of infected, β 2 is growth speed, β 3 is the initial number of infected individuals, and ρ models the autoregressive component. Similarly, the value t = 1 is set to the day of the first positive case observed. More information on two-step procedures can be found in [12]. The models were fitted in six time periods: the end date of the study (24 September 2020); 20 days before the peak of the pandemic for each of the Mexican states examined; peak day; and 10, 20, and 30 days after the peak of cases was reached. The models were compared in pairs utilizing Vuong's test, a classical likelihood ratio approach to model selection for nested and non-nested models, which uses the Kulback-Leiber information criterion [27]. The hypotheses for Vuong's model selection test are: Thus, the results will provide a p-value from Vuong's test that can be compared to a significance level α and aid in selecting the model that best fits the data. Graphically, we can observe in Figure 3 the approximate modeling of the pandemic, as well as identify the peak (when there is a change in the growth rate) and the steps where the models of interest in the project will be used.   Table 1 displays the dates when COVID-19 cases peaked at the national level and in the three Mexican states selected for the study.   Table 1 displays the dates when COVID-19 cases peaked at the national level and in the three Mexican states selected for the study.  Table 2, we can see the p-value obtained for the state of Tamaulipas on 24 September was 0.9650 for the H 2A and its complement 0.0350 was the p-value for the H 2B . At the significance level α = 0.05, we rejected the null hypothesis in favor of H 2B , suggesting that the logistic model fits the data better than the polynomial model.  Table 3 summarizes which models fit better for each region at each of the six time periods examined. In general, it is easy to see that the nonlinear growth models did not fit better than the polynomial models on dates before and during the peak of the first wave of the pandemic was reached. On the other hand, when examining the dates prior to the first wave peak of the pandemic, we found a negligible difference between the models. More importantly, examinations of model comparisons between dates before and during the peak of the pandemic revealed that no model worked better than any other. That is, in terms of modeling the growth rate of COVID-19 cases, there was no difference between the models. However, in the late phase of the pandemic, after the inflection point, the nonlinear growth models performed better than the polynomial model fit. Polynomial and nonlinear growth models were useful for modeling the beginning of the epidemic (see Figure 1) until reaching the maximum peak of daily cases for the pandemic [5]. On the other hand, nonlinear growth models were more accurate and effective when more information was available, and the maximum peak of daily cases was reached. Furthermore, time series models allowed for practical real-time monitoring of when (a) exponential growth was beginning, (b) exponential growth was in effect, and (c) exponential growth was about to end, which indicated that the epidemic was reaching its end. Finally, the nonlinear growth models allowed for describing the behavior at the end of the pandemic and monitoring and detecting a possible second wave of the epidemic. In general, the logistic and Gompertz models had the better fit. For example, for the state of Tamaulipas, we used the Gompertz model, which was one of the models with better fit for the peak point + 20 days. We could estimate the maximum number of COVID-19 cases β 1 = 63,640, the cases' growth speed β 2 = 0.0156, and the initial number of cases β 3 = 162.

Model Performance
Once the models were estimated, it was possible to predict the total cases and the most recent information on rates (or percentage) of positive active COVID-19, outpatients, stable hospitalized patients, seriously hospitalized patients, and intubated hospitalized patients. Likewise, with the information from the SENTINEL Prevention Model, we estimated the total number of asymptomatic COVID-19 positive cases. Figures 4-7 show the cumulative total cases of COVID-19 through 15 October (21 days out of the initial sample which was 24 September) for the four case studies (solid black line). These figures also show, with a red line, the point predictions made by each of the three models, as well as the area covered by the prediction intervals of said estimates with gray shading. In Figure 4, we can observe that the predictions made by the logistic and Gompertz models were relatively good, but not so in the case of the polynomial model, which even predicted a decrease in the total accumulated cases (which was not possible in the context studied). Figure 5 corresponds to the state of Tamaulipas-in this figure, we can see that the worst predictions were also made by the polynomial model. In Figure 6, we can see the predictions for the state of Quintana Roo, wherein the model that best made predictions was the logistic one, and the total number of cases was overestimated by the two other models. Regarding the predictions made for the country, in Figure 7, we can see that the best predictions, by far, were made by the Gompertz model. Table 4 shows the root mean square error (RMSE) of each model proposed for the three state case studies and at the national level at the six time periods of interest.
predictions were also made by the polynomial model. In Figure 6, we can see the predictions for the state of Quintana Roo, wherein the model that best made predictions was the logistic one, and the total number of cases was overestimated by the two other models. Regarding the predictions made for the country, in Figure 7, we can see that the best predictions, by far, were made by the Gompertz model. Table 4 shows the root mean square error (RMSE) of each model proposed for the three state case studies and at the national level at the six time periods of interest.         As a result of the 21-day predictions mentioned above, the number of new cases of COVID-19 can be obtained, and the results for the four case studies and the three models are shown in Tables 5-8. These results are based on the official data source of the daily releases issued by the Mexican Secretaria de Salud on its website https://coronavirus.gob.mx/as of 15 October 2020 [5].

Autocorrelation
An autoregressive model was fitted to the residuals of the logistics and Gompertz models estimated in a single stage to eliminate the autocorrelation. We fit an autoregressive model to the logistic and Gompertz functions to eliminate the residual autocorrelation. The analysis revealed that the inclusion of an autoregressive component of the second order improved the fit of the data. For example, for the state of Aguascalientes, Figure 8a,b shows the autocorrelation residual and partial autocorrelation plots for models without the autoregressive terms. In these figures, it is clear that the residuals show a pattern of dependency. Furthermore, Figure 8c,d demonstrates a good fit, and there is no evidence of linear relationship. The patterns in the data revealed that nonlinear models with autoregressive terms met assumptions of independence. The models showed good fit while accounting for autocorrelation of residuals while providing better interpretability of the model coefficients in terms of growth rate, maximum number of cases, and initial number of cases. Thus, the model proposed in this paper produces a substantial improvement of the predictions. of dependency. Furthermore, Figure 8c,d demonstrates a good fit, and there is no evidence of linear relationship. The patterns in the data revealed that nonlinear models with autoregressive terms met assumptions of independence. The models showed good fit while accounting for autocorrelation of residuals while providing better interpretability of the model coefficients in terms of growth rate, maximum number of cases, and initial number of cases. Thus, the model proposed in this paper produces a substantial improvement of the predictions.
(a) Autocorrelation in residuals without an autoregressive term (b) Partial autocorrelation in residuals without an autoregressive term (c) Autocorrelation in residuals when AR(1) term is added (d) Partial autocorrelation in residuals when AR(1) term is added Figure 8. Analysis of autocorrelation in residuals

Conclusions
The modeling of the COVID-19 cases is a unique challenge for statisticians, considering the fact that the data are limited and there are often delays in updating the data. Our approach to modeling the pandemic can provide assistance to decision-making officials for containing and anticipating a "second wave" of the COVID-19 pandemic in Mexico. We divided the pandemic data into two stages, and at each stage, three two-step models were adjusted. In Stage I, as can be seen in Table 3, the polynomial model of order three with an autoregressive error component of order one was computed through applying the Prais-Winsten or Cochrane-Orcutt estimation, which had a better performance than

Conclusions
The modeling of the COVID-19 cases is a unique challenge for statisticians, considering the fact that the data are limited and there are often delays in updating the data. Our approach to modeling the pandemic can provide assistance to decision-making officials for containing and anticipating a "second wave" of the COVID-19 pandemic in Mexico. We divided the pandemic data into two stages, and at each stage, three two-step models were adjusted. In Stage I, as can be seen in Table 3, the polynomial model of order three with an autoregressive error component of order one was computed through applying the Prais-Winsten or Cochrane-Orcutt estimation, which had a better performance than nonlinear growth models. In Stage II, after the peak of COVID-19 cases, two-step nonlinear growth models outperformed the polynomial model. Further, the models used in this paper were different than those used in predictive models using time series or machine learning; however, our model met the assumption of independence of residuals, and the interpretability of our model was superior to those of machine learning-particularly for government officials without a statistics or machine learning background.
Although it is not the objective of this research, another purpose for the proposed models concerns the identification of the peak of the pandemic. That is, an analytical way to know if the pandemic is currently in a period of growth, peak, or decline of sustained cases is through the adjustments of the models. For example, at the moment that any of the non-linear growth models fits significantly better than the other models, we will be at the point of sustained reduction of daily cases, which would mean that the peak has already passed, and that said model must be used to make predictions of total cases and obtain insights of the pandemic at that point.
In summary, this paper showed the efficacy of utilizing two different types of models estimated in two stages (stages depending on the state of the pandemic). The majority of the efforts to model the COVID-19 pandemic are nonlinear models, such as logistic and Gompertz [4,5]. However, these models do not take into account autoregression, thus possibly skewing short-, medium-, and long-term predictions. In contrast with the SIR models, where it is necessary to fix or assume a few initial parameters, our proposed models do not require initial parameter assumptions. Our first recommendation, due to simplicity in fitting the model, is that in the early stages of the pandemic where there was an exponential growth (during or before the peak), one should utilize a polynomial model of the third order estimated with an autoregressive component. Our second recommendation is that for the later, more advanced stages when the peak of the pandemic was reached, one should utilize a nonlinear growth model (logistic or Gompertz) estimated with an autoregressive component. In the event that two models show the best fit, if any of these is the polynomial model, it will be necessary to question whether it is in the public health decisionmakers' interest to have insights (provided by the βs of the non-linear growth models) of the current state of the pandemic; if yes, then we recommend using the non-linear growth model. In the opposite case, where it is not in the interest of the decisionmakers to know insights of the pandemic, the use of the polynomial model is recommended since it does not require initial values for its estimation. Another scenario could be that within the tie, there are two linear growth models, wherein either of the two can be used.
This research is not without limitations. The projections resulting from the models were estimated without considering any type of intervention. In other words, the intervention effects-such as mask mandates, lockdowns, social distancing, or vaccines-are not considered in our models. Any of the aforementioned variables should be considered with care. Regarding vaccination, as this article was being written, there was no official site of the Mexican government where the total number of people vaccinated could be retrieved, only unofficial sites where progress is reported; however, these numbers are not trustworthy. Regarding the measures of lockdowns and use of face masks, given the federal nature of the country, the states of the republic are free to take or not take actions on their population; thus, to analyze any variable of this nature, an exhaustive study must be carried out on the states that took similar measures, and the effect of these measures must be evaluated by means of an effect or additive or multiplicative variable in the statistical model.
Due to the characteristics of the models used, the atypical characteristics of the COVID-19 pandemic, and the results and information derived from the monitoring strategy-well known in epidemiology science as SENTINEL Prevention Model-the following should be considered: The data show a lot of variability from one day to another. Part of the noise caused by the dynamics of new case records was smoothed out by pre-processing the data (moving average) and the AR (1) component of the model; in future work, to further reduce the noise, we could add complex structures in the residuals, such as ARMA, ARIMA, or SARIMA. Furthermore, at the time of the data cut-off date, the second wave of the pandemic had not yet occurred. In the event that it is required to use this model for a second wave of the pandemic, care must be taken with the estimation of the components of the non-linear models (specifically the autoregressive and nonlinear component). That is, the nonlinear models used are not designed to model spikes in cases (second wave or third waves), and this will affect the estimation of parameters-such as growth speed, maximum cases, and the autoregressive component-being able to obtain estimates out of context of the pandemic (very high or low numbers), or uniroot problems in the time series component.
As previously mentioned, the residual autocorrelation analysis was performed, looking for linear autocorrelations. Regarding the precision of the predictions, future research will focus on comparing our proposed models against machine learning and artificial intelligence models. In some cases, such as the state of Aguascalientes, it was necessary to add a second order autoregressive component.
The benefit of utilizing a second-order autoregressive component is shown in Table 9 for the state of Aguascalientes (before 20 August 2020). For the Gompertz nonlinear growth model, we corrected for the dependency between observations by including an autocorrelation term of the second order in the model that considerably improved the overall fit criteria. Thus, we recommend the inclusion of an autoregressive term of the second order when modeling COVID-19 case growth.