# Forecasting Daily COVID-19 Case Counts Using Aggregate Mobility Statistics

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Background and Preliminary Analysis

#### 3.1. Data Sources

**Confirmed COVID-19 Daily Case Counts:**The World Health Organization (WHO) states that a confirmed COVID-19 case is an individual who received a positive COVID-19 laboratory test [39]. We obtained the daily confirmed COVID-19 case count data for several countries from [40]. In order to show the generalizability of our forecasting approach, multiple countries were considered in our study: Argentina, Austria, Canada, Denmark, India, Italy, Japan, Netherlands, Norway, Poland, Portugal, Turkey, and the United Kingdom. Each country is treated independently from others, and a separate forecasting model is built for each different country. The data in [40] rely on information from Johns Hopkins University, which itself is sourced from governments, national and subnational agencies [41]. We downloaded and used data spanning the period from 25 February 2020 until 18 December 2021. The same time period was used when obtaining COVID-19 test counts and mobility statistics.

**COVID-19 Daily Test Counts:**The number of COVID-19 laboratory tests performed in a country (i.e., COVID-19 test count) is a key factor that impacts the case counts in that country. Typically, more testing will reveal more COVID-19 cases. We obtained the daily COVID-19 test counts per country from [40]. We note that there can be discrepancies in how daily test counts are computed in different countries. For example, in some countries, the reported daily test counts correspond to how many individuals were tested, regardless of which test or how many tests they took on the same day. In other countries, multiple tests from the same individual on the same day may be counted independently. Another discrepancy is with respect to which tests are accepted as “official” tests. Some countries only counted the number of PCR tests, whereas other countries counted other tests such as antigen tests. These discrepancies are one reason why we opted to build different forecasting models instead of aggregating all data and building one model for all countries.

**Aggregate Mobility Statistics from Google CMRs:**Since the early days of the COVID-19 pandemic, Google has been publicly releasing mobility statistics called Community Mobility Reports (CMRs). These reports are built by collecting data from users who access Google services and have “location history” feature enabled [3,4]. Users’ GPS presence and time spent at different location categories are recorded. Then, CMRs are constructed according to six categories: Transit Stations $\left(TS\right)$, Retail and Recreation $\left(RR\right)$, Parks $\left(PR\right)$, Grocery and Pharmacy $\left(GP\right)$, Workplaces $\left(WP\right)$, and Residential $\left(RS\right)$. Each category comprises a range of related and representative places. To exemplify, the Parks category contains time spent in public gardens, castles, national forests, campgrounds, and observation decks. On the other hand, the Transit Stations category contains time spent in subway stations, seaports, taxi stands, highway rest stops, and car rental agencies.

#### 3.2. Time-Lagged Cross-Correlation Analysis

## 4. Forecasting Methodology

**Difference with ARIMA-Based Methods:**It is worthwhile to note that ARIMA-based methods are popular for time-series analysis. They have also been applied to COVID-19 forecasting [26,27], but typically, they have been used when there is only a single time series available (e.g., only $DC$). In contrast, our method fuses features from multiple time series: $DC$, $DT$, and multiple mobility time series. In order to take advantage of features from multiple time series and to build the most accurate regression model using the best feature set, our method utilizes a custom algorithm for optimized feature extraction, feature selection, and regression-type selection.

#### 4.1. Moving Average for $DC$

#### 4.2. Time Intervals

#### 4.3. Feature Extraction and Selection

Algorithm 1: Custom search algorithm to construct best model |

#### 4.4. Regression

**Linear Regression:**Linear regression is one of the most commonly used regression types. It fits a linear function f to the underlying data. The coefficients of the linear function are chosen to minimize the residual sum of squares between the observed ${Y}_{n}$ and the values approximated by the output of the regression function.

**Decision Tree Regression:**Decision trees can be used in both classification and regression problems [42]. They create models which predict outcomes by learning decision rules from the underlying training data. These decision rules are stored and queried in a tree structure (hierarchically), starting from the root and moving toward the leaves in each step. Deeper trees imply a higher number of decision rules.

**Random Forest Regression:**Random forest regression is an ensemble learning method [43]. It fits multiple decision trees on different subsets of the training data, where each subset is constructed by drawing samples from the training data with replacement. The collection of these decision trees constitute the random forest ensemble. Afterwards, in the prediction (forecasting) phase, each tree is used to make a prediction and then the predictions are combined, e.g., by averaging.

**Extra Trees Regression:**“Extra trees” stands for “extremely randomized trees”, which is also an ensemble learning method similar to Random Forest Regression. In random forests, when splitting each node during the construction of a decision tree, the best split rule (i.e., the most discriminative threshold) is found either from all features or a random subset of features. Yet, in extremely randomized trees, instead of the most discriminative threshold, thresholds are drawn randomly for each feature and the best of the randomly generated thresholds is selected as the split rule. This adds another level of randomness to the overall regression model, which reduces overfitting.

**KNN (K-Nearest Neighbors) Regression:**KNN is a popular algorithm for both classification and regression problems [44,45]. In KNN regression, the label assigned to a query point is computed as the mean of its k nearest neighbors in the feature space. That is, given the features of a test sample as $\overline{X}$, KNN regression predicts $\overline{Y}$ as:

**AdaBoost Regression:**AdaBoost is a popular boosting algorithm introduced by Freund and Schapire [46]. Its main idea is to fit a sequence of models on iteratively modified versions of the training data. Samples in the data are given weights, and in each iteration, weights are updated so that samples which were incorrectly predicted in the previous iteration will have their weights increased in the next iteration. Consequently, models will improve as iterations proceed, since models in the next iterations focus on addressing the weaknesses of the previous iterations [47].

**Gradient Boosting Regression:**Gradient boosting combines the intuition of boosting with the optimization of a differentiable loss function [48], e.g., the loss function can be squared error for regression. In each step of iterative boosting, a regression tree is fit on the gradient of the loss function. The goal is to arrive at a model which minimizes the loss.

**XGB Regression:**XGB, also known as XGBoost (stands for Extreme Gradient Boosting), is an efficient and optimized implementation of gradient boosting [49]. Following its inception, it quickly became popular among practitioners due to its speed and accuracy, e.g., it yielded the most accurate results in many Kaggle competitions. Therefore, we incorporated it in our framework in addition to the original gradient boosting algorithm.

**Ridge Regression:**Ridge regression improves linear regression (with ordinary least squares) in cases with correlated independent variables. It imposes an ${l}_{2}$-norm penalty on the size of the regression coefficients. Ridge regression has been successfully applied in many diverse fields; our empirical results show that it also performs well in our COVID-19 forecasting application.

**Lasso Regression:**Similar to Ridge regression, Lasso is also a linear regression type. The intuition of Lasso (least absolute shrinkage and selection operator) was discussed in various domains such as geophysics and signal processing [50], but it became popular in regression analysis after its introduction by Tibshirani [51]. The idea of Lasso is to shrink the size of the regression coefficients, which is similar to Ridge regression. However, as opposed to Ridge regression, instead of imposing an ${l}_{2}$-norm penalty, Lasso imposes an ${l}_{1}$-norm penalty on the regression coefficients. As such, Lasso is suitable for datasets which have high collinearity.

**Huber Regression:**Huber contains a piecewise loss function which combines squared loss (${l}_{2}$-norm penalty) for non-outliers and absolute value loss (${l}_{1}$-norm penalty) for outliers [45,52]. It is motivated by the fact that squared loss has the tendency to be dominated by outliers, i.e., samples with error higher than a certain threshold. Thus, ${l}_{1}$-norm penalty is applied to outliers to reduce their effect while not completely ignoring them. In contrast, ${l}_{2}$-norm penalty is applied to non-outliers.

**RANSAC (Random Sample Consensus) Regression:**The RANSAC algorithm was first published in [53]. It classifies samples in the training dataset into two: inliers which should be taken into account when building a regression model, and outliers which should not be considered when determining the regression coefficients. RANSAC iteratively selects random subsets from the data, fits a model to the random subset, classifies data as inliers vs. outliers, and deems the fitted model more desirable if the number of inliers is maximal.

#### 4.5. Cross-Validation and Evaluation Setup

## 5. Results and Discussion

#### 5.1. Comparison of Actual versus Predicted Case Counts

#### 5.2. Forecasting Accuracy

#### 5.3. Impact of Window Size w

#### 5.4. Comparison of Regression Types

#### 5.5. Analysis of Feature Sets

**Analysis with respect to $\mathcal{F}$:**First, we analyze which mobility time series are selected by Algorithm 1 for inclusion in the best $\mathcal{F}$. The results of our analysis are provided in Table 5. Each mobility time series is given one column. (We do not include non-mobility time series $DC$ and $DT$ in this table, since they are always part of $\mathcal{F}$.) A checkmark in a cell indicates that the corresponding mobility time series was selected as part of the best $\mathcal{F}$ for the corresponding country. As observed from Table 5, Transit Stations ($TS$) and Residential ($RS$) are selected by almost all countries in their best $\mathcal{F}$, which is intuitive. Since $TS$ consists of places such as subway stations, seaports, taxi stands, highway rest stops, etc., the mobility of individuals in such places is indeed a strong indicator of the spread of COVID-19. For example, if many people are spending their time in subway stations or taxi stands, then this indicates a large amount of mobility in public or private transport, which can cause the COVID-19 virus to spread faster. In contrast, $RS$ is for residential locations, e.g., if many people are staying at home, then this will slow down the spread of COVID-19. It is therefore intuitive that both $TS$ and $RS$ are typically included in the best $\mathcal{F}$ of various countries. On the other hand, Workplaces ($WP$) and Retail and Recreation ($RR$) are less commonly included in the best $\mathcal{F}$. One reason could be that individuals spend their time in recreational locations or by shopping for necessities regardless of the status of the pandemic, which weakens the predictive power of the corresponding mobility time series.

**Analysis with respect to t:**Recall that Algorithm 1 searches for the optimal time period t, between $t\in [1,{t}_{max}]$. Here, we analyze which value of t was selected by Algorithm 1 as the optimal one for each country. The results of our analysis are provided in Figure 6. We had previously found in Section 3.2 that according to TLCC, the highest correlations are reached when the time lag is around 12–15 days. The results in Figure 6 and therefore Algorithm 1 agree with our findings from Section 3.2. For nine out of 13 countries, the optimal t was found to be between 12 and 15 days. In addition, considering that $t>18$ was never found to be optimal for any of the countries, we can conclude that including unnecessarily old readings hurt the accuracy of the regression model rather than improving it.

#### 5.6. Analysis of Forecasting Bias

#### 5.7. Runtime Performance and Overhead

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Notes

1 | https://movement.uber.com/?lang=en-US, accessed on 15 September 2022 |

2 | https://covid19.apple.com/mobility, accessed on 15 September 2022 |

## References

- WHO. WHO Director-General’s Opening Remarks at the Media Briefing on COVID-19. 2020. Available online: https://www.who.int/director-general/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19—11-march-2020 (accessed on 15 September 2022).
- WHO. WHO Coronavirus (COVID-19) Dashboard 2022. Available online: https://covid19.who.int/ (accessed on 15 September 2022).
- Google. COVID-19 Community Mobility Reports. 2020. Available online: https://www.google.com/covid19/mobility/ (accessed on 15 September 2022).
- Aktay, A.; Bavadekar, S.; Cossoul, G.; Davis, J.; Desfontaines, D.; Fabrikant, A.; Gabrilovich, E.; Gadepalli, K.; Gipson, B.; Guevara, M.; et al. Google COVID-19 community mobility reports: Anonymization process description (version 1.1). arXiv
**2020**, arXiv:2004.04145. [Google Scholar] - Alessandretti, L. What human mobility data tell us about COVID-19 spread. Nat. Rev. Phys.
**2022**, 4, 12–13. [Google Scholar] [CrossRef] - Zhang, C.; Qian, L.X.; Hu, J.Q. COVID-19 pandemic with human mobility across countries. J. Oper. Res. Soc. China
**2021**, 9, 229–244. [Google Scholar] [CrossRef] - Du, B.; Zhao, Z.; Zhao, J.; Yu, L.; Sun, L.; Lv, W. Modelling the epidemic dynamics of COVID-19 with consideration of human mobility. Int. J. Data Sci. Anal.
**2021**, 12, 369–382. [Google Scholar] [CrossRef] - Sulyok, M.; Walker, M. Community movement and COVID-19: A global study using Google’s Community Mobility Reports. Epidemiol. Infect.
**2020**, 148, 1–9. [Google Scholar] [CrossRef] - Xiong, C.; Hu, S.; Yang, M.; Luo, W.; Zhang, L. Mobile device data reveal the dynamics in a positive relationship between human mobility and COVID-19 infections. Proc. Natl. Acad. Sci. USA
**2020**, 117, 27087–27089. [Google Scholar] [CrossRef] - Yilmazkuday, H. Stay-at-home works to fight against COVID-19: International evidence from Google mobility data. J. Hum. Behav. Soc. Environ.
**2021**, 31, 210–220. [Google Scholar] [CrossRef] - Tian, H.; Liu, Y.; Li, Y.; Wu, C.H.; Chen, B.; Kraemer, M.U.; Li, B.; Cai, J.; Xu, B.; Yang, Q.; et al. An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China. Science
**2020**, 368, 638–642. [Google Scholar] [CrossRef] - Kraemer, M.U.; Yang, C.H.; Gutierrez, B.; Wu, C.H.; Klein, B.; Pigott, D.M.; Du Plessis, L.; Faria, N.R.; Li, R.; Hanage, W.P.; et al. The effect of human mobility and control measures on the COVID-19 epidemic in China. Science
**2020**, 368, 493–497. [Google Scholar] [CrossRef] - Chang, S.L.; Harding, N.; Zachreson, C.; Cliff, O.M.; Prokopenko, M. Modelling transmission and control of the COVID-19 pandemic in Australia. Nat. Commun.
**2020**, 11, 5710. [Google Scholar] [CrossRef] - Li, Y.; Li, M.; Rice, M.; Zhang, H.; Sha, D.; Li, M.; Su, Y.; Yang, C. The impact of policy measures on human mobility, COVID-19 cases, and mortality in the US: A spatiotemporal perspective. Int. J. Environ. Res. Public Health
**2021**, 18, 996. [Google Scholar] [CrossRef] - Wellenius, G.A.; Vispute, S.; Espinosa, V.; Fabrikant, A.; Tsai, T.C.; Hennessy, J.; Dai, A.; Williams, B.; Gadepalli, K.; Boulanger, A.; et al. Impacts of social distancing policies on mobility and COVID-19 case growth in the US. Nat. Commun.
**2021**, 12, 3118. [Google Scholar] [CrossRef] - Zhou, Y.; Xu, R.; Hu, D.; Yue, Y.; Li, Q.; Xia, J. Effects of human mobility restrictions on the spread of COVID-19 in Shenzhen, China: A modelling study using mobile phone data. Lancet Digit. Health
**2020**, 2, e417–e424. [Google Scholar] [CrossRef] - Nouvellet, P.; Bhatia, S.; Cori, A.; Ainslie, K.E.; Baguelin, M.; Bhatt, S.; Boonyasiri, A.; Brazeau, N.F.; Cattarino, L.; Cooper, L.V.; et al. Reduction in mobility and COVID-19 transmission. Nat. Commun.
**2021**, 12, 1090. [Google Scholar] [CrossRef] - Xi, W.; Pei, T.; Liu, Q.; Song, C.; Liu, Y.; Chen, X.; Ma, J.; Zhang, Z. Quantifying the time-lag effects of human mobility on the COVID-19 transmission: A multi-city study in China. IEEE Access
**2020**, 8, 216752–216761. [Google Scholar] [CrossRef] - Ilin, C.; Annan-Phan, S.; Tai, X.H.; Mehra, S.; Hsiang, S.; Blumenstock, J.E. Public mobility data enables COVID-19 forecasting and management at local and global scales. Sci. Rep.
**2021**, 11, 13531. [Google Scholar] [CrossRef] - Rostami-Tabar, B.; Rendon-Sanchez, J.F. Forecasting COVID-19 daily cases using phone call data. Appl. Soft Comput.
**2021**, 100, 106932. [Google Scholar] [CrossRef] - Liu, D.; Clemente, L.; Poirier, C.; Ding, X.; Chinazzi, M.; Davis, J.T.; Vespignani, A.; Santillana, M. A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models. arXiv
**2020**, arXiv:2004.04019. [Google Scholar] - Athanasios, A.; Irini, F.; Tasioulis, T.; Konstantinos, K. Prediction of the effective reproduction number of COVID-19 in Greece: A machine learning approach using Google mobility data. medRxiv
**2021**. [Google Scholar] [CrossRef] - Wang, P.; Zheng, X.; Ai, G.; Liu, D.; Zhu, B. Time series prediction for the epidemic trends of COVID-19 using the improved LSTM deep learning method: Case studies in Russia, Peru and Iran. Chaos Solitons Fractals
**2020**, 140, 110214. [Google Scholar] [CrossRef] - Luo, J.; Zhang, Z.; Fu, Y.; Rao, F. Time series prediction of COVID-19 transmission in America using LSTM and XGBoost algorithms. Results Phys.
**2021**, 27, 104462. [Google Scholar] [CrossRef] - Auliya, S.F.; Wulandari, N. The Impact of Mobility Patterns on the Spread of the COVID-19 in Indonesia. J. Inf. Syst. Eng. Bus. Intell.
**2021**, 7, 31–41. [Google Scholar] [CrossRef] - Awwad, F.A.; Mohamoud, M.A.; Abonazel, M.R. Estimating COVID-19 cases in Makkah region of Saudi Arabia: Space-time ARIMA modeling. PLoS ONE
**2021**, 16, e0250149. [Google Scholar] [CrossRef] - de Araujo Morais, L.R.; da Silva Gomes, G.S. Forecasting daily Covid-19 cases in the world with a hybrid ARIMA and neural network model. Appl. Soft Comput.
**2022**, 126, 109315. [Google Scholar] [CrossRef] - Schwabe, A.; Persson, J.; Feuerriegel, S. Predicting COVID-19 spread from large-scale mobility data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 3531–3539. [Google Scholar]
- Wang, H.; Yamamoto, N. Using a partial differential equation with Google Mobility data to predict COVID-19 in Arizona. arXiv
**2020**, arXiv:2006.16928. [Google Scholar] [CrossRef] - Li, R.Q.; Song, Y.R.; Jiang, G.P. Prediction of epidemics dynamics on networks with partial differential equations: A case study for COVID-19 in China. Chin. Phys. B
**2021**, 30, 120202. [Google Scholar] [CrossRef] - Sun, D.; Duan, L.; Xiong, J.; Wang, D. Modeling and forecasting the spread tendency of the COVID-19 in China. Adv. Differ. Equ.
**2020**, 2020, 1–16. [Google Scholar] [CrossRef] - Sarkar, K.; Khajanchi, S.; Nieto, J.J. Modeling and forecasting the COVID-19 pandemic in India. Chaos Solitons Fractals
**2020**, 139, 110049. [Google Scholar] [CrossRef] - Zeng, Y.; Guo, X.; Deng, Q.; Luo, S.; Zhang, H. Forecasting of COVID-19: Spread with dynamic transmission rate. J. Saf. Sci. Resil.
**2020**, 1, 91–96. [Google Scholar] [CrossRef] - Harjule, P.; Tiwari, V.; Kumar, A. Mathematical models to predict COVID-19 outbreak: An interim review. J. Interdiscip. Math.
**2021**, 24, 259–284. [Google Scholar] [CrossRef] - Kumar, N.; Susan, S. Particle swarm optimization of partitions and fuzzy order for fuzzy time series forecasting of COVID-19. Appl. Soft Comput.
**2021**, 110, 107611. [Google Scholar] [CrossRef] - Gomes, D.C.D.S.; Serra, G.L.D.O. Machine learning model for computational tracking and forecasting the COVID-19 dynamic propagation. IEEE J. Biomed. Health Inform.
**2021**, 25, 615–622. [Google Scholar] [CrossRef] - Mileu, N.; Costa, N.M.; Costa, E.M.; Alves, A. Mobility and Dissemination of COVID-19 in Portugal: Correlations and Estimates from Google’s Mobility Data. Data
**2022**, 7, 107. [Google Scholar] [CrossRef] - Kishore, K.; Jaswal, V.; Verma, M.; Koushal, V. Exploring the utility of Google mobility data during the COVID-19 pandemic in India: Digital epidemiological analysis. JMIR Public Health Surveill.
**2021**, 7, e29957. [Google Scholar] [CrossRef] - World Health Organization. Coronavirus Disease 2019 (COVID-19) Situation Report 50. 2020. Available online: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200310-sitrep-50-covid-19.pdf?sfvrsn=55e904fb_2 (accessed on 15 September 2022).
- Ritchie, H.; Mathieu, E.; Rodes-Guirao, L.; Appel, C.; Giattino, C.; Ortiz-Ospina, E.; Hasell, J.; Macdonald, B.; Beltekian, D.; Roser, M. Coronavirus Pandemic (COVID-19). Our World Data
**2020**. Available online: https://ourworldindata.org/coronavirus (accessed on 15 September 2022). - Dong, E.; Du, H.; Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis.
**2020**, 20, 533–534. [Google Scholar] [CrossRef] - Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: Abingdon-on-Thames, UK, 2017. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory
**1967**, 13, 21–27. [Google Scholar] [CrossRef] - Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
- Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.
**1997**, 55, 119–139. [Google Scholar] [CrossRef] - Drucker, H. Improving Regressors Using Boosting Techniques. In Proceedings of the International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997; Volume 97, pp. 107–115. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat.
**2001**, 29, 1189–1232. [Google Scholar] [CrossRef] - Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Chen, S.S.; Donoho, D.L.; Saunders, M.A. Atomic decomposition by basis pursuit. SIAM Rev.
**2001**, 43, 129–159. [Google Scholar] [CrossRef] - Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol.
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat.
**1964**, 35, 73–101. [Google Scholar] [CrossRef] - Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM
**1981**, 24, 381–395. [Google Scholar] [CrossRef]

**Figure 1.**Original time series for daily case counts ($DC$), transit station mobility ($TS$), and a 15-day shifted version of $TS$. All time series are from the United Kingdom. Since the ranges of $DC$ and $TS$ are different, two y-axes are constructed. The axis on the left is for $DC$ (black curve), and the axis on the right is for $TS$ (blue and red curves).

**Figure 2.**TLCC between COVID-19 daily case counts and different mobility time series (workplaces, residential, transit stations, retail and recreation), for varying t between 0 and 29. Higher correlations are shown in red, and lower correlations are shown in blue.

**Figure 4.**Comparison of actual vs. predicted case counts. Results show that the predictions made by our forecasting methodology are highly accurate and closely resemble actual case counts.

**Figure 6.**Results of searching for the best time period t using Algorithm 1—what was the best value of t for each country?

**Table 1.**Final time intervals (start date–end date) for each country. Dates are given in format: dd/mm/yyyy.

Country | Time Interval | Country | Time Interval |
---|---|---|---|

Argentina | 03/04/2020– 18/12/2021 | Netherlands | 01/06/2020– 17/12/2021 |

Austria | 03/03/2020– 14/12/2021 | Norway | 01/04/2020– 12/12/2021 |

Canada | 12/03/2020– 17/12/2021 | Poland | 29/04/2020– 18/12/2021 |

Denmark | 27/02/2020– 15/12/2021 | Portugal | 02/03/2020– 18/12/2021 |

India | 24/04/2020– 18/12/2021 | Turkey | 25/11/2020– 08/11/2021 |

Italy | 25/02/2020– 18/12/2021 | United Kingdom | 21/04/2020– 28/11/2021 |

Japan | 02/04/2020– 18/12/2021 | - | - |

**Table 2.**Mean Absolute Error (MAE) and its ratio to the maximum number of daily cases (max(DC)) per country.

Country | $\mathit{MAE}$ ($\mathit{w}=1$) | $\mathit{max}\left(\mathit{DC}\right)$ | $\frac{\mathit{MAE}}{\mathit{max}\left(\mathit{DC}\right)}$ |
---|---|---|---|

Argentina | 1401.15 | 41,080 | 3.41% |

Austria | 389.06 | 15,809 | 2.46% |

Canada | 719.59 | 11,381 | 6.32% |

Denmark | 207.02 | 8773 | 2.36% |

India | 3492.39 | 414,188 | 0.84% |

Italy | 1108.03 | 40,902 | 2.71% |

Japan | 270.15 | 25,992 | 1.04% |

Netherlands | 515.97 | 23,714 | 2.18% |

Norway | 193.20 | 7631 | 2.53% |

Poland | 1457.33 | 35,253 | 4.13% |

Portugal | 378.06 | 16,432 | 2.30% |

Turkey | 369.41 | 63,082 | 1.99% |

United Kingdom | 1929.96 | 68,053 | 2.84% |

Country | $\mathit{w}=3$ | $\mathit{w}=5$ | $\mathit{w}=7$ |
---|---|---|---|

Argentina | 4.18 | 2.99 | 2.09 |

Austria | 5.09 | 4.12 | 2.53 |

Canada | 7.13 | 4.49 | 3.51 |

Denmark | 6.24 | 4.58 | 3.52 |

India | 2.15 | 1.43 | 1.00 |

Italy | 3.85 | 2.55 | 2.26 |

Japan | 4.05 | 2.99 | 1.78 |

Netherlands | 3.50 | 2.58 | 2.26 |

Norway | 8.57 | 6.99 | 5.13 |

Poland | 5.61 | 4.20 | 2.83 |

Portugal | 6.57 | 4.93 | 3.68 |

Turkey | 1.94 | 1.44 | 1.39 |

United Kingdom | 3.72 | 2.58 | 2.21 |

Regression Type | Countries |
---|---|

RANSAC | Argentina, Austria, Canada, Denmark, India, Italy, Japan, Norway, Poland |

Ridge | Netherlands, Portugal, Turkey, United Kingdom |

**Table 5.**Results of searching for the best feature set using Algorithm 1—which mobility time series were included in the best $\mathcal{F}$ for each country?

Country | Workplaces (WP) | Transit Stations (TS) | Residential (RS) | Retail and Recreation (RR) |
---|---|---|---|---|

Argentina | ✓ | ✓ | ||

Austria | ✓ | ✓ | ||

Canada | ✓ | ✓ | ||

Denmark | ✓ | ✓ | ||

India | ✓ | ✓ | ||

Italy | ✓ | ✓ | ||

Japan | ✓ | ✓ | ||

Netherlands | ✓ | ✓ | ||

Norway | ✓ | ✓ | ||

Poland | ✓ | ✓ | ||

Portugal | ✓ | ✓ | ||

Turkey | ✓ | ✓ | ||

United Kingdom | ✓ | ✓ |

**Table 6.**Execution times of each regression type (individually) and Algorithm 1 in total, for three different countries (Turkey, Netherlands, Italy). All values are reported in seconds.

Regression Type or Method | Turkey | Netherlands | Italy |
---|---|---|---|

Linear | 0.15 | 0.20 | 0.25 |

XGB | 1.72 | 3.16 | 3.59 |

AdaBoost | 1.74 | 2.42 | 3.05 |

Decision Tree | 0.19 | 0.20 | 0.29 |

Gradient Boosting | 1.63 | 1.69 | 2.18 |

Random Forest | 4.97 | 7.03 | 9.02 |

Extra Trees | 4.01 | 5.64 | 7.47 |

KNN | 0.26 | 0.38 | 0.45 |

Ridge | 0.15 | 0.19 | 0.23 |

Lasso | 0.24 | 0.26 | 0.31 |

Huber | 0.34 | 0.36 | 0.39 |

RANSAC | 0.51 | 0.63 | 0.71 |

Algorithm 1 (total) | 2252.95 | 6293.88 | 7949.11 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Boru, B.; Gursoy, M.E. Forecasting Daily COVID-19 Case Counts Using Aggregate Mobility Statistics. *Data* **2022**, *7*, 166.
https://doi.org/10.3390/data7110166

**AMA Style**

Boru B, Gursoy ME. Forecasting Daily COVID-19 Case Counts Using Aggregate Mobility Statistics. *Data*. 2022; 7(11):166.
https://doi.org/10.3390/data7110166

**Chicago/Turabian Style**

Boru, Bulut, and M. Emre Gursoy. 2022. "Forecasting Daily COVID-19 Case Counts Using Aggregate Mobility Statistics" *Data* 7, no. 11: 166.
https://doi.org/10.3390/data7110166