Mobility and Dissemination of COVID-19 in Portugal: Correlations and Estimates from Google’s Mobility Data

: The spread of the coronavirus disease 2019 (COVID-19) has important links with population mobility. Social interaction is a known determinant of human-to-human transmission of infectious diseases and, in turn, population mobility as a proxy of interaction is of paramount importance to analyze COVID-19 diffusion. Using mobility data from Google’s Community Reports, this paper captures the association between changes in mobility patterns through time and the corresponding COVID-19 incidence at a multi-scalar approach applied to mainland Portugal. Results demonstrate a strong relationship between mobility data and COVID-19 incidence, suggesting that more mobility is associated with more COVID-19 cases. Methodological procedures can be summarized in a multiple linear regression with a time moving window. Model validation demonstrate good forecast accuracy, particularly when we consider the cumulative number of cases. Based on this premise, it is possible to estimate and predict future evolution of the number of COVID-19 cases using near real-time information of population mobility.


Introduction
The coronavirus disease 2019  has spread across the world and one year after the first confirmed case more than 100 million cases have been accounted worldwide, more than 2 million of them resulting in death. In March 2022, two years after the pandemic declaration, the number of confirmed cases has risen to nearly 500 million and fatalities have surpassed 6 million.
The spread of the disease, caused by the transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), occurred quicker and to a greater extent than previous coronavirus epidemics [1]. In this sense, it forced countries to adopt epidemic containment measures [2]. In an increasingly globalized world, mobility is responsible for reducing epidemiological differences between regions [3], with travel restrictions being among the first non-therapeutic strategies to contain the spread of pathogens [4]. Therefore, being mobility the main driver of human contact and intrinsically associated with social interaction [5], essential to SARS-CoV-2 transmission, the main objective of non-therapeutic measures, widely adopted to stop human to human transmission, has been travel restrictions to force population to stay at home to reduce contacts [1,6]. Thus, reduced mobility is associated with decreasing transmission of COVID-19 [7][8][9][10].
Several studies show a strong relationship between COVID-19 cases and people mobility, identifying commuting behaviors as a spatial determinant of COVID-19 patterns [11][12][13][14][15][16]. Nonetheless, other authors have specifically analysed mobility based on near real-time information. The use of this kind of data is not new to epidemiological studies which shows good performance for trends and spatial patterns identification [17,18]. Regarding this pandemic, the work of Jia et al. [19] was pioneer in the identification that information on population mobility is epidemiologically informative of COVID-19 diffusion. Using mobile phone data, the authors identified that population movement within China explains with significance the geographical patterns of COVID-19 transmission. The study of Badr et al. [20] demonstrates a strong correlation between population mobility patterns and variation of COVID-19 incidence in 25 counties of the United States (US) and Kraemer et al. [6] proved the importance of mobility and travel restrictions in reducing COVID-19 transmission in China. In addition, the authors assessed that epidemic magnitude is also predictable from the volume of human movement. Yilmazkuday [21] research quantified the importance of travel between counties in the US concluding that where the population tends to stay at home more, there is less uncontrolled evolution of the pandemic and less chance of infection. In the European context, Cartenì et al. [22] related mobility habits with the evolution of the number of new infections over a 21-day horizon, identifying a direct relationship between road traffic in Italy and incidence of COVID-19. For Portugal, Mourão and Bento [23] and Alves [11] identified a positive relationship of contiguity between territorial units and the progression of pandemic spread, highlighting the importance of intermunicipal and interparish commuting for contagion, in line with the conclusions of Sousa et al. [15], and Casa Nova et al. [24] assessed the dynamic correlations between Google's community reports and COVID-19 cases.
In this sense, increased mobility is related to a positive variation in incidence and is a good predictor of the number of COVID-19 cases, with several authors developing methodologies to predict the number of cases from population mobility data. Some support their analysis on Google's "COVID-19 Community Mobility Reports" [7,[24][25][26][27], because there is no repository with comparable volume of mobility information accessible as open data. The vast number of published papers that used this data is indicative of the quality of Google's information, even though the information acquisition has a biased population sample. Other authors use alternative information on public transport use and road traffic [22], from social networks such as Facebook [28] and from cell phone geolocation [29,30]. Another important repository for changes of near real-time mobility data was Apple's Mobility Trend Reports [31], that categorized mobility according to the mode of travel (walking, driving and transit) and was used to assess similar associations [5,32]. However, this repository has been discontinued and is no longer available online since April 2022. The pandemic process has prompted the discussion about the conditions and determinants that justify spatial inequalities in the dissemination of COVID-19, highlighting the need for relevant information to understand the trends, processes and patterns of spatial diffusion in order to support public health decision making to contain this disease. This article, part of COMPRI_MOv project (FCT-ID:613765655), investigates the association between changes in mobility and the number of COVID-19 cases in Portugal. In addition to investigating the linear correlations between mobility and the number of new cases, it seeks to assess whether it is possible to estimate the number of cases, for different geographical scales, from mobility patterns, laying the foundation for a predictive model. The distinctive aspect of this article lies on the exploration of human mobility and the confirmed cases of COVID-19, through linear multiple regressions using a rolling time window, generating the prediction of the near future number of cases based on open data, allowing a more effective preparation of health services response.
The diffusion of COVID-19 in Portugal reveals heterogeneous spatial-temporal patterns, although with a geography consolidated in the metropolitan areas, along the most urbanized municipalities of the coast and regional district capitals [11,33]. For several periods Portugal recorded a higher incidence rate than the average European context which makes it a relevant case study in order to understand which specific situations and contexts potentiated high transmission in the country. For this reason, it is of paramount importance to assess to what extent mobility patterns had local effects at multiple scales as determinants of COVID-19 spread.
This article is organized in four parts. The first corresponds to this introduction, followed by materials and methods, where study area, data and methodologies are presented. The study used three levels of geographical disaggregation: mainland Portugal, the district regions and 4 municipalities (Lisbon, Oporto, Amadora and Vila Nova de Gaia).

Estimation Methodology
The methodological approach to estimate the number of COVID-19 cases was based on a multiple linear regression. For this purpose, the epidemiological data and the six mobility variables made available by Google [34] (variation from the reference value in retail and leisure places, grocery stores and pharmacies, parks, public transport stations, workplaces and homes) were considered as follows: The study used three levels of geographical disaggregation: mainland Portugal, the district regions and 4 municipalities (Lisbon, Oporto, Amadora and Vila Nova de Gaia).

Estimation Methodology
The methodological approach to estimate the number of COVID-19 cases was based on a multiple linear regression. For this purpose, the epidemiological data and the six mobility variables made available by Google [34] (variation from the reference value in retail and leisure places, grocery stores and pharmacies, parks, public transport stations, workplaces and homes) were considered as follows: where Y i represents the estimated number of COVID-19 cases for the date i, β 0 is the constant term, X i are Google mobility explanatory variables, β p are the slope coefficients for each variable and ε i is model's error term.
The time between changes in mobility and the change in the number of new confirmed cases was considered in this work with a lag of 14 days. The time lag must accommodate the incubation and the period needed for official reporting and communication of cases. In similar approaches with daily mobility data, authors have been considering a lag from 7 to 28 days [7,18,20,25] in relation to the date of the cases. This means that, for example, with a 14-day lag the number of infected people on September 1 were related to mobility previously recorded, more precisely on August 18. Different lags were considered, in an exploratory test, but a 14-day lag was the best fit for the Portuguese case. For the choice of the 14-day block in the modeling, blocks of different periods (7 and 21 days) were tested (Table 1). Considering the high volume of daily data for the different geographic aggregations (country, districts and municipalities), a script (Appendix A) was implemented in Python [35] to reduce data analysis processing time ( Figure 2). To fit a linear model with coefficients w = (w1, . . . , wp) that minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation, scikit-learn library was used [36].
where represents the estimated number of COVID-19 cases for the date i, β0 is the constant term, are Google mobility explanatory variables, are the slope coefficients for each variable and is model's error term. The time between changes in mobility and the change in the number of new confirmed cases was considered in this work with a lag of 14 days. The time lag must accommodate the incubation and the period needed for official reporting and communication of cases. In similar approaches with daily mobility data, authors have been considering a lag from 7 to 28 days [7,18,20,25] in relation to the date of the cases. This means that, for example, with a 14-day lag the number of infected people on September 1 were related to mobility previously recorded, more precisely on August 18. Different lags were considered, in an exploratory test, but a 14-day lag was the best fit for the Portuguese case. For the choice of the 14-day block in the modeling, blocks of different periods (7 and 21 days) were tested (Table 1). Considering the high volume of daily data for the different geographic aggregations (country, districts and municipalities), a script (Appendix A) was implemented in Python [35] to reduce data analysis processing time ( Figure 2). To fit a linear model with coefficients w = (w1, …, wp) that minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation, scikit-learn library was used [36]. The regression analysis was performed with two different procedures: one supported in 14-day blocks, advancing at a step of 14 days (p14), and another, from the immediately preceding 14 days, with a window rolling forward at a step of 1 day (p1) ( Figure 3). The regression analysis was performed with two different procedures: one supported in 14-day blocks, advancing at a step of 14 days (p14), and another, from the immediately preceding 14 days, with a window rolling forward at a step of 1 day (p1) ( Figure 3).
Using the regression parameters of a 14-day block, it is possible to predict the evolution of the number of infected people in the future at least up to 14 days. The regression parameters used for forecasting are also of two different types, those resulting from the 14-day block, a 14-day step, and those from the 14 days immediately preceding, a 1-day step.  Using the regression parameters of a 14-day block, it is possible to predict the evolution of the number of infected people in the future at least up to 14 days. The regression parameters used for forecasting are also of two different types, those resulting from the 14-day block, a 14-day step, and those from the 14 days immediately preceding, a 1-day step.
For mainland Portugal, the period analyzed was between March 2020 and March 2021, while for the geographical level of the districts and municipalities the period was between September 2020 and March 2021, due to epidemiological data availability.

Dependent Variable
The epidemiological information was acquired from Directorate-General of Health (Direção-Geral da Saúde, DGS) reports [37]. The availability of official national epidemiological information from this source is a great limitation to its use because of the need to collect information manually, which is more prone to errors, and requires manual data editing prior to any analysis and modelling process, or good notions of programming languages for developing scripts for automatic acquisition. This is not one of the best examples in data sharing policy compared to other European countries. Example of excellence is the Italian Civil Protection data repository [38] which provides information ready to be processed in CSV/Excel/JSON file formats (https://github.com/pcm-dpc/COVID-19, accessed on 14 March 2022).
In addition, another limitation of DGS data, already explored by Marques da Costa et al. [39], is data inconsistencies, for example, loss of synchronization (the sum of new cases differs from cumulative), incorrect allocation of cases to territorial units, breaks in the periodicity of disclosure, temporally overlapping series or interruption of disclosure of certain indicators. Specifically, information at the municipal scale has experienced several problems with the maintenance of the data series. Initially it was made available as the daily number of cumulative confirmed cases. Later, the periodicity changed to weekly, and in November 2020, this indicator was replaced by the incidence at 14 days per 100 thousand inhabitants. This change required calculations to determine the actual number For mainland Portugal, the period analyzed was between March 2020 and March 2021, while for the geographical level of the districts and municipalities the period was between September 2020 and March 2021, due to epidemiological data availability.

Dependent Variable
The epidemiological information was acquired from Directorate-General of Health (Direção-Geral da Saúde, DGS) reports [37]. The availability of official national epidemiological information from this source is a great limitation to its use because of the need to collect information manually, which is more prone to errors, and requires manual data editing prior to any analysis and modelling process, or good notions of programming languages for developing scripts for automatic acquisition. This is not one of the best examples in data sharing policy compared to other European countries. Example of excellence is the Italian Civil Protection data repository [38] which provides information ready to be processed in CSV/Excel/JSON file formats (https://github.com/pcm-dpc/COVID-19, accessed on 14 March 2022).
In addition, another limitation of DGS data, already explored by Marques da Costa et al. [39], is data inconsistencies, for example, loss of synchronization (the sum of new cases differs from cumulative), incorrect allocation of cases to territorial units, breaks in the periodicity of disclosure, temporally overlapping series or interruption of disclosure of certain indicators. Specifically, information at the municipal scale has experienced several problems with the maintenance of the data series. Initially it was made available as the daily number of cumulative confirmed cases. Later, the periodicity changed to weekly, and in November 2020, this indicator was replaced by the incidence at 14 days per 100 thousand inhabitants. This change required calculations to determine the actual number of new cases, however, as the availability occurs weekly representing a period of 14 days, there is an overlap in the series which introduces uncertainty into the calculated data that cannot be validated in an objective way. Since March 2022 municipal data is no longer available through DGS reports. Although the information presents quality issues for this study, it is still official information, which is the only that the authors considered.
The cases of COVID-19 as the dependent variable are represented by the number of new cases per day. It refers to the daily variation of newly detected confirmed cases of COVID-19, normally reporting to the previous two to three days, except if testing or reporting took longer.

Number of Cases
During the first year of incidence of COVID-19 in Portugal, there were three waves of differentiated magnitudes (Figure 4), with unequal territorial expressions ( Figure 5). of new cases, however, as the availability occurs weekly representing a period of 14 days, there is an overlap in the series which introduces uncertainty into the calculated data that cannot be validated in an objective way. Since March 2022 municipal data is no longer available through DGS reports. Although the information presents quality issues for this study, it is still official information, which is the only that the authors considered.
The cases of COVID-19 as the dependent variable are represented by the number of new cases per day. It refers to the daily variation of newly detected confirmed cases of COVID-19, normally reporting to the previous two to three days, except if testing or reporting took longer.

Number of Cases
During the first year of incidence of COVID-19 in Portugal, there were three waves of differentiated magnitudes (Figure 4), with unequal territorial expressions ( Figure 5). In the following sub-chapters, the evolution will be described in periods.  of new cases, however, as the availability occurs weekly representing a period of 14 days, there is an overlap in the series which introduces uncertainty into the calculated data that cannot be validated in an objective way. Since March 2022 municipal data is no longer available through DGS reports.
Although the information presents quality issues for this study, it is still official information, which is the only that the authors considered.
The cases of COVID-19 as the dependent variable are represented by the number of new cases per day. It refers to the daily variation of newly detected confirmed cases of COVID-19, normally reporting to the previous two to three days, except if testing or reporting took longer.

Number of Cases
During the first year of incidence of COVID-19 in Portugal, there were three waves of differentiated magnitudes (Figure 4), with unequal territorial expressions ( Figure 5). In the following sub-chapters, the evolution will be described in periods.  In the following sub-chapters, the evolution will be described in periods.
First wave Similar to in most European countries, the first wave presented low severity because of early general lockdowns based on the uncertainty and lack of knowledge about the disease. It is important to note that testing at this stage was still very limited. Mobility levels reached their minimum at lockdown ( Figure 6). Geographically, the diffusion process Residential-related mobility recorded its maximum differences comparing pre-pandemic levels during the lockdown periods but remained above the reference value throughout most of the data series, with some peaks associated to mobility restrictions. Frequency of workplaces had the opposite evolution, with abrupt negative changes during lockdowns and almost permanent negative variation throughout the period represented. The use of public transport hubs never recovered to the pre-pandemic values but fell to a lesser extent in the period coinciding with the beginning of the second wave, associated with greater mobility with the return to work and school after summer holidays. Parks recorded a frequency well above the reference value, especially in the summer (attention to the different axis) and negative variations coincide with periods of lockdown or very high incidence. Grocery and pharmacy and retail and recreation appear to have a similar evolution, although the first recorded values above the referential during more time in the summer and the latter had a more abrupt negative change during lockdowns.
Comparing mobility behavior with the evolution of the number of cases in mainland Portugal, it is visible that the increase of the number of new cases is related to changes in mobility variables.

Model Adjustment
Our results confirm that mobility is positively associated with COVID-19 infection.

Summer
Between June and August 2020, the country experienced the period after the first lockdown and the number of infections were low and controlled in a national context. In terms of mobility, work from home was still in place and was a period of school holidays. However, the municipalities around Lisbon demonstrated a differentiated behavior with an incidence about two times higher than the rest of the regions combined, although it maintained stricter restrictions compared to the rest of the country. Mobility patterns came close to pre-pandemic levels.

Second wave
Started in the north of the country but reached the rest of the territory with special emphasis also on the municipalities of the Lisbon metropolitan area and other major cities. The rise in cases matched with increased mobility associated with returning to school and face-to-face work after the summer holidays. Simultaneously coincides with the reduction of non-pharmaceutical interventions that limited social interaction.

Third wave
Has started immediately after the end of the second wave and the number of new cases registered the highest values to date in Portugal. This was the wave of greatest magnitude, during the first year, with record cases in every municipality. In this context

Independent Variables
Google mobility data [34] from COVID-19 Community Mobility Reports (https:// www.google.com/covid19/mobility/, accessed on 11 April 2021) represents the percentage change in mobility, based on the median of the first 5 weeks of 2020 (3 January to 6 February 2020), considered representative of the pre-pandemic mobility patterns. The statistics are created with aggregated and anonymized datasets of users who have enabled the Location History setting on Google technology applications.
The variation is determined based on people's visits and length of stay in places such as retail and recreation, grocery and pharmacy, parks, transit stations, workplaces, and residential places. Retail and recreation congregates spaces such as restaurants, cafes, shopping centers, theme parks, museums, libraries and movie theatres. Grocery and pharmacy concerns essential goods such as grocery markets, food warehouses, drug stores, pharmacies and similar. Parks data considers national parks, beaches, plazas and public gardens. Transit data comes from public transport hubs. Workplaces represents places of work and residential are residential areas ( Figure 6).
Residential-related mobility recorded its maximum differences comparing pre-pandemic levels during the lockdown periods but remained above the reference value throughout most of the data series, with some peaks associated to mobility restrictions. Frequency of workplaces had the opposite evolution, with abrupt negative changes during lockdowns and almost permanent negative variation throughout the period represented. The use of public transport hubs never recovered to the pre-pandemic values but fell to a lesser extent in the period coinciding with the beginning of the second wave, associated with greater mobility with the return to work and school after summer holidays. Parks recorded a frequency well above the reference value, especially in the summer (attention to the different axis) and negative variations coincide with periods of lockdown or very high incidence. Grocery and pharmacy and retail and recreation appear to have a similar evolution, although the first recorded values above the referential during more time in the summer and the latter had a more abrupt negative change during lockdowns.
Comparing mobility behavior with the evolution of the number of cases in mainland Portugal, it is visible that the increase of the number of new cases is related to changes in mobility variables.

Model Adjustment
Our results confirm that mobility is positively associated with COVID-19 infection. The relationship between mobility variables and the occurrence of new cases was established according to two methods: by 14-day blocks (step 14, p14) and by the sequence of the 14 days immediately preceding (step 1, p1).
The model adjustment was tested with data from new cases from 1 September 2020 and the first mobility data in the corresponding previous 14 days, that is, from 18 August 2020. The results point to a strong relationship between observed and projected data according to the mobility change values of the previous 14 days.
For the national context (Figure 7), there is a strong adjustment of the cumulative curve of estimated cases with the observed cumulative cases of COVID-19, being, as expected, particularly closer when the estimated values result from the p1 model. For the national context (Figure 7), there is a strong adjustment of the cumulative curve of estimated cases with the observed cumulative cases of COVID-19, being, as expected, particularly closer when the estimated values result from the p1 model.   The observation of the daily estimated data confirms the best fit of the p1 model when comparing to the real number of new cases (Figure 8).
2020. The results point to a strong relationship between observed and projected data according to the mobility change values of the previous 14 days.
For the national context (Figure 7), there is a strong adjustment of the cumulative curve of estimated cases with the observed cumulative cases of COVID-19, being, as expected, particularly closer when the estimated values result from the p1 model.   The data aggregation level and the size of the territorial units are important for the estimation process. The adjustment for Lisbon municipality (Figure 9a-d) is very strong, for both p1 and p14 models.
The observation of the results for the municipalities of Oporto and Vila Nova de Gaia (Figure 10a-d) also shows the robustness of the method when more extreme behaviors occur, such as those that occurred on November 3rd and January 18th in these two municipalities, revealing a good sensibility of the model to abrupt changes in case counting. The observation of the results for the municipalities of Oporto and Vila Nova de Gaia (Figure 10a-d) also shows the robustness of the method when more extreme behaviors occur, such as those that occurred on November 3rd and January 18th in these two municipalities, revealing a good sensibility of the model to abrupt changes in case counting.   The observation of the results for the municipalities of Oporto and Vila Nova de Gaia (Figure 10a-d) also shows the robustness of the method when more extreme behaviors occur, such as those that occurred on November 3rd and January 18th in these two municipalities, revealing a good sensibility of the model to abrupt changes in case counting.  The results are significant and follow the evolution trend, except in moments where the incidence reached extreme values. It is more important to predict the trend than the exact number of new cases. The ability to project a given volume of COVID-19 incidence allows a sufficient degree of knowledge about the future evolution of the pandemic, essential for its management in terms of implementation of public health measures. It is also noted that adjustments vary with the spatial scale under analysis (data aggregation).

Linear Correlations
To analyze the correlations between mobility and the number of cases, 3 dates were chosen corresponding to the previous phase, during and after the 3 pandemic waves. In general, it appears that the correlations on all dates are significant (>0.5), with special emphasis on 15 March 2020 and 20 February 2021 ( Table 2). The day with the lowest correlation (0.26) corresponds to 11 October 2020, obtained using the p14 model. In order to verify regional differences in the correlations, the analysis was disaggregated at the district level ( Table 3). The first aspect to be highlighted is the absence of mobility data for some municipalities, which made it impossible to analyze all municipalities and as an alternative we considered the 18 Portuguese districts. On all dates the correlations are significant, with emphasis on 25 January 2021, which in both models shows correlations greater than 0.6. In geographic terms, although there is no significant differentiation, it is observed that the districts of Aveiro, Braga, Santarém and Porto have higher average correlations than the other districts. With little significant correlations in all dates, the less populated district of Bragança stands out. The p1 model, as expected, shows mostly higher correlations than the p14 model on four of the six days analyzed. Correlations appear lower when there were information gaps in the mobility data series at the time of processing. The procedures for completing these gaps may have contributed since the samples are smaller in these regions and therefore may be far from representing the real patterns.
According to Table 3, Braga presents a higher correlation average in the third wave, which is the moment when models performed best in most municipalities with Évora reaching 0.92 of correlation coefficient.
There is an increase in model's adjustment with the pandemic progression since in the first two waves not all municipalities had yet confirmed cases and the spatial patterns revealed "coastalization" and "bipolarization" [11]. High disparities within districts can be responsible for weak associations in the first moments. The growth of the fit of regression models as the disease spreads over space and time is also evident in Sousa et al. [15] study for mainland Portugal.

Forecast of Values for the following 14 Days
A prediction of the values at 14 days was performed, following the two methods, those of blocks of 14 days and those of 14 days immediately preceding. That is, from the last known regression parameters, mobility values were used, and the values of the following 14 days were forecasted. The adjustment is significant, as can be seen for mainland Portugal and municipalities of Lisbon, Porto and Amadora (Figure 11a-d).
series at the time of processing. The procedures for completing these gaps may have contributed since the samples are smaller in these regions and therefore may be far from representing the real patterns.
According to Table 3, Braga presents a higher correlation average in the third wave, which is the moment when models performed best in most municipalities with Évora reaching 0.92 of correlation coefficient.
There is an increase in model's adjustment with the pandemic progression since in the first two waves not all municipalities had yet confirmed cases and the spatial patterns revealed "coastalization" and "bipolarization" [11]. High disparities within districts can be responsible for weak associations in the first moments. The growth of the fit of regression models as the disease spreads over space and time is also evident in Sousa et al. [15] study for mainland Portugal.

Forecast of Values for the Following 14 Days
A prediction of the values at 14 days was performed, following the two methods, those of blocks of 14 days and those of 14 days immediately preceding. That is, from the last known regression parameters, mobility values were used, and the values of the following 14 days were forecasted. The adjustment is significant, as can be seen for mainland Portugal and municipalities of Lisbon, Porto and Amadora (Figure 11a-d). Although the prediction method has some difficulty in adjusting to sudden changes in epidemiological data (number of new cases), the method allows to obtain a significant 14-day forecast with good adjustments to reality.
Negative values have been predicted with step14 that result from the need for the model to adjust after periods of high incidence, considering that there is no daily moving window. The projection of negative values is not necessarily a problem, since this is the way that the model adjusts to peak variations. Although the prediction method has some difficulty in adjusting to sudden changes in epidemiological data (number of new cases), the method allows to obtain a significant 14-day forecast with good adjustments to reality.
Negative values have been predicted with step14 that result from the need for the model to adjust after periods of high incidence, considering that there is no daily moving window. The projection of negative values is not necessarily a problem, since this is the way that the model adjusts to peak variations. Figure 12a,b are histograms of the absolute differences between observed and estimated number of cases, calculated for models p1 and p14. The absolute errors for the p1 (Figure 12a) and p14 (Figure 12b) models applied to mainland Portugal, show a higher error interval in the p14 model compared to the p1 model. Figure 12a,b are histograms of the absolute differences between observed and estimated number of cases, calculated for models p1 and p14. The absolute errors for the p1 (Figure 12a) and p14 (Figure 12b) models applied to mainland Portugal, show a higher error interval in the p14 model compared to the p1 model. In both cases the higher frequency of errors is associated with low error values. Although p1 has a higher adjustment, as the correlations show, the absolute error has lower frequency in p14, although in this one the amplitude of errors is higher.

Validation of Models
In order to quantitatively evaluate the accuracy of cases estimates, three accuracy measurements, Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE), were calculated ( Table 4). The p1 model has lower errors than the p14 model in most districts. However, it is possible to verify in some districts (Bragança, Castelo Branco, Coimbra, Évora, Porto and Santarém) that the results of the p14 model are superior. The districts of Lisbon and Porto stand out for their high absolute error (119 and 116) and the district of Santarém for its high absolute and relative error (622 and 0.795). The RSME is higher in the districts of Lisbon and Oporto, influenced by major cities of this metropolitan regions, which have socio territorial specificities with dynamism that In both cases the higher frequency of errors is associated with low error values. Although p1 has a higher adjustment, as the correlations show, the absolute error has lower frequency in p14, although in this one the amplitude of errors is higher.
In order to quantitatively evaluate the accuracy of cases estimates, three accuracy measurements, Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE), were calculated ( Table 4). The p1 model has lower errors than the p14 model in most districts. However, it is possible to verify in some districts (Bragança, Castelo Branco, Coimbra, Évora, Porto and Santarém) that the results of the p14 model are superior. The districts of Lisbon and Porto stand out for their high absolute error (119 and 116) and the district of Santarém for its high absolute and relative error (622 and 0.795). The RSME is higher in the districts of Lisbon and Oporto, influenced by major cities of this metropolitan regions, which have socio territorial specificities with dynamism that tends to aggravate the transmission of COVID-19 [11,40,41] and where mobility, despite being relevant, may not be the most relevant determinant.
Santarém district presents, by all metrics, the highest residual values followed by Lisbon and Oporto. While the case of latter two was already explained, the atypical and extreme case of Santarém may be explained by gaps in mobility data. Better results in Castelo Branco, Évora and Bragança, where the number of new cases maintained a stable tendency, never reaching high incidence values.

Discussion and Conclusions
This article has explored the effects of mobility (measured by Google) on COVID-19 cases by using daily data across several geographic scales in Portugal, covering the period between March 2020 and March 2021.
Here we have demonstrated, through freely available mobility data and official epidemiological data, that with linear regression models it is possible to obtain estimates of the number of cases. Using two models (p1 and p14), in Python scripts to automate calculations, it was found that it is possible in a simple way to obtain results for extended data ranges and for different levels of geographic aggregation. It is important to mention that procedures performed refer to a situation where vaccination coverage was still low in Portugal.
The results of this work point out to the existence of a proportional relationship between changes in mobility patterns and the propagation of the SARS-CoV-2 virus. This association was established between the number of new cases and changes in mobility volumes with a lag of 14 days. Based on this relationship, the projection of values for the near future, a 14 day horizon, is possible with acceptable margins of accuracy to allow for precautionary decisions of public health nature.
The use of mobility data in relation to the incidence of COVID-19 applied to Portugal demonstrates a strong relationship, suggesting that more mobility is associated with more COVID-19 cases. However, this relationship does not imply that mobility is the only cause of transmission. This is supported by studies that claim that change in mobility patterns in Portugal with the reduction of mobility contributes to the reduction of the disease's effective reproduction number [5,42] and thus the incidence is reduced in the following weeks [43]. Other factors associated with transmission of the virus such as the use of masks, social distance or vaccination are not part of this model. Although, the advantage of the developed models lies in the ease of implementation and exploitation when compared with more complex epidemiological models, and the possibility of use in contexts with data gaps.
A key insight from our work is a strong capacity to forecast COVID-19 for 14 days ahead, since there was a lack of studies for Portugal that projected the future evolution of COVID-19 cases at multiple scales of analysis with mobility data. The methodology can be used to develop an epidemiologic surveillance system that predicts the evolution of the pandemic using "near real-time" mobility data, supporting decision-making processes related to public health and non-pharmaceutical measures to contain the spread of COVID-19. This approach does not have to be limited to COVID-19 and can be replicated for other infectious diseases, as other studies for influenza [17,18], provided that the optimal lag is effectively determined considering the epidemiological information and the type of mobility data.
The results achieved are in line with similar works. For example, Ilin et al. [28] used statistical models to generate 10-day forecasts of COVID-19 cases supported by Google mobility data, having verified a good adjustment of the models to local data. Kishore et al. [44] explored the use of Google data to assess the role of mobility in spreading COVID-19 infection in India. The authors observed a high correlation coefficient between epidemiological and mobility indicators for the lockdown and unlock phases.
Better epidemiological information, in terms of dissemination format, periodicity and spatial resolution, is necessary so that more detailed results and scientific evidence can be achieved in the study of epidemiological phenomena, today associated with SARS-CoV2, in the future with other infectious diseases that will certainly occur at a more or less distant moment in time.
The work developed depends on the existence of epidemiological and mobility data. Regarding epidemiological data, it is important to mention a critical aspect in the development of models, which is the data quality. As noted by Tamagusko and Ferreira [42], the number of infected individuals confirmed daily may not correspond to the disease's reality, because the number of confirmed infections depends on the number of tests performed, and the criteria adopted to test the population were not well explained. Number of cases from official sources is highly dependent on the degree of testing performed, often with severe territorial disparities influenced by context factors [45]. Lack of information quality control can be responsible for biases that lead to results being subject to ecological fallacy [46] and modifiable area unit issues are common with epidemiological data [47], especially distribution of COVID-19 cases [48]. This bias does not allow for the identification of cause-effect relationships, however since mobility is a proxy of social interaction, which is the real driver for the spread of contagious diseases, we believe that the correlations identified are a first step for future studies to explore inferring causality. Another aspect refers to the fact that the data are not made available in a format that is easily manipulated (human-readable and machine-readable), that can be submitted to analysis and modelling tasks or integrated in a geographic information system, which is a limitation for fast and accurate data usage [39].
Some delay in Google updating latest data can also constrain obtaining information in time to allow forecasting the future for the generality of the Portuguese territory. One aspect that deserves special attention in works that use this data is the possibility of existing a bias related to the users who generate mobility data not representing the total population, because the sample depends on users of Google services consenting to location sharing. Naturally, this issue is critical in regions where the use of mobile phones is not a common practice. Another possible limitation is the fact that mobility data were used in raw. In contrast with other studies [5] that use techniques to smooth series weekly patterns (influenced by weekends, holidays, etc.), no transformation or standardization was performed, which could change the results.
A last potential limitation is the linear approach, since there are studies that use other type of regressions, such as polynomial [49,50], to predict the trend evolution of new cases.
Having found high correlations between mobility and the number of cases, in future research it will be important to explore the effect of different degrees of vaccination coverage on the evolution of the number of cases as well as additional sources of mobility near real-time data such as mobile phone or car data. Open data was indispensable in this work and the institutions that produce and disseminate them should invest in better data sharing policies.

Data Availability Statement:
The epidemiological data used in the study is available in the following URL: https://covid19.min-saude.pt/relatorio-de-situacao/ accessed on 28 January 2022. Google mobility data was downloaded from: https://www.google.com/covid19/mobility/ accessed on 11 April 2021.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The code developed in Python and the processed data can be accessed in the repository https://github.com/nmileu/compri_mov accessed on 1 May 2022.