Detecting Destroyed Communities in Remote Areas with Personal Electronic Device Data: A Case Study of the 2017 Puebla Earthquake

: Large-scale humanitarian disasters often disproportionately damage poor communities. This e ﬀ ect is compounded when communities are remote with limited connectivity and response is slow. While humanitarian response organizations are increasingly using a wide range of satellites to detect damaged areas, these images can be delayed days or weeks and may not tell the story of how many or where people are a ﬀ ected. In order to address the need of identifying severely damaged communities due to humanitarian disasters, we present an algorithmic approach to leverage pseudonymization locational data collected from personal cell phones to detect the depopulation of localities severely a ﬀ ected by the 2017 Puebla earthquake in Mexico. This algorithm capitalizes on building a pattern of life for these localities, ﬁrst establishing which pseudonymous IDs are a resident of the locality and then establishing what percent of those residents leave those localities after the earthquake. Using a study of 15 localities severely damaged and 15 control localities una ﬀ ected by the earthquake, this approach successfully identiﬁed 73% of severely damaged localities. This individual-focused system provides a promising approach for organizations to understand the size and severity of a humanitarian disaster, detect which localities are most severely damaged, and aid them in prioritizing response and reconstruction e ﬀ orts.


Introduction
On September 19, 2017, a 7.1 magnitude (M) earthquake occurred in Puebla, Mexico [1]. Mexico City and the states of Puebla and Morelos sustained significant damage to the infrastructure and population due to the densely populated region [1]. The United States Agency for International Development (USAID) and the Pan American Health Organization (PAHO) estimated that at least 43,000 buildings were destroyed or sustained significant damage, 6100 people were injured, 366 people were killed, and hundreds missing [2]. The most impacted areas were in the rural parts of the region, making it difficult for response agencies to identify and respond to communities with the most need. The scale of the destruction was significant and distributed geographically, with thousands of residents left homeless and others sleeping on the streets in fear of aftershocks [2].
Following large-scale humanitarian disasters, challenges emerge regarding prioritizing emergency response and aid provisions. With natural disasters, there is generally a large displaced population seeking shelter and in need of assistance. It is often instinctive for people to flee the affected area after some time; however, this can complicate relief efforts and increase the chance of mortalities [3]. Traditional methodologies like witnesses' interviews or satellite imagery are commonly used among relief organizations to determine the post-destruction and estimate the number of mortalities or missing people [3]. However, they are often slow, potentially biased, and unreliable [3].
With the occurrence of natural disasters increasing over time, new spatial and geostatistical perspectives have been adopted. Earthquake risk and damage modeling have specifically seen a large growth. In 2010, Sahar, Muthukumar, and French used geographic information systems (GIS) and algorithms to extract two-dimensional building shapes from aerial imagery for better earthquake risk assessment modeling [4]. Feng et al. showed the combined use of remote sensing with GIS building data to detect the three-dimensional destruction of buildings and estimate the number of potential casualties [5]. Newer efforts have shown the ability to preemptively assess expected damage in urban areas using earthquake building codes, which can lead to more efficiently placed assistance camps or evacuation routes [6]. Recently, there has been a shift towards more geo-computational approaches. Hossain et al. leveraged smart watch data and GIS technologies to algorithmically capture the heart rate of the earthquake victims in order to identify critical areas for search and rescue [7]. Machine learning and neural network approaches are being tested to improve previous earthquake prediction models [8]. Other approaches have incorporated crowd-sourced data, such as emergency volunteer mapping and social media geotags to detect where the largest damage has occurred in humanitarian emergencies or constellations of small satellites to detect significantly damaged villages on a daily basis [9][10][11]. The National Geospatial Agency is attempting to automate the extraction of areas in need of humanitarian assistance by leveraging artificial intelligence on high-resolution imagery [12]. All of these methods use overarching rather than personal location data to provide insight into humanitarian disasters.
The literature shows that call detail records (CDRs) have been useful in measuring broad spatial population characteristics and migration in the private and public sectors. Using triangulation techniques, CDR data can produce a geolocation at the time a call or text is made and evaluate a population mobility and social network, sometimes down to an individual scale [13]. Telecommunication companies are constantly analyzing CDRs to monitor their market penetration and economic success. Researchers continue to explore how governments and private organizations can benefit from more timely population and migration estimates by using CDRs, especially in developing nations where such knowledge can inform policy, but the costs of collecting data may be significant. For example, Salat, Smoreda, and Schläpfer developed methods for extrapolating population densities from CDRs from tracing weekly, monthly, and yearly patterns of mobile phone use in Senegal [14]. Zufiria et al. also utilized mobile phone data from Senegal and found that aggregated mobility profiles based on likely livelihoods can shed light on economic activity, agricultural cycles, and precipitation, and thus, seasonal migration [15]. Lai et al. further directly applied CDR data and found that it can complement national statistics, especially in countries with high rates of internal migration, to ensure that public services are appropriately deployed [16]. Cell phone data is effective in detecting and depicting established patterns of life within a country.
Call detail records have also been extensively employed to complement various organizations' understandings of population movements when these patterns of life are disturbed in the wake of specific environmental and epidemiological crises. In a particularly relevant example, Bengtsson et al. (2011) analyzed locational data from CDRs to find that 630,000 people left Port-au-Prince within a 19-day period after the 2010 Haiti earthquake [3]. Similarly, Wilson et al. (2016) utilized deidentified mobile CDRs and algorithmic transition matrices to identify population flows in and out of Kathmandu Valley in the first few weeks after the 2015 Gorkha earthquake in Nepal [17]. Pastor-Escuredo et al. showed the viability in using CDRs to characterize impacts of flooding in Tabasco, Mexico in 2009 [18]. Andrade et al. demonstrated in their analysis of a 2016 earthquake centered in Manabi, Ecuador that using aggregated activity through call towers can protect individual users from privacy concerns while assessing the extent of urban infrastructure damage and providing insight into patterns of mobility depending on the user's proximity to the epicenter of the earthquake [19]. Horanont et al. (2013) used 9.2 billion location records, derived from an auto-global navigation satellite system (GNSS) service from a telecommunications company in Japan, to analyze human mobility patterns after the 2011 Great Japan Earthquake [20]. CDR data has also been utilized in applications of epidemics and disease control. Peak et al. (2018) used CDRs to algorithmically investigate the decrease of travel during the Ebola epidemic intervention in Sierra Leone [21]. These studies were able to detect the mobility and behavior patterns of the population after a large-scale natural disaster, showing a promising method to prepare for damage assessments and post-disaster responses.
Another type of telecommunications data that can be used in similar applications to CDRs is personal electronic device (PED) data. With the expansion of mobile phone accessibility in the global South, PED data has been leveraged to assist in the challenges faced in large-scale humanitarian emergencies. The use of PED data has developed in parallel with CDR data and is gathered instead from global navigation satellite systems (GNSSs) in smart devices (i.e., cell phones, smart phones, smart watches, smart tablets, etc.) [22]. PED data differs from the previously used CDR data, as it collects information about a location without depending on the transmission of communications (calls and/or texts) and, therefore, yields a higher level of accuracy of location precision. Yabe et al. (2019) utilized PED data of one million users following the Kumamoto earthquake to estimate the evacuation rates relative to seismic intensities [23]. Chen et al. (2020) utilized PED data from Baidu Map to track urban flow changes in Shenzhen during Typhoon Mangkhut [24]. More recently, with the COVID-19 pandemic, PED data is being used to track the mobility, transmission, and success of social-distancing guidelines. Liautaud, Huybers, and Santillana (2020) leveraged PED data to analyze the decrease in mobility with fever incidences from thermometers connected to smartphones [25]. It confirmed that social distancing has reduced transmission of the virus and could help identify potential outbreaks in the future.
PED data provides extremely rich spatiotemporal data on human mobility and can be used in many multidisciplinary applications, such as natural disasters, public health, credit fraud, human rights violations, etc. [26]. Companies like LocationSmart (www.locationsmart.com), Foursquare (www. foursquare.com), or Cuebiq (www.cuebiq.com) sell offline location analytics for businesses to provide consumer insights and marketing. Organizations like UNICEF and the World Bank also leverage this locational data for real-time humanitarian responses [27] (Figure 1). ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 3 of 20 applications of epidemics and disease control. Peak et al. (2018) used CDRs to algorithmically investigate the decrease of travel during the Ebola epidemic intervention in Sierra Leone [21]. These studies were able to detect the mobility and behavior patterns of the population after a large-scale natural disaster, showing a promising method to prepare for damage assessments and post-disaster responses.
Another type of telecommunications data that can be used in similar applications to CDRs is personal electronic device (PED) data. With the expansion of mobile phone accessibility in the global South, PED data has been leveraged to assist in the challenges faced in large-scale humanitarian emergencies. The use of PED data has developed in parallel with CDR data and is gathered instead from global navigation satellite systems (GNSSs) in smart devices (i.e., cell phones, smart phones, smart watches, smart tablets, etc.) [22]. PED data differs from the previously used CDR data, as it collects information about a location without depending on the transmission of communications (calls and/or texts) and, therefore, yields a higher level of accuracy of location precision. Yabe et al. (2019) utilized PED data of one million users following the Kumamoto earthquake to estimate the evacuation rates relative to seismic intensities [23]. Chen et al. (2020) utilized PED data from Baidu Map to track urban flow changes in Shenzhen during Typhoon Mangkhut [24]. More recently, with the COVID-19 pandemic, PED data is being used to track the mobility, transmission, and success of social-distancing guidelines. Liautaud, Huybers, and Santillana (2020) leveraged PED data to analyze the decrease in mobility with fever incidences from thermometers connected to smartphones [25]. It confirmed that social distancing has reduced transmission of the virus and could help identify potential outbreaks in the future.
PED data provides extremely rich spatiotemporal data on human mobility and can be used in many multidisciplinary applications, such as natural disasters, public health, credit fraud, human rights violations, etc. [26]. Companies like LocationSmart (www.locationsmart.com), Foursquare (www.foursquare.com), or Cuebiq (www.cuebiq.com) sell offline location analytics for businesses to provide consumer insights and marketing. Organizations like UNICEF and the World Bank also leverage this locational data for real-time humanitarian responses [27] (Figure 1). This study details an approach using pseudonymization PED data to detect locality depopulation from the 2017 large-scale earthquake in Puebla, Mexico. The data was preprocessed This study details an approach using pseudonymization PED data to detect locality depopulation from the 2017 large-scale earthquake in Puebla, Mexico. The data was preprocessed using Python algorithms and loaded into a PostgreSQL database. The approach then used a system of algorithms to detect when the residents left the locality. The algorithms first identified the residents of the localities, compared the number of residents each day, and then analyzed how the population average changed over time to detect communities that were depopulating. By using communities close to the earthquake's epicenter, as well as similarly sized communities away from the epicenter, this approach accurately showed that communities near the earthquake rapidly depopulated as a result of the earthquake. Currently, there is limited research on the use of PED data tracking mobility during a natural disaster. However, this study seeks to address the gap that exists and describes an approach that provides humanitarian response organizations an affordable, accurate, and automated approach to detect which communities are impacted the hardest by a large-scale humanitarian disaster. Such an approach could find would likely be more valuable in areas that lack other means of reporting (lack of capacity) or where the local government/authorities do not want to share information with the international community (lack of transparency).

Personal Electronic Device (PED) Data
The location analytics company Cuebiq Inc. provided access to a sample of Central Mexico's pseudonymization and privacy-enhanced PED locational data between 4 September to 10 October, 2017. Data was collected in both online and offline modes, so if the connection was lost with the proximate cellular towers, locations would still be recorded and included later within the dataset. Individual devices, based on their international mobile equipment identity (IMEI), were pseudonymized, and their locations can be plotted for one day, indicating patterns like traveling over time ( Figure 2). The positioning data was collected by the individual app's location services using a variety of methods to collect the IMEI's location. The data was housed in a physically secured computing research facility behind firewalls. The analysis was conducted on this remote server by logging in with personal laptops. The first-party data collected by Cuebiq contained 8 different columns: ID, device type, noise type, latitude, longitude, distance from previous data point, timestamp, and accuracy. With potential privacy information in the PED data, Cuebiq applies procedures to ensure privacy and different layers of protection for all users. For this dataset, data was deidentified by hashing and encrypting the ID, and a noise of 600m was added for home locations (within a geohash grid) and between 20-100 m for all other locations. This noise was added to the dataset to further anonymize specific users. For each type of location, a different privacy methodology was applied. Home and work locations were randomized within census blocks, allowing for the estimation of demographics without actually revealing the locations of the users; sensitive Points of Interest (POIs), such as primary schools, sexual/reproductive health clinics, places of worship, etc., were removed from the dataset completely; whitelisted POIs (commercial and public points of interest) were unchanged; and no-match POIs (all other data points) had a noise of 20-100 m based on the density of data points within the area.

Earthquake Layer
Following the earthquake, the European Commission's Emergency Response Coordination Centre (ERCC) produced a detailed map depicting the relative intensity of the earthquake across and beyond the Puebla state [29]. Using a modified Mercalli scale, the ERCC highlighted areas that experienced "Very Strong" (VII), "Strong" (VI), "Moderate" (V), and "Light" (IV) shaking. From this information, we selected the 15 experimental or damaged localities that fell within the "Very Strong" (VII) or "Strong" (VI) areas. For the control localities, we selected 15 localities that were entirely outside of even ISPRS Int. J. Geo-Inf. 2020, 9, 643 5 of 19 the most expansive "Light" (IV) areas ( Figure 3). This intensity map served as our ground reference data, delineating villages severely affected and those unaffected.

Earthquake Layer
Following the earthquake, the European Commission's Emergency Response Coordination Centre (ERCC) produced a detailed map depicting the relative intensity of the earthquake across and beyond the Puebla state [29]. Using a modified Mercalli scale, the ERCC highlighted areas that experienced "Very Strong" (VII), "Strong" (VI), "Moderate" (V), and "Light" (IV) shaking. From this information, we selected the 15 experimental or damaged localities that fell within the "Very Strong" (VII) or "Strong" (VI) areas. For the control localities, we selected 15 localities that were entirely outside of even the most expansive "Light" (IV) areas ( Figure 3). This intensity map served as our ground reference data, delineating villages severely affected and those unaffected.

Study Area
Puebla State is approximately 13,000 square miles, surrounded by the states of Morelos, Veracruz, Guerrero, Oaxaca, Hidalgo, and Tlaxcala. The state is known for its mountainous geography and has benefited from rich volcanic soil for agriculture. While the capital of Puebla, Puebla City (Ciudad de Puebla), is more developed, wealth disparities exist in rural localities within the state. This analytic approach relies on both control and experimental localities having a significant number of residents consistently using apps where their locational data is shared and accessible by Cuebiq. Mexico's active smartphone penetration, limited to individuals who use their phone at least once a month, extends to just 40.7 percent of the population [30]. Since the usage of location-sharing apps was likely less prevalent in 2017, especially in rural areas, 38 experimental locality candidates were selected from the study area to be possibly used in the study. In the selection of the experimental

Study Area
Puebla State is approximately 13,000 square miles, surrounded by the states of Morelos, Veracruz, Guerrero, Oaxaca, Hidalgo, and Tlaxcala. The state is known for its mountainous geography and has benefited from rich volcanic soil for agriculture. While the capital of Puebla, Puebla City (Ciudad de Puebla), is more developed, wealth disparities exist in rural localities within the state. This analytic approach relies on both control and experimental localities having a significant number of residents consistently using apps where their locational data is shared and accessible by Cuebiq. Mexico's active smartphone penetration, limited to individuals who use their phone at least once a month, extends to just 40.7 percent of the population [30]. Since the usage of location-sharing apps was likely less prevalent in 2017, especially in rural areas, 38 experimental locality candidates were selected from the study area to be possibly used in the study. In the selection of the experimental localities, candidates were chosen based on a population size between 2000 to 6000 from the 2010 Census and their proximity to the epicenter of the earthquake [29,31].
Eighty-two control candidate localities were also based on the same population criteria as the experimental dataset but chosen from other Mexican states that were not within the range of the earthquake (Puebla, Oaxaca, and Guerrero). In order to establish the residents of a particular locality, as well as determine the percentage of residents who left that locality on a given day, a geometric buffer was manually created around each locality. These "geofences" were created visually from satellite imagery on Google Earth to capture nearly all of the inhabited area while excluding as much uninhabited hinterland as possible. The size of the village was not considered, only the visually developed area. Given the heterogeneity of the population size and locality shape, each locality was assigned a unique buffer, ranging from approximately 500 to 4000 in diameter ( Figure 3). Of these candidates, 15 control and 15 experimental localities were used in the final analysis ( Figure 4) once they were assessed for market penetration.
localities, candidates were chosen based on a population size between 2000 to 6000 from the 2010 Census and their proximity to the epicenter of the earthquake [29,31].
Eighty-two control candidate localities were also based on the same population criteria as the experimental dataset but chosen from other Mexican states that were not within the range of the earthquake (Puebla, Oaxaca, and Guerrero). In order to establish the residents of a particular locality, as well as determine the percentage of residents who left that locality on a given day, a geometric buffer was manually created around each locality. These "geofences" were created visually from satellite imagery on Google Earth to capture nearly all of the inhabited area while excluding as much uninhabited hinterland as possible. The size of the village was not considered, only the visually developed area. Given the heterogeneity of the population size and locality shape, each locality was assigned a unique buffer, ranging from approximately 500 to 4000 in diameter (Figure 3). Of these candidates, 15 control and 15 experimental localities were used in the final analysis ( Figure 4) once they were assessed for market penetration.

Methodology
The analytic workflow processed each candidate locality through a series of scripts. If that locality met the requirements detailed below, it was used in the final dataset ( Figure 5).

Database Preparation
For 14.4 GB of locational data, PostgreSQL's suitability for more structured, robust, and tabular data made for our database of choice. The preparation of the data consisted of: (1) Converting the provided data into a readable csv file by running a Python conversion script. This code unzipped the daily csv files, inserted the corresponding column names, and then saved the data into a new csv file. Each day of data for the 37 days consisted of more than a hundred csv files, depending on the amount of data for that day.
(2) Loading the data into the database using a custom-scripted Python algorithm. The script connected to the PostgreSQL database, iterated through the csv files, and then imported the data into the table, utilizing the pyscopg2 Anaconda package. It took approximately 6.5 h to load an initial 225,962,016 rows that represented 37 days' worth of data.
(3) Creating a Python script that connected the database to a Jupyter Notebook in order to conduct large-scale operations on the data.

Methodology
The analytic workflow processed each candidate locality through a series of scripts. If that locality met the requirements detailed below, it was used in the final dataset ( Figure 5).

Methodology
The analytic workflow processed each candidate locality through a series of scripts. If that locality met the requirements detailed below, it was used in the final dataset ( Figure 5).

Preprocessing
In order to improve the code runtime and discard noise in the dataset, a temporal preprocessing (TP) algorithm was developed and implemented. The initial dataset contained 774,343 unique IDs. Preprocessing was established to only analyze those IDs where a pattern-of-life analysis could be conducted. This algorithm was designed to remove IDs and all their associated "hits" or records that did not have a significant presence throughout the dataset ( Figure 6). First, the algorithm converted each record's epoch timestamp into a YYYY-MM-DD HH:MM format (ex: 2018-10-12 11:33), which was then truncated to display just the day of the month (ex: 12). The algorithm then checked for how many distinct days each ID was present in the dataset. If an ID was detected for at least 4 distinct days of the 37 complete days (from 4 September to 11 October) present in the dataset) then the records associated with the ID were kept. If not, the records associated with that ID were removed from the dataset. Four was chosen as the minimum number of days to establish that an ID lived in a specific village but kept as many unique IDs as possible. This reduced the dataset by 8,271,655 rows to a total of 217,690,361 rows-a total reduction of 3.6%. Running this preprocessing algorithm took 5 h on a PowerEdge R530 Server with two Intel Xeon E5-2695 2.1GHz processors. Removing IDs without a sustained presence in the dataset ensured that our study could derive meaningful trends from a reasonable number of data points. Before preprocessing, the IDs had an average of 7.86 days with a record. Preprocessing eliminated many IDs with 1 or 2 total records, bringing up the average days that an ID was present to 13.48 (Table A1 (Figure 7). Removing IDs without a sustained presence in the dataset ensured that our study could derive meaningful trends from a reasonable number of data points. Before preprocessing, the IDs had an average of 7.86 days with a record. Preprocessing eliminated many IDs with 1 or 2 total records, bringing up the average days that an ID was present to 13.48 (Table A1 (Figure 7).

Resident Determination Algorithm
The resident determination algorithm established an indicator for residency in a locality. A resident of a particular locality was defined as a unique ID that was present within that locality's buffer at least 3 different pre-earthquake days (4 September to 18 September). This processing script first augmented the previously temporally preprocessed data with 3 additional columns. The first created geography type points corresponding to the latitude and longitude originally provided with each record. The second and third, respectively, extracted the day of the year and the hour of the day associated with each record. The subset of the data recorded before the earthquake was then copied into a second table.
For each control and experimental locality, new copies of the pre-earthquake and the temporally preprocessed data were created. All points within both subsets that did not intersect with a spatial buffer manually calibrated for the physical size of the associated town were discarded. This geofence was adjusted for the scrambled nature of the coordinates provided by Cuebiq, adding a 610-m margin to account for points thrown outside of the buffer that would have been detected inside the buffer otherwise. Removing IDs without a sustained presence in the dataset ensured that our study could derive meaningful trends from a reasonable number of data points. Before preprocessing, the IDs had an average of 7.86 days with a record. Preprocessing eliminated many IDs with 1 or 2 total records, bringing up the average days that an ID was present to 13.48 (Table A1 (Figure 7). The algorithm attempted to balance suppressing false positives and extracting as many residents from the data as possible. With the temporal restraints of the data preventing the establishment of a baseline residency, the algorithm aimed to capture as many individuals with high and consistent levels of activity throughout the dataset as possible by defining an ID as a resident if they were recorded within the geofence for at least 20 percent of the unique pre-earthquake days (3 days). Since residents may leave town for short periods of time, the algorithm's 20 percent threshold ensures the detection of those with strong geospatial connections to the locality in question. The IDs defined as residents of a locality were then stored in a separate table.

Market Penetration Criteria
Once each locality's residents were established, the workflow next ensured that each experimental and control locality met a market penetration standard for further analysis. This threshold was set to filter out localities with less than 3 identified residents.
In order to obtain 15 control and 15 experimental localities, this study examined 82 control candidates and 38 experimental candidate localities. Since smartphone penetration in Mexico is relatively lower in rural areas, we expect that a similar analysis run on more current data will yield a larger number of qualifying localities.

Resident Slope Analysis
For the 15 experimental localities and 15 control localities, our approach sought to identify localities that experienced a decrease of residents after the earthquake. In response, a slope analysis of the residents before and after the earthquake was conducted. A slope analysis showed the decrease of residents in each locality through the difference between slopes and was chosen because the method provided a better means of accounting for a more gradual migration out of a given locality. This slope difference for each locality was calculated using the LINEST function in Microsoft Excel. Localities that experienced a decrease in population of greater than 24% from before the earthquake to after the earthquake were "alerted" as being rapidly depopulated (Table A3). This threshold of 24% was set to maximize the overall accuracy for the 30 villages.

Validation
From the 15 experimental localities, any locality with a decrease in population of greater than 24% was "alerted" as being rapidly depopulated due to the earthquake. Any locality that did not fit this criterion was considered an omission error. Likewise, 15 localities outside of the earthquake zone and between the same sizes as the experimental dataset were chosen at random to assess the commission errors. If any of the control localities also recorded a decrease in population of greater than 24% (−0.24), they were considered "alerted" or a commission error (Figure 8). this criterion was considered an omission error. Likewise, 15 localities outside of the earthquake zone and between the same sizes as the experimental dataset were chosen at random to assess the commission errors. If any of the control localities also recorded a decrease in population of greater than 24% (−0.24), they were considered "alerted" or a commission error (Figure 8). Figure 8. This approach detected a locality as depopulating if the pre-earthquake slope of residents was 24% lower than the post-earthquake slope of residents. In this figure, the experiment locality Huaquechula had a consistent percent of residents (6% slope) before the earthquake but saw a steady decline in residents after the earthquake (−19%), while the control locality El Terrero maintained a steady percent of its residents both before (0%) and after (−5%).

Results
This approach resulted in an overall accuracy of 73% in detecting the depopulation of localities following the earthquake. Out of the 15 control localities, 12 were detected as not depopulated or an insignificant decrease in residents after the earthquake, yielding a commission error of 20% (Table  A3). Of the 15 experimental localities, 10 were detected with a decrease greater than 24% or more of the population after the earthquake, yielding an omission error of 33% (Table A3).
All locality candidates for the controls were randomly selected from 21 out of the 32 states in Mexico; at least two localities from each state in between the population sizes of 1000 -6000 people were chosen. All 15 experimental localities were in the state of Puebla, geographically proximate to Figure 8. This approach detected a locality as depopulating if the pre-earthquake slope of residents was 24% lower than the post-earthquake slope of residents. In this figure, the experiment locality Huaquechula had a consistent percent of residents (6% slope) before the earthquake but saw a steady decline in residents after the earthquake (−19%), while the control locality El Terrero maintained a steady percent of its residents both before (0%) and after (−5%).

Results
This approach resulted in an overall accuracy of 73% in detecting the depopulation of localities following the earthquake. Out of the 15 control localities, 12 were detected as not depopulated or an insignificant decrease in residents after the earthquake, yielding a commission error of 20% (Table A3).
Of the 15 experimental localities, 10 were detected with a decrease greater than 24% or more of the population after the earthquake, yielding an omission error of 33% (Table A3).
All locality candidates for the controls were randomly selected from 21 out of the 32 states in Mexico; at least two localities from each state in between the population sizes of 1000 -6000 people were chosen. All 15 experimental localities were in the state of Puebla, geographically proximate to the epicenter of the earthquake (Figure 9). There appeared to be no spatial correlation between the false negatives or experimental localities that failed to alert. The five experimental localities that did not alert were in the same geographic proximity to the 10 experimental localities that correctly alerted. Six out of the 15 of the control localities were from the state of Mexico, two from Hidalgo, two from Yucatan, two from Guerrero, one from Michoacán, one from Chiapias, and one from Colima. Additionally, while there appeared to be no definitive spatial correlation of the controls, most false positives were in states near the central region of the country, west of the earthquake. The overall population of a locality did not correlate with the accuracy for the control or experimental localities. For example, control locality San Miguel Ixtapan had a population of 1251 people with a slight decrease of 1.36% of residents, whereas control locality Huamuxtitlán had a population of 6063 but an increase of 3.15% (Table A2). Similarly, experimental locality Domingas Arenas had a population of 5864 people and a decrease of 47.86% of residents after the earthquake, but experimental locality San Felix Hidalgo had a population of 1628 people and a decrease of 53.73% (Table A3. Lastly, of the experimental localities, there was not a strong correlation between earthquake zone severity and the accuracy of the algorithm. Of the seven localities located in Zone VI, two did not alert, while of the eight localities in Zone VII, three did not alert (Figure 4) (Table A4) (Table A5).

Discussion
This approach leveraged Cuebiq pseudonymized PED data to identify and analyze patterns of movement after a natural disaster. The methods used in this study provided a workflow that can be The overall population of a locality did not correlate with the accuracy for the control or experimental localities. For example, control locality San Miguel Ixtapan had a population of 1251 people with a slight decrease of 1.36% of residents, whereas control locality Huamuxtitlán had a population of 6063 but an increase of 3.15% (Table A2). Similarly, experimental locality Domingas Arenas had a population of 5864 people and a decrease of 47.86% of residents after the earthquake, but experimental locality San Felix Hidalgo had a population of 1628 people and a decrease of 53.73% (Table A3).
Lastly, of the experimental localities, there was not a strong correlation between earthquake zone severity and the accuracy of the algorithm. Of the seven localities located in Zone VI, two did not alert, while of the eight localities in Zone VII, three did not alert (Figure 4) (Table A4) (Table A5).

Discussion
This approach leveraged Cuebiq pseudonymized PED data to identify and analyze patterns of movement after a natural disaster. The methods used in this study provided a workflow that can be potentially used as a framework for other natural disasters or similar incidences (such as violence) that prompt migration. The spatial distribution and market penetration of the 2017 data posed a challenge when working with the Cuebiq PED data; it was difficult to strike a balance between examining localities with a tighter, more reliable geofence and ones that had enough residents to analyze longitudinally. The 15 control and 15 experimental localities could only be selected after examining 82 control and 38 experimental candidates and checking their market penetrations to see if their impacts could be measured and analyzed. Working with smaller, more rural localities (generally with a population less than 2000) meant using smaller buffers with greater specificity in identifying residents-false positives were less likely to be seen from individuals moving through the area while they commuted to work, etc. However, there were fewer residents in those localities overall, because there was low cell phone penetration in rural areas. This will likely not be as pressing of an issue when replicating the experiment on more recent data, since PED penetration has shown to consistently increase over time; in 2019, Mexico's smartphone penetration increased 10 percentage points in three years to capture 49.5% of the population [32].
There are also data collection issues presented in this approach. If a disaster is severe, it will likely destroy the telecom and power networks. Since current PEDs rely on the terrestrial telecom network, they will have no way to upload their information. With the advent of low-Earth constellations of communication satellites, like SpaceX's Starlink (starlink.com), this could be alleviated, but the devices will still lose their charge without a power network within a few days. This could be addressed with dispersed power sources such as diesel generators, solar panels, or small-scale wind generators, but these are not widely available in some less-developed areas.
Another drawback of this study is the temporal distribution of the data. Since the data spanned a little more than a month-15 days before the earthquake and 21 days after the earthquake-using a dataset with larger temporal range will likely improve the results of the pattern of life analysis used to correctly identify the residents, including those who may be traveling, working temporarily elsewhere, or only using their smartphones intermittently. Using the approach described here, an average of only 2% of a locality's 2010 population were tracked as residents [31].
After some consideration, future elements of this research could include, but are not limited to, examining the geological and geographic factors that could have contributed to the lack of market penetration, analyzing the usage of different types of smart device applications that are partnered with Cuebiq and are commonly used in Mexico, understanding the aggregate demographics of the region in correlation to smartphone usage, and other situational circumstances that affected the population's mobility but were not previously identified. These reasons were the top hypotheses as for why the algorithm performed better on some localities than others, and further investigation could give potential insight on the lack of market penetration and more accurate adjustments for the algorithm.
The approach presented here could be extended to the present day, monitoring currently intact localities in areas at risk of humanitarian disasters. While downloading, loading into the database and preprocessing this month of data in our study region took approximately three days on a Windows 10 server; automating the process to remove the inefficiencies incurred by optimizing the workflow to output parallel datasets differing in experimental parameters could make this process feasible on a weekly basis. During the aftermath of a humanitarian disaster, data could pinpoint localities incurring the greatest damage. Additionally, this approach could be modified to function from a period of a few hours as opposed to a 24-h period to serve as an early warning mechanism of a humanitarian disaster and the mass migration that may result. While there are hurdles, this approach is scalable. A cloud-based effort could easily monitor all localities in an area, such as this one in Mexico, alerting users any time a spatial grouping of localities experiences rapid depopulation.
Organizations interested in this approach could modify the algorithm's sensitivity to suit their objectives. For example, in situations such as monitoring a specific region at risk of disaster, the −24% slope difference could be modified to −15% percent, reducing the chance that a locality's depopulation would miss detection. This lowers the omission rate but increases the commission rate. Conversely if an organization is concerned with the total number of people fleeing, the alert level could be changed to 30% to reduce commission errors.

Conclusions
In future studies, building a better baseline of locality patterns of life could account for weekly, monthly, and annual changes and significantly improve the accuracy. This is possible with the significant increase in market penetration of location PED and scalable computing. Additionally, with a longer period of data, a custom buffer could be created for each ID based on their usual movements. When the ID passes out of that buffer, it would be recorded as an abnormal movement. Enough IDs with abnormal movements on a specific day could indicate an anomalous environmental or political event. The advantage that this algorithm has is that it could be used in larger cities for when one neighborhood might be destroyed and the residents shift to a neighboring one. A disadvantage is that a false positive could occur for other reasons, such as large-scale sporting events.
As more years of data become available, a greater number of incidents like humanitarian emergencies will provide "labels" for training artificial intelligence or machine-learning algorithms. These algorithms will more accurately detect anomalous situations of interest, such as the localities most severely affected by a humanitarian disaster. The operationalization of this approach will likely be possible in the near future with increased data collection and availability, improved computer processing, and near-real time. Operationalization will, however, require that the location data from PEDs are transferred, requiring a functioning telecom network or a space-based communication network such as Starlink. Operationalization also requires that PEDs can continue to be powered through either a functioning power grid or, if that is destroyed in a severe disaster, dispersed power sources such as diesel generators, solar panels, or small-scale wind generators.
With increases in computing capacity and the growing database of PED on a global scale, approaches like this have the possibility of providing researchers and practitioners a way to monitor large areas at-risk of humanitarian disaster. It is understood that this approach will never replace on-the-ground witnesses but serve as a low-cost alert system capable of providing additional information in areas lacking connectivity. Such an approach will be even more valuable in areas that lack other means of reporting (lack of capacity) or the local government/authorities do not want to share information with the international community (lack of transparency). Our hope is that this research helps provide organizations committed to providing emergency humanitarian responses a way to act in a decisive manner through informed, data-driven insights and, ultimately, reduce the suffering of those affected by a disaster. code for database extraction and loading for the study and Kurtis Weatherford for his contributions in creating the spatial buffers and maps used in this study.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Before preprocessing, the IDs had an average of 7.86 days with a record. Preprocessing eliminated many IDs with one or two records total, bringing the average days the IDs had a hit up to 13.48.  Table A2. Fifteen control localities were used in the final analysis. In processing, an additional 610m was added to this buffer radius to account for privacy data screening by the data provider.  Table A3. Fifteen experimental localities were used in the final analysis all from earthquake class areas of VI/VII or higher. In processing, an additional 610m was added to this buffer radius to account for privacy data screening by the data provider.