Understanding the Representativeness of Mobile Phone Location Data in Characterizing Human Mobility Indicators

: The advent of big data has aided understanding of the driving forces of human mobility, which is beneﬁcial for many ﬁelds, such as mobility prediction, urban planning, and trafﬁc management. However, the data sources used in many studies, such as mobile phone location and geo-tagged social media data, are sparsely sampled in the temporal scale. An individual’s records can be distributed over a few hours a day, or a week, or over just a few hours a month. Thus, the representativeness of sparse mobile phone location data in characterizing human mobility requires analysis before using data to derive human mobility patterns. This paper investigates this important issue through an approach that uses subscriber mobile phone location data collected by a major carrier in Shenzhen, China. A dataset of over 5 million mobile phone subscribers that covers 24 h a day is used as a benchmark to test the representativeness of mobile phone location data on human mobility indicators, such as total travel distance, movement entropy, and radius of gyration. This study divides this dataset by hour, using 2-to 23-h segments to evaluate the representativeness due to the availability of mobile phone location data. The results show that different numbers of hourly segments affect estimations of human mobility indicators and can cause overestimations or underestimations from the individual perspective. On average, the total travel distance and movement entropy tend to be underestimated. The underestimation coefﬁcient results for estimation of total travel distance are approximately linear, declining as the number of time segments increases, and the underestimation coefﬁcient results for estimating movement entropy decline logarithmically as the time segments increase, whereas the radius of gyration tends to be more ambiguous due to the loss of isolated locations. This paper suggests that researchers should carefully interpret results derived from this type of sparse data in the era of big data.


Introduction
Understanding human mobility is of crucial importance [1,2], with potential benefits for various fields such as mobility prediction [3,4], urban planning [5][6][7], transportation research [8,9] health research [10].With the rapid development of information and communication technology [11] in the past two decades, various types of massive digital footprints generated by humans such as smart card data, call detail records (CDRs), geo-tagged social media data, GPS tracking data, WiFi data, credit-card records data, and their concomitant analytics are used for human mobility research [2,[12][13][14][15][16][17][18].However, there is debate regarding the representativeness or inherent biases of the data.For example, previous studies demonstrate that mobile phone users are unevenly distributed in age, gender, and geography [19,20].This type of bias also exists in social media data [21,22].
Unlike GPS tracking data that can have multiple records per minute [23], a main disadvantage of the data used in previous research, such as mobile phone location and social media check-in data, is that it is very 'sparsely' sampled on a temporal scale.Thus, an individual's records can be distributed over a few hours a day, or week, or over just a few hours a month, due to the uneven distribution of peoples' phone activities in space and time, which is an issue that requires attention to the data [24].Previous researchers have discussed how CDRs can introduce biases in human mobility research [25,26] and how the level of deviation is closely related to the ratio of sampled phone communication records in an individual's trajectory [26].In addition, Sagarra et al. [27] proposed a supersampled model to assess the sampling biases of reduced data.The representativeness of different time segments has not been investigated comprehensively due to the lack of ground truth for trajectories.What is the representativeness of sparse mobile phone location data on estimations of human mobility?This question must be addressed before using data to study human mobility patterns and derive results.
In this paper, we quantitatively analyze the representativeness of mobile phone location data on estimations of individual human mobility patterns.CDRs usually capture individual footprints during phone communication, whereas the actively tracked mobile phone location data contains phone communication records and location records triggered by location update strategies such as periodic and regular updates and cellular handover.This study uses active tracking data to conduct the investigation.Figure 1 shows an individual's complete trajectory from our mobile phone location dataset over an entire day.The Voronoi tessellations were used to represent the service areas of cell phone towers.It is difficult to determine a real path because most cell phone towers had not been recorded even under active updating strategies.Therefore, the main research question of this study is to determine the effects of sparsely temporally sampled mobile phone location data on the evaluation of human mobility indicators.
ISPRS Int.J. Geo-Inf.2017, 6, 7 2 of 19 and human health research [10].With the rapid development of information and communication technology [11] in the past two decades, various types of massive digital footprints generated by humans such as smart card data, call detail records (CDRs), geo-tagged social media data, GPS tracking data, WiFi data, credit-card records data, and their concomitant analytics are used for human mobility research [2,[12][13][14][15][16][17][18].However, there is debate regarding the representativeness or inherent biases of the data.For example, previous studies demonstrate that mobile phone users are unevenly distributed in age, gender, and geography [19,20].This type of bias also exists in social media data [21,22].
Unlike GPS tracking data that can have multiple records per minute [23], a main disadvantage of the data used in previous research, such as mobile phone location and social media check-in data, is that it is very 'sparsely' sampled on a temporal scale.Thus, an individual's records can be distributed over a few hours a day, or week, or over just a few hours a month, due to the uneven distribution of peoples' phone activities in space and time, which is an issue that requires attention to the data [24].Previous researchers have discussed how CDRs can introduce biases in human mobility research [25,26] and how the level of deviation is closely related to the ratio of sampled phone communication records in an individual's trajectory [26].In addition, Sagarra et al. [27] proposed a supersampled model to assess the sampling biases of reduced data.The representativeness of different time segments has not been investigated comprehensively due to the lack of ground truth for trajectories.What is the representativeness of sparse mobile phone location data on estimations of human mobility?This question must be addressed before using data to study human mobility patterns and derive results.
In this paper, we quantitatively analyze the representativeness of mobile phone location data on estimations of individual human mobility patterns.CDRs usually capture individual footprints during phone communication, whereas the actively tracked mobile phone location data contains phone communication records and location records triggered by location update strategies such as periodic and regular updates and cellular handover.This study uses active tracking data to conduct the investigation.Figure 1 shows an individual's complete trajectory from our mobile phone location dataset over an entire day.The Voronoi tessellations were used to represent the service areas of cell phone towers.It is difficult to determine a real path because most cell phone towers had not been recorded even under active updating strategies.Therefore, the main research question of this study is to determine the effects of sparsely temporally sampled mobile phone location data on the evaluation of human mobility indicators.This paper investigates this question and provides several suggestions to select appropriate dataset to analyze human mobility.The findings of this research can also be used to evaluate the representativeness of other types of sparsely sampled data, such as geo-tagged social media data.This paper investigates this question and provides several suggestions to select appropriate dataset to analyze human mobility.The findings of this research can also be used to evaluate the representativeness of other types of sparsely sampled data, such as geo-tagged social media data.This paper is organized as follows: in the second section, we provide a review of studies related to this research.Section 3 introduces the active tracking mobile phone location dataset and the study area.Section 4 describes the method for evaluating the representativeness of sparse mobile phone data for measurement of human mobility indicators.Section 5 discusses the analysis results.The last section summarizes our findings and discusses future research directions.

Literature Review
This section presents relevant research in the following two areas; big data for human mobility research and representative issues of big data.

Mobile Phone Location Data for Human Mobility Research
Many valuable findings related to human mobility and interaction with urban environments have been reported in recent years with the advent of big data.These profound research studies can be used for mobility prediction [3,4], urban planning [5][6][7], transportation research [8,9,28], and other fields [10,29].Among the datasets, mobile phone location data is very special data because mobile phones have an extremely high penetration rate and people usually take their cell phones with them, especially in Asian countries such as China.Some researchers view this type of data as a reasonable source to describe human mobility [30].
By using the sparsely sampled mobile phone location data, Kung et al. [31] explored the home-work commuting patterns of several cities in different countries and discovered that the commute time and average value distributions are independent of commute distance or country.Diao et al. [32] discovered the common laws governing an individual's activity participation and extracted the embedded information by presenting an activity detection model with travel diary surveys.Human footprints can also be used to analyze the spatial-temporal patterns of convergence and divergence in urban areas [33].For transportation research, trip chain segments derived from mobile phone location data can be used to estimate the dynamic potential demand of bicycle trips in public transportation planning [34].By estimating the dynamic origin-destination matrices, weekday and weekend travel patterns have been portrayed to analyze differences in travel demand over time [35].However, how good are the subsample datasets in providing a good estimation of mobility patterns?The answer to this question is not simply yes or no, but investigations in the representativeness of the sparsely sampled location data may help to find some answers.
In addition, the human activity space and the mobility heterogeneity in this space are also the topics in many studies regarding human mobility research [2,[36][37][38][39].For instance, González et al. [2] found out that the radius of gyration for all individuals can be approximated with a truncated power-law.Yuan et al. [37] explored the relationships between phone usage and indicators of travel behavior characterized by movement entropy and radius.The absence of some outlying location points in the sparse mobile phone records may influence the calculation of movement radius in real scene.Calabrese et al. [30] compared the total trip length between mobile phone and vehicle data and demonstrated that using the Euclidean distance between cell phone towers to measure individual mobility could bring some downwards bias, but whether the sparse distribution of location records is also one of the reasons for this bias needs to be validated.Song and Barabási [39] and Gallotti [40] used entropy to predict individual mobility patterns.Moreover, Cuttone et al. [41] found out that there are also some relationships between the spatial and temporal resolution of the mobile phone data and the accuracy of predicting human mobility.The effectiveness of the sparse location data in the characterization of individual human mobility should be paid more attention.
Moreover, from the literature reviewed above, many indicators are used to characterize the human mobility patterns, such as the radius of gyration [2,38], movement entropy [37,39,40], and travel distance [26,30].These indicators are usually used to characterize the travel distance, range of activity space, and heterogeneity of visitation patterns, which are three of the fundamental indicators in human mobility.However, few studies have reported how representative the sparse location data is in the characterization of individual human mobility.

Representative Issues of Big Data
Despite the eager study of big data, there are also debates regarding privacy [42][43][44], data quality [45][46][47][48], and representative issues [25,26].Previous studies demonstrate that mobile phone users are unevenly distributed in gender and geography [19,20] and population component [49].This type of bias also exists in social media data [21,22].The effects of spatial sampling and the granularity of sparse location data have also been studied [24,50].
Temporal sampling issues are of critical importance in using data to investigate human mobility patterns.GPS tracking data can have relatively fine granularity from both temporal and spatial perspectives [23,51]; however, the mobile phone location data and geo-tagged social media data used in previous studies are very sparsely temporally sampled due to the uneven distribution of peoples' phone activities in space and time, which is the main issue that requires attention [24].An individual's mobile phone records or social media check-in data can be distributed over a few hours a day, or a week, or just a few hours a month.Goodchild [52] indicated that losses in quality control and rigorous sampling are characteristics of big data that can distinguish it from small data.Although previous studies have demonstrated that sparsely sampled CDRs introduce some biases to human mobility research [25,26] and that the level of deviation is closely related to the ratio of CDRs in an individual's complete trajectory [26], they do not describe how to obtain a more representative dataset if the complete trajectory is not available for comparison.
The incompleteness of temporally or spatially sampled location data is also a considerable factor leading to uncertainty issues in GIScience [53,54], raising concerns regarding how uncertainties could affect the findings [55,56].Some researchers think that long periods of time help increase sample size; Jacobs [57] notes that these data are large numbers of repeated observations over time and/or space and may not get rid of the sparse issue.The critical question of 'how good are mobile phone location data at providing an accurate estimation of individual mobility indicators?' remains to be addressed before using data to investigate human mobility patterns and derive reasonable results.
Thus, this paper quantitatively evaluates the representativeness of sparse mobile phone location data in estimations of individual human mobility indicators.We not only focus on determining the effects of different time segments on human mobility characterization but also on providing a clear quantitative cognition of the representativeness of data.

Study Area and Dataset
The study area of this research is Shenzhen, one of the largest cities in China.This section provides background information on Shenzhen and the active tracking mobile phone location dataset collected there.

Study Area
The population of Shenzhen is greater than 15 million in an approximately 2000 square kilometer area, reflecting the highest population density among Chinese cities.Its annual gross domestic product (GDP) ranked fourth among all cities in China [58], after Shanghai, Beijing, and Guangzhou.Located on the south coast of China, Shenzhen is across the border from Hong Kong (Figure 2).Shenzhen has developed into an influential international city.The prosperous socioeconomic status of Shenzhen makes it a good choice for human mobility and business area analyses.

Data
H.X.; Danczyk, A.; Brewer, R.; Starr, R. Evaluation of cell phone traffic data in Minnesota.e company that includes approximately 60% of the entire mobile phone market in Shenzhen.Approximately 16 million subscribers' location records were collected during a single workday.Table 1 shows the attributes of the mobile phone location data.For privacy concerns, the user ID is encrypted.Mobile communication carriers record the closest cell phone tower each time the subscriber uses his or her phone.Unlike call detail records data, the mobile phone location data records in this paper contain the following connection types: (1) Making and receiving calls; (2) Sending and receiving text messages; (3) Regular location updates (triggered by moving from one cell phone tower to another), and (4) Periodic location update (triggered by tower pinging if a subscriber has no phone activities for a specified time period).
The (3) and ( 4) are two active update strategies for this dataset.The connection types were not given in this dataset.Even under the active update strategies, we cannot determine the actual path because most of the cell phone towers had not been recorded (Figure 1).There are 5940 unique cell phone towers in this dataset.Figure 3 shows the spatial kernel density of the cell phone towers.The cell phone towers are unevenly distributed in the urban space.Overall the cell phone towers are densely distributed in the center of the city and in highly populated areas, whereas the cell phone towers are sparsely distributed in suburban areas,

Data
The mobile phone location data used in our research was collected by a very large mobile phone company that includes approximately 60% of the entire mobile phone market in Shenzhen.Approximately 16 million subscribers' location records were collected during a single workday.Table 1 shows the attributes of the mobile phone location data.For privacy concerns, the user ID is encrypted.Mobile communication carriers record the closest cell phone tower each time the subscriber uses his or her phone.Unlike call detail records data, the mobile phone location data records in this paper contain the following connection types: (1) Making and receiving calls; (2) Sending and receiving text messages; (3) Regular location updates (triggered by moving from one cell phone tower to another), and (4) Periodic location update (triggered by tower pinging if a subscriber has no phone activities for a specified time period).
The (3) and ( 4) are two active update strategies for this dataset.The connection types were not given in this dataset.Even under the active update strategies, we cannot determine the actual path because most of the cell phone towers had not been recorded (Figure 1).There are 5940 unique cell phone towers in this dataset.Figure 3 shows the spatial kernel density of the cell phone towers.The cell phone towers are unevenly distributed in the urban space.
Overall the cell phone towers are densely distributed in the center of the city and in highly populated areas, whereas the cell phone towers are sparsely distributed in suburban areas, resulting in a lower positioning accuracy.The average distance and maximum distance between adjacent cell phone towers is about 0.21 and 2.6 km, respectively.
ISPRS Int.J. Geo-Inf.2017, 6, 7 6 of 19 resulting in a lower positioning accuracy.The average distance and maximum distance between adjacent cell phone towers is about 0.21 and 2.6 km, respectively.Since the focus of this paper is to investigate the representativeness of spare mobile phone location data in characterizing the human mobility patterns, the uneven distribution of people's phone activities in space and time is the main concern regarding to our research goal [24], while the dense distribution of cell phone towers across the urban area indicated that the spatial granularity at the cell phone tower level may not be a major drawback in this study area.

Methodology
This paper introduced the frequently used human mobility indicators.Thereafter, the method of evaluating the representativeness of mobile phone location data included three main steps.First, we divided the day into 24 hourly segments, extracted the subscribers whose records covered all the 24 hourly segments into a new dataset, and calculated their complete human mobility indicators as the benchmarks of this study.Then, we calculated the sampled human mobility indicators by selecting different numbers of time segments from the new dataset under random rules.Finally, a linear regression model was proposed to quantify the aggregated underestimation level between sampled and complete human mobility indicators in each random time.

Frequently used Human Mobility Indicators
There are many frequently used indicators to measure activity space, like maximum travel distance, radius of gyration, movement radius, total travel distance, movement entropy, visitation frequency, and so on.Mainly, these indicators could be classified into three categories, which are the range of activity space, the travel distance, and the heterogeneity of visitation patterns within the activity space.For instance, both of the movement entropy and visitation frequency are used to characterize the heterogeneity of visitation patterns.Thus, this paper used three of them to characterize human activity behavior.They are calculated based on a working day and defined as follows: Total travel distance: The total travel distance is the sum of the Euclidian distance between each pair of consecutive records [26], which is a basic measure of individual mobility.
Movement entropy: A characterization of the heterogeneity of visitation patterns [37,38], calculated as Since the focus of this paper is to investigate the representativeness of spare mobile phone location data in characterizing the human mobility patterns, the uneven distribution of people's phone activities in space and time is the main concern regarding to our research goal [24], while the dense distribution of cell phone towers across the urban area indicated that the spatial granularity at the cell phone tower level may not be a major drawback in this study area.

Methodology
This paper introduced the frequently used human mobility indicators.Thereafter, the method of evaluating the representativeness of mobile phone location data included three main steps.First, we divided the day into 24 hourly segments, extracted the subscribers whose records covered all the 24 hourly segments into a new dataset, and calculated their complete human mobility indicators as the benchmarks of this study.Then, we calculated the sampled human mobility indicators by selecting different numbers of time segments from the new dataset under random rules.Finally, a linear regression model was proposed to quantify the aggregated underestimation level between sampled and complete human mobility indicators in each random time.

Frequently used Human Mobility Indicators
There are many frequently used indicators to measure activity space, like maximum travel distance, radius of gyration, movement radius, total travel distance, movement entropy, visitation frequency, and so on.Mainly, these indicators could be classified into three categories, which are the range of activity space, the travel distance, and the heterogeneity of visitation patterns within the activity space.For instance, both of the movement entropy and visitation frequency are used to characterize the heterogeneity of visitation patterns.Thus, this paper used three of them to characterize human activity behavior.They are calculated based on a working day and defined as follows: Total travel distance: The total travel distance is the sum of the Euclidian distance between each pair of consecutive records [26], which is a basic measure of individual mobility.
Movement entropy: A characterization of the heterogeneity of visitation patterns [37,38], calculated as where n is the number of distinct cell phone towers visited by a subscriber and p i is the probability that location i is visited.
Radius of gyration: Describes how widely the subscriber travelled; one of the most frequently used measures to characterize the range of activity space [2,38], calculated as where N is the number of time-sequenced cell phone towers visited by a subscriber, p j is the jth tower that the subscriber visited, and p cm is the center of all time-sequenced locations.

Extracting Valid Subscribers
After introducing the frequently used human mobility indicators, the method used to evaluate the representativeness of mobile phone location data in characterizing these indicators included three main steps, described below.
Clearly, mobile phone location records of different subscribers are sparsely distributed in different numbers of time segments.The less time segments the subscriber's records are in, the sparser are the records from a temporal perspective.For example, about 3.37% of subscribers' records were just in one hour a day, and the percentage of users that have records in 24 temporal segments was 35.70%, which means that the records of almost 65% of users were distributed in less than 24 segments, as shown in Figure 4.Moreover, the records of approximately 13.18% of users were in 6 segments or less.Hence, it is questionable whether the mobility patterns of users can be properly characterized without covering enough temporal intervals.
ISPRS Int.J. Geo-Inf.2017, 6, 7 7 of 19 where n is the number of distinct cell phone towers visited by a subscriber and pi is the probability that location i is visited.
Radius of gyration: Describes how widely the subscriber travelled; one of the most frequently used measures to characterize the range of activity space [2,38], calculated as where N is the number of time-sequenced cell phone towers visited by a subscriber, pj is the jth tower that the subscriber visited, and pcm is the center of all time-sequenced locations.

Extracting Valid Subscribers
After introducing the frequently used human mobility indicators, the method used to evaluate the representativeness of mobile phone location data in characterizing these indicators included three main steps, described below.
Clearly, mobile phone location records of different subscribers are sparsely distributed in different numbers of time segments.The less time segments the subscriber's records are in, the sparser are the records from a temporal perspective.For example, about 3.37% of subscribers' records were just in one hour a day, and the percentage of users that have records in 24 temporal segments was 35.70%, which means that the records of almost 65% of users were distributed in less than 24 segments, as shown in Figure 4.Moreover, the records of approximately 13.18% of users were in 6 segments or less.Hence, it is questionable whether the mobility patterns of users can be properly characterized without covering enough temporal intervals.The data of 5.8 million subscribers were included in this research, thus it could be used to investigate the effects of different time segments in characterizing human mobility patterns.Perhaps these subscribers may habitually use their mobile phones more frequently than others.In addition, previous studies have demonstrated that mobile phone users are heterogeneously distributed in age, gender, and space [19,20].Thus, mobile phone users in our subsample dataset may have different biases in these aspects, which need to be further explored in future work.The data of 5.8 million subscribers were included in this research, thus it could be used to investigate the effects of different time segments in characterizing human mobility patterns.Perhaps these subscribers may habitually use their mobile phones more frequently than others.In addition, previous studies have demonstrated that mobile phone users are heterogeneously distributed in age, gender, and space [19,20].Thus, mobile phone users in our subsample dataset may have different biases in these aspects, which need to be further explored in future work.

Random Rules
After the 5.8 million subscribers were extracted, we divided each subscriber's records into 24 time segments, as shown in Figure 5.

Random Rules
After the 5.8 million subscribers were extracted, we divided each subscriber's records into 24 time segments, as shown in Figure 5.To investigate the representativeness of sparse mobile phone location data on estimation of individual human mobility indicators, this study varied the number of time segments selected from 2 to 23.For each number of time segments, the selection was randomized 100 times to ensure each time segment could be selected.For example, when the selected number of time segments was two, the time segments (#2, #5) or the segments (#3, #9) could be selected out; when the selected number of time segments was three, the segments (#4, #5, #21) or the segments (#2, #6, #17) could be selected out, as shown in Figure 6.In addition, the selected time segments were not repeated even if the number of time segments was the same.For instance, when the selected number of time segments was three, the segment combination (#4, #9, #22) was selected only once among all the 100 random times.To investigate the representativeness of sparse mobile phone location data on estimation of individual human mobility indicators, this study varied the number of time segments selected from 2 to 23.For each number of time segments, the selection was randomized 100 times to ensure each time segment could be selected.For example, when the selected number of time segments was two, the time segments (#2, #5) or the segments (#3, #9) could be selected out; when the selected number of time segments was three, the segments (#4, #5, #21) or the segments (#2, #6, #17) could be selected out, as shown in Figure 6.In addition, the selected time segments were not repeated even if the number of time segments was the same.For instance, when the selected number of time segments was three, the segment combination (#4, #9, #22) was selected only once among all the 100 random times.

Random Rules
After the 5.8 million subscribers were extracted, we divided each subscriber's records into 24 time segments, as shown in Figure 5.To investigate the representativeness of sparse mobile phone location data on estimation of individual human mobility indicators, this study varied the number of time segments selected from 2 to 23.For each number of time segments, the selection was randomized 100 times to ensure each time segment could be selected.For example, when the selected number of time segments was two, the time segments (#2, #5) or the segments (#3, #9) could be selected out; when the selected number of time segments was three, the segments (#4, #5, #21) or the segments (#2, #6, #17) could be selected out, as shown in Figure 6.In addition, the selected time segments were not repeated even if the number of time segments was the same.For instance, when the selected number of time segments was three, the segment combination (#4, #9, #22) was selected only once among all the 100 random times.Moreover, for the same number of time segments, each segment should be selected at least five times among every 100 times.For example, in selecting two time segments, if the segments (#2, #5), (#2, #7), (#2, #16), (#2, #19), and (#2, #23) were selected, the #2 appeared five times but #5, #7, #16, #19, #23 appeared only once and the other 18 segments didn't appear.Thus, in the next 95 random times, more attention would be paid to 23 other segments in the random selections.This rule is designed to reduce the inequality in selecting each time segment.
When there are 23 time segments, there are only 24 choices from which to select the 23 segments.Each of the 5.8 million subscribers' randomly sampled mobility indicators were calculated by using all the mobile phone records in selected time segments at each random time.

Evaluating the Aggregated Underestimation Coefficient
For each random time, we calculated a set of sampled indicator values, the sampled total travel distance, the sampled movement entropy, and the sampled radius of gyration by using the sampled records in randomly selected time segments for all 5.8 million subscribers.
To quantify the aggregated underestimation level for sampled time segments in characterizing the human mobility indicators, a linear regression model was used [26].
Here, for each random time, each of the mobility indicators calculated by using the complete records in the whole time segments are defined as the independent variable x, and the corresponding sampled mobility indicator are defined as the dependent variable y by using the records in randomly selected time segments.The coefficient a measures the relationship between sampled and complete indicators of all the 5.8 million subscribers.Here, b was set to 0 in the linear regression model because, when the mobility indicator in a complete benchmark dataset is 0, the mobility indicator in the selected dataset should also be 0. The coefficient a is calibrated by the least square regression method [60].
Thus, the aggregated underestimation coefficient (uc) is defined as follows: Clearly, the lower uc is, the lower the level of underestimation is and the more representative the randomly selected time segments are for characterizing human mobility indicators.For instance, when the selected time segments are (#3, #6, #7, #13, #19, and #21) and the coefficient between the sampled total travel distance and complete total travel distance is 0.25, the aggregated underestimation coefficient is 0.75.The uc is relatively high, which means the representativeness of these six time segments is low, because the total travel distance calculated by using records in these six time segments may be about 75% shorter than their total footprints in the study area.

Measuring Mobility Indicators by Randomly Selecting Time Segments
This section analyzed the various differences between sampled mobility indicators and complete mobility indicators from the individual perspective.Then the quantitatively aggregated underestimation effects were explored from the average perspective.

Individual Perspective
This section focuses on evaluating the representativeness of sparse mobile phone location data in individual daily mobility pattern analysis.Examples of random mobility indicators and complete mobility indicators are shown in Table 2 and Figures 7-9.
The horizontal and vertical axes represent the mobility indicators from complete and random time segments, respectively.If random time segments are representative of the complete time segments, the points on Figures 7-9 should be close to the light blue diagonal line from lower left to upper right.The representativeness of different time segments for estimations of individual mobility indicators are quite different, as the gray dots show.For example, when 10 time segments are used, the individual movement entropy is overestimated for 32.79% of subscribers and the individual radius of gyration is overestimated for 19.42% of subscribers.7-9, there are deviations between the red dots and the blue diagonal line, which indicates that using fewer mobile phone location data time segments tends to underestimate the total travel distance, movement entropy, and radius of gyration from an average perspective, which can also be seen from Table 2.  overestimated because fewer records lead to a shorter total travel distance.As the number of time segments increases, there are fewer deviations from the blue diagonal line for the total travel distance.Conversely, the variation in average total travel distance increases when the complete total travel distance increases.For example, when the complete total travel distance is 100 km for 10 time segments, the random travel distance is approximately 65 km but when the complete total travel distance is 200 km, the random total travel distance is between 70 km and 140 km.This was likely because the number of subscribers decreases rapidly as the total travel distance increases and because the location records in some time segments are distant from those in other time segments.The total travel distance could be greater than 70 km, which was caused by subscribers such as taxi or bus drivers, package deliverers, and tourists.These subscribers account for less than 2.0% of the 5.8 million subscribers.Another interesting pattern is that the range of average total travel distance is supposed to be narrower when using 23 time segments.This is mainly because there are fewer random times and the selected records are very close to the total records for each individual.
The movement entropy could be overestimated or underestimated for different individuals due to calculation using records from different time segments.However, from an average perspective, the declining trend in the underestimation coefficient in estimating movement entropy can be observed from Figures 7-9.When 3 time segments are used, the underestimation coefficient Unlike total travel distance and movement entropy, the distribution of the random average radius of gyration does not always increase with the complete radius of gyration.As shown in Figures 7-9, the random average radius of gyration increases until the complete radius of gyration is approximately 9 km.Then, although the complete radius of gyration increases, the average random radius of gyration declines.Therefore, the linear regression model is used within 9 km.To estimate the radius of gyration, incomplete mobile phone location records are probably good enough in most cases for analysis of subscriber travel within a normal daily activity range, i.e., less than 9 km.In addition, for subscribers whose complete radius of gyration is greater than 9 km, the average random radius of gyration is often zero or very close to zero due to the loss of some long-distance locations.These subscribers account for less than 7.0% of all valid subscribers, usually travel in many different directions, and are likely to travel in a wide range.Thus, the lack of any time segments between 8 am and 8 pm may significantly affect the radius of gyration.The social identities of these subscribers may be taxi or Uber/Didi drivers, package deliverers, or tourists.Therefore, mobile phone location data might significantly underestimate the radius of gyration of The random total travel distance cannot be overestimated because fewer records lead to a shorter total travel distance due to the triangle principle.However, the movement entropy and radius of gyration could be overestimated or underestimated for different individuals due to the use of records from different time segments in the calculation.The average level is often used to characterize the distribution of the corresponding bandwidth [4,26,37,38].The average level of the estimation was studied as described below.

Average Perspective
In Figures 7-9, there are deviations between the red dots and the blue diagonal line, which indicates that using fewer mobile phone location data time segments tends to underestimate the total travel distance, movement entropy, and radius of gyration from an average perspective, which can also be seen from Table 2.
From an average perspective, the underestimation coefficient of the total travel distance is 0.86 (R 2 = 0.291, goodness of fit [61]) when there are 3 time segments.When 10 time segments are used, the underestimation coefficient is 0.52 (R 2 = 0.894).The sampled total travel distance is not typically overestimated because fewer records lead to a shorter total travel distance.As the number of time segments increases, there are fewer deviations from the blue diagonal line for the total travel distance.Conversely, the variation in average total travel distance increases when the complete total travel distance increases.For example, when the complete total travel distance is 100 km for 10 time segments, the random travel distance is approximately 65 km but when the complete total travel distance is 200 km, the random total travel distance is between 70 km and 140 km.This was likely because the number of subscribers decreases rapidly as the total travel distance increases and because the location records in some time segments are distant from those in other time segments.
The total travel distance could be greater than 70 km, which was caused by subscribers such as taxi or bus drivers, package deliverers, and tourists.These subscribers account for less than 2.0% of the 5.8 million subscribers.Another interesting pattern is that the range of average total travel distance is supposed to be narrower when using 23 time segments.This is mainly because there are fewer random times and the selected records are very close to the total records for each individual.
The movement entropy could be overestimated or underestimated for different individuals due to calculation using records from different time segments.However, from an average perspective, the declining trend in the underestimation coefficient in estimating movement entropy can be observed from Figures 7-9.When 3 time segments are used, the underestimation coefficient of movement entropy is 0.49 (R 2 = 0.943), but when 10 time segments are used, the underestimation coefficient is 0.18 (R 2 = 0.986), which is very close to 0. Moreover, when 23 time segments are used, the points are close to the blue diagonal line and the underestimation coefficient is only 0.01, which means the records in these 23 time segments can represent the complete movement entropy entirely.
Unlike total travel distance and movement entropy, the distribution of the random average radius of gyration does not always increase with the complete radius of gyration.As shown in Figures 7-9, the random average radius of gyration increases until the complete radius of gyration is approximately 9 km.Then, although the complete radius of gyration increases, the average random radius of gyration declines.Therefore, the linear regression model is used within 9 km.To estimate the radius of gyration, incomplete mobile phone location records are probably good enough in most cases for analysis of subscriber travel within a normal daily activity range, i.e., less than 9 km.
In addition, for subscribers whose complete radius of gyration is greater than 9 km, the average random radius of gyration is often zero or very close to zero due to the loss of some long-distance locations.These subscribers account for less than 7.0% of all valid subscribers, usually travel in many different directions, and are likely to travel in a wide range.Thus, the lack of any time segments between 8 am and 8 pm may significantly affect the radius of gyration.The social identities of these subscribers may be taxi or Uber/Didi drivers, package deliverers, or tourists.Therefore, mobile phone location data might significantly underestimate the radius of gyration of subscribers whose activity range is very wide (i.e., greater than 9 km).Moreover, the range of the random radius of gyration is supposed to be wider when the complete radius of gyration increases.
Most importantly, even the use of many time segments can generate a much smaller radius of gyration, which indicates that an incomplete trajectory remains questionable for deriving the range of daily activity space.
Using mobile phone location records from different numbers of time segments can generate very different results in the distribution of total travel distance, movement entropy, and radius of gyration, which indicate the distance, range, and heterogeneity of individual mobility patterns, respectively.Therefore, the representativeness of mobile phone location data should be addressed before using it to answer different research questions.Next, we provide a comprehensive comparison of the representativeness of different numbers of time segments and of the same number of time segments with different time slots using the underestimation coefficient from an average perspective.

Quantitative Analysis of the Total Travel Distance Underestimation Coefficient
To evaluate the representativeness of different numbers of time segments and of the same number of time segments with different slots, we varied the selected number of time segments from 2 to 23.For each number of time segments, we randomized the selection 100 times, except for when 23 time segments were used.For each random time, we can calculate the aggregated total travel distance underestimation coefficient.The distribution of the underestimation coefficients for estimating total travel distance is shown in Figure 10.
ISPRS Int.J. Geo-Inf.2017, 6, 7 13 of 19 subscribers whose activity range is very wide (i.e., greater than 9 km).Moreover, the range of the random radius of gyration is supposed to be wider when the complete radius of gyration increases.Most importantly, even the use of many time segments can generate a much smaller radius of gyration, which indicates that an incomplete trajectory remains questionable for deriving the range of daily activity space.
Using mobile phone location records from different numbers of time segments can generate very different results in the distribution of total travel distance, movement entropy, and radius of gyration, which indicate the distance, range, and heterogeneity of individual mobility patterns, respectively.Therefore, the representativeness of mobile phone location data should be addressed before using it to answer different research questions.Next, we provide a comprehensive comparison of the representativeness of different numbers of time segments and of the same number of time segments with different time slots using the underestimation coefficient from an average perspective.

Quantitative Analysis of the Total Travel Distance Underestimation Coefficient
To evaluate the representativeness of different numbers of time segments and of the same number of time segments with different slots, we varied the selected number of time segments from 2 to 23.For each number of time segments, we randomized the selection 100 times, except for when 23 time segments were used.For each random time, we can calculate the aggregated total travel distance underestimation coefficient.The distribution of the underestimation coefficients for estimating total travel distance is shown in Figure 10.First, it is obvious that, even with the same number of time segments, the underestimation coefficient can be quite different.For instance, when 4 time segments are used, the underestimation coefficient varies from 0.77 to 0.90 and when the 18 time segments are used, the underestimation coefficient is between 0.19 and 0.26.These patterns indicate that location records in different time segments have different representativeness for characterizing total travel distance in human mobility research.This is relatively easy to understand in the context of human activities: if the selected time segments are mainly related to home activity, the total travel distance tends to be shorter and the underestimation coefficient tends to be higher, but if the selected time segments cover home and work activity, the total travel distance tends to be larger, which leads to a lower underestimation coefficient.First, it is obvious that, even with the same number of time segments, the underestimation coefficient can be quite different.For instance, when 4 time segments are used, the underestimation coefficient varies from 0.77 to 0.90 and when the 18 time segments are used, the underestimation coefficient is between 0.19 and 0.26.These patterns indicate that location records in different time segments have different representativeness for characterizing total travel distance in human mobility research.This is relatively easy to understand in the context of human activities: if the selected time segments are mainly related to home activity, the total travel distance tends to be shorter and the underestimation coefficient tends to be higher, but if the selected time segments cover home and work activity, the total travel distance tends to be larger, which leads to a lower underestimation coefficient.
Second, another interesting pattern is that, as the number of selected time segments increases, the underestimation coefficient tends to decline significantly.The average underestimation coefficient is 0.93 when the 2 time segments are used and declines to 0.04 when 23 time segments are used.By fitting another linear regression model with an intercept, the declining trend is nearly linear (R 2 = 0.99) and n indicates the number of time segments.uc d (n) = −0.04n+ 0.92 (5) It is easy to determine how representative mobile phone location data is for estimating total travel distance using this model.For example, if each individual's records cover only eight time segments, the total travel distance may be approximately 60% shorter than their total footprint in the study area.

Quantitative Analysis of the Movement Entropy Underestimation Coefficient
Similarly, we can calculate a movement entropy underestimation coefficient for each random time.The distribution of the aggregated underestimation coefficients for estimating movement entropy is shown in Figure 11.
ISPRS Int.J. Geo-Inf.2017, 6, 7 14 of 19 Second, another interesting pattern is that, as the number of selected time segments increases, the underestimation coefficient tends to decline significantly.The average underestimation coefficient is 0.93 when the 2 time segments are used and declines to 0.04 when 23 time segments are used.By fitting another linear regression model with an intercept, the declining trend is nearly linear (R 2 = 0.99) and n indicates the number of time segments.It is easy to determine how representative mobile phone location data is for estimating total travel distance using this model.For example, if each individual's records cover only eight time segments, the total travel distance may be approximately 60% shorter than their total footprint in the study area.

Quantitative Analysis of the Movement Entropy Underestimation Coefficient
Similarly, we can calculate a movement entropy underestimation coefficient for each random time.The distribution of the aggregated underestimation coefficients for estimating movement entropy is shown in Figure 11.As in the underestimation coefficient distribution for estimating total travel distance, it is evident that even with the same number of time segments, the underestimation coefficient can be quite different.For example, when 7 time segments are used, the underestimation coefficient varies from 0.24 to 0.39.This pattern is easy to understand as there may be new locations or the visiting frequency of some locations may change in different time segments.Moreover, the range of the underestimation coefficient is likely to be narrower as the number of time segments increases.For example, when 4 time segments are used, the underestimation coefficient varies from 0.39 to 0.68 and when 18 time segments are used, the underestimation coefficient is between 0.02 and 0.10.The average underestimation coefficient drops significantly from 0.64 to 0.20 when the number of time segments selected varies from 2 to 10.
Another interesting pattern is that, as the number of selected time segments increases, the underestimation coefficient tends to decline.The trend can be fitted by a logarithmic regression model with an intercept (R 2 = 0.99, n is the number of time segments).As in the underestimation coefficient distribution for estimating total travel distance, it is evident that even with the same number of time segments, the underestimation coefficient can be quite different.For example, when 7 time segments are used, the underestimation coefficient varies from 0.24 to 0.39.This pattern is easy to understand as there may be new locations or the visiting frequency of some locations may change in different time segments.Moreover, the range of the underestimation coefficient is likely to be narrower as the number of time segments increases.For example, when 4 time segments are used, the underestimation coefficient varies from 0.39 to 0.68 and when 18 time segments are used, the underestimation coefficient is between 0.02 and 0.10.The average underestimation coefficient drops significantly from 0.64 to 0.20 when the number of time segments selected varies from 2 to 10.
Another interesting pattern is that, as the number of selected time segments increases, the underestimation coefficient tends to decline.The trend can be fitted by a logarithmic regression model with an intercept (R 2 = 0.99, n is the number of time segments).
We can easily determine the representativeness of the mobile phone location data for estimating movement entropy using this model.For example, if each individual's records cover only eight time segments, the underestimation coefficient is approximately 0.25, so the average movement entropy may be approximately 25% less than their total footprint in the study area.

Quantitative Analysis of the Radius of Gyration Underestimation Coefficient
Similarly, we can calculate an underestimation coefficient of the radius of gyration for each random time.The distribution of the aggregated underestimation coefficients for estimating radius of gyration is shown in Figure 12.We can easily determine the representativeness of the mobile phone location data for estimating movement entropy using this model.For example, if each individual's records cover only eight time segments, the underestimation coefficient is approximately 0.25, so the average movement entropy may be approximately 25% less than their total footprint in the study area.

Quantitative Analysis of the Radius of Gyration Underestimation Coefficient
Similarly, we can calculate an underestimation coefficient of the radius of gyration for each random time.The distribution of the aggregated underestimation coefficients for estimating radius of gyration is shown in Figure 12.As interpreted in Section 5.1.2,to estimate the radius of gyration, incomplete mobile phone location records are probably good enough in most cases to analyze subscribers travel within a normal daily activity range, such as less than 9 km.Therefore, in this section, we mainly focus on radius of gyration less than 9 km.
Obviously, even with the same number of time segments, the underestimation coefficient can be quite different.For example, when 4 time segments are used, the underestimation coefficient varies from 0.31 to 0.97.This pattern is easy to understand, as there may be new locations due to different time segments.Moreover, the range of the underestimation coefficient is likely to be narrower as the number of time segments increases.For example, when 3 time segments are used, the underestimation coefficient varies from 0.36 to 0.77 and when there are more than 15 time segments used, the underestimation coefficient is between 0.28 and 0.35.
The declining trend could be fitted by a linear regression model with an intercept (R 2 = 0.63, n is the number of time segments), we can easily determine how representative the mobile phone location data is for estimating the radius of gyration within 9 km using this model.Unlike the total travel distance and movement entropy, the goodness of fit (R 2 ) is only 0.63.
The radius of gyration is likely to be more uncertain with fewer selected time segments.As can be seen from Figure 12, the average underestimation coefficient is greater than 0.29 even when 23 time segments are used, which means that any number of sampled time segments could depict the range of daily travel as at least 29% shorter than their total footprint in the study area.In addition, as has been interpreted in Section 5.1.2,for subscribers whose activity range is greater than 9 km, the sampled radius of gyration could often be much lower due to the absence of outlying location As interpreted in Section 5.1.2,to estimate the radius of gyration, incomplete mobile phone location records are probably good enough in most cases to analyze subscribers travel within a normal daily activity range, such as less than 9 km.Therefore, in this section, we mainly focus on radius of gyration less than 9 km.
Obviously, even with the same number of time segments, the underestimation coefficient can be quite different.For example, when 4 time segments are used, the underestimation coefficient varies from 0.31 to 0.97.This pattern is easy to understand, as there may be new locations due to different time segments.Moreover, the range of the underestimation coefficient is likely to be narrower as the number of time segments increases.For example, when 3 time segments are used, the underestimation coefficient varies from 0.36 to 0.77 and when there are more than 15 time segments used, the underestimation coefficient is between 0.28 and 0.35.
The declining trend could be fitted by a linear regression model with an intercept (R 2 = 0.63, n is the number of time segments), we can easily determine how representative the mobile phone location data is for estimating the radius of gyration within 9 km using this model.Unlike the total travel distance and movement entropy, the goodness of fit (R 2 ) is only 0.63.uc r (n) = −0.009n+ 0.44 (7) The radius of gyration is likely to be more uncertain with fewer selected time segments.As can be seen from Figure 12, the average underestimation coefficient is greater than 0.29 even when 23 time segments are used, which means that any number of sampled time segments could depict the range of daily travel as at least 29% shorter than their total footprint in the study area.In addition, as has been interpreted in Section 5.1.2,for subscribers whose activity range is greater than 9 km, the sampled radius of gyration could often be much lower due to the absence of outlying location point.This also indicates that radius of gyration may not be the most appropriate measurement for characterizing the range of human mobility by using sparsely sampled location data, such as mobile phone location data.Thus, we suggest researchers use indicators cautiously to interpret results derived from sparely sampled location data.
Finally, based on the results and distribution of subscribers in Figure 4, if given a real sample of mobile-tracked individuals and supposing that the uc for 24 time segments is 0, the weighted underestimation levels of the total travel distance, the movement entropy, and the radius of gyration are about 23%, 11% and 21%, respectively, in this study area.

Conclusions
In this paper, we investigated the representativeness of sparse mobile phone location data in characterizing mobility indicators, which are used for measuring the range of activity space, the travel distance, and the heterogeneity of visitation patterns within activity space.The contribution of this study is threefold: Firstly, the case study shows that the representativeness of estimations of human mobility indicators for each individual can lead to overestimation or underestimation.However, from an average perspective [4,26,37,38], when compared with all of the records, incomplete mobile phone location data tends to underestimate mobility indicators, such as average total travel distance and movement entropy.Moreover, the underestimation of the average radius of gyration becomes more significant.The representativeness of mobile phone data is also dependent on the records in different time segments.
Secondly, this study quantitatively assesses the representativeness of randomly selected time segments from the benchmark dataset in characterizing human mobility indicators.The aggregated underestimation coefficient results for estimating the total travel distance linearly decline as the number of time segments increases.For example, if each individual's records cover only 33% of the trajectory, the total travel distance may be approximately 60% shorter on average than their total footprints in the study area.The aggregated underestimation coefficient results for estimating movement entropy logarithmically declines as the number of time segments increases.For instance, if each individual's records cover only 33% of the trajectory, the aggregated underestimation coefficient is approximately 0.25, so the movement entropy may be approximately 25% less on average than their total footprint in the study area.
Lastly, the underestimation effects can be very significant for the radius of gyration, and the average underestimation coefficient is greater than 0.29 even when 23 time segments are used, which means incomplete mobile phone location data could depict an average of daily travel approximately 29% shorter than their total footprints in the study area.This may indicate that the radius of gyration should be used cautiously, because it is easily underestimated by using sparsely sampled location data, such as mobile phone location data.However, our findings may or may not be applicable to other cities due to different urban environments and mobile phone usage habits.
This study presents an alternative way to evaluate the representativeness of mobile phone location data for human mobility research.The method proposed in this paper can also be used for coarse data such as geo-tagged social media check-in data.Using the investigative approach here, researchers can understand the strengths and limitations of their data to help derive reasonable results.However we do note several limitations and challenges specific to sparsely sampled location data, such as: (1) The mobile phone usage habits; Figure 4 shows that the temporal coverage of subscribers' records are mostly relatively low, which may be related to subscribers' mobile phone usage habits.So the underestimation coefficient may be higher in non-random sampled mobile phone location data if the subscribers travel a lot but rarely take their mobile phones.(2) The bias of using subsamples instead of whole datasets; mobile phone users in subsample datasets may have different biases in gender, age, or geography [19,20].We will further explore the effects of this bias in characterizing human mobility patterns in future study.

Figure 1 .
Figure 1.An example of one user's trajectory.Clearly, most of the Voronoi tessellations that intersect the trajectory are not recorded.

Figure 1 .
Figure 1.An example of one user's trajectory.Clearly, most of the Voronoi tessellations that intersect the trajectory are not recorded.

Figure 3 .
Figure 3. Cell phone tower spatial kernel density.Due to a dataset provider requirement, the point-based cell phone tower spatial distribution cannot be shown.The administrative boundary of Shenzhen is vectorized into shape-file according to the whitepaper of the Urban Planning, Land & Resources Commission of Shenzhen Municipality [59].

Figure 3 .
Figure 3. Cell phone tower spatial kernel density.Due to a dataset provider requirement, the point-based cell phone tower spatial distribution cannot be shown.The administrative boundary of Shenzhen is vectorized into shape-file according to the whitepaper of the Urban Planning, Land & Resources Commission of Shenzhen Municipality [59].

Figure 6 .
Figure 6.Random rules for selecting different numbers of time segments.

Figure 6 .
Figure 6.Random rules for selecting different numbers of time segments.

Figure 6 .
Figure 6.Random rules for selecting different numbers of time segments.

Figure 7 .
Figure 7. Human mobility indicators in 3 random segments (segment #2, #14 and #20).The light gray dots are the three random and complete mobility indicators for each subscriber.For the total travel distance and radius of gyration, the horizontal axis is 0.1 km bandwidth.For movement entropy, the horizontal axis is 0.01 bandwidth.The red dots are the average value of the gray dots in their corresponding bandwidth.

Figure 7 .
Figure 7. Human mobility indicators in 3 random segments (segment #2, #14 and #20).The light gray dots are the three random and complete mobility indicators for each subscriber.For the total travel distance and radius of gyration, the horizontal axis is 0.1 km bandwidth.For movement entropy, the horizontal axis is 0.01 bandwidth.The red dots are the average value of the gray dots in their corresponding bandwidth.

Figure 8 .
Figure 8. Human mobility indicators in 10 random segments (segment #5, #6, #7, #9, #11, #12, #14, #16, #17, and #19).The light gray dots are the three random and complete mobility indicators of each subscriber.For the total travel distance and radius of gyration, the horizontal axis is 0.1 km bandwidth.For movement entropy, the horizontal axis is 0.01 bandwidth.The red dots are the average of the gray dots in their corresponding bandwidth.

Figure 8 .
Figure 8. Human mobility indicators in 10 random segments (segment #5, #6, #7, #9, #11, #12, #14, #16, #17, and #19).The light gray dots are the three random and complete mobility indicators of each subscriber.For the total travel distance and radius of gyration, the horizontal axis is 0.1 km bandwidth.For movement entropy, the horizontal axis is 0.01 bandwidth.The red dots are the average of the gray dots in their corresponding bandwidth.

Figure 10 .
Figure 10.Distribution of aggregated underestimation coefficients for estimating total travel distance using different numbers of time segments.

Figure 10 .
Figure 10.Distribution of aggregated underestimation coefficients for estimating total travel distance using different numbers of time segments.

Figure 11 .
Figure 11.Distribution of aggregated underestimation coefficients for estimating movement entropy using different numbers of time segments.

Figure 11 .
Figure 11.Distribution of aggregated underestimation coefficients for estimating movement entropy using different numbers of time segments.

Figure 12 .
Figure 12.Distribution of aggregated underestimation coefficients for estimating radius of gyration using different numbers of time segments.

Figure 12 .
Figure 12.Distribution of aggregated underestimation coefficients for estimating radius of gyration using different numbers of time segments.

Table 1 .
Example of individuals' cell phone records during a day.

Table 1 .
Example of individuals' cell phone records during a day.
The sign ***** ignores the minutes of a Longitude or a Latitude, and the sign **/** ignores the exact month and day due to privacy protection.

Table 2 .
Mobility indicator statistics for different random time segments.

Table 2 .
Mobility indicator statistics for different random time segments.