1. Introduction
Obtaining urban traffic patterns and the population distribution of urban residents is the basis for urban transport planning. As a result of rapid urbanization and intelligentization, the population distribution, travel demand, and travel characteristics of cities are quickly changing. However, the traditional survey methods can only reflect the traffic characteristics of a city within a fixed period of time. It is difficult to adapt these methods to the rapid development of cities due to their long cycles and low sample sizes. This brings challenges to urban transport planning and management. To fully understand the urban traffic characteristics, urban management departments and service agencies need to acquire urban travel characteristics data at low cost and high frequency to support the needs of urban management and planning. Meanwhile, with the rapid development of information collection technology, the acquisition of large-sample, multi-dimensional, and fine-grained information is becoming easier and easier.
New traffic data collection methods are generally divided into four categories [
1]:
- (1)
Location-based traffic information acquisition techniques, such as floating cars with global positioning systems (GPSs). It is collected by vehicles equipped with GPSs and communication devices driving on the road.
- (2)
Radio frequency-based traffic information acquisition techniques, such as radio frequency identification (RFID). RFID generally consists of readers, tags, and back-end systems. Each label has a unique identification mark, which is often pasted or bound to the body for tracking and management.
- (3)
Video-based traffic information acquisition techniques, such as license plate recognition (LPR) systems. LPR data are collected through the smart bayonet system. The smart bayonet system is composed of front-end equipment, communication transmission network and back-end monitoring and management platform. When a vehicle passes through the system, it will be photographed and recorded.
- (4)
Sensor-based traffic information acquisition techniques, such as microwave radar, inductive loop detectors, and magnetometers. A set of ground-induction coils are buried under the road, and the driving situation of the vehicle is obtained by detecting the change of the coil inductance.
However, data acquired by location-based traffic information acquisition technology, such as GPS data, can only be collected on vehicles equipped with GPSs and communication devices, such as taxis and online car-hailing. RFID data can only be collected by vehicles equipped with RFID equipment. They are relatively small as a sample size. Sensor-based traffic information acquisition techniques, such as inductive loop detectors, can only collect the passing record of the vehicle, but cannot collect any other characteristics, such as vehicle color, license plate number, etc.
In recent years, automatic license plate recognition technology has been actively developed and promoted, and a huge amount of LPR data has been obtained. An LPR system is a system that uses advanced photoelectric technology, image processing technology, and pattern recognition technology to take images of each passing vehicle and automatically recognize vehicle license plates. The LPR system is usually installed a few meters away from the highway. When a vehicle is detected at a certain distance in front of the device, the system begins to capture images of the vehicle. The image recognition algorithm is used to identify the vehicle license plate number, vehicle type, color, and other vehicle information. The data are then stored in the database and finally uploaded to the traffic administration data center [
2].
Scholars have conducted research based on LPR data. The LPR system is essentially a network composed of cameras, which can take pictures of each passing vehicle and automatically convert them into a detailed spatiotemporal record to capture vehicle data in real time with high precision and wide coverage. The information in the LPR record includes the detector ID (representing different camera mounts), the license plate number (vehicle ID), the direction of the vehicle, and the time stamp when the camera captured the vehicle. Therefore, the LPR system collects the spatiotemporal information of each vehicle and reconstructs the travel track of each vehicle by linking a series of spatiotemporal records [
3]. As an important input for traffic demand management, the method of obtaining the origin–destination (OD) matrix based on LPR data has become a popular research area. In 2000, Dixon et al. proposed a model to estimate the OD information of vehicles based on LPR data. The trajectory data left by the vehicle in the LPR system were used to analyze the travel characteristics, and the OD matrix was obtained, which was verified on a highway [
4]. Antoniou et al. directly estimated the dynamic OD using the OD matrix obtained from LPR data [
5]. Sun et al. obtained the vehicle’s path node and travel time by recognizing the vehicle license plate, completed the vehicle information with the missing path through the Bayesian estimation model, obtained the initial OD matrix based on the LPR data, and then corrected the initial OD matrix using the road traffic flow to obtain the final OD matrix [
6]. Zhou et al. used the vehicle trajectory data obtained by the Chongqing RFID system to obtain vehicle travel OD information and compared the data with resident travel survey data. The vehicle travel stay point recognition, OD segmentation, and the vehicle behavior portrait were used to obtain the vehicle OD [
7]. Apart from obtaining the OD, LPR also plays an important role in estimating the traffic flow [
8], which is essential for a wide range of intelligent transportation system (ITS) applications, such as carpooling [
9], travel behavior clustering [
10], queue length estimation [
11,
12], traffic state estimation [
13], and trajectory reconstruction [
14]. Thus, it can be seen that LPR has become an indispensable data resource in transportation research and plays a crucial role in promoting the development of intelligent transportation.
Another data source that provides an emerging and promising source of information for urban transport planning is cellular signaling (CS) data. CS data are communication data between mobile phone users and the transmitting base station or microstation. As soon as the mobile phone is turned on, CS data begin to be generated. Due to its large sample size, long observation period, short sampling period, and strong followability, CS data have attracted widespread attention from researchers. The information in a CS data record includes the data record number, base station location area code, traveler ID (unique identity), base station identification code, communication time, GPS longitude, GPS latitude, traveler gender, and traveler age. Therefore, the CS data contain the location information of each user, allowing the trajectories of each traveler to be reconstructed by summarizing a series of spatiotemporal records. Apart from individual mobility research, CS data play an important role in calculating the regional population, estimating OD flows, constructing traveler profiles, and analyzing the spatiotemporal distribution of an urban population.
CS data have stimulated researchers to review the conventional research questions about human mobility at an unprecedented spatiotemporal scale, with contents including, but not limited to, traffic demand analysis and control, dynamic population spatial distribution analysis, road spatiotemporal uniformity management [
15,
16,
17], occupation and residence commuter channel analysis, external passenger flow channel analysis, transport mode detection [
18], large passenger flow analysis [
19], and urban arterial traffic status detection [
20]. An in-depth understanding of human spatiotemporal flow patterns and their interactions with the urban environment can be beneficial for various applications, from urban planning and transportation to public health [
21]. The human mobility patterns are closely related to the population distribution and urban traffic patterns. Some attempts have been made to analyze urban traffic. Jiang et al. used CS data to obtain the daily activities and travel characteristics of Singapore residents [
22]. Alexander et al. obtained the travel matrix of residents in Boston based on CS data and inferred the travel purpose of residents based on historical information [
23]. Gao et al. used CS data to extract the characteristics of the travel time and the space distribution of Beijing residents [
24]. Liu et al. analyzed how urban land use influences commuting flows in Wuhan from the perspective of CS data [
25].
Based on the above studies, LPR and CS data have become indispensable forms of data in transportation research. However, their research mainly has the following shortcomings: (1) Owing to the costs or difficulty in data acquisition, many scholars’ traffic planning, OD analysis, and other traffic studies were only based on individual datasets. An individual dataset has certain limitations in analyzing traffic patterns and population distributions. (2) Although many scholars have conducted a lot of research using CS and LPR data, they did not distinguish between stay population and move population. In transportation research, the move population and stay population often have different spatial and temporal patterns. Therefore, the identification of the move population and stay population is of great importance in transportation research. (3) Owing to differences in the data collection rules, data density, and data acquisition methods of different data sources, it is difficult to determine whether there are significant differences in urban traffic analysis results obtained using different data sources.
To deal with these problems, the CS and LPR datasets from five working days in Foshan, China, were used to analyze the population distribution and traffic patterns that are of most concern for urban transportation planning. First, based on the different characteristics of CS and LPR data, algorithms were designed to identify the move population and stay population. Then, the move population and stay population distribution were obtained. Finally, the correlation degrees of the results were analyzed in terms of the correlation coefficient and the significance level.
The remainder of this article is organized as follows. The LPR and CS datasets are described in
Section 2, respectively. The data preprocessing is also introduced in this section. The LPR- and CS-based stay-point recognition methods are proposed in
Section 3, respectively. The correlation indices are also introduced in this section. The results and discussion are presented in
Section 4. Finally,
Section 5 presents conclusions and future research directions.
3. Methodology
To study the dynamic distribution of urban residents, mobile phone users and vehicles were divided into move and stay states. The move state means that the user was in the process of a trip, while the stay state indicates that the user was staying in a certain location to engage in work, study, leisure, or entertainment activities. In contrast to the moving state, people are usually in the staying state for most of the day. The move population was mainly used to describe the characteristics of population mobility in the region, while the staying population expressed the common characteristics of crowd gathering. By differentiating the spatial and temporal information, we can further understand the differences in the functional use of the city by the population.
3.1. Stay Point Recognition Algorithm Based on Different Speed Thresholds
The identification of the stay points of the vehicle using the LPR data can be divided into two cases: (1) the end point of the previous trip (EPPT) and the start point of the next trip (SPNT) are in different LPR systems; (2) the end point of the previous trip and the start point of the next trip are the same LPR system. The identification of the staying locations in these two cases should be handled separately. The basic idea of stay point recognition is to judge whether the vehicle stays between two track points based on its speed [
26].
In the first case, the steps to identify the stay point are as follows:
Step 1: First we use ArcGIS path matching algorithm to match the LPR system to the road. Where ArcGIS provides users with a scalable, comprehensive geographic information system platform. We then obtain the shortest path between LPR systems by requesting the Baidu Map API, which is a set of free application interfaces for developers based on the Baidu Map service, to establish the shortest distance matrix of the LPR systems.
Step 2: Based on the time series of the passing records of a vehicle, the timestamp and location of adjacent records are collected, and the driving speed is calculated based on the shortest distance and time difference:
where
represents the driving speed of the vehicle,
denotes the shortest distance obtained from step 1, and
and
are the timestamps of the downstream and the upstream LPR systems, respectively.
Step 3: Based on the travel times between different LPR systems in the statistical time window T, the lower speed limit
is calculated, and the lower speed limit matrix is obtained. As the volatility of the traffic flow within a day leads to fluctuations in the travel time, the abnormal value during peak hours may be normal during peak hours. Therefore, a statistical time window was set, and the travel time within the same time window was considered to be stable. We set T from 00:00 to 06:00 as 60 min and T from 06:00 to 24:00 as 30 min.
was the maximum value of the lower 5% of the travel speeds between two LPR systems and 5 km/h:
Step 4: If the driving speed between two LPR systems was smaller than , we considered the vehicle to have stopped between the two LPR systems and marked the previous LPR system as the stay point; otherwise, it was considered a moving point.
In the latter case, i.e., when the end point of the previous trip and the start point of the next trip are in the same LPR system, if the time difference (TD) between two consecutive records was greater than 20 min, we defined it as a stay point.
The above algorithm can be summarized by Algorithm 1.
Algorithm 1 Stay point recognition algorithm based on different speed thresholds |
Input: LPR data of a vehicle, denoted as |
Output: LPR Stay points () and Move points () |
For each point in do: |
If EPPT == SPNT: |
If TD > 20 min: ← SPNT |
Else: ← SPNT |
Else: |
If > : ← SPNT |
Else: ← SPNT |
End for |
3.2. Spatiotemporal Clustering Algorithm Based on Time Allocation
Different from traditional k-means and density-based spatial clustering of applications with noise (DBSCAN) clustering algorithms, since different mobile phone users have different spatiotemporal travel patterns, we need to first calculate the stay time of a user at a certain base station. If the stay time exceeded the threshold , the user was considered to have stayed at the base station. Otherwise, the user may just pass by. We need to design an algorithm to determine the potential location of the user. Hence, we proposed a spatiotemporal clustering algorithm based on time allocation to calculate the user’s actual stay position.
When the stay time of a user at a certain base station exceeded the threshold
, the user was considered to have stayed at the base station, that is, the position of the base station was the stay point of the user. When the user’s location switched back and forth between different base stations in a time less than
, it was very likely that the user was slowly moving or staying near the switched base stations, which was a potential stay space–time mode. In this case, we need to calculate the user’s actual stay position, and the calculation formula is as follows:
In the formula, and represent the longitude and latitude of the stay point, and represent the timestamps of the first and last points, respectively, and denotes the stay time at the ith base station.
Based on Equation (3), we proposed a spatiotemporal clustering algorithm based on time allocation, described as follows:
Step1: The pre-processed CS data were input, with each user as the processing unit. All the records of user i in one day were selected and sorted. The base stations passed by the user i was named O1, O2, …, On according to the time sequence.
Step2: Points O1 and O2 were selected in sequence, and the user’s potential stay point was calculated using Equation (3). We then obtained the new point with longitude and latitude . If both the distances from O1 to potential stay point O and from O2 to O were less than distance threshold , O1 and O2 may have constituted a stay. Three consecutive points O1, O2, and O3 were then selected, and the new potential stay point O was recalculated. We then reobtained the new point with longitude and latitude . If the distance from O1 to O, from O2 to O, and from O3 to O were less than , then O1, O2, and O3 may have constituted a stay. By analogy, when n consecutive points O1, O2, …, On were selected, and there was a point whose distance from O was greater than , the loop stopped.
Step 3: The time interval from O1 to On−1 was calculated. If the time interval was greater than the time threshold , it was then considered to constitute a stay. The longitudes and latitudes of the points O1, O2, …, On−1 were replaced with the longitude and latitude of O, and they were marked as a stay position.
Step 4: If the time interval was smaller than the time threshold , it could not constitute a stay. Point O1 was marked as a moving point. Two points O2 and O3 were selected in turn, and the process returned to step 2. The loop was continued until all points of user i were traversed.
The above algorithm can be summarized by the following flowchart (see
Figure 2).
In
Figure 2, O
i and O
j+1 denote the
ith and
j +
1th point of user, respectively. d
oi represents the distance from O
i to potential stay point O. O
mv is the move point dataset and O
sp is the stay point dataset. The variables t
min and d
min are the time threshold and distance threshold, respectively.
lngnew and
latnew are the longitude and latitude of potential stay point calculated by Equation (3).
3.3. Correlation Index
In traffic analysis, the commonly used spatial analysis unit is the traffic analysis zone (TAZ). Thus, it is necessary to count the number of users/vehicles of two types (move and stay) in the TAZ throughout a day. To compare the calculation results of different datasets, the proportion of TAZ users/vehicles was used instead of the absolute number. The calculation method was as follows:
where
represents the proportion of users of each type in the TAZ,
Type is either stay or move,
represents the number of users of each type in TAZ
i, and
n is the number of TAZs.
3.3.1. Correlation Coefficient
To quantitatively describe the differences in the population distribution and urban traffic pattern calculated using the two datasets, the correlation coefficients and significance levels were used as indicators. Correlation is a non-deterministic relation, and the correlation coefficient is one of the indicators used to measure linear correlations between the variables. The Pearson correlation coefficient (defined as
) is used to measure the degree of correlation between two variables. It is generally believed that when 0 < |
| ≤ 0.3, the two variables are weakly correlated; when 0.3 < |
| ≤ 0.5, the two variables are slightly correlated; when 0.5 < |
| ≤ 0.8, the two variables are significantly correlated; when 0.8 < |
| ≤ 1, the two variables are highly correlated.
can be calculated by the following formula:
where
is the Pearson correlation coefficient of the two datasets,
n denotes the number of TAZs,
and
are the proportion of user types in the TAZs in datasets A and B, respectively,
and
denote the average proportion of user types in the TAZs in datasets A and B, respectively.
3.3.2. Significance Test
The correlation coefficient can only show that there is a correlation between the LPR and CS data in these five working days. Since this is only a sample, there may be systematic sampling errors in the correlation coefficients obtained. When the overall correlation coefficient is 0, the calculated correlation coefficient may not be 0 due to sampling error. Therefore, to judge whether the correlation coefficient is meaningful, it must be compared with the overall correlation coefficient. This requires hypothesis testing on to determine whether it was caused by sampling errors or there was indeed a correlation between the two variables. A significance test is based on a hypothesis related to the parameters of the population (random variables) or the distribution form of the population formed in advance, and the sample information is used to judge whether the hypothesis is reasonable, i.e., to judge whether the true status of the population is significantly different from the original hypothesis. The significance test of the Pearson correlation coefficient is the t-test. Therefore, t-test was used in this analysis, and the steps were as follows:
Step 1: A hypothesis was established.
: , LPR and CS data are linearly independent;
: , LPR and CS data linearly related.
Step 2: A significance level was set.
The significance level was established by convention. Generally, α = 0.05, or 0.01. This means that the error rate of the significance test must be less than 5% or 1%, respectively (in statistics, events with a probability of less than 5% in the real world are usually called “impossible events”). We defined a significance level of α = 0.05 in this study.
Step 3: t-test statistics were calculated.
The
t-test statistic can be calculated using the following formula:
By calculating
t, we can obtain the
p-value. If the observed value of the statistic
t obtained from the sample is
t0, then the
p-value can be obtained by the following formula:
The
p-value reflects the probability of an event occurring. In statistics, the
p-value is obtained based on the significance test method. Generally,
p < 0.05 is considered to be statistically different,
p < 0.01 is considered to be statistically significant, and
p < 0.001 is considered to be extremely statistically significant. This means that the probabilities that the difference between samples is caused by sampling error are less than 0.05, 0.01, and 0.001, respectively. The
p-value cannot assign any importance to the data, only the probability of an event occurring. Since the calculation of the
p-value is not the focus of this article, readers can refer to the relevant literature [
27].
5. Conclusions
Compared with traditional survey methods, LPR and CS can quickly obtain the population distribution and travel patterns of urban residents. However, most of the analyses of the population distribution and urban traffic patterns are based on a single dataset. It is impossible to know whether there are differences in the analyses between different datasets. In addition, they usually do not distinguish between move population and stay population. It is critical to distinguish them in transportation research. As a result, there is no guarantee that the calculation results based on a single datum can meet the accuracy requirements of urban traffic analysis. To solve this problem, five working days of the LPR and CS data in Foshan City were examined, and different stay point recognition methods were used to calculate the population distribution and traffic patterns commonly used in urban traffic planning. For LPR data, different stay point recognition methods were designed according to whether the end point of the previous trip and the start point of the next trip are the same. For CS data, a spatiotemporal clustering algorithm based on time allocation is proposed to recognize stay points. Then, the correlations between the two datasets were analyzed. The results showed that there was high similarity between the population distribution and traffic patterns obtained from the two datasets. However, due to the different collection mechanisms of the two datasets, the identifications of the stay and move users/vehicles were slightly different. CS data should be collected regardless of whether the mobile phone user is traveling or not, while LPR data should only be collected when the vehicle is traveling. In addition, vehicle users who use mobile phones are only a minority of mobile phone users. Therefore, although there was a noon peak in the identification of vehicle users, since most mobile phone users do not travel at noon, there was no noon peak in the identification of the CS data. Moreover, the coverage of the LPR systems in some TAZs is low, and travel identification is greatly affected by individual travel, resulting in low correlations, such as No. 2, 9, 11, and 31 TAZs.
In most cases, the results of the population distribution and traffic pattern calculated between different traffic datasets and the extended conclusions were highly similar. The experiment results confirm that our proposed method can learn more temporal and spatial correlation among human mobility datasets to help urban transport planning. Our main contributions can be summarized as follows:
- (1)
A spatiotemporal clustering algorithm based on time allocation was proposed to identify stay points using cellular signaling data.
- (2)
Urban traffic patterns and population distribution were obtained from the perspective of multisource traffic data.
- (3)
The correlation between cellular signaling data and license plate recognition data was analyzed.
- (4)
The results revealed that cellular signaling data and license plate recognition data were significantly correlated in population distribution and urban traffic patterns.
Through this data mining and analysis, it is known that under the condition that the data quality is fully guaranteed, the fusion of different traffic datasets is more in line with urban traffic laws for analyzing population distribution and traffic patterns. It shows that the human travel data obtained through different data collection devices are not independent but are significantly correlated. In our future work, we will integrate more kinds of human mobility data for urban transport planning to support the development of intelligent transportation systems. Meanwhile, the integration of more datasets will lead to user privacy leakage, data privacy protection is one of our next research directions. Besides, the same TAZ may include office areas and residential areas. The population distribution of them must be different. Therefore, the population distribution characteristics of residential and office areas are also the future research direction.