Identiﬁcation and Prediction of Large Pedestrian Flow in Urban Areas Based on a Hybrid Detection Approach

: Recently, population density has grown quickly with the increasing acceleration of urbanization. At the same time, overcrowded situations are more likely to occur in populous urban areas, increasing the risk of accidents. This paper proposes a synthetic approach to recognize and identify the large pedestrian ﬂow. In particular, a hybrid pedestrian ﬂow detection model was constructed by analyzing real data from major mobile phone operators in China, including information from smartphones and base stations (BS). With the hybrid model, the Log Distance Path Loss (LDPL) model was used to estimate the pedestrian density from raw network data, and retrieve information with the Gaussian Progress (GP) through supervised learning. Temporal-spatial prediction of the pedestrian data was carried out with Machine Learning (ML) approaches. Finally, a case study of a real Central Business District (CBD) scenario in Shanghai, China using records of millions of cell phone users was conducted. The results showed that the new approach significantly increases the utility and capacity of the mobile network. A more reasonable overcrowding detection and alert system can be developed to improve safety in subway lines and other hotspot landmark areas, such as the Bundle, People’s Square or Disneyland, where a large passenger ﬂow generally exists.


Introduction
Rapid urbanization results directly in crowding in megacities.When organizing major sports activities and holiday gatherings, it is likely to induce accidents in the hot spots caused by overcrowding, such as the Stampede Event on New Year Eve of 2015 in Shanghai.Since then, the Shanghai municipal government analyzed the reasons for incidents and organized an alliance formed by mobile operators and information technology (IT) experts to set up an information platform to alert and pre-control pedestrians against overcrowding accidents.How to make use of the wireless data through information technology, to model and analyze the crowd density, along with early warning about the over-crowding situations is an important step and has quite a lot of significance for both real and potential application prospects.
Traditional pedestrian detection used to adopt video-based methods, which extract the pedestrian flow and individuals from the video sequences.Typically, the algorithms for pedestrian flow analysis can be summarized in three continuous parts: pedestrian detection, tracking and counting.During the detection of pedestrian flow, the individual detection and trajectory tracking requires background difference [1][2][3][4] and time difference algorithms [5,6] and also the expectation maximization (EM) [7,8] technique.To obtain the density of pedestrians, studies tend to focus on the pixel texture and individual characteristics.As for motion tracking, the bottom-up data-driven methods show more superiority over the top-down model-driven one in obtaining the trajectories [9][10][11].However, still many problems have to be solved for video-based detection, including color constancy problems influenced by lighting conditions and weather, inaccurate counting caused by individual occlusion and coarse segmentation, real-time requirements limited by high complexity of tracking algorithms, as well as the high cost for deployment and maintenance.From this point of view, the video-based pedestrian detection is not appropriate for large numbers of people on a large-scale in outdoor areas.
Positioning methods play an important role in pedestrian detection, since the information on the number of people and locations can be obtained by tracking through localization systems.Localization based on a Received Signal Strength (RSS) fingerprinting approach has been attracting large amount of research efforts during the past few decades, where the basic idea is to construct an RSS fingerprints database during the training phase, and then perform location estimation by matching the users' reported fingerprints in the database during the localization phase [12].Indoor localization systems based on the approach have been developed with different flavors.Embedded sensors of mobile devices are exploited to improve accuracy of the location estimation [13,14], where crowdsourcing paradigm is introduced to reduce the cost of the site survey in the training phase [15].Machine learning algorithms are also leveraged to shorten the delay of the localization process [16][17][18].
Recently, more research institutes and IT companies have analyzed pedestrian using their large set of data resources from the applications or operators.Mobile communication operators and providers, such as Telecomm (China) and Huawei (Shenzhen, China), have access to the network data interface connected to individual smartphones and the signal data of base stations, which can be fully extracted and discovered for pedestrian flow detection.As for the IT companies providing searching, social communicating and map applications, such as Baidu (Beijing China) and Tencent (Shenzhen, China), they owned the searching request and GPS data, which provide the position info and can extract more from the GPS and sensor data from the smartphones of the crowd-sourcing users.
In this paper, we want to leverage the large set of real-time data of network information platforms from operators, together with other auxiliary sources to analyze the crowd density and speed.Prediction analyses have been carried out to alert about the overcrowding situation when pedestrian density attains a certain threshold, so that it can be avoided in advance.The main contributions of this study can be summarized as follows: (1) Various sources of information have been integrated to build up a multilayer and reliable information platform.Specifically, pedestrian localization by matching a cell ID can obtain a big picture of the participant density distribution.Then, the multiple base station (MultiBS) information and the Wi-Fi Application Programming Interface (API) were leveraged to improve the accuracy.The embedded sensors assist to obtain the trajectory information, and the Integrated Circuit (IC) card counts provide the total number of participants within certain area.Furthermore, the video processing equipment may be installed at certain important spots.
(2) Raw data from a real mobile network, which recorded interaction details of more than 3 million users within the network over 20 days, were processed and some important analysis results are presented.In particular, the basic temporal and spatial property were analyzed and presented in graphical figures.Based on a refined Log Distance Path Loss (LDPL) model, pedestrian density was analyzed through mobile data in different precisions with corresponding process methods.
(3) Positioning algorithm was implemented to leverage the temporal correlation of wireless signal propagation.A generalized modeling of signal based temporal correlation of signal strength was used.The probabilistic method, mainly as a fingerprinting method, can help improve the positioning accuracy with effort.The fingerprint method can be compounded for both base station signals and Wi-Fi signals, which helps achieve the multi-level pedestrian localization system.
(4) Finally, the data were modelled both temporally and spatially through the Gaussian Process (GP) and regressed to multivariate Gaussian distribution.This modeling and regression help to recover missed values, and is rather useful for pedestrian detection systems since real-time data is generally not complete.To alert the crowd in advance, the over-crowded time is predicted, which is compared through the Gaussian process and machine learning methods.
The remainder of the paper is organized as follows: Section 2 presents the related works.Section 3 illustrates the hybrid pedestrian detection model.Section 4 presents the analysis of network data of operators, and depicts methods for mobile tracking and positioning design.Section 5 provides the data regression and prediction through Gaussian Process analysis and also the Central Business District (CBD) case analysis.Finally, conclusion remarks and future research directions are provided in Section 6.

LAN Mobile Localization
Pedestrian detection can be achieved from localization approaches.In addition to outdoor localization services being well served by a Global Positioning System (GPS), many localization techniques and sophisticated schemes have been developed utilizing deployed sensors [19], Radio Frequency Identification (RFID) [20], Bluetooth [21], infrared transceivers [22], etc.However, considering the deploying and maintenance costs, these methods are inferior to the Local Area Networks (LAN) positioning approaches for the faking of large pedestrian detection.Such LAN localization technologies can be roughly categorized into two groups: the deterministic methods, including Angle of Arrival (AOA) [23], Time of Arrival (TOA) [24] and Time Difference of Arrival (TDOA) [25], and the probabilistic methods, mainly as the fingerprinting method [17,18,26].
Among the aforementioned positioning schemes, the fingerprinting method has gathered more attention.The basic idea of the fingerprinting approach can be described as two phases shown in Figure 1.First, in the offline training phase, the physical area is divided into grids and marked as location Reference Points (RPs).Then, a number of RSSes of all detected Wireless Local Area Networks (WLAN) access points (Aps) were collected at each reference point to make the RSS profiles as radio maps.On the other hand, the online phase is to estimate the user location by matching the request measurement and radio map.Typically, many positioning algorithms may be used for the online localization phase, such as k-Nearest Neighbor (kNN), Machine Learning (ML), Compressive Sensing (CS), etc.The remainder of the paper is organized as follows: Section 2 presents the related works.Section 3 illustrates the hybrid pedestrian detection model.Section 4 presents the analysis of network data of operators, and depicts methods for mobile tracking and positioning design.Section 5 provides the data regression and prediction through Gaussian Process analysis and also the Central Business District (CBD) case analysis.Finally, conclusion remarks and future research directions are provided in Section 6.

LAN Mobile Localization
Pedestrian detection can be achieved from localization approaches.In addition to outdoor localization services being well served by a Global Positioning System (GPS), many localization techniques and sophisticated schemes have been developed utilizing deployed sensors [19], Radio Frequency Identification (RFID) [20], Bluetooth [21], infrared transceivers [22], etc.However, considering the deploying and maintenance costs, these methods are inferior to the Local Area Networks (LAN) positioning approaches for the faking of large pedestrian detection.Such LAN localization technologies can be roughly categorized into two groups: the deterministic methods, including Angle of Arrival (AOA) [23], Time of Arrival (TOA) [24] and Time Difference of Arrival (TDOA) [25], and the probabilistic methods, mainly as the fingerprinting method [17,18,26].
Among the aforementioned positioning schemes, the fingerprinting method has gathered more attention.The basic idea of the fingerprinting approach can be described as two phases shown in Figure 1.First, in the offline training phase, the physical area is divided into grids and marked as location Reference Points (RPs).Then, a number of RSSes of all detected Wireless Local Area Networks (WLAN) access points (Aps) were collected at each reference point to make the RSS profiles as radio maps.On the other hand, the online phase is to estimate the user location by matching the request measurement and radio map.Typically, many positioning algorithms may be used for the online localization phase, such as k-Nearest Neighbor (kNN), Machine Learning (ML), Compressive Sensing (CS), etc.

Temporal Spatial Tracking Prediction
Kaemarungsi et al. investigated properties of the RSS for fingerprinting based localization using Wi-Fi [27].Comprehensive experiment results reveal two important features of the RSS: first, the mean and variance of the RSS in one location basically remain the same over time; second, the auto-covariance function of the RSS in one location has the same shape for separate time-series.Based on these two

Temporal Spatial Tracking Prediction
Kaemarungsi et al. investigated properties of the RSS for fingerprinting based localization using Wi-Fi [27].Comprehensive experiment results reveal two important features of the RSS: first, the mean and variance of the RSS in one location basically remain the same over time; second, the auto-covariance function of the RSS in one location has the same shape for separate time-series.Based on these two observations, this study mainly models the RSS observed in one location as a stationary process.Fang et al. propose a localization approach based on the dynamic system and machine learning technique [17], which estimates users' location by the state consisting of RSSes observed in different times and locations.However, the simple combination of spatial and temporal information does not reveal the essence of how the temporal information can be utilized for localization purposes, where the RSS observed at different times is considered as multiple measurements of fingerprints.

HYbrid Pedestrian Flow Model
In this section, all information was integrated to achieve a comprehensive pedestrian flow detection platform.As presented in Figure 2, a hybrid pedestrian detection model (HPDM) was proposed that consists of five levels.Initially, a coarse person density may be obtained from one single base station with accuracy of 200 to 300 m, which is the first level detection.As a large pedestrian flow control architecture for a megacity level, this level is accurate enough for an overall picture.
Basically, each mobile phone generally receives signals from more than three base stations, from which LAN localization can be obtained through signal modeling.Based on Log Distance Path Loss (LDPL) model and AOA, and DOA positioning methods, triangulation localization can achieve an accuracy of less than one hundred meters, which is the approach for the second layer of the hybrid pedestrian density detection model.As all wireless signal strength can be obtained from a network interface from an operator in real time, users are usually not aware of anything in this passive localization procedure.
For more accurate pedestrian density, the fingerprint localization is introduced.The probabilistic fingerprint method provides more accuracy with a rather complex algorithm, which achieves the pedestrian detection radius to around 10 m, becoming the third level of the hybrid pedestrian flow detection model.Furthermore, additional sensors and auxiliary information for pedestrian flow, i.e., Wi-Fi, Bluetooth, RFID, etc. may achieve higher accuracy.Additionally, some map applications can access the accelerator sensor data, which may assist with predicting the pedestrian direction.Video-based methods may also be introduced at the most important sights.These further improve the pedestrian detection, consisting of the fifth level pedestrian detection and also the final level.
observations, this study mainly models the RSS observed in one location as a stationary process.Fang et al. propose a localization approach based on the dynamic system and machine learning technique [17], which estimates users' location by the state consisting of RSSes observed in different times and locations.However, the simple combination of spatial and temporal information does not reveal the essence of how the temporal information can be utilized for localization purposes, where the RSS observed at different times is considered as multiple measurements of fingerprints.

HYbrid Pedestrian Flow Model
In this section, all information was integrated to achieve a comprehensive pedestrian flow detection platform.As presented in Figure 2, a hybrid pedestrian detection model (HPDM) was proposed that consists of five levels.Initially, a coarse person density may be obtained from one single base station with accuracy of 200 to 300 m, which is the first level detection.As a large pedestrian flow control architecture for a megacity level, this level is accurate enough for an overall picture.
Basically, each mobile phone generally receives signals from more than three base stations, from which LAN localization can be obtained through signal modeling.Based on Log Distance Path Loss (LDPL) model and AOA, and DOA positioning methods, triangulation localization can achieve an accuracy of less than one hundred meters, which is the approach for the second layer of the hybrid pedestrian density detection model.As all wireless signal strength can be obtained from a network interface from an operator in real time, users are usually not aware of anything in this passive localization procedure.
For more accurate pedestrian density, the fingerprint localization is introduced.The probabilistic fingerprint method provides more accuracy with a rather complex algorithm, which achieves the pedestrian detection radius to around 10 m, becoming the third level of the hybrid pedestrian flow detection model.Furthermore, additional sensors and auxiliary information for pedestrian flow, i.e., Wi-Fi, Bluetooth, RFID, etc. may achieve higher accuracy.Additionally, some map applications can access the accelerator sensor data, which may assist with predicting the pedestrian direction.Video-based methods may also be introduced at the most important sights.These further improve the pedestrian detection, consisting of the fifth level pedestrian detection and also the final level.The hybrid localization can be summarized as follows: We first integrated sources to build a multi-layer and reliable information platform.Specifically, pedestrian localization by matching a cell ID can obtain an overall picture of the person density distribution.Furthermore, the Multi-BS info and the Wi-Fi API can be leveraged to improve the accuracy.In addition, the embedded sensors can help obtain the trajectory information, the IC card counting provides the total number of persons within certain area and the video processing may also be set up at certain important spots.Finally, after integrating all of the technical platforms, multi-layer pedestrian information architecture can be built up.

Cell Data Process and Analysis
In this section, the entire process of the data access procedure, including data preprocessing, data structure, user temporal properties, and pedestrian density spatial distribution was introduced.Information from the signal base station was incorporated to detect pedestrian flow as the first level.To detect pedestrian flow from actual mobile networks, real datasets from an operator's data center were examined, which contains detailed signaling and application records of 3,384,521 active users in 65,482 active cells within the first 20 days of January 2016.
Several kinds of records could be tracked and saved both in packet service (PS) and circuit service (CS) domains, including calling detailed record (CDR), exchange detailed record (XDR), and also user field detailed record (UFDR).An overview of evolved packet core (EPC) architecture was presented in Figure 3 to make the paper self-contained and to facilitate understanding of the information resources for pedestrian detection-the raw data of CDR, including signaling CDR intercepted from the Gn interface in the third generation (3G) network, Internet CDR from web services and data traffic CDR for charging.The XDR has been pre-processed from CDR.As for UFDR, categorized records for HyperText Transfer Protocol (HTTP)/ Wireless Application Protocol (WAP) browsing, Email, Manufacture Message Specification (MMS), Domain Name System (DNS), File Transfer Protocol (FTP), and streaming in detail were used.Both the CDR from a Gn interface and the sequence data of UFDR were collected for browsing HTTP actions.The format of the collected data is presented in Table 1, which is a simplified sample data from 36 kinds of parameters for privacy concerns.The datasets were obtained from China Telecom, which recorded and stored data with the best industry practices.Moreover, commitment of researchers to respect privacy were signed, which limited the access to the mentioned data to just a few authorized partners.The hybrid localization can be summarized as follows: We first integrated sources to build a multilayer and reliable information platform.Specifically, pedestrian localization by matching a cell ID can obtain an overall picture of the person density distribution.Furthermore, the Multi-BS info and the Wi-Fi API can be leveraged to improve the accuracy.In addition, the embedded sensors can help obtain the trajectory information, the IC card counting provides the total number of persons within certain area and the video processing may also be set up at certain important spots.Finally, after integrating all of the technical platforms, multi-layer pedestrian information architecture can be built up.

Cell Data Process and Analysis
In this section, the entire process of the data access procedure, including data preprocessing, data structure, user temporal properties, and pedestrian density spatial distribution was introduced.Information from the signal base station was incorporated to detect pedestrian flow as the first level.To detect pedestrian flow from actual mobile networks, real datasets from an operator's data center were examined, which contains detailed signaling and application records of 3,384,521 active users in 65,482 active cells within the first 20 days of January 2016.
Several kinds of records could be tracked and saved both in packet service (PS) and circuit service (CS) domains, including calling detailed record (CDR), exchange detailed record (XDR), and also user field detailed record (UFDR).An overview of evolved packet core (EPC) architecture was presented in Figure 3 to make the paper self-contained and to facilitate understanding of the information resources for pedestrian detection-the raw data of CDR, including signaling CDR intercepted from the Gn interface in the third generation (3G) network, Internet CDR from web services and data traffic CDR for charging.The XDR has been pre-processed from CDR.As for UFDR, categorized records for HyperText Transfer Protocol (HTTP)/ Wireless Application Protocol (WAP) browsing, Email, Manufacture Message Specification (MMS), Domain Name System (DNS), File Transfer Protocol (FTP), and streaming in detail were used.Both the CDR from a Gn interface and the sequence data of UFDR were collected for browsing HTTP actions.The format of the collected data is presented in Table 1, which is a simplified sample data from 36 kinds of parameters for privacy concerns.The datasets were obtained from China Telecom, which recorded and stored data with the best industry practices.Moreover, commitment of researchers to respect privacy were signed, which limited the access to the mentioned data to just a few authorized partners.

Pedestrian Temporal Properties
Information from CDR data can be categorized as time info, position info, event info, user info, throughput info and application info.Overviews of the temporal properties of data are introduced as follows.
User activities counting.Each calling action is treated as an activity, and after adding up to each hour's count, the result of a typical week was obtained and shown in Figure 4a.During each hour of a day, users averagely make around 140 million callings per hour from 9 a.m. to 9 p.m., and less than 60 million times in the late evening to early morning h.The daytime user activities show stable patterns.After that, they decreased in the first half of night and ramped up at dawn.The lines of daily activity counts were found similar to each other, except for the holidays.The red line for New Year's Eve was found to be much higher than common days at midnight.The total number of activities on weekdays and weekends are different.Figure 4b shows that on average the amount of user activities made on weekends decreased by about 15% from that on weekdays.The activity counting info in temporal space gives the first indications on resource demand of communication system.There should be four typical periods: day and night on weekdays and weekends.The total amount of user activity indicates the scale of the system.
Calling response time.The response time of each activity was examined and the mean and max value within each hour was also computed, as presented in Figure 4c.The red lines are the average response time, which shows a relatively quicker response during the daytime and a slower one at night, due to the power and frequency control of the systems.As for the maximum response time marked as blue lines, it was found that there have always been some extremely bad values during each day, which indicates the need for performance improvement.This observation indicates that the average response times are relatively the same around 150 ms, and bad cases still exist every day (Figure 4d).The finding shows that the current telecommunication system does have power and frequency control policy, however this also causes longer average response time at night and some extremely bad performance.It is essential to provide the control strategy with power and frequency difference control to guarantee a stable and reliable service of the network.
Traffic throughput.The traffic throughput in mobile network contains control signal and content data.Figure 4e shows the context data throughput for 24 h during one day.The curve was found to be similar to that of user activities.Based on the throughput, two patterns were identified: first, the data user transmissions are different from each other at different times.However, with respect to the overall picture of the system with a large number of users, the throughput in a mobile network can be simulated by scaling the throughput of activities.The rest findings include the average size of data context and upload and download signaling messages.
Application characteristics.The data from the Gn interface contain the URL information.After approximate string matching of URLs, hosts and domains of the URLs were obtained, and the payload, duration and the number of users within each minute were analyzed in detail.Among dozens of applications matched with more sub domains, the properties of nine typical applications were presented in Figure 4f-h, and the corresponding data analyses can be used to build up payload profiles.Since the user behavior model considers the processes in mobile network, reasonable profiles for payload sending stage were generated.It was found that the applications show different patterns in total data size (Figure 4f), duration (Figure 4g) and user groups (Figure 4h) within 24 h during one day.

Pedestrian Temporal Properties
Information from CDR data can be categorized as time info, position info, event info, user info, throughput info and application info.Overviews of the temporal properties of data are introduced as follows.
User activities counting.Each calling action is treated as an activity, and after adding up to each hour's count, the result of a typical week was obtained and shown in Figure 4a.During each hour of a day, users averagely make around 140 million callings per hour from 9 a.m. to 9 p.m., and less than 60 million times in the late evening to early morning h.The daytime user activities show stable patterns.After that, they decreased in the first half of night and ramped up at dawn.The lines of daily activity counts were found similar to each other, except for the holidays.The red line for New Year's Eve was found to be much higher than common days at midnight.The total number of activities on weekdays and weekends are different.Figure 4b shows that on average the amount of user activities made on weekends decreased by about 15% from that on weekdays.The activity counting info in temporal space gives the first indications on resource demand of communication system.There should be four typical periods: day and night on weekdays and weekends.The total amount of user activity indicates the scale of the system.
Calling response time.The response time of each activity was examined and the mean and max value within each hour was also computed, as presented in Figure 4c.The red lines are the average response time, which shows a relatively quicker response during the daytime and a slower one at night, due to the power and frequency control of the systems.As for the maximum response time marked as blue lines, it was found that there have always been some extremely bad values during each day, which indicates the need for performance improvement.This observation indicates that the average response times are relatively the same around 150 ms, and bad cases still exist every day (Figure 4d).The finding shows that the current telecommunication system does have power and frequency control policy, however this also causes longer average response time at night and some extremely bad performance.It is essential to provide the control strategy with power and frequency difference control to guarantee a stable and reliable service of the network.
Traffic throughput.The traffic throughput in mobile network contains control signal and content data.Figure 4e shows the context data throughput for 24 h during one day.The curve was found to be similar to that of user activities.Based on the throughput, two patterns were identified: first, the data user transmissions are different from each other at different times.However, with respect to the overall picture of the system with a large number of users, the throughput in a mobile network can be simulated by scaling the throughput of activities.The rest findings include the average size of data context and upload and download signaling messages.
Application characteristics.The data from the Gn interface contain the URL information.After approximate string matching of URLs, hosts and domains of the URLs were obtained, and the payload, duration and the number of users within each minute were analyzed in detail.Among dozens of applications matched with more sub domains, the properties of nine typical applications were presented in Figures 4f-h, and the corresponding data analyses can be used to build up payload profiles.Since the user behavior model considers the processes in mobile network, reasonable profiles for payload sending stage were generated.It was found that the applications show different patterns in total data size (Figure 4f), duration (Figure 4g) and user groups (Figure 4h) within 24 h during one day.

Spatial Distribution Figures
The spatial properties can be further explored by Figure 5a-d.Figure 5a is the cdf (Cumulative Distribution Function) figure of traffic within a week, demonstrating a rather stable pattern with little differences.Figure 5b provides the log distribution of response time in urban areas.Figures 5c and 6d are two heat maps of pedestrian density spatial distribution in two different situations.Situation 1 represents a sparse condition, while situation 2 indicates a crowded one.Both the user info, response time, and pedestrian temporal or spatial properties were utilized as original data by a machine learning method for modeling, regression and prediction.

Spatial Distribution Figures
The spatial properties can be further explored by Figure 5a-d.Figure 5a is the cdf (Cumulative Distribution Function) figure of traffic within a week, demonstrating a rather stable pattern with little differences.Figure 5b provides the log distribution of response time in urban areas.Figure 5c,d are two heat maps of pedestrian density spatial distribution in two different situations.Situation 1 represents a sparse condition, while situation 2 indicates a crowded one.Both the user info, response time, and pedestrian temporal or spatial properties were utilized as original data by a machine learning method for modeling, regression and prediction.

Spatial Distribution Figures
The spatial properties can be further explored by Figure 5a-d.Figure 5a is the cdf (Cumulative Distribution Function) figure of traffic within a week, demonstrating a rather stable pattern with little differences.Figure 5b provides the log distribution of response time in urban areas.Figures 5c and 6d are two heat maps of pedestrian density spatial distribution in two different situations.Situation 1 represents a sparse condition, while situation 2 indicates a crowded one.Both the user info, response time, and pedestrian temporal or spatial properties were utilized as original data by a machine learning method for modeling, regression and prediction.

Model Implementation and Case Study
The method proposed in this section depends on a supervised algorithm.The details of the improved algorithm and learning process were provided in the previous work [28], which theoretically shows the improvement of fingerprinting localization brought by temporal correlation of RSS, as well as evaluates how the temporal correlation of RSS can influence the reliability of location estimation.

Pedestrian Tracking Accuracy Improved by RSS Temporal Series
In this section, temporal series are utilized to improve the localization accuracy.Suppose that m RSSes were sampled from AP i .Then, an intermediate matrix A can be constructed before computing the covariance matrix ∑ i ( → r ): where we assume that only D-dimensional temporal correlation is considered.The second subscript of each entry of the matrix means the jth measurement with respect to AP i .It is easy to find that the mean value µ k of each vector A T k can be calculated as: With Maximum Likelihood Estimation (MLE), the correlation matrix was estimated as: where k and j are equal to 1, 2, . . ., D, respectively.Notice that we can approximate all ∑ k,k+δ with ρ δ , and then the covariance matrix Σ has the following form: Algorithm 1 illustrates how to utilize temporal correlation for better location estimation, which is, in essence, a synthetic approach integrating the information of both the mean value and the temporal correlation of the RSS.

Input parameters:
The training data set for each location → r , x i,j (i = 1 . . .n; j = 1 . . .w); The reported RSS sequence t i,j (i = 1 . . .a; j = 1 . . .b) from a user; Indoor space L is a set of all the identified locations recorded in the database; Threshold Th is the critical value of choice for mean vectors.
(1) For each location i in L, calculate the mean vector µ i = (µ 1 , µ 2 , . . ., µ m ) with Equation ( 2) and calculate the correlation vector ρ i = (ρ 1, ρ 2 , . . ., ρ m ) as Equation (3) shows.(2) For the reported data t ij from a user, also calculate the target mean and correlation vectors as µ t , ρ t .(3) Find the Euclidean distance between µ t and every µ i .Find all the vectors µ k among those µ i s , and the distance between each µ k and µ t should be within Th in the sample space, i.e., |µ i − µ t | ≤ Th.The corresponding locations associated with those µ k s are denoted using a set {I kmin }. (4) Compare the Euclidean distance between ρ t and ρ i in {I kmin }.Find the vector that is nearest to ρ t .The nearest distance in correlation sample space is the place at which to localize the user.
The basic idea of the algorithm is that we first find a list of candidate locations of the user with the mean value comparison-as most of the literatures.Then, we find the most likely location with the temporal correlation comparison.The experiment results are illustrated in Figure 6.The radius was set to be 30 cm, 60 cm, 120 cm and 180 cm, respectively.In each case, the threshold H was increased from 0 to 5. Note that the unit of the threshold is not important, as the normalized distance was considered in the sample space.As shown in Figure 6, the localization reliability increases first and then decreases as the threshold increases in all scenarios.
If the threshold is 0, the user's location is basically estimated using the mean of the RSSes, and the temporal correlation information is not utilized.If the threshold increases, the system could cross-check the candidate locations and find the most matched one; therefore, the reliability is improved.When the threshold is large enough, it means that more candidate locations could be on the list.Since these locations are picked up according to their corresponding fingerprints in the sample space, they may be far away from each other in the physical space.Their observed temporal correlation information is unable to effectively tell one location from another.The greater the number of the candidate locations, the higher the probability that the location is far from the APs, and the temporal correlation becomes indistinguishable.This is why the reliability becomes worse when the threshold becomes rather large.The experiment data were further analyzed, and Figure 7 shows that the reliability of the fingerprinting localization could be improved by up to 13% when the radius and threshold were chosen as 0.3 and 1, respectively.

Pedestrian Prediction and Case Study
In large cities, large crowds are unpredictable in certain areas and at individual moments.Therefore, accurate prediction is almost impossible with the traditional methods.However, it is possible to identify and pre-control the crowed with mobile phone signals.According to reliable models and simulations, we can regulate the flow and prevent conflicts by conducting case studies in subway lines, at hot spots, during peak travel times, major sports and cultural activities, and in other entertainment areas, such as Disneyland, where a large flow of crowd often exists.When there is a large passenger flow, the density will appear on mobile phones or apps which assists individuals and governments with making prejudgment and plans.Figure 7a

Pedestrian Prediction and Case Study
In large cities, large crowds are unpredictable in certain areas and at individual moments.Therefore, accurate prediction is almost impossible with the traditional methods.However, it is possible to identify and pre-control the crowed with mobile phone signals.According to reliable models and simulations, we can regulate the flow and prevent conflicts by conducting case studies in subway lines, at hot spots, during peak travel times, major sports and cultural activities, and in other entertainment areas, such as Disneyland, where a large flow of crowd often exists.When there is a large passenger flow, the density will appear on mobile phones or apps which assists individuals and governments with making prejudgment and plans.Figure 7a to Figure 7c are the heat maps of pedestrian flow at a certain CBD (Shanghai Expo Site) from 7 a.m. to 9 a.m.The heat maps provide the spatial characteristics and the three maps at different times show clearly the temporal evolution.As the random pedestrian process can be modeled as a Gaussian Process (GP), data were modelled both temporally and spatially through the Gaussian Process and regressed to multivariate Gaussian distribution as shown in Figure 8a,b.On one hand, for the temporal Gaussian process, a special kernel function was used.On the other, a combination of Gaussian process models was used to make the prediction for the spatial Gaussian process.
Finally, the prediction result is presented in Figure 8c.The red line is the predicted upper bound and the blue line is the lower bound.The gray area is the real data with added variance.By the end of 2015, China Telecom has 197.9 million users, which has a roughly penetration rate of 15%.Taking the crowd density threshold as 5/m 2 , when early warnings for over-crowded area is scheduled, a threshold can be set based on the outdoor pedestrian density 0.75/m 2 base line via data resource [16].It is obvious that the real data with added variance is inside the predicted bounds, and the variation trends are similar, which implies accuracy of the proposed method.Moreover, the peak values above the alarm value could be easily obtained, providing successful early warning.
The modeling and regression can also recover certain missed values, which is helpful for pedestrian detection systems since usually real-time data is not complete.To alert the crowd in advance, overcrowded times have to be predicted, which was compared through the Gaussian process model and machine learning methods.As the random pedestrian process can be modeled as a Gaussian Process (GP), data were modelled both temporally and spatially through the Gaussian Process and regressed to multivariate Gaussian distribution as shown in Figure 8a,b.On one hand, for the temporal Gaussian process, a special kernel function was used.On the other, a combination of Gaussian process models was used to make the prediction for the spatial Gaussian process.
Finally, the prediction result is presented in Figure 8c.The red line is the predicted upper bound and the blue line is the lower bound.The gray area is the real data with added variance.By the end of 2015, China Telecom has 197.9 million users, which has a roughly penetration rate of 15%.Taking the crowd density threshold as 5/m 2 , when early warnings for over-crowded area is scheduled, a threshold can be set based on the outdoor pedestrian density 0.75/m 2 base line via data resource [16].It is obvious that the real data with added variance is inside the predicted bounds, and the variation trends are similar, which implies accuracy of the proposed method.Moreover, the peak values above the alarm value could be easily obtained, providing successful early warning.
The modeling and regression can also recover certain missed values, which is helpful for pedestrian detection systems since usually real-time data is not complete.To alert the crowd in advance, over-crowded times have to be predicted, which was compared through the Gaussian process model and machine learning methods.As the random pedestrian process can be modeled as a Gaussian Process (GP), data were modelled both temporally and spatially through the Gaussian Process and regressed to multivariate Gaussian distribution as shown in Figure 8a,b.On one hand, for the temporal Gaussian process, a special kernel function was used.On the other, a combination of Gaussian process models was used to make the prediction for the spatial Gaussian process.
Finally, the prediction result is presented in Figure 8c.The red line is the predicted upper bound and the blue line is the lower bound.The gray area is the real data with added variance.By the end of 2015, China Telecom has 197.9 million users, which has a roughly penetration rate of 15%.Taking the crowd density threshold as 5/m 2 , when early warnings for over-crowded area is scheduled, a threshold can be set based on the outdoor pedestrian density 0.75/m 2 base line via data resource [16].It is obvious that the real data with added variance is inside the predicted bounds, and the variation trends are similar, which implies accuracy of the proposed method.Moreover, the peak values above the alarm value could be easily obtained, providing successful early warning.
The modeling and regression can also recover certain missed values, which is helpful for pedestrian detection systems since usually real-time data is not complete.To alert the crowd in advance, overcrowded times have to be predicted, which was compared through the Gaussian process model and machine learning methods.

Conclusion and Future Works
In this paper, a synthetic approach based on the analysis of both network processing procedures and users' behavior was proposed.Then, a practical user behavior model was constructed by analyzing actual data from a cellular operator, China Telecom.With this, a matrix mapping based dynamic resource allocation mechanism was proposed for the virtualized core, integrating the information of both network processing procedures and users' applications.To demonstrate the effectiveness of the approach, a testbed was developed to perform resource allocation over a simplified UMTS core network.Various experiments were carried out on the testbed using application and signaling records of millions of users.Results indicate that the proposed approach significantly increases the utility and capacity of the mobile core network.Future works may be conducted to further explore if the bandwidth resource allocation could be different in detail, and also explore the influence of spatial properties of users on the resource allocation.Furthermore, the usage rate of users presents a large variability by different situation and time (Figure 5a), which may be considered when constructing the models for higher accuracy.Finally, multi-scenario analysis of the algorithm was not carried out in this study because of data scarcity.In future, additional datasets may be obtained to test the wider applicability of the proposed algorithm.

Conclusion and Future Works
In this paper, a synthetic approach based on the analysis of both network processing procedures and users' behavior was proposed.Then, a practical user behavior model was constructed by analyzing actual data from a cellular operator, China Telecom.With this, a matrix mapping based dynamic resource allocation mechanism was proposed for the virtualized core, integrating the information of both network processing procedures and users' applications.To demonstrate the effectiveness of the approach, a testbed was developed to perform resource allocation over a simplified UMTS core network.Various experiments were carried out on the testbed using application and signaling records of millions of users.Results indicate that the proposed approach significantly increases the utility and capacity of the mobile core network.Future works may be conducted to further explore if the bandwidth resource allocation could be different in detail, and also explore the influence of spatial properties of users on the resource allocation.Furthermore, the usage rate of users presents a large variability by different situation and time (Figure 5a), which may be considered when constructing the models for higher accuracy.Finally, multi-scenario analysis of the algorithm was not carried out in this study because of data scarcity.In future, additional datasets may be obtained to test the wider applicability of the proposed algorithm.

Figure 1 .
Figure 1.Two phases for fingerprinting localization, (a) offline training phase of the fingerprinting localization system; and (b) online localization phase of the fingerprinting localization system.

Figure 1 .
Figure 1.Two phases for fingerprinting localization, (a) offline training phase of the fingerprinting localization system; and (b) online localization phase of the fingerprinting localization system.

Figure 5 .
Figure 5. Spatial properties of CDR data, (a) traffic cdf; (b) response by area; (c) heat map for a sparse condition (situation 1); (d) heat map for a crowded condition (Situation 2).

Figure 5 .
Figure 5. Spatial properties of CDR data, (a) traffic cdf; (b) response by area; (c) heat map for a sparse condition (situation 1); (d) heat map for a crowded condition (Situation 2).

Figure 5 .
Figure 5. Spatial properties of CDR data, (a) traffic cdf; (b) response by area; (c) heat map for a sparse condition (situation 1); (d) heat map for a crowded condition (Situation 2).

Figure 6 .
Figure 6.Reliability with different threshold H and error tolerance radius.
to Figure c are the heat maps of pedestrian flow at a certain CBD (Shanghai Expo Site) from 7 a.m. to 9 a.m.The heat maps provide the spatial characteristics and the three maps at different times show clearly the temporal evolution.

Figure 6 .
Figure 6.Reliability with different threshold H and error tolerance radius.

Table 1 .
Sample data of detailed calling record from the Gn Interface in a 3G cellular network.