Moving Point Density Estimation Algorithm Based on a Generated Bayesian Prior

To improve decision making, real-time population density must be known. However, calculating the point density of a huge dataset in real time is impractical in terms of processing time. Accordingly, a fast algorithm for estimating the distribution of the density of moving points is proposed. The algorithm, which is based on variational Bayesian estimation, takes a parametric approach to speed up the estimation process. Although the parametric approach has a drawback, that is the processes to be carried out on the server are very slow, the proposed algorithm overcomes the drawback by using the result of an estimation of an adjacent past density distribution.


Introduction
Population density, namely the number of people per unit area, is one of the most effective factors in decision making involved in tasks, like infrastructure management, city development planning and merchandising [1].For example, police officers can optimize their patrols in a city on the basis of high-population density locations, because accidents often occur at those locations in cities.A dataset of people's positions, consisting of "moving points", as the example in Figure 1a, can be used to calculate the population density.Figure 1a was drawn by plotting the results of questionnaires about people's journeys (namely, data pairs consisting of origination and destination pairs of the travels) during a whole day on a base map [2].The drawing thus represents people's positions at the moment in the past.In contrast, the people's positions measured in real time through mobile phones with GPS-like positioning functions [3,4] reflect people's movements.In that case, the people's positions are thus recorded in real time as moving points.Such massive data about people's positions should be commonly used, because they should be taken into account for many decisions, such as devising patrol plans for police officers.The OGC (Open Geospatial Consortium [5]), which is an international standard organization concerned with geospatial information systems, publishes standard specifications applicable to such data.A standard for encoding such massive moving-point data is defined as "OGC Moving Features encoding" [6], which is defined for exchanging massive amounts of moving-point data easily.As mentioned above, real-time moving-point data are useful for many applications; especially, the distribution function of the density of moving points (hereafter "point density"), expressed as a function of position (x, y) and time t, is the most effective feature for describing the tendency of the moving points.However, handling a massive point dataset itself is not so easy, even if it is exchangeable.Therefore, a conversion from massive moving-point data to easily handleable data is needed.
"Coverage data," which describes the point density, are one of the most easily handleable kinds of data.The coverage can be encoded by various encoding standards, like OGC NetCDF [7] and GML [8]. Figure 1b shows a population density distribution visualized from the points shown in (a).Such an intelligible image colored according to density, often called a "heatmap," can be generated automatically (Figure 1b is generated by a quadratic kernel density estimator shown in Section 2.) After visualization, the heatmaps can be encoded by general image formats, like PNG and JPG.For distributing such images, an OGC WMS (Web Map Service) [9] interface is commonly used.Accordingly, a conversion from point data to a coverage would contribute to system interoperability.
Counting the number of the points is the most naive technique for estimating the density of stationary points.Namely, a space in which points exist is separated into small cells, and the number of points in each cell is counted.Such a method is called a histogram method [10].However, the histogram method has several problems.One of the problems is the trade-off between the size of the cell ("cell size") and the number of data.Cell size can be adjusted in the case that the hierarchical cell structure, like a "quadtree" [11], is adapted.It should be smaller if the point density is required to be finer; however, statistical dispersion of the estimation results becomes higher if the cells are small, because smaller cells have fewer points.To estimate a finer point density, a huge dataset is therefore required.For that reason, several conventional algorithms for accurately estimating the point density, even with small cells, have been proposed.The reciprocal of the cell size is called "resolution"; that is, the resolution of a point density using small cells is "high." Another problem with histograms arises in practical data-providing systems.An example of a client-server system using point density is shown in Figure 2. The data source periodically provides a real-time point dataset to the analysis server.The server estimates the point density and sends it to user-client systems A and B. The user clients use the point density by associating it with other information, such as temperature distribution and map illustrations.In this case, the server is basically a "high-spec" computer; in contrast, the user clients are generally "low-spec" computers.For such systems, OGC standards are available.Namely, OGC WMS [9] is a service interface for distributing map images, and OGC WCS (Web Coverage Service) [12] is a service interface for distributing coverage data.They provide web APIs to send the point density data.WMS-request data sent from user-client systems include the area boundary of the required map and the map-image size, which determine the image resolution.The resolution required by user-client A might be different from that required by user-client B; besides, the required resolution dynamically changes according to the user requirements.The point density must thus be dynamically calculated.The computational power of the user client system is generally weaker than that of the server.Furthermore, the point dataset itself may not be distributed, because the amount of data provided by the server should be small enough to be sent via a network with low throughput.Many calculations of point density with various resolutions should therefore be carried out on the server in real time.Accordingly, a fast on-line algorithm for calculating point densities is needed.
To satisfy the requirements mentioned above, in this work, a "point-density-estimation algorithm" for handling a moving-point dataset (described with OGC Moving Features) is proposed.The proposed algorithm, which is based on Bayesian estimation, can estimate point density (distributed with OGC WMS or WCS) by using only a partial dataset; consequently, it works faster than conventional algorithms.
These trajectory analyses have been developed for extracting information from stored trajectories, not points.Basically, the density of moving points does not depend on the movement of points; that is, the density can be calculated by using only a "snapshot" of the moving points.With the proposed algorithm, however, the density estimation is accelerated by using the "adjacent past density distribution", which is based on the similarity between the current point density and the most recent one.
Kernel-density estimators [24,25] are popular methods for estimating a point density.The density calculated by a kernel-density estimator is the summation of distributions around the points.The Gaussian kernel-density estimator, which is the most general type, estimates the point densities as a summation of Gaussians.The density P (x) is given as: where N points are denoted as {x n } and h corresponds to the radius of each Gaussian.A quadratic kernel-density estimator, which estimates points densities as a summation of quadratic functions as: is often used.A quadratic kernel estimator works faster than a Gaussian kernel estimator.On the other hand, the accuracy of a quadratic kernel estimator is generally less than that of a Gaussian kernel estimator.Accordingly, a quadratic kernel estimator is often applied to visualize density in the form of a heatmap.Several kernel-density estimators for summarizing the tendency of moving points have been proposed; as examples, the dual kernel-density estimator [26] is designed to show differences between two different point densities; and the directional kernel-density estimator [27] visualizes the tendency of the movement itself.These kernel-density estimators use all points to estimate the point density.Thus, the calculation of P (x) in the density estimation becomes more complex in proportion to the number of points.Consequently, it is difficult to carry out the calculation in real time.This kind of estimator is called "nonparametric", because the parameters used in the model are not determined by the estimation process.Another approach, called "parametric", is based on a specific functional formula.The parameters of the functional formula are optimized to make them consistent with the dataset.The calculation time taken by the parametric approach is proportional to the complexity of the function.Thus, even if the number of points is large, the density estimation does not take a long time.However, optimization of the parameters takes a very long time.For example, a standard variational Bayesian estimation, which is based on a Gaussian mixture model (GMM), is often applied as a probability distribution of the point dataset.In the case of a variational Bayesian estimation, the distribution modeling is a convergence that makes parameters consistent with all point data; that is, the calculations are iterated.Although the distribution modeling can be carried out by the server, the processing time is too long for real-time calculation.

Contributions
Even though the proposed algorithm takes a parametric approach, its parameter optimization is very fast, because the point density is generated by updating the adjacent past point density.Measurement by GPS-like positioning functions is generally periodic, so the proposed algorithm handles discrete time (a period of which equals a time discretization unit).Indices of the discrete times are denoted by integers Under the assumption that the point density at time t is expressed as the sum of Gaussian components, the point density at t + 1 is considered as the sum of the Gaussian components similar to the Gaussian components at time t.The difference between the point density at t and that at t + 1 is predictable even if the point dataset is partially used.
The proposed algorithm is based on the update of the Gaussian components by using a partial point dataset.As the mechanism for updating the Gaussian components, Bayesian estimation is applied.In the case of Bayesian estimation, a probability-distribution function is assumed a priori and revised by using the point dataset.Bayesian estimation thus seems applicable for updating the probability-distribution function; however, to apply Bayesian estimation, a trick to determine parameters used by the assumed probability-distribution function is needed.In the following section, the trick named "responsibility reduction" is described.

Variational Bayesian Estimation
Variational Bayesian estimation is often applied to estimate a multivariate probability-distribution function.An argument of the multivariate probability-distribution function is replaced by a pair of coordinate values of a point as (x, y).The N -points density, which is obtained from point probability distribution P (x, y), is calculated as N P (x, y).
A probability distribution of parameters used in the probability-distribution function is defined in the variational Bayesian estimation.The probability-distribution function of the parameter set is called the "prior probability-distribution function" ("prior").The prior and additional observed points ("observation") derive the estimated "posterior probability distribution" ("posterior") in the variational Bayesian estimation.The probability-distribution function of the GMM is denoted by: where x is a vector consisting of an observation's coordinate values (i.e., (x, y)), N (x|Λ, µ) is a two-dimensional Gaussian function with a correlation matrix denoted by Λ k and an average denoted by µ k , and π k is the mixing ratio of the k-th component to the sum of all components.The prior for {π k }, {µ k } and {Λ k } is then defined as: Here, α 0 , W 0 , ν 0 and β 0 are parameters, called "hyperparameters," of the prior.W(•|W, ν) is the Wishart distribution with deviation W and degrees of freedom ν, and Dir(•|α) is the Dirichlet distribution with parameter α.The prior is conjugate to the Gaussian mixture; that is, the posterior formula updated by observations is the same as that of the prior.Estimation of the posterior is therefore implemented as the update of hyperparameters α 0 , β 0 , m 0 and W 0 by using observations {x n } as: N k , x k and S k are respectively calculated as: r n,k , "responsibility", is calculated as: where ψ(•) is a digamma function.According to Equation ( 9), r n,k is interpreted as the contribution coming from the n-th observation to each component labeled with k.
The calculation of the responsibility requires the hyperparameters; however, the estimation of the hyperparameters also requires the responsibility.Accordingly, to optimize the hyperparameters, the above-described processes are iterated; first, the hyperparameters are randomly initialized; second, the responsibilities are calculated with the initial hyperparameters; third, the hyperparameters are updated with the calculated responsibility.After iterating the calculation, the optimized hyperparameters are obtained.

KL Divergence as a Criterion for Convergence
In the variational Bayesian estimation, the initial hyperparameters α 0 , β 0 , m 0 and W 0 are given a priori.In other words, the initial hyperparameters, which specify the prior, can be determined to make the convergence of the estimation results earlier.If the prior is set to a similar function as the posterior, most of the processes of the point density estimation are unnecessary.Accordingly, most of the point dataset has little effect on the point density; that is, most of the point dataset can be ignored.Moreover, the prior is accurately predictable as the adjacent past posterior, which was estimated at the previous iteration, because the point density of moving points does not change so much in a short period.The prior should therefore be set to the same function as the adjacent past posterior in order to stop the point density estimation early.
Kullback-Leibler (KL) divergence, which behaves like the distance between distribution functions, is applicable as a criterion for judging whether the calculation can be stopped or not.
Figure 3 illustrates such a judgment process.The point dataset is separated into small blocks in advance, where the block size is a constant parameter during the process.Note that the block size should be large enough to detect statistical significance.In Step 1, the prior is generated from the adjacent past posterior.In Step 2, to estimate the posterior, one of the small blocks (labeled "1") is input into the prior.In Step 3, the next prior, which is generated by the estimated posterior in Step 2, is updated with the block labeled "2".In this step, KL divergence from the posterior generated in Step 2 to the posterior generated in Step 3 is calculated.If the KL divergence is small enough, additional blocks are judged unnecessary.If not, the posterior is similarly calculated, and so on.In the evaluation of KL-divergence, the number of the input data is controlled; that is, few data are used when the point density changes a little, but many data are used when the point density changes drastically.Step 3 Step 4 Figure 3. Serial intromission.
The KL divergence from p(x) to q(x) KL(p(x)||q(x)) is defined as: The proposed algorithm requires the KL-divergence from the prior to the posterior.That is, where {θ i } = {{π k }, {Λ k }, {µ k }}, and {x n } m is the m-th data block.If K is small enough, the estimation result is considered to be converged.p({θ i }|{x n } m ) can be replaced by q , because it is the result of the variational Bayesian estimation.K is thus transformed to: where is the set of hyperparameters updated by {x n } m+1 .Here, B(•) is defined as:

Responsibility Reduction
With the proposed method, a trick named "responsibility reduction" is applied to determine the hyperparameters.For example, as shown in Equation ( 5), β t,k = β 0 + N t,k , where N t,k is the sum of the responsibilities, which are contributed by the k-th component of the GMMat time t.Thus, N t,k corresponds to the number of points belonging to the k-th component, because responsibility r n,k indicates the i-th point's "belongingness" to the k-th component.Then, β t,k is updated with N t+1,k as β t+1,k = β 0 + N t,k + N t+1,k , where β t+1,k is a hyperparameter of a posterior at t + 1.Similarly, The hyperparameter at time t + n thus gets larger if n gets larger.Consequently, the contribution of the newest point dataset, N t+n,k , is relatively smaller than the contribution of the prior, β 0 , because the prior includes the sum of the all past responsibilities.However, the contribution of the newest point dataset should be greater than the others.The problem described in the preceding paragraph comes from confusion between point datasets.The new point density is different from the past point density; therefore, the past points and new points should not be equally used.Accordingly, the prior should be generated from the past posterior, but the prior should not be equal to the posterior.The effects of past points on the hyperparameters should thus be removed.If the effects from the hyperparameters are equal to the effects produced by the new points, the effects of points in the hyperparameters are kept constant.The dataset at time t is separated into M blocks containing enough points for inputting sequentially, where the number of points in the blocks are denoted by n . These blocks are input sequentially to update the posterior.When the j-th block is input, the ratio of the number of input points to the total number of points is calculated as . The reduction rate of responsibilities γ (j) (t) is then defined as: The sum of the past responsibility N t,k is multiplied by the reduction rate.By this process, the total number of points contributing to the hyperparameters is adjusted and kept constant.
The updating mechanism is figuratively drawn as an "evaporating pot" in Figure 4.In Step (a), the pot is filled by a liquid, labeled t.The content of the pot is then evaporated so that half of its capacity remains.Next, a new liquid, labeled t + 1, is poured into the pot.The volume of the poured liquid is half of the pot capacity.In step (b), the contents of the pot are again evaporated to half its capacity.The liquid labeled t and the liquid labeled t + 1 are reduced equally to half of their initial volumes.That is, the volumes of the liquids labeled t and t + 1 become 1/4 of the pot capacity.Then, repeatedly, a new liquid, labeled by t + 2, is poured into the pot, which is half full of the liquid.After repeating the process, as shown in Steps (c) and (d), the ingredients in the pot are a mixture of liquids labeled t, t + 1, • • • .The volume of liquid labeled t is therefore 1 2 t −t+1 V , where V is the pot capacity.Thus, the volume of liquid will be decreased exponentially.The liquid in Figure 4 corresponds to the sum of responsibilities; in other words, evaporation corresponds to multiplication to the sum of responsibilities by the reduction rate.Accordingly, the contribution from the old dataset is decreased exponentially by multiplying it by the reduction rate, while the sum of all responsibilities is kept constant.By the multiplication, the hyperparameters in the prior at time t + 1 given by the posterior at time t are derived as follows.
Here, N t,k , xt,k and S t,k at time t are given by Equation ( 9).The hyperparameters in the posterior at time t + 1 are given by: where: x (0) t,n and r (0) t,n,k are points in the n-th block and their responsibilities, respectively.Another block of the new point dataset is input if the KL-divergence between the posteriors is large.The prior is updated with the i-th block by the following formulas: To obtain the recursive formula, each side of the i + 1-th formula is subtracted by each side of the i-th formula as follows.
δW is defined as: In comparison to Equation ( 9), the formulas for N and x are natural; that is, the stack of the contributions from the past data is reduced and added to the current contribution.However, S seems similar to the other hyperparameters, except δW .δW represents the error caused by sequential estimation, so the effect of δW on estimation accuracy is quite small if N and n are large enough.Accordingly, in the case of the proposed algorithm, δW is ignored.The proposed algorithm is shown in Figure 5.A normal variational Bayesian estimation is carried out first.The estimated posterior at time t is used to generate the prior at the next time t + 1.Each dataset is separated into many blocks, and the blocks are sequentially input to update the prior generated by the last estimated posterior.The block intromission is stopped when the KL divergence between the prior and the posterior is small enough.When that condition is met, the estimation of the point density at time t is finished.After the estimation, time t is incremented into t + 1, and the estimation of the point density at time t + 1 is started.These processes are carried out iteratively to estimate the point density.The parameters used for this algorithm are hyperparameters used by the variational Bayesian estimation and the number of blocks.They are determined manually in the same manner as a general variational Bayesian estimation.Note that the number of the blocks should be much smaller than the size of the entire dataset in order to obtain blocks large enough for detecting statistical significance.Satisfying this condition is generally easy because the entire data size is extremely large.

Experimental Settings and Criteria
The proposed algorithm was evaluated experimentally.The features of the point dataset used for this evaluation are listed in Table 1.The point datasets, named "people flow data" [28], describe the positions of people during a day.The point datasets were generated from the result of questionnaires about people's travels during a day, namely data pairs of the origination and destination of the journeys.Interpolations between the origination and the destination were carried out in order to generate point datasets every minute.Therefore, the interpolation points at the same time were collected as one dataset.The point density from 0:00 a.m. to 12:00 a.m. was estimated by the proposed algorithm and by two conventional kernel density estimators.The points, which are expressed as longitude and latitude coordinates, were converted to an orthogonal coordinate system of which the origin is the centroid of points at 0:00 a.m..Moreover, the coordinate values of the points in the orthogonal coordinate system were scaled to adjust their maximal values to 10.0.The dataset was separated into two blocks.One block was used to estimate the point density; the other was used for evaluating the accuracy criteria.This is generally called "holdout validation".Five thousand points included in the second block were actually used, because it would take too much time to carry out the evaluation if all of the points in the evaluation block were used.
Two criteria were evaluated: accuracy and processing time.As the accuracy criterion, Ev based on logarithmic likelihood, defined as: was applied.Here, N is the number of points denoted by x n .Ev indicates the reproducibility of the point density of the entire dataset.As the processing-time criterion, the time for generating a 150-pixel × 100-pixel heat-map image was used; that is, the calculation was carried out 15,000 times.
To evaluate the criteria accurately, the measurement of Ev and the processing time was carried out three times, and the average of the results was used as the criterion.
As the parameters used by the proposed algorithm, the number of mixed Gaussian distributions was set to 200; the block size for sequential intromission was 1,500; the initial hyperparameters were α 0 = 2.1 and β 0 = 1.0; diagonal components of W 0 were set to 1.5; and non-diagonal components were set to 0.0 and ν 0 = 2.2.In addition, 200 points were randomly selected from the initial point dataset as the m 0 .These parameters were determined manually to obtain the highest Ev.
As conventional methods, the Gaussian kernel method and the quadratic kernel method were similarly evaluated.Standard deviation h was set to 0.2 for the Gaussian kernel method; radius h was set to 3.0 for the quadratic kernel method in order to avoid the P (x n ) = 0 (which gives Ev = −∞).
Incidentally, the program used for the evaluations was implemented with Java on a computer with an Intel Core i7-3980K 3.2 GHz and 32.0 GB RAM.

Results
The processing times for each method are plotted in Figure 6a.The people's positions at the time indicated by "dataset time" on the horizontal axis are included in the dataset used for the evaluation.Hereafter, "dataset" time is used in the same meaning.The processing time for the proposed algorithm was less than 1.0 s, that for the Gaussian kernel method around 250 s and that for the quadratic kernel method 80 s.The proposed algorithm is thus the fastest of the three methods.The accuracy criterion for each method is plotted in Figure 6b.The accuracy criterion for the proposed algorithm is close to that for the Gaussian-kernel density estimator, and that of the quadratic-kernel density estimator is lowest.These results indicate that the proposed algorithm is the fastest and accurate enough.Distributions obtained by the three methods are compared in Figure 7.The red regions in the images indicate high-density areas, the yellow regions medium density and the green regions sparse areas.Image (a), which represents the density distribution given by the proposed algorithm, is similar to that given by the Gaussian-kernel estimator, namely Image (b), but it is diffused a bit.Image (c), which shows the result of the quadratic kernel estimation, is diffused too much to describe the point density.This diffuseness is caused by large h; therefore, the image generated with the result of the quadratic kernel with h = 1.0 is similar to Images (a) and (b), as shown in Image (d).However, the accuracy criteria for the quadratic-kernel method with h = 1.0 is −∞; that is, the distribution is not accurate at all.Accordingly, the proposed algorithm generates the most unblurred accurate heatmap in these three algorithms.

Discussion
Sub-sampling is one of the most popular methods for speeding up density estimation processes.Accordingly, a sub-sampled dataset was also evaluated.Figure 8a plots a series of processing times after the point dataset was randomly sub-sampled to 2300 points for the Gaussian-kernel estimator (h = 3.0) and 17,000 points for the quadratic-kernel estimator.Both the Gaussian-kernel estimator and the quadratic-kernel estimator took 0.6 s under this condition.These times are comparable with that taken by the proposed algorithm; however, the accuracies of the two former estimators is lower, as shown in Figure 8b.It is thus concluded that the proposed algorithm maintains high accuracy even when the processing time was reduced.Heatmaps generated from sub-sampled datasets are shown in Figure 9a,b.The heatmap given by the quadratic-kernel estimator (Figure 9a) is very similar to the heatmap obtained without sub-sampling, which is shown in Figure 7c, because the sub-sampled dataset is large enough.However, the heatmap given by the Gaussian-kernel estimator (Figure 9b) is quite different from Figure 7b.This implies that the sub-sampled dataset used by the Gaussian-kernel estimator is too small to generate a heatmap.As shown in Figure 10, to compare heatmaps at 7:01 a.m., 7:02 a.m., 7:03 a.m. and 7:04 a.m., a generated heatmap depends on points chosen at random, that is the heatmap is changed randomly.Because the Gaussian-kernel estimator is slower than the quadratic-kernel estimator, fewer points were used.That means the quadratic-kernel estimator is more feasible for generating heatmaps quickly than the Gaussian-kernel estimator.However, as shown in Figure 8, the proposed algorithm can generate more accurate heatmaps than that generated by the quadratic kernel estimator.9b are comparable with these Ev.In addition, the proposed algorithm achieved Ev = −3.79,which is also comparable with Ev shown in Figure 9b.That implies that the evaluation setting like h and the number of point data are not significant for the accuracy evaluation.
According to the results of the experiment described above, the proposed algorithm achieved faster performance than two existing algorithms, namely a quadratic-kernel density estimator and a Gaussian-kernel density estimator.Moreover, by subsampling, the kernel density estimators can be accelerated to speeds comparable with that of the proposed algorithm; however, they show lower accuracy.Therefore, it is confirmed that the proposed algorithm has advantages over the existing kernel-density estimators.

Concluding Remarks
A method for estimating the density of moving points, based on a variational Bayesian estimation, was proposed.The prior in the estimation is generated from the posterior given by the previous estimation.Responsibilities used in the posterior formula should be reduced in order to generate the prior; therefore, a trick named "responsibility reduction" was introduced.Moreover, the processing time and the estimation accuracy of the proposed algorithm were experimentally evaluated.The proposed algorithm was faster than two conventional methods; moreover, the estimation accuracy of the proposed algorithm is comparable with those of the conventional methods.These results demonstrate that the proposed algorithm is fast without loss of accuracy.Therefore, it is concluded that a massive moving-point dataset, which can be encoded with OGC Moving Features, can be converted to a heatmap, which is distributable with OGC WMS and WCS, in real time.
As future work, the proposed algorithm will be extended to handle the non-GMM distributions.The responsibility reduction is applicable to other distributions.The the extended algorithm will thus be further developed to speed up the point density estimation.

Figure 2 .
Figure 2. A system using point density.

Figure 10 .
Figure 10.Visualized distributions by the Gaussian-kernel estimator after subsampling.