Unlike existing approaches, which heavily rely on clustering and are therefore dependent on their parameters, our method adopts a different strategy. Our method employs a Bayesian network to estimate the most probable POI based on a user’s location history. The Bayesian network is capable of modeling and calculating the joint probability that a specific location is a POI, taking into account various significant factors including the day of the week, time of day, category of the place, frequency of visits, and duration of each stay. The Bayesian network, with its ability to integrate a priori knowledge and manage uncertainty, is particularly well-suited for this complex task. It considers how these various factors interact with each other and their combined influence on the probability of a location being a POI. For instance, it can recognize that certain categories of places are more likely to be POIs at specific times of the day or on certain days of the week, as well as that more frequent visits or longer stays may indicate a POI with a higher probability. The approach starts by constructing a Bayesian model structured around the causal and conditional relationships between the variables in question. Then, the network is fed historical location data to learn the conditional probabilities that underpin the model. Once the Bayesian network is trained, it is capable of making real-time inferences to predict new POIs based on incoming observations. Using our method, one is able not only to identify the most visited locations but also to understand the context of these visits, which provides a richer comprehension of user’s habits and preferences. This approach is particularly powerful due to its probabilistic nature, which effectively manages uncertainties and variations in user behavior, thereby providing robust and reliable predictions for POIs. In what follows, we recall the basic concepts of a Bayesian network before detailing how we build it.
4.5.1. Overview of Bayesian Networks
A Bayesian Network (BN) consists of directed acyclic graphs (DAGs), where directed cycles are not allowed. The random variables are represented as vertices, and edges between nodes capture the dependence or causal relations. In addition to these, BNs can include special types of nodes known as decision nodes and utility nodes, which are primarily used in influence diagrams—an extension of BNs for decision-making processes. Decision nodes represent choices available to a decision maker, while utility nodes quantify the desirability of outcomes, allowing for evaluation based on different decision scenarios. Let
be a set of variables represented as nodes in the network, which may include both chance nodes (random variables) and decision nodes (
). The edges between these nodes denote directional influence, making node
a
parent of
if there is a direct edge from
to
. Utility nodes (
) do not directly influence other nodes but are influenced by combinations of chance and decision nodes to represent the utility associated with different outcomes. The absence of directed cycles ensures that the graph remains acyclic. The acyclic structure plays a crucial role in simplifying the joint probability distribution of the variables involved. Each node
within such a network is conditionally independent of any of its non-descendants when the values of its parent nodes
are given. This conditional independence is foundational for defining the probability of observing any particular state of
, which is calculated as follows:
where
denotes the parents of
. Building on this principle, the joint probability distribution for all the variables in the network is the product of these individual conditional probabilities, which is formally expressed as follows:
BNs are adept at handling a variety of variable types, including discrete, continuous, or a mixture of both, which allows them to be applied in diverse fields. Discrete variables in these networks are defined with a finite number of states, and their relationships are quantified using Conditional Probability Tables (CPTs), which detail the probabilities of one state occurring given the states of parent variables. Continuous variables, on the other hand, require a parametric form or a piecewise representation to define their conditional distributions accurately, reflecting the complexities involved in modeling real-world processes that change over a continuum. A primary application of BNs is belief propagation, which is a powerful method for updating the marginal probabilities of the variables based on new observations or evidence. This process calculates the marginal probability of a variable
, given evidence
e, using the following formula:
where
is the probability of the evidence. Belief propagation leverages the network’s modular structure and the conditional independencies it represents to efficiently compute these probabilities, facilitating rapid updates to beliefs as new data become available.
4.5.2. Building Our Bayesian Network
The structure of our Bayesian network, depicted in
Figure 5, is designed to encapsulate and analyze user movement nuances from historical location data. It incorporates five variables: Day Index, Time Index, Category of Place, Visit Frequency, and Stay Duration.
The first two variables are crucial to reveal patterns in the frequency and timing of visits. They allow the network to discern daily and hourly trends, impacting where and when visits occur. The Day Index (represents the day of the week) can have values of 1, 2, or 3, respectively representing the following: daily distinctions, week weekend holiday categorizations, or no specific day distinction—with the number of possible values varying according to the level of distinction (seven for daily and three for grouped days). Time Index (represents the time of day) partitions the day into equal intervals such that a specific time like 08:31 would fall into a predetermined time slot, such as the fourth interval in a system where each interval spans two hours. We retrieve the Day Index and the Time Index into the vector W. These correspond to and , respectively. The variable Category of Place classifies locations into types such as restaurants, parks, or offices, evaluating their potential as POIs. The Visit Frequency measures how often a user returns to a particular location, providing insights into user preferences and routines, while Stay Duration gauges the length of each visit, offering clues about the site’s appeal or significance.
To categorize visit frequencies and average stay durations in
G, we utilize Algorithm 3. This algorithm categorizes locations according to visit frequencies and stay durations within a user’s historical location graph
G. We specify the number of categories
k, such as three for low, medium, and high. In line 1, the input graph
G is defined. In line 2, the number of categories
k is set. The algorithm initializes dictionaries
F and
S to store visit frequencies and stay durations for each vertex (line 3). It iterates through each vertex
v in the graph’s set of vertices
(line 4). For each weight tuple
, the visit frequency
and stay duration
are added to the dictionaries
F and
S for the vertex
v (lines 7–9). After processing all vertices and edges, the algorithm computes quantiles for visit frequencies
and stay durations
using the function
computeQuantiles (lines 10–11). It then iterates through each vertex
v again (line 12) and each edge
e where
v is the source node (line 13). The weights for each edge are retrieved again (line 14). For each weight tuple, the algorithm determines the quantile category for the visit frequency and stay duration using
getQuantileCategory (lines 15–16). Finally, the weight tuples are updated with these categories (line 17). The result is a categorized representation of locations based on visit frequencies and stay durations.
Algorithm 3. Categorization of locations according to visit frequencies and stay durations in G. |
- 1:
Input: user’s mobility graph - 2:
Input: number of categories (e.g., 3 for low, medium, high) - 3:
Initialize dictionaries F and S to store frequency and stay duration for each vertex - 4:
for each do - 5:
for each where e is connected to v as the source node do ▹ For each edge connected to the vertex - 6:
Retrieve for edge e - 7:
for each do - 8:
▹ Add visit frequency - 9:
▹ Add stay duration - 10:
end for - 11:
end for - 12:
end for - 13:
computeQuantiles(F, k) ▹ Compute quantiles for frequency - 14:
computeQuantiles(S, k) ▹ Compute quantiles for stay duration - 15:
for each do - 16:
for each where e is connected to v as the source node do - 17:
Retrieve for edge e - 18:
for each do - 19:
▹ Determine the position of the frequency fs as a function of its position relative to the quartiles Qf - 20:
▹ Determine the position of the stay duration st as a function of its position relative to the quartiles Qs - 21:
Update with and - 22:
end for - 23:
end for - 24:
end for
|
These variables are interlinked within the network, where the Day Index and Time Index influence the Category of Place, which in turn affects both the Visit Frequency and the Stay Duration. This causality is represented in the network’s directed acyclic graph, which illustrates the directional influence of one variable over another, thereby structuring clear dynamics and dependencies within the model.
After defining this network structure, the next critical step involves configuring the network using historical data, which includes collecting comprehensive datasets and learning the CPTs for each node, except for the root nodes. These CPTs quantify how the states of parent nodes influence the states of their child nodes. The decision-making process within the network is facilitated by special nodes such as the Utility Node, which calculates the utility value based on the outputs from the
Visit Frequency and
Stay Duration. This utility reflects the desirability or value derived from visiting a specific category of place at certain times and frequencies, effectively guiding the decision on whether a location qualifies as a POI or not. The decision node (POI) evaluates whether the calculated utility exceeds a predefined threshold, indicating significant importance or interest, to decide about considering a location as a point of interest. The joint probability distribution of the network is given by the following:
where
is the probability of the day;
is the conditional probability of the time given the day;
is the conditional probability of the category of place given the day and time;
is the conditional probability of the visit frequency given the day, time, and category of place; and
is the conditional probability of the stay duration given the day, time, category of place, and visit frequency.
By applying the property of conditional probability and considering the conditional independence between the variables, we obtain the following assumptions:
The day of the week D and the time T are independent of each other: .
The visit frequency V depends only on the category of place C: .
The stay duration S, given the category of place, is independent of , and V: .
Thus, the joint probability for this model, given the conditional independence assumptions, is as follows:
Algorithm 4 describes the construction of a Bayesian network from historical data. It takes as input a user’s mobility graph
G and aims to produce a Bayesian network
with the constructed CPTs. In the first step, it calculates the basic probabilities for the day
D and time
T indices. It initializes counts for these indices and iterates through the nodes and edges of the graph
G. For each edge connected to a source node, it retrieves the movement data
and increments the counts for the day and time indices based on the extracted values. Next, the algorithm calculates the conditional probabilities for each movement category
C given the day and time indices. This involves aggregating the occurrences of each category and normalizing them to obtain the required probabilities. It also computes the conditional probabilities of variables
V and
S given
C. These probabilities are essential for understanding the relationships between different variables within the network. In the third step, the algorithm constructs the CPTs for each node using the previously calculated probabilities:
,
,
,
, and
. This construction involves organizing the probabilities into a tabular format that represents the conditional dependencies between variables. These tables are crucial for the accurate functioning of the Bayesian network. Finally, it integrates the calculated CPTs into the corresponding nodes of the Bayesian network
and returns the completed network. This integration ensures that each node in the network is equipped with the necessary probabilistic information to perform inference.
Algorithm 4. Construction of a Bayesian Network from Historical Data. |
- 1:
Input: G - user’s mobility graph - 2:
Output: - Bayesian network with constructed CPTs - 3:
Step 1: Calculation of Basic Probabilities ▹ Calculate the basic probabilities for day and time indices - 4:
Initialize counts for day index D and time index T - 5:
for each do - 6:
for each where e is connected to v as the source node do - 7:
Retrieve for edge e - 8:
for each do - 9:
Increment count of td and store in - 10:
Increment count of ts and store in - 11:
end for - 12:
end for - 13:
end for - 14:
; ▹ Compute the probabilities for day and time indices - 15:
Step 2: Calculation of Conditional Probabilities - 16:
▹ The probability of C given D and T - 17:
▹ Compute the probability of V given C - 18:
▹ Compute the probability of S given C - 19:
Step 3: Construction of the CPTs - 20:
Construct the CPTs for each node: - 21:
, , , , ▹ Construct the CPTs for the nodes in the Bayesian Network - 22:
Step 4: Integration of the CPTs into the Network - 23:
Integrate the calculated CPTs into the corresponding nodes of the network - 24:
Return
|
4.5.3. POI Extraction Using the Bayesian Network
For the extraction of the POI, we have developed and implemented an algorithm based on a Bayesian network. This algorithm utilizes a probabilistic model to analyze and evaluate locations based on various attributes derived from users’ location histories. The goal is to determine whether a specific location can be classified as a POI, which is essential for constructing our mobility graph.
The Algorithm 5 extracts and evaluates the POIs using decision analysis and returns a reduced graph. The process begins by iterating through each vertex
v in graph
G. For each vertex, it retrieves the location metadata, specifically the “category”. It then examines each edge
e connected to
v as the source node and retrieves the weight
for the edge. For each tuple
in
, the algorithm sets evidence in the Bayesian network with the elements
. Using the Bayesian network, it identifies the posterior probabilities for the category,
and
, and computes the utility
U based on the Bayesian network evidence and utilities. If the computed utility
U meets or exceeds the threshold
t, the vertex
v is added to the reduced graph
. Subsequently, for each edge
connected to
v, if the edge
is not already in the reduced graph, it assigns a weight
to the edge. If the edge
has contextual metadata, it associates these metadata using the function
. The edge
is then added to the reduced graph
. The algorithm ensures that the weight of the relation
is updated if it already exists. Finally, the reduced graph
is returned.
Algorithm 5. Reduced Graph Generation and Extraction of Points of Interest. |
- 1:
Input: G - user’s mobility graph - 2:
t - utility threshold - 3:
- Bayesian network (with decision node for POI) - 4:
Output: - Reduced graph - 5:
Initialize an empty list to store selected vertices - 6:
for each v in do - 7:
RetrieveLocationMetadataByName - 8:
for each edge where e is connected to v as the source node do - 9:
Retrieve for edge e - 10:
for each in do - 11:
Set evidence in with - 12:
Identify posterior probabilities for category, and using - 13:
Compute utility U based on evidence and utilities - 14:
if then ▹ Check if the computed utility meets or exceeds the threshold - 15:
- 16:
end if - 17:
end for - 18:
end for - 19:
end for ▹ Construct by verifying relations between vertices in - 20:
for each v in do - 21:
- 22:
for each edge where is connected to v do - 23:
if target node of is in and then - 24:
Assign weight - 25:
if has contextual metadata then - 26:
Associate with its contextual metadata using - 27:
end if - 28:
- 29:
end if - 30:
end for - 31:
end for - 32:
return ▹ Return the reduced graph
|
The utility function used to determine the desirability of a location as a POI is defined as follows:
This utility function is constructed using the logarithms of the conditional probabilities and helps to manage the range of probability values, as well as ensures that multiplying small probabilities does not result in extremely small utility values. By summing the logarithms of the probabilities, we effectively combine the influences of the Category of Place, Visit Frequency, and Stay Duration, capturing their combined impact on the decision-making process.