An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs

Riyana, Surapon; Harnsamut, Nattapon

doi:10.3390/sym17101772

Open AccessArticle

An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs

by

Surapon Riyana

¹

and

Nattapon Harnsamut

^2,*

¹

School of Renewable Energy, Maejo University, Chiang Mai 50290, Thailand

²

School of Information and Communication Technology, University of Phayao, Phayao 56000, Thailand

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(10), 1772; https://doi.org/10.3390/sym17101772

Submission received: 25 August 2025 / Revised: 10 October 2025 / Accepted: 14 October 2025 / Published: 21 October 2025

(This article belongs to the Special Issue Applications Based on Symmetry in Cryptography and Information Security)

Download

Browse Figures

Versions Notes

Abstract

Location-based services (LBSs), which are used for navigation, tracking, and mapping across digital devices and social platforms, establish a user’s position and deliver tailored experiences. Collecting and sharing such trajectory datasets with analysts for business purposes raises critical privacy concerns, as both symmetry in recurring behavior mobility patterns and asymmetry in irregular movement mobility patterns in sensitive locations collectively expose highly identifiable information, resulting in re-identification risks, trajectory disclosure, and location inference. In response, several privacy preservation models have been proposed, including k-anonymity, l-diversity, t-closeness, LKC-privacy, differential privacy, and location-based approaches. However, these models still exhibit privacy issues, including sensitive location inference (e.g., hospitals, pawnshops, prisons, safe houses), disclosure from duplicate trajectories revealing sensitive places, and the re-identification of unique locations such as homes, condominiums, and offices. Efforts to address these issues often lead to utility loss and computational complexity. To overcome these limitations, we propose a new (

ξ

,

ϵ

)-privacy model that combines data generalization and suppression with sliding windows and R-Tree structures, where sliding windows partition large trajectory graphs into simplified subgraphs, R-Trees provide hierarchical indexing for spatial generalization, and suppression removes highly identifiable locations. The model addresses both symmetry and asymmetry in mobility patterns by balancing generalization and suppression to protect privacy while maintaining data utility. Symmetry-driven mechanisms that enhance resistance to inference attacks and support data confidentiality, integrity, and availability are core requirements of cryptography and information security. An experimental evaluation on the City80k and Metro100k datasets confirms that the (

ξ

,

ϵ

)-privacy model addresses privacy issues with reduced utility loss and efficient scalability, while validating robustness through relative error across query types in diverse analytical scenarios. The findings provide evidence of the model’s practicality for large-scale location data, confirming its relevance to secure computation, data protection, and information security applications.

Keywords:

privacy preservation model; sensitive location; location-based services (LBSs); R-Tree; data sliding window; data suppression; data generalization

1. Introduction

Global positioning systems (GPSs) [1,2] are outstanding in location-based applications. They are generally used for getting from one location to another, tracking (i.e., monitoring objects or personal movements), mapping (i.e., creating world maps), and timing (i.e., making it possible to take precise time measurements). They can achieve their objectives by using appropriately specified GIS technology, e.g., the United States Global Positioning System (USA GPS) [3], Russia’s Global Navigation Satellite System (GLONASS) [4], China’s BeiDou Navigation Satellite System (BDS) [5,6], and so on. They can generally be separated into three groups according to the characteristics of the services they provide, i.e., personal, commercial, and military GPS.

Examples of the use of GPS technologies in personal and commercial real-life applications are maps, trackers, transportation, and delivery services. Well-known map applications include Google Maps [7,8], Google Earth [9,10], and OpenStreetMap [11,12]; well-known tracker applications include portable GPS trackers [13,14,15], Find My iPhone [16], and Android’s Find My Device [17]; well-known GIS transportation applications include logistics [18,19] and express services [20,21,22]; and well-known delivery service applications include DoorDash [23], Uber Eats [24], Zomato [25], Deliveroo [26], Doordash [27], and FoodPanda [28]. Aside from the applications mentioned above, GPS technologies collect data about a user’s visited locations. Such data is called the trajectory dataset (or sometimes the location-based dataset) [29,30,31,32]. It is generally used to show the history of locations that users visit. However, we also found that some trajectory datasets allow access to data analysts for business reasons, such as improving marketing strategies and service strategies, analyzing human behaviors and traffic, or providing valuable insights for related applications and urban planning. In data analysis situations, as the authors of [33,34,35,36,37,38] demonstrate, there are concerns regarding privacy violation issues. An example of a privacy violation issue in trajectory datasets is explained in Example 1.

Example 1

(A privacy violation issue in trajectory datasets). We give Table 1 as the specified trajectory dataset provided to the data analyst. We assume that the adversary receives Table 1, further ensuring that a trajectory path (the sequence of the user’s visited locations), as shown in Table 1, is the sequence of Bob’s visited locations. Moreover, we suppose that the adversary thoroughly understands that Bob visited

a_{2}

and

e_{5}

, and that they need to reveal Bob’s diagnosis using Table 1. In this situation, the adversary can ensure that Bob’s diagnosis is HIV, because only

t_{1}

can be determined according to the adversary’s background knowledge about Bob.

With Example 1, we can conclude that the unique subsequence of the user’s visited locations in trajectory datasets may raise concerns about privacy violation issues. To address these, LKC-privacy [33] and its extended models [39,40,41,42,43] are proposed. Privacy preservation is assumed in trajectory datasets; the adversary only has background knowledge about the subsequence of the target user’s visited locations for at most L locations due to data suppression [44].

That is, the trajectory datasets do not raise any concerns in terms of privacy violation issues when all unique L-size subsequences of the user’s visited locations are suppressed to at least K indistinguishable paths. Furthermore, every sensitive value relates to each indistinguishable L-size subsequence of the user’s visited locations; reliable data re-identification must be, at most, C.

An example of privacy preservation in trajectory datasets with LKC-privacy is explained in Example 2.

Example 2

(Privacy preservation with LKC-privacy). We suppose that Table 1 is the specified trajectory dataset provided to the analyst. For privacy preservation, let the value of L and K be 2, and that of C be 0.70. For privacy preservation within these given privacy constraints, all unique 2-size subsequences of the user’s visited locations in Table 1 are suppressed to at least two indistinguishable subsequences. Furthermore, all diagnoses were normalized according to reliable data re-identification to be at most 0.70. Therefore, a data version of Table 1 does not raise any concerns about privacy violation issues, as shown in Table 2. With Table 2, we can see that the confidence in data re-identification for every diagnosis, considering each 2-size subsequence of the user’s visited locations, is at most 0.70 (or 70%).

In addition, we can see that Table 2 is more secure in terms of privacy preservation than the original (Table 1). However, it loses some meaning in terms of data utilization. Data utility and data privacy generally present a trade-off. However, we found that LKC-privacy has serious vulnerabilities that must be considered when it is used to address privacy violation issues in trajectory datasets. The vulnerabilities of LKC-privacy will be explained in Section 2.

This work is organized as follows. The motivation for this work is presented in Section 2. Then, this work’s model and notation are presented in Section 3. Subsequently, the experimental results are discussed in Section 4 before, finally, conclusions and future research directions are discussed in Section 5 and Section 6, respectively.

2. Motivation

When the data holder allows the analyst to access datasets, privacy violation is a serious issue. To address this, several privacy preservation models have been proposed, such as k-anonymity [45], l-diversity [46], t-closeness [47], and their extended models, as presented in [48,49,50,51,52,53,54]. For these models, the idea of privacy preservation is as follows.

Dataset attributes are separated into explicit identifier, quasi-identifier, and sensitive attribute(s).
All values in every explicit identifier attribute must be removed.
The re-identifiable quasi-identifier values are suppressed or generalized by their less specific values to be indistinguishable.
In addition, some privacy preservation models (e.g., l-diversity and t-closeness) further consider the characteristics of sensitive values in terms of their privacy preservation constraints.

Although these preservation models can be used to address privacy violation issues when datasets are released, they still have several issues that must be addressed, e.g., data utility issues and high data transformation complexity. To address the vulnerabilities of these models, the differential privacy model [55] has been proposed. This privacy preservation model is based on a data query framework in conjunction with data noise and the re-identification probability. That is, the data holder does not allow the expert to utilize the dataset directly. The dataset can be utilized by the expert via the data query framework such that the query result is dictated by the data re-identification probability. If an arbitrary query result does not accord with the given data re-identification probability, it is returned with the appropriate noise. Recent studies have further enhanced the effectiveness of differential privacy in location-based services. For example, a recent study [56] proposed an efficient differential privacy-based clustering mechanism that adds Laplace noise to cluster centroids, aiming to balance data utility and privacy protection. In the same direction, the work in [57] developed a semantic-aware differential privacy framework that preserves both spatial coordinates and the semantic meaning of locations, thereby enhancing personalized location privacy.

Unfortunately, the privacy preservation models mentioned above are often insufficient in addressing privacy violation issues in trajectory (or location-based) datasets. This is because they are proposed for addressing privacy violation issues in datasets that have each quasi-identifier attribute as a different data domain, e.g., Sex, Age, Education, and Position, while the quasi-identifier of trajectory datasets is the sequence of a user’s visited locations, e.g.,

< a_{2} \to b_{2} \to c_{3} \to x_{4} \to y_{5} >

. To remove this vulnerability in these models, privacy preservation models for trajectory datasets have been proposed [33,34,35,36,37,38]. One of the most well-known of these is LKC-privacy [33], which uses three privacy parameters (i.e., L, K, and C) to limit privacy violation issues when the data holder allows the data analyst to access trajectory datasets. An example of privacy preservation in trajectory datasets with LKC-privacy is explained in Example 2. Although LKC-privacy can be used to address such issues, we found that it still has serious vulnerabilities that must be addressed, i.e., data utility issues, complexities, and data streaming issues. Moreover, LKC-privacy is only appropriate for addressing privacy violation issues in the static trajectory dataset, which has attributes strongly separated into the user’s visited location sequence and the related sensitive values. In addition, sensitive locations (e.g., specialized hospitals, pawnshops, prisons, and safe houses) are not considered in the context of the privacy preservation constraints of LKC-privacy. For this reason, although trajectory datasets are satisfied by LKC-privacy constraints, they still raise concerns about privacy violation issues, in terms of sensitive location inference, that must be addressed. An example of such privacy violation issues is shown in Example 3.

Example 3

(Privacy violation issues in terms of sensitive location inference). We suppose that Figure 1 is the specified location-based graph proposed to represent five sequences of the users’ visited locations. Moreover, we assume that the location

x_{4}

is a serious or sensitive location, i.e., a cancer (specialized) hospital. Therefore, if the adversary can establish that the user visited the location

x_{4}

, the adversary can infer that they have symptoms of cancer.

Moreover, we found that LKC-privacy [33] and the trajectory privacy preservation models presented in [34,35,36,37,38,39,40,41,42,43] are also vulnerable to privacy violation issues when considering duplicate trajectory paths, as demonstrated in Example 4.

Example 4

(Privacy violation issues considering duplicate trajectory paths). We assume that John is the target user of the adversary. Moreover, we assume that the adversary ensures that a sequence of locations, as shown in Figure 1, is that of John’s visited locations. The adversary further knows that John visited both locations

b_{2}

and

c_{3}

. In these situations, the adversary can see these two location subsequences (i.e.,

< a_{2} \to b_{2} \to c_{3} \to x_{4} \to y_{5} >

and

< d_{1} \to b_{2} \to c_{3} \to x_{4} >

) according to their background knowledge about John. However, the adversary must still ensure that a user who visited the location,

x_{4}

, is John. Therefore, we can conclude that the number of satisfied subsequence occurrences in terms of the user’s visited locations does not have any effect on inference reliability regarding those who visited the location

x_{4}

.

Another vulnerability of these privacy preservation models is that they do not consider location type, such as identifier, relationship, and sensitive locations. In addition, relationship locations represent those between the initially specified locations and before the specific final location.

The sensitive locations are where the user does not need other people to know when they visited them because they can lead to privacy violations, such as prisons, specialized hospitals, pawnshops, or safe houses. The starting locations are often a (unique) private location of users such as users’ homes, condominiums, or offices. They can be used as the explicit identifier value to re-identify the data owner. An example of privacy violation issues considering unique locations is demonstrated in Example 5.

Example 5

(Privacy violation issues considering unique locations). We suppose that the adversary knows that Bob’s family lives at location

a_{1}

. Moreover, we assume that location

x_{4}

is a pawnshop, and that location

s_{3}

is an elementary school. We assume that Bob lives with their 5-year-old daughter. In this situation, the adversary can ensure that one of Bob’s family members goes to the pawnshop, and another person goes to the elementary school. Therefore, the adversary can infer that Bob’s family has financial problems, and can confidently assume that it is Bob who goes to the pawnshop, because the daughter is prohibited by law (she is a child).

With Examples 3–5, we can conclude that LKC-privacy [33] and the trajectory privacy preservation models presented in [34,35,36,37,38,39,40,41,42,43] still present privacy violation issue vulnerabilities that must be addressed. To address these, we propose a new privacy preservation model, which we present in Section 3.

In brief, as shown in Table 3, classical models such as k-anonymity, l-diversity, and t-closeness were designed for relational data and therefore cannot address trajectory-specific risks, including sensitive location inference, duplicate trajectories, and unique location re-identification. These models follow a non-interactive paradigm in which an anonymized dataset is released. In contrast, differential privacy adopts an interactive black-box model, providing strong theoretical guarantees through noisy query responses controlled by the privacy budget

ε

, which quantitatively binds the probability of distinguishing any individual’s record within the output. However, differential privacy operates without releasing any dataset and provides no explicit mechanism to handle uniqueness or spatially correlated locations.

The proposed (

ξ

,

ϵ

)-privacy framework is conceptually distinct from the

ε

parameter used in differential privacy. In our model,

ξ

represents the suppression depth and

ϵ

denotes the spatial generalization level. Together, these parameters form a dual privacy-preserving mechanism specifically tailored to location-based graph data, focusing on mitigating trajectory disclosure and location inference rather than noise-based perturbation. The proposed (

ξ

,

ϵ

)-privacy framework follows the data-release paradigm and is designed to address all three trajectory scenarios, thereby enhancing privacy protection while maintaining scalability and data utility. We present the model in Section 3.

3. Model and Notation

3.1. The Graph of Users’ Visited Sequence Locations

In this section, we present the graph characteristics proposed to represent the users’ visited sequence locations. Let

U = {u_{1}, u_{2}, \dots u_{n}}

be the set of users, and

T = {t_{1}, t_{2}, \dots, t_{o}}

be the set of all possible timestamps at which the users visited each location. Let

L O C = {l o c_{1}, l o c_{2}, \dots, l o c_{m}}

be the set of all possible locations, and

u_{p} [L O C]

=

< u_{p} [l o c_{β}^{t_{γ}}] \to \dots \to u_{p} [l o c_{α}^{t_{ζ}}] >

be the sequence of locations that are visited by the user

u_{p} \in U

, i.e., the user

u_{p}

visited the locations

l o c_{β}, \dots, l o c_{α} \in L O C

at timestamps

t_{γ}

, \dots,

t_{ζ}

\in T

, respectively. Let

G (V, E)

be a directed graph proposed to represent the sequence of locations visited by every user

u_{p} \in U

. That is,

G (V)

is the set of vertices, such that each vertex represents an element of

u_{p} [l o c_{α}^{t_{ζ}}] \in u_{p} [L O C]

that does not include user identifier data; every element

u_{p} [l o c_{α}^{t_{ζ}}]

of

u_{p} [L O C]

in

G (V, E)

is presented in the form of

l o c_{α}^{t_{ζ}}

. Let

G (E) = {(l o c_{β}^{t_{γ}}, l o c_{α}^{t_{ζ}}) ∣ l o c_{β}^{t_{γ}}, l o c_{α}^{t_{ζ}} \in V a n d l o c_{β}^{t_{γ}} \neq l o c_{α}^{t_{ζ}}}

be the set of edges. Every vertex

l o c_{α}^{t_{ζ}}

with only indegree(s) must be according to

t_{γ} < t_{ζ}

, such that

t_{γ}

is the timestamp of each connected indegree vertex of

l o c_{α}^{t_{ζ}}

. While each vertex

l o c_{α}^{t_{ζ}}

only has the outdegree(s), it must satisfy the property

t_{γ} > t_{ζ}

such that

t_{γ}

is each connected outdegree vertex of

l o c_{α}^{t_{ζ}}

. Every vertex

l o c_{α}^{t_{ζ}}

has both in- and outdegrees and must satisfy the following properties:

Let vertex $l o c_{ψ}^{t_{φ}}$ connect to vertex $l o c_{α}^{t_{ζ}}$ .
Moreover, let vertex $l o c_{α}^{t_{ζ}}$ connect to vertex $l o c_{β}^{t_{γ}}$ .
Therefore, the timestamp of the vertices $l o c_{ψ}^{t_{φ}}$ , $l o c_{α}^{t_{ζ}}$ . $l o c_{β}^{t_{γ}}$ must be, according to the property, $t_{φ} < t_{ζ} < t_{γ}$ .

We found that each vertex of

G (V, E)

generally has a different ability to re-identify the data owner. For this reason, we can divide each vertex into an appropriate level by considering its data re-identification ability when provided to the data analyst. Typically, the first vertex (the starting point) of each path in

G (V, E)

has a better data re-identification ability than other vertices in the path because it usually represents a private user location, e.g., a house or a condominium. Moreover, we can see that the endpoint of each path in

G (V, E)

is often worse than other vertices in terms of data re-identification.

Definition 1

(The level of data re-identification). Let

l o c_{β}^{t_{γ}}

be a non-indegree vertex and

l o c_{α}^{t_{ζ}}

be a non-outdegree vertex. Let

l o c_{β}^{t_{γ}} \to l o c_{φ}^{t_{γ + 1}} \to \dots \to l o c_{α}^{t_{ζ}}

be a sequence of vertices in

G (V, E)

from

l o c_{β}^{t_{γ}}

to

l o c_{α}^{t_{ζ}}

. Let

d G (l o c_{β}^{t_{γ}}, l o c_{β}^{t_{γ}})

,

d G (l o c_{β}^{t_{γ}}, l o c_{φ}^{t_{γ + 1}})

, …,

d G (l o c_{β}^{t_{γ}}, l o c_{α}^{t_{ζ}})

be the distance between

l o c_{β}^{t_{γ}}

and

l o c_{β}^{t_{γ}}

, between

l o c_{β}^{t_{γ}}

and

l o c_{φ}^{t_{γ + 1}}

, …, and between

l o c_{β}^{t_{γ}}

and

l o c_{α}^{t_{ζ}}

, respectively. The level of

l o c_{β}^{t_{γ}}

→

l o c_{φ}^{t_{γ + 1}} \to \dots \to l o c_{α}^{t_{ζ}}

can be denoted as

L_{d G (l o c_{β}^{t_{γ}}, l o c_{β}^{t_{γ}})}

,

L_{d G (l o c_{β}^{t_{γ}}, l o c_{φ}^{t_{γ + 1}})}

, …,

L_{d G (l o c_{β}^{t_{γ}}, l o c_{α}^{t_{ζ}})}

, respectively.

For example, let Figure 1 be the specified location-based graph of the users’ visited locations. The level of data re-identification for each vertex of Figure 1 with Definition 1 is shown in Figure 2. That is, the vertices

l o c_{5}^{t_{1}}

,

l o c_{1}^{t_{1}}

,

l o c_{2}^{t_{1}}

, and

l o c_{6}^{t_{1}}

do not have any indegree. Thus, they are available at level 0. Moreover, we can see that the vertices

l o c_{3}^{t_{2}}

,

l o c_{5}^{t_{2}}

, and

l o c_{1}^{t_{2}}

are closer to their related vertex in level 0 than other vertices. Thus, they are available at level 1. Vertices, for which the distance between them and their related vertex at level 0 is 2, are

l o c_{2}^{t_{3}}

,

l o c_{1}^{t_{3}}

, and

l o c_{5}^{t_{3}}

. Thus, these vertices are available at level 2. Only vertex

l o c_{4}^{t_{4}}

has a distance of 3 between it and its related vertices in level 0; therefore, only this vertex is available in level 3. In addition, unconsidered vertices (i.e.,

l o c_{6}^{t_{5}}

and

l o c_{5}^{t_{5}}

) are available in level 4.

3.2. The Type of Vertices

In this section, we describe the characteristics of vertices in

G (V, E)

that are considered in the proposed privacy preservation constraint, i.e., sensitive and unique vertices. The sensitive vertex is an arbitrary one that represents the sensitive location (e.g., specialized hospitals, pawnshops, safe houses, or prisons) visited by the user(s). They may raise concerns about privacy violation issues when utilized outside the scope of the data-collecting organization. Thus, the data holder must ensure that, when

G (V, E)

is provided to the data analyst, every sensitive vertex must be protected by an appropriate privacy preservation technique. An example of privacy violation issues in

G (V, E)

, considering the sensitive vertex, is explained in Example 6.

Example 6

(Privacy violation issues considering sensitive vertices). We assume that Jenifer has a cancer diagnosis. The history of Jenifer’s visited locations is a sequence of users’ visited locations in

G (V, E)

. We further assume that Jennifer is a target user for the adversary, such that they need to disclose Jenifer’s disease from

G (V, E)

. Moreover, the adversary knows that a location in the sequence of Jenifer’s visited locations is a specialized hospital for treating cancer. In this situation, the adversary can infer that Jennifer has cancer-related health problems.

Another type of vertex is also considered in the proposed privacy preservation constraints: the unique vertex. This is an arbitrary vertex

G (V, E)

which represents the user’s house, office, or other unique locations. The adversary can use this vertex to identify the sequence of the target user’s visited locations in

G (V, E)

. An example of privacy violation issues in

G (V, E)

, considering the unique vertex, is explained in Example 7.

Example 7

(Privacy violation issues from considering unique vertices). Let Emma be the target user of the adversary, and let Figure 2 be the

G (V, E)

provided to the data analyst. We assume that the adversary strongly believes that the provided

G (V, E)

contains the sequence of Emma’s visited locations. Let location

l o c_{4}^{t_{4}}

be a pawnshop (a private or sensitive location). Moreover, we assume that the adversary knows that the location of Emma’s house is

l o c_{1}

. In this situation, the adversary can confirm that Emma goes to a pawnshop and, therefore, can infer that she has financial problems.

With Examples 6 and 7, we can conclude that the sensitive and unique vertices can lead to privacy violation issues when

G (V, E)

is provided to a data analyst.

3.3. Data Sliding Windows [58,59,60,61]

Location graph

G (V, E)

is generally very large or extensive; furthermore, it is very complex. It thus often requires more execution time for data processing. However, to the best of our knowledge, we found that it is often processed (or utilized) from the newest to the oldest data, or according to a specified period of time. For this reason, we can use data sliding windows to increase the efficiency of data processing in the location graph

G (V, E)

, the idea being that the graph is separated and made smaller.

Definition 2

(Data sliding windows). Let

G (V, E)

be the specified location graph, and let

τ_{b}

and

τ_{e}

be the specified time periods, such that

τ_{b}

is the initial time and

τ_{e}

is the end time, where

τ_{b}

<

τ_{e}

. Let

f_{D S W} (G (V, E), τ_{b}, τ_{e}) : G (V, E)

\to_{τ_{b}, τ_{e}}

S U B {(G (V, E))}_{1},

\dots,

S U B {(G (V, E))}_{g}

be the data sliding window function proposed for sliding

G (V, E)

, becoming

S U B {(G (V, E))}_{1},

\dots,

S U B {(G (V, E))}_{g}

. That is,

S U B {(G (V, E))}_{1},

\dots,

S U B {(G (V, E))}_{g}

are the subgraphs of

G (V, E)

, i.e.,

S U B {(G (V, E))}_{1},

\dots,

S U B {(G (V, E))}_{g}

⊆

G (V, E)

, so that they only collect the vertices and edges of

G (V, E)

available in the timestamp between

τ_{b}

and

τ_{e}

.

An example of utilizing

G (V, E)

from the newest to the oldest data is the data holder creating a dynamic report considering the sequence of the users’ visited locations from the last ten months to create appropriate tourist traveling paths. In addition, an example case of generating reports from

G (V, E)

by specifying the time periods supposes that the data holder needs to build a report to show the frequency of users who visited each location between September 2024 and December 2024.

3.4. Location Hierarchy

3.4.1. Dynamic Location Hierarchy

R-Trees are tree data structures that are proposed to present and index multidimensional information [62,63], such as geographical coordinates, rectangles, and polygons. First proposed in 1984 by Antonin Guttman [64], they are often available in real-world map applications for storing spatial objects, such as restaurant locations and the polygons that typical maps are made of, such as streets, buildings, lakes, and coastlines. They can be used to find answers to a specified question, e.g., find all universities within one kilometer of the location currently being visited, retrieve all road segments in the range of one kilometer from considering the visited location, or find the nearest hospital. R-Trees can further accelerate nearest-neighbor search for various distance metrics, including great-circle distance. That is, all the objects of interest lie within this bounding rectangle. A question that does not intersect the bounding rectangle cannot intersect any of the contained objects.

Definition 3

(Non-overlapped R-Tree). Let

R = {r_{0}, r_{1}, \dots, r_{s}}

be all possible rectangles that form the boundary of

G (V)

in

G (V, E)

. Every

r_{x}

, where

0 < x < s

, includes both sets of information as

L A B E L (r_{x})

and the set of the specified locations, such that

L A B E L (r_{x})

is the label of

r_{x}

. Let R-Tree be a tree data structure constructed from R according to the following conditions.

The bounding rectangle is not covered by others; it is the root of R.
The child of $r_{x}$ is every $r_{y}$ that is only covered by $r_{x}$ and is not covered by others.
The label of each vertex $r_{x}$ in the tree is represented by $L A B E L (r_{x})$ .

An example of R-Trees has been constructed from the location graph

G (V, E)

. Let each blue cycle in Figure 3a represent a location that was visited by the user(s). Let all rectangles (i.e., red, blue, brown, purple, and black) be the boundings of the specified locations, such that the red rectangle is the largest and the black rectangle is the smallest. An R-Tree version of these location bounds is shown in Figure 3b.

3.4.2. Manual Location Hierarchy

Manual location hierarchy is another way of constructing the R-Tree of the location graph

G (V, E)

. It can be built by a location data expert, e.g., the R-Tree of locations can be presented via urban zoning, roads, or other data utilization applications.

All locations are available in the lower level; they must be more specific than all those available in the higher level. Alternatively, we can say that each most specific location is presented by a vertex available in level 0 (the leaf vertex), with the lowest specific location presented by the hierarchy root.

Definition 4

(Manual Location Hierarchy). Let

f_{M L H} (G {(V)}_{l}) : G {(V)}_{l} \to G {(V)}_{l + 1}

be a manual location function for the locations

G (V)

from level l to level

l + 1

, such that all locations in level l are more specific than level

l + 1

. Moreover, the locations of every level do not overlap, i.e.,

\cap \forall v \in G (V) = \emptyset

, and

\cup \forall v \in G (V) = G (V)

. With the manual location hierarchy function, the location hierarchy of

G (V)

can be presented as a location sequence from level 0 to level l,

G {(V)}_{0} \overset{f_{M L H} (G {(V)}_{0})}{\to} G {(V)}_{1} \overset{f_{M L H} (G {(V)}_{1})}{\to} \dots \overset{f_{M L H} (G {(V)}_{l - 2})}{\to} G {(V)}_{l - 1} \overset{f_{M L H} (G {(V)}_{l - 1})}{\to} G {(V)}_{l}

. In addition, after that, we call the location hierarchy of

G (V)

from level 0 to level l

M L H^{G (V)}

.

3.5. Data Suppression

In this section, we propose a data suppression technique that can be used to eliminate the data re-identification ability in

G (V, E)

. As we know, the level of identifiable data for the vertices in

G (V, E)

can be defined by the order of those available in

G (V, E)

. An example of

G (V, E)

vertex leveling is shown in Figure 2, i.e., the level of each vertex can be defined from the first (the starting point) to the last visited location (the endpoint). This is because the first visited location of each sequence of the user’s visited locations can generally be identified as being much higher than the others; furthermore, we found that the last visited location of each sequence often has the lowest identifiability. The first visited location data are highly identifiable due to this location generally being unique and private; it is often the location of the user’s house or office. For this reason, we can define the identifiable data level for every sequence of the user

u_{p}

’s visited locations in

G (V, E)

from

u_{p} [l o c_{β}^{t_{γ}}]

to

u_{p} [l o c_{α}^{t_{ζ}}]

in order, i.e.,

u_{p} [l o c_{β}^{t_{γ}}] \to \dots \to u_{p} [l o c_{α}^{t_{ζ}}]

, where

l o c_{β}^{t_{γ}}

is the location as the starting point and

l o c_{α}^{t_{ζ}}

is that at the endpoint. With this property of the user’s visited locations,

G (V, E)

, we can address privacy violation issues (or decrease the data re-identification ability) in

G (V, E)

. That is, before

G (V, E)

is provided to the data analyst, the users’ data privacy in

G (V, E)

is maintained by suppressing the unique vertices,

ξ

-Suppression.

Definition 5

(

ξ

-Suppression). Let

G (V, E)

be the specified graph of the users’ visited sequence locations, such that its vertices are separated into l levels. Let

L_{0} (G (V)), \dots, L_{l} (G (V))

represent the vertices that are available in the levels

L_{0}

, …, and

L_{l}

, respectively. Let ξ be a positive integer and the suppression constraint of

G (V, E)

. Let

S U B {(G (V, E))}_{z}

, where

1 \leq z \leq g

, be each specified subgraph of

G (V, E)

. Let

f_{S u p p} (S U B {(G (V, E))}_{z}, ξ) : S U B {(G (V, E))}_{z} \to_{ξ} S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

be the function for suppressing the unique vertices of

S U B {(G (V, E))}_{z}

to become

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

. That is,

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

are a forest graph versions of

S U B {(G (V, E))}_{z}

, such that

S U B {(G (V))}_{1}^{'}, \dots, S U B {(G (V))}_{q}^{'}

are satisfied by the following properties.

$S U B {(G (V))}_{1}^{'} \cap \dots \cap S U B {(G (V))}_{q}^{'} = \emptyset$ .
$(L_{0} (S U B (G (V)))$ ∪… $\cup L_{ξ - 1} (S U B (G (V))))$ ∩ $(S U B {(G (V))}_{1}^{'}$ ∪…∪ $S U B {(G (V))}_{q}^{'})$ =∅ such that $L_{l} (S U B (G (V))$ is the set of the vertices in level l of $S U B (G (V))$ , where $0 \leq l \leq ξ - 1$ .
$(L_{0} (S U B (G (V)))$ ∪… $\cup L_{ξ - 1} (S U B (G (V))))$ ∪ $(S U B {(G (V))}_{1}^{'}$ ∪…∪ $S U B {(G (V))}_{q}^{'})$ = $S U B (G (V))$ .

For example, let Figure 2 be the specified

G (V, E)

. Setting the value of

ξ

to 1, 2, 3, and 4, the results are as shown in Figure 4a, Figure 4b, Figure 4c, and Figure 4d, respectively.

3.6. Data Generalization

Aside from the identifiable data level of vertices, the proposed privacy preservation model is further based on another major assumption about privacy violation issues: the privacy data of the target user in

G (V, E)

can be violated in terms of a sensitive vertex (a sensitive location), even when more than one user has visited this vertex. An example of privacy violation issues considering the specified sensitive vertex and duplicate paths is illustrated in Examples 3 and 4, respectively. To address these privacy violation issues, before

G (V, E)

is released, the sensitive vertices are generalized by their less specified values to be indistinguishable. In addition, the less specified values of the sensitive vertices in

G (V, E)

are presented by a non-overlapped R-Tree that is satisfied by Definition 3 or a split-halves R-Tree satisfied by Definition 6.

Definition 6

(Split-Halves R-Tree). Let W be the width of the area where the locations are available in

G (V, E)

, and let H be the height.

Let

R = {r_{0}, r_{1}, \dots, r_{s}}

be all possible bounding rectangles that can be constructed from

G (V, E)

. That is, firstly,

r_{0}

is the constructed bounding rectangle of size

W * H

, i.e., it covers all locations of

G (V, E)

. After that,

r_{x - 1}

is divided in half to be

r_{x}

and

r_{y}

, where

1 < x < s

,

1 < y < s

, and

x \neq y

, until only one remains in a rectangle such that it is separated by considering first the width and then the height. Finally, all bounding rectangles

r_{0}, r_{1}, \dots, r_{s}

are presented in the form of a tree data structure, which is denoted as

R T^{S H}

, such that each bounding rectangle is a vertex and the label of each vertex is presented by the label of the bounding rectangle. That is,

r_{0}

is the root of

R T^{S H}

. The child of every

r_{x}

is constructed from each

r_{y}

covered by

r_{x}

, but others do not cover it. Each leaf vertex of

R T^{S H}

is represented by a bounding rectangle that covers a location. Let

L_{0}, \dots L_{l}

be the possible level of

R T^{S H}

. The levels of

R T^{S H}

are arranged according to the data specification. That is, the root of

R T^{S H}

is available in level

L_{l}

, and all leaf vertices are available in level

L_{0}

.

An example of creating the bounding rectangles of

G (V, E)

with Definition 6 is shown in Figure 5. With Figure 5a, the first bounding rectangle is created such that it covers all locations that are available in

G (V, E)

. With Figure 5b–d, the location areas are divided by first considering the first and then the height, such that

R 1, R 2, \dots, R 15

are the labels (the names of the specified areas) of the bounding rectangle 1 to 15, respectively.

Definition 7

(

ϵ

-Generalization). Let ϵ be a positive integer; it is the generalization constraint such that it is in the range between 0 and l. Let

R T^{S H}

or

M L H^{G (V)}

be the data structure that presents the generalized values for each specified vertex v in

G (V, E)

. Let

l o c_{β}^{t_{γ}}

be the specified vertex. The generalized data version of

l o c_{β}^{t_{γ}}

is the visited time

t_{γ}

and the

L A B E L (r_{x})

. With

R T^{S H}

, the

L A B E L {(r_{x})}^{t_{γ}}

of

r_{x}

is the bounding rectangle of

l o c_{β}^{t_{γ}}

in the level ϵ of

R T^{S H}

. With

M L H^{G (V)}

, the

L A B E L {(r_{x})}^{t_{γ}}

of

r_{x}

is the label of

r_{x}

in the level ϵ of

M L H^{G (V)}

.

3.7. The Proposed Privacy Preservation Model

3.7.1. Problem Statement

Let

G (V, E)

be a directed graph that represents the sequence of users’ visited locations and let

S E N \subset G (V)

be the set of sensitive locations that are available in

G (V, E)

. Let

ϵ

be the data generalization constraint, and let

R T^{S H}

or

M L H^{G (V)}

be the data structure proposed to present the level of locations in

G (V, E)

, such that it is the reference of the specific location levels that can be used to generalize the unique locations in

G (V, E)

to be indistinguishable. Let

ξ

be the data suppression constraint for suppressing the unique locations in

G (V, E)

to be indistinguishable, i.e., another data distortion technique that is also used to distort the unique locations in

G (V, E)

to be indistinguishable. Let

τ_{b}

and

τ_{e}

be the time period of the specified locations in

G (V, E)

, such that

τ_{b}

is the initial time and

τ_{e}

is the end time, where

τ_{b}

<

τ_{e}

. Let

f_{D G} (f_{D S} (f_{D S W} (G (V, E), τ_{b}, τ_{e}), ξ), S E N, R T^{S H}, ϵ) :

G (V, E) \to_{f_{D S} (f_{D S W} (G (V, E), τ_{b}, τ_{e}), ξ), S E N, R T^{S H}, ϵ}

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

be a privacy preservation function, i.e., proposed for transforming

G (V, E)

to become

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

. That is, the vertices of

G (V, E)

are influenced by

τ_{b}

and

τ_{e}

to be

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

.

All unique locations of

G (V, E)

are suppressed by

ξ

. Moreover, each sensitive location

s e n \in S E N

is generalized by its less specific values available in level

L_{ϵ}

of

R T^{S H}

or

M L H^{G (V)}

.

3.7.2. The Privacy Preservation Algorithm

In this section, we present an algorithm that can be used for transforming

G (V, E)

to satisfy the proposed privacy preservation constraints, as shown in Algorithm 1. This algorithm has seven inputs, i.e.,

(G (V, E)

,

S E N (G (V))

,

D S T

,

τ_{b}

,

τ_{e}

,

ξ

, and

ϵ

. That is,

(G (V, E)

is the specified graph that represents the sequence of users’ visited locations.

S E N

is the sensitive vertices—the sensitive locations—available in

G (V, E)

.

D S T

is

R T^{S H}

or

M L H^{G (V)}

, which is proposed to represent the data specification level of each sensitive vertex in

S E N

.

τ_{b}

and

τ_{e}

are the time periods of the specified vertices in

G (V, E)

, such that

τ_{b}

is the initial time and

τ_{e}

is the end time, where

τ_{b}

<

τ_{e}

.

ξ

is the data suppression constraint for suppressing the unique vertices in

G (V, E)

. Another input of the proposed algorithm is

ϵ

, which is the data generalization constraint for generalizing each sensitive location in

G (V, E)

. The output of this algorithm is

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

, satisfied by

τ_{b}

,

τ_{e}

,

ξ

, and

ϵ

.

Algorithm 1

(G (V, E)

,

S E N

,

D S T

,

τ_{b}

,

τ_{e}

,

ξ

,

ϵ

)-privacy

Require: $G (V, E) \neq N U L L$ , $D S T \neq N U L L$ , $τ_{b} > τ_{e}$ , $ξ \geq 0$ , and $ϵ \geq 0$
Ensure: $S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}$ are satisfied by $τ_{b}$ , $τ_{e}$ , $ξ$ , and $ϵ$
$S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}$ $\leftarrow f_{D S} (f_{D S W} (G (V, E), τ_{b}, τ_{e}), ξ)$
for $ϱ \leftarrow 0 t o q$ do
for $\forall p a t h \in G {(V, E)}_{ϱ}^{'}$ do
for $\forall s e n \in S E N$ do
for $\forall v \in p a t h$ do
if $s e n \in S E N$ is equal to $v \in p a t h$ then
$p a t h \leftarrow v$ is generalized by $f_{D G} (v, D S T, ϵ)$
end if
end for
end for
end for
end for
Return $S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}$

To achieve the proposed privacy preservation constraints in

G (V, E)

, the algorithm first slides (or spits) the vertices of

G (V, E)

to be

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

by

f_{D S W} (G (V, E), τ_{b}, τ_{e})

. That is, the vertices of

G (V, E)

do not occur in the timestamp between

τ_{b}

and

τ_{e}

; they are not ignored because they are available in the outside scope of the data collection for publishing purposes. Then, the unique vertices are available in levels 0, …,

ξ - 1

, and

ξ

of

G (V, E)

for suppression. Thus, the output of this step is a forest of

S U B (G (V, E))

, i.e.,

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

. Subsequently,

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

are iterated. Furthermore, every sequence of vertices in

S U B {(G (V, E))}_{ϱ}^{'}

, where

1 \leq ϱ \leq q

, is also iterated, as are the vertices of

S U B {(G (V, E))}_{ϱ}^{'}

.

If the algorithm finds the sensitive vertex

s e n \in S E N

, it is generalized by its less specific value available in level

ϵ

of

D S T

. Finally, the algorithm returns

S U B {(G (V, E))}_{1}^{'}, \dots, S U B {(G (V, E))}_{q}^{'}

, which are satisfied by

ξ

and

ϵ

.

In addition, with the complexity of the proposed privacy preservation algorithm, if we only consider suppression-based data distortion, we can see that only the number of paths and level of the suppressed vertices affect data suppression processes. Therefore, the complexity of these processes in the proposed privacy preservation algorithm can be defined using Equation (1).

O (f_{D S} (G (V, E))) = ξ * | P A T H |

(1)

where

$ξ$ is the level of suppressed vertices.
$| P A T H |$ is the number of paths that are available in $G (V, E)$ .

The data generalization process is computed after the data suppression process, as suppression is set to be the first priority in our framework. Generally, the result of data suppression on

G (V, E)

produces the forest graphs of

G (V, E)

, i.e.,

O (G {(V, E)}_{1}^{'}, \dots, G {(V, E)}_{q}^{'})

. For example, the original graph is illustrated in Figure 2, while the corresponding output after the suppression process, representing the forest graphs of

G (V, E)

, is shown in Figure 4.

Therefore, the data generalization complexity is based on the number of forest graphs of

G (V, E)

, the number of sensitive locations, the number of vertices in each forest graph, and the height of

R T^{S H}

or

M L H

. The complexity of data generalization for each forest graph can be defined using Equation (2). The data generalization complexity of the proposed privacy preservation algorithm can be defined by Equation (3).

O (f_{D G} (G {(V, E)}_{ϱ}^{'})) = n * | S E N | * | P A T H | * (l - 1)

(2)

where

n is the number of locations that each user visit in $G (V, E)$ .
$| S E N |$ is the number of sensitive locations that must be protected in $G (V, E)$ .
$| P A T H |$ is the number of paths that are available in $G {(V, E)}_{ϱ}^{'}$ of $G (V, E)$ .
l is the high of $R T^{S H}$ or $M L H$ .

O (G {(V, E)}_{1}^{'}, \dots, G {(V, E)}_{q}^{'}) = \sum_{ϱ = 1}^{q} f_{D G} (G {(V, E)}_{ϱ}^{'})

(3)

where

$G {(V, E)}_{1}^{'}, \dots, G {(V, E)}_{q}^{'}$ are the forest graphs of $G (V, E)$ .

Therefore, the complexity of the proposed privacy preservation algorithm can be defined according to its data suppression and generalization processes, i.e., it can be defined using Equation (4).

O (G (V, E)) = O (f_{D S} (G (V, E))) + O (G {(V, E)}_{1}^{'}, \dots, G {(V, E)}_{q}^{'})

(4)

3.7.3. Utility Measurement

With the proposed privacy preservation algorithm, presented in Section 3.7.2, we can see that

G (V, E)

can achieve privacy preservation constraints by suppressing and generalizing the unique and sensitive vertices to be indistinguishable. For this reason, a metric is needed to measure data utility.

With data suppression, the unique vertices in

G (V, E)

are removed until they satisfy the proposed privacy preservation constraint. Each removed vertex generally directly affects the data utility of

G (V, E)

. Therefore, the data utility or the penalty cost (or the data loss) of

G (V, E)

can be defined using Equation (5), leading to the low data utility in

G (V, E)

.

S u p p L o s s (G (V, E)) = \frac{| P A T H | \cdot (ξ + 1)}{| G (V) | - | S U P P (G (V)) | + (| P A T H | \cdot (ξ + 1))}

(5)

where

$| P A T H |$ is the number of paths in $G (V, E)$ .
$| G (V) |$ is the number of vertices in $G (V, E)$ .
$| S U P P (G (V)) |$ is the number of suppressed vertices in $G (V, E)$ .

With data generalization, sensitive vertices are distorted by their less specific values to satisfy the proposed privacy preservation constraint. In addition, generalized vertices affect

G (V, E)

data utility. An appropriate metric is thus necessary to measure such utility; this is shown in Equation (6). The higher penalty cost of Equation (6) leads to low data utility in

G (V, E)

.

G e n L o s s (G (V, E)) = \frac{\sum_{α = 1}^{| G (V) |} \frac{L (v_{α})}{H (R T^{S H})}}{G (V)}

(6)

where

$L (v_{α})$ represents the generalization level of the vertex $v_{α}$ .
$H (R T^{S H})$ is the highest of $R T^{S H}$ .

Therefore, the utility of the data or the penalty cost of

G (V, E)

can be further defined using Equation (7). Using this equation, a low penalty cost is desirable.

T o t a l L o s s (G (V, E)) = S u p p L o s s (G (V, E)) + G e n L o s s (G (V, E))

(7)

In addition to Equation (7), the utility of the

G (V, E)

data can be measured using a relative error metric [65]. With this metric, the utility of the data or the penalty cost of

G (V, E)

is based on the difference between the original query result and that of the related experiment query. The higher cost of the relative errors means that

G (V, E)

has low data utility; this relative error can be defined using Equation (8).

R e l a t i v e E r r o r (v, v_{0}) = \frac{v - v_{0}}{v}

(8)

where

v is the original query result.
$v_{0}$ is that of the related experiment query.

4. Experiment

In this section, we present the experiments conducted to evaluate the proposed algorithm in terms of both effectiveness and efficiency. The effectiveness is assessed using two measures: utility loss and relative error. Utility loss, measured by the

T o t a l L o s s

, evaluates data quality after anonymization. We evaluated the relative error between the original and experimental query results with different query types, including full, partial, and range scans. Efficiency is evaluated based on the total execution time required for the anonymization process.

4.1. Experimental Setup

We evaluated our proposed framework using trajectory datasets, which are inherently high-dimensional and sparse [66,67], since positions are only recorded when changes occur, leaving most time–location cells empty. This sparsity often makes trajectories highly specific and unique, thereby increasing the risk of re-identification. For our experiments, we employed two well-established trajectory datasets—City80k and Metro100k—which have been widely adopted in trajectory data privacy research [66,68,69]. In this context, Metro100k exemplifies sparse high-dimensional data, while City80k provides a comparatively denser representation of mobility. Together, they enable the evaluation of the proposed framework under both high-dimensional and sparse trajectory conditions.

The City80k dataset simulates the movement trajectories of 80,000 citizens navigating through a metropolitan area. It records movements across 26 city blocks over a 24 h period, reflecting realistic urban mobility patterns. Each trajectory is represented as a sequence of visited locations in the format location_id, where locations are denoted by alphanumeric codes such as

f 1

,

e 2

, and

c 3

. The dataset also contains five disease categories as sensitive attributes, namely HIV, Cancer, SARS, Flu, and Diabetes, which were not utilized in this research. The Metro100k dataset is designed to represent the transit patterns of 100,000 passengers traveling within the Montreal subway system. Passenger movements are recorded across 65 stations within a 60-min time window, resulting in 3900 possible spatio-temporal combinations derived from 65 locations and 60 time units. Each point in a trajectory is presented in the form of

L 28 . T 1

, indicating location 28 at timestamp

T 1

. Five employment status categories are included as sensitive attributes, namely Welfare, Full-time, Retired, Part-time, and Self-employed; however, these sensitive attributes were not used in this investigation.

Our experiments were carried out on an Intel Core i7-6700 3.40 GHz desktop computer (Intel Corp., Santa Clara, CA, USA) with 16 GB RAM, following the experimental setup described in related trajectory privacy research. Our experimental evaluation focused on the trade-off between privacy protection and data utility preservation under different parameter configurations, including the number of suppressed timestamps (

ξ

), the number of generalized locations (

ϵ

), the level of Manual Location Hierarchy (

M L H

), and the dataset size.

4.1.1. Effectiveness

In the first part of the experiment, we examined the effectiveness of the algorithms by varying the numbers of

ξ

and

ϵ

, as well as the level of

M L H

. Effectiveness is evaluated using utility loss, with results presented on a logarithmic scale. To enhance robustness, each data point in the plots represents the average value obtained from three independent random trials. In addition, our evaluation primarily focuses on the re-identification risks at starting points, since these locations are typically more unique and therefore more vulnerable to privacy attacks. Endpoint locations, while included in the datasets, were observed to be less distinctive due to user convergence at common destinations, and thus were not the main focus of our effectiveness analysis.

In Figure 6a,b, we examine how increasing privacy-preserving parameters affects utility loss. In Figure 6a, we vary the number of

ξ

from 1 to 5 timestamps, while keeping the number of

ϵ

at 1 and the level of

M L H

at 1, and using the complete dataset. In Figure 6b, we vary the number of

ϵ

from 1 to 5 locations while fixing the number of

ξ

at 1 and the level of

M L H

at 1, and using the complete dataset (100%).

Both experiments demonstrate a consistent pattern, where utility loss increases as privacy-preserving parameters increase. This occurs because uniform frequency distributions across different locations, where each location exhibits similar occurrence counts, and suppressing multiple timestamps from the dataset leads to increased data distortion and reduced analytical dataset utility. These findings confirm the expected trade-off between privacy protection and data utility.

In Figure 6c,d, we analyze the impact of

M L H

on utility loss under different privacy configurations, where

M L H

is varied from level 1 to the maximum level 4. In Figure 6c, the parameters are set to

ξ = 0

and

ϵ = 1

, whereas in Figure 6d,

ξ = 1

and

ϵ = 1

are applied.

M L H

represents Manual Location Hierarchy, which is another way of constructing the R-Tree of the location graph

G (V, E)

, typically designed by location data experts. It can be organized according to criteria such as urban zoning, road networks, or other application-specific considerations, and is structured into multiple levels, where the lower levels represent more specific locations and the higher levels represent more general locations, with Level 0 corresponding to the raw data. More general locations improve privacy preservation but reduce spatial granularity.

The results show that utility loss consistently increases as the

M L H

level increases, as can be seen in both Figure 6c,d for both datasets (City80k and Metro100k). This occurs because generalizing locations to higher levels reduces the data resolution, making it less useful for fine-grained analysis. Moreover, the City80k dataset yields consistently higher utility loss compared with Metro100k, likely due to its more diverse individual movement distribution. An additional observation is that, for City80k, the utility loss in Figure 6d is lower than in Figure 6f, because suppressing the first timestamp in City80k increases distortion, as the dataset is relatively dense at the first timestamp. In contrast, the Metro100k dataset shows almost no difference between Figure 6d,f. Although suppression is applied, it has little effect, since the first timestamp in Metro100k contains only 359 non-null records out of 100k total, leading to negligible utility loss differences.

In Figure 6e,f, we investigate how dataset size affects utility loss under different privacy configurations. In Figure 6e, we vary the dataset size from 20 to 100% while keeping the number of

ξ

at 0, the number of

ϵ

at 1, and

M L H

at 1. In Figure 6f, we conduct a similar experiment but with

ξ

increased to 1, while maintaining the number of

ϵ

at 1 and the

M L H

at 1. A comparison shows that, when

ξ

increases from 0 to 1 in Figure 6f, a higher value of utility loss is observed compared with Figure 6e, due to the suppression of one timestamp. However, both experiments show relatively stable utility loss as the dataset size increases. This stability can be explained by the stratified sampling approach, preserving the dataset’s proportional structure. Despite applying privacy-preserving techniques, such as timestamp suppression or location generalization, the uniform frequency distributions across different locations are preserved across all dataset sizes.

4.1.2. Efficiency

Having demonstrated the effectiveness of the proposed algorithm, we subsequently consider its efficiency, i.e., the execution time.

We investigated efficiency with regard to the numbers of

ξ

and

ϵ

, the level of

M L H

, and dataset size. To ensure robustness, each plotted data point represents the mean value computed from three independent random trials.

In Figure 7a, we vary the number of

ξ

from 1 to 5 timestamps, while fixing the number of

ϵ

at 1 and the level of

M L H

at 1, and using the complete dataset. The results indicate that the execution time increases when the number of

ξ

increases, because each additional timestamp requires separate suppression processing. The algorithm must identify and remove all data entries for each suppressed timestamp, leading to increased computational overhead. The effect is more evident in the Metro100k dataset, in which the larger scale and higher dimensionality substantially increase the computational burden compared with the smaller City80k dataset.

In Figure 7b, we vary the number of

ϵ

from 1 to 5 locations, while fixing the number of suppressed timestamps at 1 and using the complete dataset (100%).

The results show that the execution time gradually increases when the value of

ϵ

increases. This behavior occurs because each additional generalized location introduces further computational requirements for the generalization algorithm. As a consequence, the overall processing overhead rises and the execution time increases.

In Figure 7c,d, we analyze the impact of

M L H

on execution time under different privacy configurations, where the

M L H

level is varied from 1 to 4 using the complete dataset. In Figure 7c, the parameters are set to

ξ = 0

and

ϵ = 1

, whereas in Figure 7d,

ξ = 1

and

ϵ = 1

are applied. The results demonstrate that the execution time remains relatively stable when the level of

M L H

increases in both settings. This stability occurs because our algorithm performs direct mapping from original values to a specified target

M L H

level, without requiring sequential traversal through intermediate levels. As a result, transforming the data from the original values to Level 1 requires the same computational effort as transforming directly to Level 4, leading to a consistent processing time regardless of the

M L H

level chosen.

A further observation is that the execution time in Figure 7d is slightly higher than in Figure 7c. This additional cost arises from the timestamp suppression when

ξ = 1

, as the algorithm needs to process the removal of the first timestamp before applying

M L H

generalization. Such a slight difference confirms that the proposed framework maintains efficient scalability under stronger privacy configurations.

In Figure 7e,f, the impact of dataset size on execution time is examined under different privacy configurations. In Figure 7e, we vary the dataset size from 20 to 100%, while keeping the number of

ξ

at 0, the number of

ϵ

at 1, and the level of

M L H

at 1. Similarly, in Figure 7f, we employ the same experimental setup, except that

ξ

is fixed at 1 while

ϵ

and

M L H

remain unchanged.

The results from both experiments show that execution time increases with dataset size growth. This outcome arises because larger datasets contain a greater number of records that must be individually processed by our algorithm, thereby leading to proportionally longer execution times. The introduction of timestamp suppression does not substantially affect this trend. Whether no suppression is applied (

ξ = 0

), as in Figure 7e, or a single timestamp is suppressed (

ξ = 1

), as in Figure 7f, the difference in execution time is negligible. This observation indicates that the computational complexity is primarily determined by dataset size rather than by the presence or absence of timestamp suppression.

Moreover, the observed stability in execution time is consistent with the theoretical complexity, which can be expressed as

O (n \cdot | S E N | \cdot | P A T H | \cdot (l - 1))

. This alignment between empirical findings and analytical complexity provides empirical support for the framework’s scalability for large-scale trajectory data.

4.2. Relative Error Across Query Types

Our experimental evaluation compared the relative error of differential privacy, used as the baseline, with our proposed algorithm across three query types, which are full scan, partial scan, and multi-timestamp scan queries. The experiments were conducted on two real-world location datasets—City80k and Metro100k—and the results are reported in Figure 8 and Figure 9, respectively.

In the differential privacy baseline, the privacy budget parameter

ε

is varied between 0.5, 1.0, and 1.5. A smaller

ε

corresponds to stronger privacy protection but a higher relative error, while a larger

ε

implies weaker protection but lower relative error. In our proposed algorithm,

ξ

represents the number of suppressed timestamps and

ϵ

denotes the number of generalized locations, with the

M L H

level fixed at 1. To conduct a more comprehensive evaluation across query types and compare against the baseline, we varied

ξ

and

ϵ

under five configurations. In the first set of experiments,

ξ

was varied from 0 to 2 while fixing

ϵ = 1

, which corresponds to bars 4 through 6 in the figures of each query type. In the second set,

ϵ

was varied from 1 to 3 while fixing

ξ = 1

, corresponding to bars 5, 7, and 8 in the figures. For both the differential privacy baseline and the proposed algorithm, all experiments were conducted on the complete dataset.

4.2.1. Full Scan Queries

For full-scan operations, which involve counting the total number of cells in the dataset, we evaluate the performance of different configurations of our proposed algorithm. The results reveal performance differences that reflect the underlying characteristics of each privacy preservation approach. The configuration

ξ

= 0 and

ϵ

= 1 achieves perfect accuracy with zero relative error (0.0000) for both the City80k (Figure 8) and Metro100k (Figure 9) datasets. This is because the method only generalizes specific locations (for example, L7 becomes L7*) without removing any timestamp data, thereby preserving the complete cell structure required for accurate full-scan counts. Since the counting operation measures the number of cells containing any data, generalization does not affect the results. However, if counting was instead limited to cells containing only original (non-generalized) values, the generalization process in our algorithm would reduce the count and consequently increase the relative error. Furthermore, the proportion of anonymized data is extremely small compared to the total number of cells in the dataset, which minimizes its impact on the query results.

Differential privacy methods also perform exceptionally well in full-scan queries, maintaining extremely low relative errors across all

ε

values. For City80k, the errors range from 0.0000011 to 0.0000149, while Metro100k shows an even better performance with errors between 0.00000058 and 0.0000078. This strong performance is due to the fact that differential privacy introduces statistical noise into query results. However, in large aggregate counts typical of full-scan operations, the relative effect of this noise becomes negligible compared to the overall cell count.

In the set of experiments with a varying

ξ

, the results of our proposed method show that, when

ξ

increases, the relative error also increases. Suppression-based variants exhibit significantly degraded performance with much higher relative errors. In the City80k dataset, suppressing the two timestamps, which correspond to

ξ

= 2, results in an error of 0.249, and in the Metro100k dataset the error is 0.0013. This large difference can be explained by the distribution of data within the first two timestamps. In City80k, the first two timestamps are highly dense, containing approximately 24.9% of all cells in the dataset. Therefore, removing them causes a substantial reduction in the total cell count, leading to a large relative error. In contrast, the first two timestamps in Metro100k contain only about 0.13% of the total dataset cells, so their removal has a minimal impact on the full-scan count, resulting in a much lower relative error. Eliminating complete timestamps in either case reduces the total cell count; however, the effect is magnified when those timestamps contain a significant portion of the dataset.

In the set of experiments with varying

ϵ

, corresponding to bars 5, 7, and 8, the relative error remains essentially unchanged as

ϵ

increases. This outcome is consistent with the earlier explanation that the counting operation considers the total number of cells containing any data. Therefore, generalization does not affect the results.

4.2.2. Partial Scan Queries

For partial scan operations, which retrieve counts for specific locations within a subset of timestamps, the results reveal more complex and nuanced performance patterns. These queries are more sensitive to both timestamp suppression (

ξ

) and location generalization

ϵ

, as they focus on localized subsets of the dataset rather than global aggregates. In this experiment, partial scans are performed using queries such as the following:

Query 1: SELECT COUNT(*) FROM city80k WHERE Timestamp1 = L14;

Query 2: SELECT COUNT(*) FROM city80k WHERE Timestamp1 = L3;

Query 3: SELECT COUNT(*) FROM city80k WHERE Timestamp1 = L24;

These are example queries for the City80k dataset, but the same procedure is applied to the Metro100k dataset with its corresponding location identifiers. One location (L14) is randomly selected for generalization, while the other two locations (L3 and L24) remain in their original form. The relative errors from these three queries are then averaged to produce the final metric for each configuration. This setup allows us to observe the effect of generalizing a single location on query accuracy while keeping other locations unchanged.

Differential privacy methods achieve the best performance across both datasets, with accuracy improving as the privacy budget

ε

increases. For the City80k dataset (Figure 8), relative errors range from 0.600 at

ε

= 0.5 to 0.067 at

ε

= 1.5. The Metro100k dataset (Figure 9) shows an even better performance, with errors decreasing from 0.111 at

ε

= 0.5 to 0.014 at

ε

= 1.5. This pattern occurs because differential privacy introduces carefully calibrated noise that, when applied to location-specific queries, still allows for reasonably accurate results, provided there is sufficient supporting data. The black box nature of differential privacy ensures that users only interact through predefined query interfaces and receive statistical summaries, which helps maintain accuracy while protecting sensitive location information.

Our configuration

ϵ

= 1 and

ξ

= 0 delivers moderate performance, producing a consistent relative error of 0.333 for both City80k and Metro100k. This uniform value suggests that generalizing one specific location introduces a predictable, systematic accuracy loss of approximately one-third for partial scan queries. The underlying reason is that when a location is generalized—for example, when L14 becomes L14*—a query targeting the original location no longer matches its generalized equivalent. Since partial scans operate over specific timestamps and locations, this mismatch directly and consistently reduces the count across all queries.

While the white box nature of this approach provides transparency by allowing users to directly examine and query the transformed dataset, it also exposes the altered structure, introducing a clear trade-off between transparency and accuracy. We focus on protecting locations that could lead to re-identification or privacy breaches; so, in such cases, higher relative error reflects intentional privacy preservation. Conversely, if the query targets a location without privacy risk and is thus not generalized, the relative error becomes 0, indicating high query accuracy. For example, if the queries target L22, L3, and L24 instead of L14, which was generalized to preserve privacy, the resulting relative error will be 0 because all queried locations exactly match the original dataset.

In the set of experiments varying

ξ

, the results of our proposed method show that the relative error does not change but remains at the maximum value of 1.0, with the exception of the case in which

ξ = 0

, as already explained in previous sections. Suppression-based variants perform the worst, producing relative errors of exactly 1.0 for both datasets and both suppression levels. For

ξ

= 1, suppression removes the first timestamp (

T 1

), and for

ξ

= 2, both

T 1

and

T 2

are removed. In this experiment, Query 1 specifically searches for data in timestamp 1. Since both suppression settings remove all data from the targeted timestamp(s), the query returns zero results, causing the relative error to reach 1.0 in all cases. This approach intentionally preserves privacy, as timestamp

T 1

often contains sensitive information such as home locations. Our algorithm therefore deletes this data to prevent potential re-identification, while still allowing flexibility in adjusting the level of privacy according to requirements. Moreover, if the analysis uses data from other timestamps that are not

T 1

or

T 2

, the returned results match the original dataset exactly, providing full data utility for non-sensitive temporal segments.

In the second set,

ϵ

is varied from 1 to 3 while fixing

ξ = 1

, corresponding to bars 5, 7, and 8. The relative error remains essentially unchanged at the maximum value of 1.0 as

ϵ

increases. This stability occurs because the relative error is already dominated by the suppression step with

ξ = 1

, as discussed in the experiments with varying

ξ

.

4.2.3. Multi-Timestamp Scan Queries

For range scan operations, which count the number of records for specific locations across multiple consecutive timestamps, the results exhibit performance characteristics between full and partial scan queries. These queries are affected by both location generalization

ϵ

and timestamp suppression (

ξ

); however, the magnitude of the impact depends on how much of the scanned range overlaps with the generalized or suppressed data. In this experiment, the range scan for the City80k dataset was performed using queries such as the following:

Query 4: SELECT COUNT(*) FROM city80k WHERE T1 = ’L14’ OR T2 = ’L14’ OR T3 = ’L14’;

Query 5: SELECT COUNT(*) FROM city80k WHERE T1 = ’L3’ OR T2 = ’L3’ OR T3 = ’L3’;

Query 6: SELECT COUNT(*) FROM city80k WHERE T1 = ’L24’ OR T2 = ’L24’ OR T3 = ’L24’;

These are example queries for City80k, but the same procedure was applied to Metro100k using its corresponding location identifiers. In each run, one location (L14) is randomly selected for generalization, while the other two locations (L3 and L24) remain in their original form. The relative errors from these three queries are averaged to obtain the final metric for each configuration.

Differential privacy methods achieve the best performance across both datasets, with accuracy improving as the privacy budget

ε

increases. For the City80k dataset (Figure 9), relative errors decrease from 0.745 at

ε

= 0.5 to 0.089 at

ε

= 1.5. In the Metro100k dataset (Figure 9), performance is even better, with errors ranging from 0.122 at

ε

= 0.5 to 0.016 at

ε

= 1.5. This is because the noise introduced by differential privacy has a less proportional impact when aggregating results over multiple timestamps, as the larger aggregated counts dilute the effect of the added noise.

Our configuration

ξ

= 0 and

ϵ

= 1 produces moderate performance, with relative errors of 0.333 for both City80k and Metro100k. This outcome is similar to the partial scan case, where generalizing one location—such as L14 becoming L14*—causes queries targeting that location to miss the generalized records, leading to a consistent undercount of about one-third. We focus on protecting locations with high privacy risk, so higher relative error in these cases reflects intentional privacy preservation. If the query targets only non-generalized locations—for example, L22, L3, and L24 instead of L14—the relative error is 0 and the results match the original dataset.

When the configuration is

ξ

= 1 and

ϵ

= 1, all data in timestamp 1 are removed, and when the configuration is

ξ

= 2 and

ϵ

= 1, both timestamp 1 and timestamp 2 are removed. This approach is applied to protect sensitive periods, such as those containing home location data, and can be adjusted according to the desired privacy level. Suppression-based variants show notable accuracy degradation, with the extent of loss depending on how many suppressed timestamps fall within the query range. For

ξ

= 1, suppression removes

T 1

, and for

ξ

= 2, both

T 1

and

T 2

are removed. When the scanned range includes these suppressed timestamps, the absence of data significantly lowers the counts, pushing the relative error higher. It does not, however, reach 1.0, as it does in partial scans, because

T 3

remains unsuppressed and contributes to the counts. If the range scan only targets timestamps outside the suppressed set, the results exactly match the original dataset. In range scans, the SQL statements cover timestamps 1 through 3; removing earlier timestamps therefore directly impacts query results.

In the second set,

ϵ

is varied from 1 to 3 while fixing

ξ = 1

, corresponding to bars 5, 7, and 8. As

ϵ

increases, the relative error also increases. This occurs because the generalization involves locations

L 14

,

L 3

, and

L 24

, which are the specific positions we selected for generalization and which directly correspond to the queries being executed. When all three locations are generalized in the case of

ϵ = 3

, Queries 4, 5, and 6 return no results, since the original values have been fully generalized. Consequently, the relative error reaches the maximum value of 1.0. However, if the generalized locations did not overlap with the queried positions, the relative error would be lower because the queries would still retrieve valid results.

The query accuracy results for full, partial, and range scans illustrate the balance between privacy protection and data utility in our generalization–suppression method compared with differential privacy. Full scans show minimal impact from anonymization, while partial scans reveal that, by design, our method achieves perfect accuracy for non-sensitive locations but lower accuracy for sensitive ones. Range scans yield intermediate results, with suppression reducing accuracy based on the number of removed timestamps. Unlike differential privacy, which only returns statistical outputs from allowed queries, our approach provides full access to the transformed dataset, enabling flexible analysis while still protecting sensitive data.

In addition, the privacy domain includes several well-known models, such as k-anonymity, l-diversity, t-closeness, and

L K C

-privacy, which are commonly used as benchmarks. However, the strategy and data characteristics addressed in this research differ in terms of privacy leak conditions and structural properties. Specifically,

L K C

-privacy is designed for datasets that contain explicit sensitive attributes, whereas the proposed method operates on location-based graphs that do not include such attributes. Instead, privacy risks in our model arise from spatial and structural inferences, such as unique trajectory patterns or visits to sensitive locations. Consequently, the mechanisms of

ξ

-suppression and

ϵ

-generalization in our (

ξ

,

ϵ

)-privacy aim to mitigate trajectory disclosure and location inference rather than attribute disclosure. Therefore, while models such as k-anonymity, l-diversity, t-closeness, and

L K C

-privacy remain valuable in their respective contexts, they are not directly aligned with the objectives and constraints of the approach proposed in this study.

5. Conclusions

This work enumerates and explains the vulnerabilities of privacy preservation models (k-anonymity, l-diversity, t-closeness,

L K C

-privacy, differential privacy, and location-based privacy preservation models) to privacy violation issues from sensitive location inference, from duplicate trajectory paths, and from unique location attacks when location-based data are independently released. Moreover, in this work, we also reduced data utility issues and transformation complexity. To address the vulnerabilities of privacy preservation models, we propose a new model that can address such privacy violations.

Our experimental results indicate that the released location-based data are satisfied by the proposed model; which, when compared to other models, is found to be more secure in terms of privacy preservation and marks an improvement in terms of data utility maintenance. We also observed that endpoint re-identification rates were consistently lower than those of starting points due to higher overlaps at common destinations, highlighting the importance of addressing both unique and overlapping trajectory locations in privacy preservation.

6. Future Work

Although the proposed model effectively addresses major privacy risks, such as sensitive location inference, duplicate trajectory disclosure, and unique location re-identification in independently released location-based data, adversaries are likely to develop new strategies to compromise mobility privacy. Future work should therefore explore enhanced models capable of mitigating emerging threats while ensuring scalability to much larger datasets (e.g., millions of trajectories or real-time mobility streams). Extending the framework to such settings would strengthen its applicability in large-scale deployments such as smart cities and IoT-based mobility systems.

In addition, future studies should also construct or collect datasets that better reflect the three risk scenarios. While City80k and Metro100k offer useful insights, they do not fully capture sensitive locations, duplicate trajectories, or unique location risks. Moreover, our empirical findings suggest that endpoint re-identification rates are generally lower than starting points due to the high overlap at common destinations. This observation, however, requires further validation using diverse datasets, larger-scale adversarial models, and varying contextual factors (e.g., time-of-day effects). Future research may thus employ a combination of real-world mobility datasets and carefully designed synthetic datasets not only to capture the three main risk scenarios more comprehensively, but also to investigate endpoint re-identification risks in greater depth, thereby strengthening the overall robustness of the proposed framework.

Author Contributions

Conceptualization, S.R. and N.H.; methodology and formal analysis S.R.; validation and investigation N.H.; writing—original draft, S.R.; writing—review and editing, N.H.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used and analyzed during the current study (City80k and Metro100k) are not publicly available due to privacy and confidentiality agreements. Data supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the Maejo University and University of Phayao for their academic and institutional support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cummins, C.; Orr, R.; O’Connor, H.; West, C. Global positioning systems (GPS) and microtechnology sensors in team sports: A systematic review. Sport Med. 2013, 43, 1025–1042. [Google Scholar] [CrossRef] [PubMed]
Enge, P.K. The global positioning system: Signals, measurements, and performance. Int. J. Wirel. Inf. Netw. 1994, 1, 83–105. [Google Scholar] [CrossRef]
Grewal, M.S.; Weill, L.R.; Andrews, A.P. Global Positioning Systems, Inertial Navigation, and Integration; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
Lee, H.-S.; Kim, G.-H.; Ju, H.-S.; Mun, H.-S.; Oh, J.-H.; Shin, B.-S. Global Navigation Satellite System/Inertial Navigation System-Based Autonomous Driving Control System for Forestry Forwarders. Forests 2025, 4, 647. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Wang, E.; Chen, J.; Feng, S.; Wang, D.; Dai, L. Advances in BeiDou Navigation Satellite System (BDS) and satellite navigation augmentation technologies. Satell. Navig. 2020, 1, 12, Erratum in Satell. Navig. 2020, 1, 13. https://doi.org/10.1186/s43020-020-00015-x. [Google Scholar] [CrossRef]
Yang, Y.; Gao, W.; Guo, S.; Mao, Y.; Yang, Y. Introduction to BeiDou-3 navigation satellite system. Navigation 2019, 66, 7–18. [Google Scholar] [CrossRef]
Noone, R. Location Awareness in the Age of Google Maps; Taylor & Francis: Abingdon, UK, 2024. [Google Scholar]
Pramanik, M.A.; Rahman, M.M.; Anam, A.I.; Ali, A.A.; Amin, M.A.; Rahman, A.M. Modeling traffic congestion in developing countries using google maps data. In Proceedings of the Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Virtual, 29–30 April 2021; Springer: Berlin/Heidelberg, Germany, 2021; Volume 1, pp. 513–531. [Google Scholar]
Zhao, Q.; Yu, L.; Li, X.; Peng, D.; Zhang, Y.; Gong, P. Progress and trends in the application of Google Earth and Google Earth Engine. Remote Sens. 2021, 13, 3778. [Google Scholar] [CrossRef]
Yang, L.; Driscol, J.; Sarigai, S.; Wu, Q.; Chen, H.; Lippitt, C.D. Google Earth Engine and artificial intelligence (AI): A comprehensive review. Remote Sens. 2022, 14, 3253. [Google Scholar] [CrossRef]
Herfort, B.; Lautenbach, S.; Porto de Albuquerque, J.; Anderson, J.; Zipf, A. A spatio-temporal analysis investigating completeness and inequalities of global urban building data in OpenStreetMap. Nat. Commun. 2023, 14, 3985. [Google Scholar] [CrossRef]
Biljecki, F.; Chow, Y.S.; Lee, K. Quality of crowdsourced geospatial building information: A global assessment of OpenStreetMap attributes. Build. Environ. 2023, 237, 110295. [Google Scholar] [CrossRef]
Wojtusiak, J.; Nia, R.M. Location prediction using GPS trackers: Can machine learning help locate the missing people with dementia? Internet Things 2021, 13, 100035. [Google Scholar] [CrossRef]
Cullen, A.; Mazhar, M.K.A.; Smith, M.D.; Lithander, F.E.; Ó Breasail, M.; Henderson, E.J. Wearable and portable GPS solutions for monitoring mobility in dementia: A systematic review. Sensors 2022, 22, 3336. [Google Scholar] [CrossRef]
Yadav, S.P.; Zaidi, S.; Nascimento, C.D.S.; de Albuquerque, V.H.C.; Chauhan, S.S. Analysis and Design of automatically generating for GPS Based Moving Object Tracking System. In Proceedings of the 2023 International Conference on Artificial Intelligence and Smart Communication (AISC), Greater Noida, 27–29 January 2023; pp. 1–5. [Google Scholar]
McFedries, P.; McFedries, P. Protecting Your Device. In Troubleshooting iOS: Solving iPhone and iPad Problems; Apress: New York, NY, USA, 2017; pp. 91–109. [Google Scholar]
Heinrich, A.; Bittner, N.; Hollick, M. AirGuard-protecting android users from stalking attacks by apple find my devices. In Proceedings of the 15th ACM Conference on Security and Privacy in Wireless and Mobile Networks, Washington DC, USA, 16–19 May 2022; pp. 26–38. [Google Scholar]
Chen, Y.; Huang, Z.; Ai, H.; Guo, X.; Luo, F. The impact of GIS/GPS network information systems on the logistics distribution cost of tobacco enterprises. Transp. Res. Part Logist. Transp. Rev. 2021, 149, 102299. [Google Scholar] [CrossRef]
Feng, Z.; Li, G.; Wang, W.; Zhang, L.; Xiang, W.; He, X.; Zhang, M.; Wei, N. Emergency logistics centers site selection by multi-criteria decision-making and GIS. Int. J. Disaster Risk Reduct. 2023, 96, 103921. [Google Scholar] [CrossRef]
Zhang, X.; Liu, X. A two-stage robust model for express service network design with surging demand. Eur. J. Oper. Res. 2022, 299, 154–167. [Google Scholar] [CrossRef]
Wang, L.; Garg, H.; Li, N. Pythagorean fuzzy interactive Hamacher power aggregation operators for assessment of express service quality with entropy weight. Soft Comput. 2021, 25, 973–993. [Google Scholar] [CrossRef]
Xu, S.X.; Guo, R.Y.; Zhai, Y.; Feng, J.; Ning, Y. Toward a positive compensation policy for rail transport via mechanism design: The case of China Railway Express. Transp. Policy 2024, 146, 322–342. [Google Scholar] [CrossRef]
Dalal, S.; Chiem, N.; Karbassi, N.; Liu, Y.; Monroy-Hernández, A. Understanding Human Intervention in the Platform Economy: A Case Study of an Indie Food Delivery Service; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
Hasegawa, Y.; Ido, K.; Kawai, S.; Kuroda, S. Who took gig jobs during the COVID-19 recession? Evidence from Uber Eats in Japan. Transp. Res. Interdiscip. Perspect. 2022, 13, 100543. [Google Scholar] [CrossRef]
Panigrahi, A.K. A case study on Zomato–The online Foodking of India. J. Manag. Res. Anal. 2020, 7, 25–33. [Google Scholar] [CrossRef]
Galati, A.; Crescimanno, M.; Vrontis, D.; Siggia, D. Contribution to the sustainability challenges of the food-delivery sector: Finding from the Deliveroo Italy case study. Sustainability 2020, 12, 7045. [Google Scholar] [CrossRef]
Pourrahmani, E.; Jaller, M.; Fitch-Polse, D.T. Modeling the online food delivery pricing and waiting time: Evidence from Davis, Sacramento, and San Francisco. Transp. Res. Interdiscip. Perspect. 2023, 21, 100891. [Google Scholar] [CrossRef]
Yeo, S.F.; Tan, C.L.; Teo, S.L.; Tan, K.H. The role of food apps servitization on repurchase intention: A study of FoodPanda. Int. J. Prod. Econ. 2021, 234, 108063. [Google Scholar] [CrossRef]
Coifman, B.; Li, L. A critical evaluation of the Next Generation Simulation (NGSIM) vehicle trajectory dataset. Transp. Res. Part Methodol. 2017, 105, 362–377. [Google Scholar] [CrossRef]
Ivanovic, B.; Song, G.; Gilitschenski, I.; Pavone, M. trajdata: A unified interface to multiple human trajectory datasets. Adv. Neural Inf. Process. Syst. 2024, 36, 27582–27593. [Google Scholar]
Huang, X.; Yin, Y.; Lim, S.; Wang, G.; Hu, B.; Varadarajan, J.; Zheng, S.; Bulusu, A.; Zimmermann, R. Grab-posisi: An extensive real-life gps trajectory dataset in southeast asia. In Proceedings of the 3rd ACM SIGSPATIAL international workshop on prediction of human mobility, Chicago, IL, USA, 5 November 2019; pp. 1–10. [Google Scholar]
Jiang, W.; Zhu, J.; Xu, J.; Li, Z.; Zhao, P.; Zhao, L. A feature based method for trajectory dataset segmentation and profiling. World Wide Web 2017, 20, 5–22. [Google Scholar] [CrossRef]
Mohammed, N.; Fung, B.C.M.; Hung, P.C.K.; Debbabi, M. Anonymizing RFID Data: Preserving Privacy and Utility. In Privacy and Anonymity in Information Society; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5479, pp. 373–389. [Google Scholar]
Peng, T.; Liu, Q.; Wang, G.; Xiang, Y.; Chen, S. Multidimensional privacy preservation in location-based services. Future Gener. Comput. Syst. 2019, 93, 312–326. [Google Scholar] [CrossRef]
Shaham, S.; Ding, M.; Liu, B.; Dang, S.; Lin, Z.; Li, J. Privacy preservation in location-based services: A novel metric and attack model. IEEE Trans. Mob. Comput. 2020, 20, 3006–3019. [Google Scholar] [CrossRef]
Sun, G.; Cai, S.; Yu, H.; Maharjan, S.; Chang, V.; Du, X.; Guizani, M. Location privacy preservation for mobile users in location-based services. IEEE Access 2019, 7, 87425–87438. [Google Scholar] [CrossRef]
Yang, X.; Gao, L.; Zheng, J.; Wei, W. Location privacy preservation mechanism for location-based service with incomplete location data. IEEE Access 2020, 8, 95843–95854. [Google Scholar] [CrossRef]
Liao, D.; Huang, X.; Anand, V.; Sun, G.; Yu, H. k-DLCA: An efficient approach for location privacy preservation in location-based services. In Proceedings of the 2016 IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia, 22–27 May 2016; pp. 1–6. [Google Scholar]
Riyana, S.; Riyana, N. A privacy preservation model for rfid data-collections is highly secure and more efficient than lkc-privacy. In Proceedings of the 12th International Conference on Advances in Information Technology, Bangkok, Thailand, 29 June–1 July 2021; pp. 1–11. [Google Scholar]
Rafiei, M.; Wagner, M.; van der Aalst, W.M. TLKC-privacy model for process mining. In Proceedings of the International Conference on Research Challenges in Information Science, Limassol, Cyprus, 12–14 May 2021; Springer: Berlin/Heidelberg, Germany, 2020; pp. 398–416. [Google Scholar]
Liu, P.; Wu, D.; Shen, Z.; Wang, H. Trajectory privacy data publishing scheme based on local optimisation and R-tree. Connect. Sci. 2023, 35, 2203880. [Google Scholar] [CrossRef]
Hemkumar, D.; Ravichandra, S.; Somayajulu, D. Impact of prior knowledge on privacy leakage in trajectory data publishing. Eng. Sci. Technol. Int. J. 2020, 23, 1291–1300. [Google Scholar] [CrossRef]
Aïmeur, E.; Brassard, G.; Rioux, J. CLiKC: A privacy-mindful approach when sharing data. In Proceedings of the Risks and Security of Internet and Systems: 11th International Conference, CRiSIS 2016, Roscoff, France, 5–7 September 2016; Revised Selected Papers 11. Springer: Berlin/Heidelberg, Germany, 2017; pp. 3–10. [Google Scholar]
Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 571–588. [Google Scholar] [CrossRef]
Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness -Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 2007, 1, 3-es. [Google Scholar] [CrossRef]
Li, N.; Li, T.; Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 15 April 2006–20 April 2007; pp. 106–115. [Google Scholar]
Wang, R.; Zhu, Y.; Chen, T.S.; Chang, C.C. Privacy-preserving algorithms for multiple sensitive attributes satisfying t-closeness. J. Comput. Sci. Technol. 2018, 33, 1231–1242. [Google Scholar] [CrossRef]
Casas-Roma, J.; Herrera-Joancomartí, J.; Torra, V. k-Degree anonymity and edge selection: Improving data utility in large networks. Knowl. Inf. Syst. 2017, 50, 447–474. [Google Scholar] [CrossRef]
Lu, D.; Kate, A. Rpm: Robust anonymity at scale. Proc. Priv. Enhancing Technol. 2023, 2023, 347–360. [Google Scholar] [CrossRef]
Dorri, A.; Kanhere, S.S.; Jurdak, R.; Gauravaram, P. LSB: A Lightweight Scalable Blockchain for IoT security and anonymity. J. Parallel Distrib. Comput. 2019, 134, 180–197. [Google Scholar] [CrossRef]
Bojja Venkatakrishnan, S.; Fanti, G.; Viswanath, P. Dandelion: Redesigning the bitcoin network for anonymity. Proc. ACM Meas. Anal. Comput. Syst. 2017, 1, 1–34. [Google Scholar] [CrossRef]
Temuujin, O.; Ahn, J.; Im, D.H. Efficient L-diversity algorithm for preserving privacy of dynamically published datasets. IEEE Access 2019, 7, 122878–122888. [Google Scholar] [CrossRef]
Parameshwarappa, P.; Chen, Z.; Koru, G. Anonymization of daily activity data by using l-diversity privacy model. ACM Trans. Manag. Inf. Syst. (TMIS) 2021, 12, 1–21. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Wang, B.; Li, H.; Ren, X.; Guo, Y. An Efficient Differential Privacy-Based Method for Location Privacy Protection in Location-Based Services. Sensors 2023, 23, 5219. [Google Scholar] [CrossRef]
Yan, L.; Li, L.; Mu, X.; Wang, H.; Chen, X.; Shin, H. Differential Privacy Preservation for Location Semantics. Sensors 2023, 23, 2121. [Google Scholar] [CrossRef] [PubMed]
Tao, Y.; Papadias, D. Maintaining sliding window skylines on data streams. IEEE Trans. Knowl. Data Eng. 2006, 18, 377–391. [Google Scholar] [CrossRef]
Braverman, V.; Ostrovsky, R.; Zaniolo, C. Optimal sampling from sliding windows. In Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Providence, RI, USA, 19 June 19–1 July 2009; pp. 147–156. [Google Scholar]
Yang, L.; Shami, A. A lightweight concept drift detection and adaptation framework for IoT data streams. IEEE Internet Things Mag. 2021, 4, 96–101. [Google Scholar] [CrossRef]
Nguyen, T.D.; Shih, M.H.; Srivastava, D.; Tirthapura, S.; Xu, B. Stratified random sampling from streaming and stored data. Distrib. Parallel Databases 2021, 39, 665–710. [Google Scholar] [CrossRef]
Qiao, J.; Feng, G.; Yao, G.; Li, C.; Tang, Y.; Fang, B.; Zhao, T.; Hong, Z.; Jing, X. Research progress on the principle and application of multi-dimensional information encryption based on metasurface. Opt. Laser Technol. 2024, 179, 111263. [Google Scholar] [CrossRef]
McCabe, M.C.; Lee, J.; Chowdhury, A.; Grossman, D.; Frieder, O. On the design and evaluation of a multi-dimensional approach to information retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 363–365. [Google Scholar]
Guttman, A. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, Boston, MA, USA, 18–21 June 1984; pp. 47–57. [Google Scholar]
Riyana, S.; Natwichai, J. Privacy preservation for recommendation databases. Serv. Oriented Comput. Appl. 2018, 12, 259–273. [Google Scholar] [CrossRef]
Chen, R.; Fung, B.C.; Mohammed, N.; Desai, B.C.; Wang, K. Privacy-preserving trajectory data publishing by local suppression. Inf. Sci. 2013, 231, 83–97. [Google Scholar] [CrossRef]
Jin, F.; Hua, W.; Francia, M.; Chao, P.; Orlowska, M.E.; Zhou, X. A Survey and Experimental Study on Privacy-Preserving Trajectory Data Publishing. IEEE Trans. Knowl. Data Eng. 2023, 35, 5577–5596. [Google Scholar] [CrossRef]
Harnsamut, N.; Natwichai, J.; Riyana, S. Privacy preservation for trajectory data publishing by look-up table generalization. In Proceedings of the Australasian Database Conference. Springer International Publishing Cham, Gold Coast, QLD, Australia, 24–27 May 2018; pp. 15–27. [Google Scholar]
Harnsamut, N.; Natwichai, J. Privacy-aware trajectory data publishing: An optimal efficient generalisation algorithm. Int. J. Grid Util. Comput. 2023, 14, 632–643. [Google Scholar] [CrossRef]

Figure 1. An example of a user’s visited locations. The red circle is a sensitive location.

Figure 2. The level of data re-identification for each vertex in Figure 1.

Figure 3. An example of R-Trees. (a) The rectangle of locations. (b) The tree is corresponding to the rectangle.

Figure 4. Four data versions of Figure 2 after suppressing the vertices of levels 0, 1, 2, and 3, respectively. (a) Suppressing the level 0. (b) Suppressing the levels 0 and 1. (c) Suppressing the levels 0, 1, and 2. (d) Suppressing the levels 0, 1, 2, and 3.

Figure 5. Location bounding with split-halves R-Trees. (a) The focused area. (b) The first round of splitting the focused area. (c) The second round of splitting the focused area. (d) The third round of splitting the focused area.

Figure 6. Effect of the

ϵ

,

ξ

,

M L H

, and dataset size on utility loss.

Figure 6. Effect of the

ϵ

,

ξ

,

M L H

, and dataset size on utility loss.

Figure 7. Effect of the

ϵ

,

ξ

,

M L H

, and dataset size on execution time.

Figure 7. Effect of the

ϵ

,

ξ

,

M L H

, and dataset size on execution time.

Figure 8. Effect of relative error across query types for the City80k dataset.

Figure 9. Effect of relative error across query types for the Metro100k dataset.

Table 1. An example of a trajectory dataset used in this study.

	Path	Diagnosis
$t_{1}$	$< a_{2} \to c_{4} \to e_{5} >$	HIV
$t_{2}$	$< a_{1} \to c_{4} \to e_{5} >$	Food poisoning
$t_{3}$	$< c_{4} >$	Leukemia
$t_{4}$	$< a_{3} \to c_{4} \to e_{6} >$	Gerd
$t_{5}$	$< a_{3} \to c_{4} \to e_{6} \to a_{7} >$	Cancer
$t_{6}$	$< a_{1} \to b_{2} \to e_{5} \to a_{8} >$	Flu
$t_{7}$	$< a_{1} \to b_{2} \to e_{5} \to a_{8} \to a_{9} >$	Diabetes
$t_{8}$	$< a_{1} \to b_{2} >$	Tuberculosis
$t_{9}$	$< c_{4} \to e_{5} >$	Conjunctiva
$t_{10}$	$< a_{3} \to c_{4} \to e_{5} >$	Flu

Table 2. A data version of Table 1 is satisfied by LKC-privacy constraints, where L = 2, K = 2, and C = 0.70.

	Path	Diagnosis
$t_{1}$	$< c_{4} \to e_{5} >$	HIV
$t_{2}$	$< c_{4} \to e_{5} >$	Food poisoning
$t_{3}$	$< c_{4} >$	Leukemia
$t_{4}$	$< a_{3} \to c_{4} \to e_{6} >$	Gerd
$t_{5}$	$< a_{3} \to c_{4} \to e_{6} >$	Cancer
$t_{6}$	$< a_{1} \to b_{2} \to e_{5} \to a_{8} >$	Flu
$t_{7}$	$< a_{1} \to b_{2} \to e_{5} \to a_{8} >$	Diabetes
$t_{8}$	$< a_{1} \to b_{2} >$	Tuberculosis
$t_{9}$	$< c_{4} \to e_{5} >$	Conjunctiva
$t_{10}$	$< a_{3} \to c_{4} >$	Flu

Table 3. Comparison of classical privacy-preserving models and the proposed (

ξ

,

ϵ

)-privacy in terms of sensitive location handling, duplicate trajectories, unique locations, and released data version. The term “checkmark” denotes that the model supports the feature, while “cross” denotes that it does not.

Table 3. Comparison of classical privacy-preserving models and the proposed (

ξ

,

ϵ

)-privacy in terms of sensitive location handling, duplicate trajectories, unique locations, and released data version. The term “checkmark” denotes that the model supports the feature, while “cross” denotes that it does not.

Model	Sensitive Location Handling	Duplicate Trajectories	Unique Location Handling	Released Data Version
k-anonymity	×	×	×	✓
l-diversity	×	×	×	✓
t-closeness	×	×	×	✓
Differential privacy	✓	✓	×	×
( $ξ$ , $ϵ$ )-privacy (the proposed model)	✓	✓	✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Riyana, S.; Harnsamut, N. An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs. Symmetry 2025, 17, 1772. https://doi.org/10.3390/sym17101772

AMA Style

Riyana S, Harnsamut N. An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs. Symmetry. 2025; 17(10):1772. https://doi.org/10.3390/sym17101772

Chicago/Turabian Style

Riyana, Surapon, and Nattapon Harnsamut. 2025. "An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs" Symmetry 17, no. 10: 1772. https://doi.org/10.3390/sym17101772

APA Style

Riyana, S., & Harnsamut, N. (2025). An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs. Symmetry, 17(10), 1772. https://doi.org/10.3390/sym17101772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient and Effective Model for Preserving Privacy Data in Location-Based Graphs

Abstract

1. Introduction

2. Motivation

3. Model and Notation

3.1. The Graph of Users’ Visited Sequence Locations

3.2. The Type of Vertices

3.3. Data Sliding Windows [58,59,60,61]

3.4. Location Hierarchy

3.4.1. Dynamic Location Hierarchy

3.4.2. Manual Location Hierarchy

3.5. Data Suppression

3.6. Data Generalization

3.7. The Proposed Privacy Preservation Model

3.7.1. Problem Statement

3.7.2. The Privacy Preservation Algorithm

3.7.3. Utility Measurement

4. Experiment

4.1. Experimental Setup

4.1.1. Effectiveness

4.1.2. Efficiency

4.2. Relative Error Across Query Types

4.2.1. Full Scan Queries

4.2.2. Partial Scan Queries

4.2.3. Multi-Timestamp Scan Queries

5. Conclusions

6. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI