# Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation

## Abstract

## 1. Introduction

## 2. Previous Work

## 3. Concept

#### 3.1. System Model

#### 3.2. Analytics Service

## 4. Material and Methods

#### 4.1. Dataset

#### 4.2. Software Architecture

#### 4.3. First Component: HyperLogLog (HLL)

^{12}items to a single set—a number that is difficult to express in non-scientific notation. For comparison, using a regwidth of 4 and a log2m of 10 already reduces the maximum number of items that can be estimated to 12 million, with a relative error of ±3.25% (for references to the above, see the online documentation). From a privacy perspective, it is recommended to use the smallest possible parameter settings, which depend on the expected maximum size of HLL sets. In our case, the Flickr YFCC100M dataset encompasses 100 million total post IDs, which is why we used the default settings of log2m = 11 and regwidth = 5. For many other datasets, smaller parameter settings will be possible.

#### 4.4. Second Component: Location

## 5. Case Study: Alex, “Sandy”, and “Robert”

## 6. Results

#### 6.1. Worldwide Visitation Patterns

#### 6.2. Utility of Published Benchmark Data

#### 6.3. Privacy Trade-Off

## 7. Discussion

## 8. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A

**Figure A2.**Figure 7 generated with 50 km grid size parameter and corresponding error rates.

**Figure 1.**Illustration of the system model and the two cases of possible adversaries discussed in this work.

**Figure 2.**Transformation steps applied to a single character string, such as a user ID, for generating a HyperLogLog (HLL) set, and the final estimation of cardinality (Example values were generated with real data, but different values may be produced based on various parameter settings).

**Figure 3.**Percentage of global spatial outlier volume (k = 1) in the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset, for decreasing precision levels (GeoHash) and different metrics used in this paper (to reproduce this graphic, see Supplementary Materials, S5).

**Figure 5.**Screenshot of map for user counts per 100 km grid bin, allowing interactive comparison of estimated values (HLL) and exact counts (raw) (see Figure A1 for a static, worldwide view of the map, and Supplementary Materials S8 for the interactive version).

**Figure 7.**Analyzing spatial relationships with HLL intersection, based on incremental union of user sets from benchmark data (100 km-grid) for France, Germany and the United Kingdom (

**left**). The Venn Diagram (

**right**) shows estimation of common user counts for different groups, and the percentage of error compared to raw data. The same graphic, generated for 50 km grid size, is available in Figure A2.

**Table 1.**Total counts for different metrics based on raw and HLL data with default parameters (to reproduce these numbers, see Supplementary Materials, S5).

Metric | Exact (Raw) | Estimated (HLL) |
---|---|---|

Coordinate count | 12,764,268 | 12,756,691 |

User count | 581,099 | 589,475 |

Post count | 100,000,000 | 98,553,392 |

User days | 17,662,780 | 17,678,373 |

Context | Raw Data | HLL Data |
---|---|---|

Input data size of comma-separated values (CSV) | 2.5 GB | Explicit: 281 MB Sparse: 134 MB Full: 3.3 GB |

Output data size, 100 km grid (CSV) | 182.46 MB | 19.80 MB |

Processing time (Worldmap) | Post count: 7 min 13 s User count: 8 min 55 s User days: 12 min 8 s | 54.1 s (Post count, user count, user days) |

Memory peak (Worldmap) | Post count: 15.4 GB User count: 15.5 GB User days: 19.3 GB | 1.4 GB (Post count, user count, user days) |

Benchmark data size (CSV) | / | 10.61 MB (bins with user count ≥ 100) |

