Next Article in Journal
BiodivAR: A Cartographic Authoring Tool for the Visualization of Geolocated Media in Augmented Reality
Next Article in Special Issue
Lossless Watermarking Algorithm for Geographic Point Cloud Data Based on Vertical Stability
Previous Article in Journal
Examining the Nonlinear Impacts of Origin-Destination Built Environment on Metro Ridership at Station-to-Station Level
 
 
Article
Peer-Review Record

Using HyperLogLog to Prevent Data Retention in Social Media Streaming Data Analytics

ISPRS Int. J. Geo-Inf. 2023, 12(2), 60; https://doi.org/10.3390/ijgi12020060
by Marc Löchner * and Dirk Burghardt
Reviewer 1:
Reviewer 2:
Reviewer 3:
ISPRS Int. J. Geo-Inf. 2023, 12(2), 60; https://doi.org/10.3390/ijgi12020060
Submission received: 22 December 2022 / Revised: 30 January 2023 / Accepted: 6 February 2023 / Published: 9 February 2023
(This article belongs to the Special Issue Trustful and Ethical Use of Geospatial Data)

Round 1

Reviewer 1 Report

The presented manuscript describes the application of the HyperLogLog algorithm to contributors' privacy in social media data streaming by geohash. It is an exciting subject that significantly impacts a better understanding of social media data and analysis, with a focus on data privacy. The manuscript is generally well structured. General remarks are:

 

  • I suggest adding more background, as there is a lot of research dealing with the use of social media in disaster situations. It would be interesting to see where yours fits in and why is it more applicable than others already integrated into early warning different tools?
  • The references are not shown according to the instructions for the authors. Some are repeated, for some missing data (Twitter, Google etc.)

 

This manuscript is suitable for publishing after addressing the stated remarks in a minor revision. However, as you mentioned yourself, this approach has its limitations. Still, it can serve as an excellent basis for further work in ensuring the use of volunteer data, which would increase the motivation to share and contribute to usability.

 

Author Response

> * I suggest adding more background, as there is a lot of research dealing with the use of social media in disaster situations. It would be interesting to see where yours fits in and why is it more applicable than others already integrated into early warning different tools?
Thank you for your remark. While the article is not intended to highlight the use of social media in disaster situations explicitly, we have added some exemplary articles about research on social media data usage in the context of flooding events. The VOST article is a source for the exemplary scenario for the application of our proposed method to use HLL to prevent data retention. We have also restructured the article at multiple places to better emphasize its intention.
> * The references are not shown according to the instructions for the authors. Some are repeated, for some missing data (Twitter, Google etc.)
Thank you for pointing this out. We have cleaned up our Bibtex library for our reference list to meet the required quality.
Along with the new manuscript version, we have added a Diff document for you that highlights the changes between the versions explicitly.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper addresses an important and interesting topic. I only have a few minor comments.

1. Are there any metrics that could measure the impact of privacy-preserving effect?

2. What is the generalizability of this framework, such as in different regions and for different topics other than disaster management?

3. According to Figure 3, the unit of geohash region is pretty large. Can such a framework be applied for higher-resolution regions?

 

 

Author Response


> 1. Are there any metrics that could measure the impact of privacy-preserving effect?
While we have not compared the captured "omickron" cardinalities with their original values, we do have analyzed the accuracy of the computed cardinalities in earlier publications (Löchner et al., 2020; Dunkel et al., 2020) and therefore the privacy impact. We have added a remark and references accordingly.
> 2. What is the generalizability of this framework, such as in different regions and for different topics other than disaster management?
The concept to use HLL to preserve privacy instead of performance improvements is useful to any analysis scenario, wherein results are based only on statistical values. We have restructured the article at multiple places and emphasized this especially in the conclusion chapter.
> 1. According to Figure 3, the unit of geohash region is pretty large. Can such a framework be applied for higher-resolution regions?
Yes, the use of other geohash precision values than the presented is possible and encouraged. We have emphasized this in the corresponding concept chapter 3.2 and revisited the topic in the discussion chapter 5.
Along with the new manuscript version, we have added a Diff document for you that highlights the changes between the versions explicitly.

Author Response File: Author Response.pdf

Reviewer 3 Report

Using HyperLogLog to prevent data retention in social media streaming data analytics

 

The topic is very interesting, because investigations using social media as a data source have increased remarkably and numerous concerns about the protection of personal data have arisen, but there are several aspects of the manuscript that are not clear. In a high-impact indexed journal like IJGI, most references should be articles published in prestigious journals, however there are numerous references to web pages and conference proceedings in this manuscript.

 

Keywords: privacy; social media; data retention; hyperloglog

 

MDPI: “Three to ten pertinent keywords need to be added after the abstract.”

To increase the visibility of the manuscript through bibliographic search engines, it is necessary to broaden the key terms. For example:

 

Keywords: social media; privacy protection; data retention; disaster management; geocode systems; privacy-aware data storage; cardinality estimation; hyperloglog algorithm; Twitter

 

Background

 

“Social media data is being defined as personal data by Commission [15].”

 

This statement is not clear. First, Commission should be European Commission (EC). Second, which article of the EC directive establishes that social media is personal data? Third, “Personal data is any information that relates to an identified or identifiable living individual (EC)”; in this sense, under a number of state and federal laws of the United States (USA), personal information broadly includes any information that identifies or is linked or reasonably linkable to an individual. Finally, according to the EC, "anonymised data" is an example of data not considered personal data. Therefore, it is necessary to distinguish between social media that allow user identification and social media with anonymous users. For example, TripAdvisor hosts more than a billion anonymous user comments shared through a pseudonym known only to the user.

 

According to the EU GDPR Directive, it is also necessary to relate “The processing of personal data for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes” with “the principle of data minimisation”.

 

Materials and Methods

 

The limitations of the HyperLogLog algorithm are not clear. You should inform readers about the concerns of other authors about the use of such an algorithm. For example:

 

·        New cardinality estimation methods for HyperLogLog sketches (arXiv, 2017)

·        Security of HyperLogLog (HLL) cardinality estimation: vulnerabilities and protection (IEEE, 2020)

·        HyperLogLog: Exponentially bad in adversarial settings (ETH, 2022)

 

It is also interesting to inform readers about other privacy-aware data storage systems.

 

Results and Discussion

 

How accurate is the data represented in Figures 3 and 4? Have you verified the correspondence between the data estimated by HyperLogLog and the real data?

 

“like Prometheus or InfluxDB”: Please include references.

 

“Each area is converted from the geohash to a Postgis geometry and then returned as a standard GeoJSON compliant coordinate”: Please include any reference to PostGIS and GeoJSON.

 

“Fig. 3 shows an example implementation using Leaflet.”: Please include any reference to Leaflet.

 

The Discussion should be based on the results of this research and its relationship with other related research.

 

Conclusion

 

The organisation of this section fails to highlight the interesting implications of this study.

 

Concluding remarks section should have a summary of the research and findings, and three subsections:

 

Theoretical contribution: Does the study contribute to the body of knowledge on privacy-aware data retrieval from a theoretical or academic perspective?

 

Managerial implications: Can the study be useful for government organisations or other parties interested in privacy-aware social media analytics?

 

Limitations and future research: What limitations does the study have derived from the use of the HyperLogLog algorithm?

 

Reference list

 

Please ensure you check all your citations for completeness, accuracy and consistency: Multidisciplinary Digital Publishing Institute (MDPI) style. Please complete the references according to the MDPI template:

 

3. Twitter.

28. Google.

37. Löchner, M. VGIsink.

39. Twitter.

40. Data, C.

Etc.

44. What does “In Proceedings of the Proceedings” mean?

Author Response

> Keywords
Thank you for your suggestions, I have gratefully adopted them.
> European Commission
You are right, this statement is unclear and might even be misleading. It was not my intention to discuss whether (all of) social media data is per definition personal data, but only to re-initiate the thread back to the term for the next sentence to make sense. But when reviewing it I figured it makes more sense in the paragraph above, so I left that statement out completely. Thanks for pointing that out.
> concerns about the algorithm
Thank you for suggesting these articles about security issues and improvements. I have extended the section about related work regarding HLL as well as in the discussion, where I took the chance to go into the subject of the per-se public availability of social media data.
> inform readers about other privacy-aware data storage systems
Thanks for pointing that out. While I had already discussed differential privacy as an alternative approach, your remark encouraged me to add another paragraph with some exemplary articles about alternative storage methods and argumentation about their lack of utility for our scenario.
> accuracy of the figures
While we have not compared the captured "omickron" cardinalities with their original values, we do have analyzed the accuracy of the computed cardinalities in earlier publications (Löchner et al., 2020; Dunkel et al., 2020). We have added a remark and references accordingly. Nevertheless, your question regarding this has encouraged me to update figure 3 to zoom closer to obvious differences of the spatial data. The figure's intention is focused more on demonstrating the usefulness of unifying multiple data sets rather than the accuracy of the data.
Figure 4 has no relation to figure 3. Its values are hypothetical, as its intention is to support the understanding of the scenario in section 3.4.
> conclusion
Thank you for outlining the expected structure of the conclusion. I have updated the chapter accordingly.
Along with the new manuscript version, I have added a diff document for you that highlights the changes between the versions explicitly.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Using HyperLogLog to prevent data retention in social media streaming data analytics

The authors have satisfactorily addressed my concerns and the manuscript has been significantly improved, for which I recommend its publication in ISPRS International Journal of Geo-Information (IJGI), Special Issue on Trustful and Ethical Use of Geospatial Data.

Back to TopTop