Improving Utility of Private Join Size Estimation via Shuffling
Abstract
1. Introduction
- We design a sketch-based protocol, SDPJoinSketch, for join size estimation under SDP. We provide detailed proof of both the privacy amplification and utility of SDPJoinSketch.
- We present an improved algorithm called SDPJoinSketch+, which reduces hash-collision errors by leveraging the anonymity of SDP with secure encryption techniques.
- We conduct experiments demonstrating the utility improvements of our methods compared to state-of-the-art approaches.
2. Related Work
2.1. Sketches for Join Size Estimation
2.2. Techniques and Applications Under SDP
3. Preliminaries
3.1. Fast-AGMS
3.2. Centralized Differential Privacy (CDP)
3.3. Local Differential Privacy (LDP)
3.4. Shuffle Model of DP (SDP)
4. Sketch-Based Join Size Estimation Under SDP
4.1. SDPJoinSketch
4.1.1. Client Side of SDPJoinSketch
| Algorithm 1 Local Randomizer |
| Public Parameters: |
| Input: Local data from i-th user, Hash function pairs |
| Output: indices and perturbed data |
| 1: Sample uniformly at random from [k] and [m], respectively. |
| 2: Initialize a vector |
| 3: ▹Encode (Lines 3–5) |
| 4: ▹ is a hadamard matrix of order m |
| 5: |
| 6: Sample ▹ Perturb (Lines 6–11) |
| 7: if then |
| 8: |
| 9: else |
| 10: Unif({−1,+1}) |
| 11: end if |
| 12: return |
4.1.2. Intermediate Shuffler of SDPJoinSketch
4.1.3. Server Side of SDPJoinSketch
| Algorithm 2 Analyzer |
| Public Parameters: |
| ⊳ Sketch Aggregation: (SKA) |
| Input: Multi-set |
| Output: Private Sketch M |
| 1: Initialize a sketch |
| 2: for do |
| 3: |
| 4: end for |
| 5: |
| 6: return M |
| ⊳ Join Size Estimation (JSE): |
| Input: Private Sketches and for user groups A and B |
| Output: Join Size Estimation |
| 1: Initialize a vector |
| 2: for do |
| 3: for do |
| 4: |
| 5: end for |
| 6: end for |
| 7: |
| 8: return |
4.2. Privacy Amplification via Shuffling
4.3. Utility Analysis
5. Improving Utility of SDPJoinSketch
5.1. Framework of SDPJoinSketch+
| Algorithm 3 SDPJoinSketch+ |
| Public Parameters: |
| ⊳ Stage 1: Find frequent join values |
| Input: Sampled subsets |
| Output: Frequent Item set |
| 1: Clients: Perturb data from with Algorithm 1 |
| 2: Server: Construct SDPJoinSketch and |
| 3: Frequent item set |
| 4: return |
| ⊳ Stage 2: Improved join size estimation |
| Input: Remaining subsets , and |
| Output: Join Size Estimation |
| 1: Clients: add indicator t for each user. |
| 2: for and do |
| 3: if , else |
| 4: |
| 5: end for |
| 6: Perturb data from and with Algorithm 1 |
| 7: Return |
| 8: Shuffler: |
| 9: Server: |
| 10: Construct sketches , with |
| 11: Construct sketches , with |
| 12: in Algorithm 2 |
| 13: in Algorithm 2 |
| 14: |
| 15: return |
| Algorithm 4 FFI: Find Frequent Items |
| Public Parameters: , frequent threshold |
| Input: SDPJoinSketch |
| Output: Frequency item set |
| 1: Initialize |
| 2: for do ▹ Estimating frequency |
| 3: for do |
| 4: |
| 5: end for |
| 6: |
| 7: end for |
| 8: ▹ Threshold for high-frequency item |
| 9: Initialize |
| 10: for do |
| 11: if then |
| 12: Add d into |
| 13: end if |
| 14: end for |
| 15: return |
5.2. Privacy and Utility Analysis
6. Experimental Evaluation
6.1. Experiment Setup
- Datasets: We use both synthetic and real-world datasets.
- Zipf datasets. We generate several datasets of size 1,000,000 following Zipf distribution, with skewness parameters ranging from to (bigger denotes higher skewness) and the data domain fixed at 500 for simplicity. The Zipf distribution reflects the skewed frequency patterns commonly found in real-world workloads.
- Gaussian dataset. We also generate a dataset of size 1,000,000 following Gaussian distribution, with a mean of 5,000 and a standard deviation of 50.
- Twitter ego-network dataset (https://snap.stanford.edu/data/ego-Twitter.html, accessed on 26 September 2025). Real-world dataset consists of data from 2,420,766 items across 77,072 domains from the Twitter app.
- Facebook ego-network dataset (https://snap.stanford.edu/data/ego-Facebook.html, accessed on 26 September 2025). Real-world dataset consists of data from 352,936 items across 4039 domains from the Facebook app.
- Competitors: To illustrate the effectiveness of our algorithm, we compare our SDPJoinSketch (SJS) and SDPJoinSketch+ (SJS+) with existing representative SDP, LDP, and DP methods.
- Hist_KR. Applying basic randomized response mechanism on the ESA framework.
- SFLH. A Fast variant of SLH [33], which combines privacy amplification theory based on optimally local hashing.
- KRR. K-ary randomized response that perturbs the join values and computes the join size with calibrated frequency vectors.
- FLH. The heuristic fast variant of Optimal Local Hashing (OLH), where multiple counters are used to improve efficiency.
- LDPJoinSketch (LJS). A sketch-based join size estimation method under LDP.
- Laplace Mechanism (Lap). Ensuring privacy by adding noise drawn from a Laplace distribution to the original data.
- Metrics: In the experiments, we use Absolute Error (AE) and Relative Error (RE) to measure data utility performance.
- Absolute Error (AE). , is the actual join size, is the estimated result, and T is the testing rounds.
- Relative Error (RE). . The parameters are the same as those defined in AE.
- Methodology: We set by default. The result for each experiment is averaged over 10 runs with different hash seeds. The threshold is set to 0.1 for Zipf datasets, 0.01 for the Gaussian dataset, and 0.001 for Twitter and Facebook datasets. The privacy budget involved in the experiments represents the central guarantee.
6.2. Utility Comparison
- SDPJoinSketch has better utility than methods under LDP for join size estimation.
- The enhanced mechanism SDPJoinSketch+ further improves utility by reducing hash-collision errors and has smaller communication overhead than conventional LDP protocols.
- Dealing with high-skewed data with small privacy budget, our proposed methods demonstrate better performance.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Leis, V.; Radke, B.; Gubichev, A.; Mirchev, A.; Boncz, P.; Kemper, A.; Neumann, T. Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J. 2018, 27, 643–668. [Google Scholar] [CrossRef]
- Chu, S.; Balazinska, M.; Suciu, D. From theory to practice: Efficient join query evaluation in a parallel database system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, VIC, Australia, 31 May–4 June 2015; pp. 63–78. [Google Scholar]
- Wang, P.; Qi, Y.; Zhang, Y.; Zhai, Q.; Wang, C.; Lui, J.C.; Guan, X. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 25–33. [Google Scholar]
- Bessa, A.; Daliri, M.; Freire, J.; Musco, C.; Musco, C.; Santos, A.; Zhang, H. Weighted minwise hashing beats linear sketching for inner product estimation. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Seattle, WA, USA, 18–23 June 2023; pp. 169–181. [Google Scholar]
- Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
- Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, CA, USA, 26–29 October 2013; pp. 429–438. [Google Scholar]
- Differential Privacy Team, Apple. Learning with Privacy at Scale Differential. 2017. Available online: https://machinelearning.apple.com/research/learning-with-privacy-at-scale (accessed on 26 September 2025).
- Erlingsson, Ú.; Pihur, V.; Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar]
- Fanti, G.C.; Pihur, V.; Erlingsson, Ú. Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries. Proc. Priv. Enhancing Technol. 2016, 2016, 41–61. [Google Scholar] [CrossRef]
- Ding, B.; Kulkarni, J.; Yekhanin, S. Collecting telemetry data privately. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Wang, T.; Blocki, J.; Li, N.; Jha, S. Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017; pp. 729–745. [Google Scholar]
- Cormode, G.; Maddock, S.; Maple, C. Frequency estimation under local differential privacy [experiments, analysis and benchmarks]. arXiv 2021, arXiv:2103.16640. [Google Scholar] [CrossRef]
- Wang, T.; Li, N.; Jha, S. Locally differentially private heavy hitter identification. IEEE Trans. Dependable Secur. Comput. 2019, 18, 982–993. [Google Scholar] [CrossRef]
- Qin, Z.; Yang, Y.; Yu, T.; Khalil, I.; Xiao, X.; Ren, K. Heavy hitter estimation over set-valued data with local differential privacy. In Proceedings of the CCS’16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 192–203. [Google Scholar]
- Zhang, M.; Liu, X.; Yin, L. Sketches-Based Join Size Estimation Under Local Differential Privacy. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 1726–1738. [Google Scholar]
- Erlingsson, Ú.; Feldman, V.; Mironov, I.; Raghunathan, A.; Talwar, K.; Thakurta, A. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, San Diego, CA, USA, 6–9 January 2019; pp. 2468–2479. [Google Scholar]
- Cheu, A.; Smith, A.; Ullman, J.; Zeber, D.; Zhilyaev, M. Distributed differential privacy via shuffling. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, 19–23 May 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 375–403. [Google Scholar]
- Ghazi, B.; Golowich, N.; Kumar, R.; Manurangsi, P.; Pagh, R.; Velingker, A. Pure differentially private summation from anonymous messages. arXiv 2020, arXiv:2002.01919. [Google Scholar] [CrossRef]
- Ghazi, B.; Golowich, N.; Kumar, R.; Pagh, R.; Velingker, A. On the power of multiple anonymous messages: Frequency estimation and selection in the shuffle model of differential privacy. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, Madrid, Spain, 4–8 May 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 463–488. [Google Scholar]
- Balle, B.; Bell, J.; Gascón, A.; Nissim, K. The privacy blanket of the shuffle model. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 638–667. [Google Scholar]
- Zhang, M.; Lin, S.; Yin, L. Local differentially private frequency estimation based on learned sketches. Inf. Sci. 2023, 649, 119667. [Google Scholar] [CrossRef]
- Rusu, F.; Dobra, A. Sketches for size of join estimation. ACM Trans. Database Syst. (TODS) 2008, 33, 1–46. [Google Scholar] [CrossRef]
- Alon, N.; Matias, Y.; Szegedy, M. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, Philadelphia, PA, USA, 22–24 May 1996; pp. 20–29. [Google Scholar]
- Alon, N.; Gibbons, P.B.; Matias, Y.; Szegedy, M. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Philadelphia, PA, USA, 31 May–2 June 1999; pp. 10–20. [Google Scholar]
- Cormode, G.; Garofalakis, M. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, 30 August–2 September 2005; pp. 13–24. [Google Scholar]
- Ganguly, S.; Kesh, D.; Saha, C. Practical algorithms for tracking database join sizes. In Proceedings of the International Conference on Foundations of Software Technology and Theoretical Computer Science, Hyderabad, India, 15–18 December 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 297–309. [Google Scholar]
- Ganguly, S.; Garofalakis, M.; Rastogi, R. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the International Conference on Extending Database Technology, Crete, Greece, 14–18 March 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 569–586. [Google Scholar]
- Wang, F.; Chen, Q.; Li, Y.; Yang, T.; Tu, Y.; Yu, L.; Cui, B. JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation. In Proceedings of the ACM on Management of Data, Seattle, WA, USA, 18 June–23 June 2023; Volume 1, pp. 1–26. [Google Scholar]
- Ion, M.; Kreuter, B.; Nergiz, A.E.; Patel, S.; Saxena, S.; Seth, K.; Raykova, M.; Shanahan, D.; Yung, M. On deploying secure computing: Private intersection-sum-with-cardinality. In Proceedings of the 2020 IEEE European Symposium on Security and Privacy (EuroS&P), Genoa, Italy, 7–11 September 2020; pp. 370–389. [Google Scholar]
- Li, Y.; Lee, X.; Peng, B.; Palpanas, T.; Xue, J. Privsketch: A private sketch-based frequency estimation protocol for data streams. In Proceedings of the International Conference on Database and Expert Systems Applications, Penang, Malaysia, 28–30 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 147–163. [Google Scholar]
- Wang, Y.; Wang, Y.; Chen, C. DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows. In Proceedings of the KDD ’24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 3255–3266. [Google Scholar]
- Bittau, A.; Erlingsson, Ú.; Maniatis, P.; Mironov, I.; Raghunathan, A.; Lie, D.; Rudominer, M.; Kode, U.; Tinnes, J.; Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28–31 October 2017; pp. 441–459. [Google Scholar]
- Wang, T.; Ding, B.; Xu, M.; Huang, Z.; Hong, C.; Zhou, J.; Li, N.; Jha, S. Improving utility and security of the shuffler-based differential privacy. arXiv 2019, arXiv:1908.11515. [Google Scholar] [CrossRef]
- Ghazi, B.; Kumar, R.; Manurangsi, P.; Pagh, R.; Sinha, A. Differentially private aggregation in the shuffle model: Almost central accuracy in almost a single message. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 3692–3701. [Google Scholar]
- Wang, N.; Zheng, W.; Wang, Z.; Wei, Z.; Gu, Y.; Tang, P.; Yu, G. Collecting and analyzing key-value data under shuffled differential privacy. Front. Comput. Sci. 2023, 17, 172606. [Google Scholar] [CrossRef]
- Balle, B.; Bell, J.; Gascón, A.; Nissim, K. Private summation in the multi-message shuffle model. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 9–13 November 2020; pp. 657–676. [Google Scholar]
- Luo, Q.; Wang, Y.; Yi, K. Frequency Estimation in the Shuffle Model with Almost a Single Message. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; pp. 2219–2232. [Google Scholar]
- Knuth, D.E. The Art of Computer Programming; Pearson Education: London, UK, 1997; Volume 3. [Google Scholar]
- Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, 4–7 March 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]













| Notations | Description |
|---|---|
| n | The number of users |
| Private join attribute value of the ith user | |
| D | The domain of attribute values and |
| Privacy budget of local and central model | |
| The failure probability of differential privacy | |
| M | The representation of sketch |
| The number of lines and columns of a sketch | |
| Hash function of the jth line | |
| Hash function of the jth line | |
| The shuffling procedure | |
| The local randomizer | |
| r | Sample rate for SDPJoinSketch+ |
| High-frequency threshold | |
| A random permutation operation on index i |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, X.; Mao, Y.; Zhang, M.; Li, M. Improving Utility of Private Join Size Estimation via Shuffling. Mathematics 2025, 13, 3468. https://doi.org/10.3390/math13213468
Liu X, Mao Y, Zhang M, Li M. Improving Utility of Private Join Size Estimation via Shuffling. Mathematics. 2025; 13(21):3468. https://doi.org/10.3390/math13213468
Chicago/Turabian StyleLiu, Xin, Yibin Mao, Meifan Zhang, and Mohan Li. 2025. "Improving Utility of Private Join Size Estimation via Shuffling" Mathematics 13, no. 21: 3468. https://doi.org/10.3390/math13213468
APA StyleLiu, X., Mao, Y., Zhang, M., & Li, M. (2025). Improving Utility of Private Join Size Estimation via Shuffling. Mathematics, 13(21), 3468. https://doi.org/10.3390/math13213468

