Hierarchical Clustering Approach for Selecting Representative Skylines
Abstract
:1. Introduction
- We propose a method for selecting the representative skyline points that uses the hierarchical agglomerative clustering structure. Here, we define a new distance measurement called θ distance that enables us to obtain a set of top and diverse representative skyline points by configuring the hierarchical agglomerative clustering structure.
- We also show the correctness and efficiency of our proposed method through a comprehensive experimental study.
2. Related Work
2.1. Skyline Query Processing Methods
2.2. Representative Skyline Query Processing Methods
2.2.1. Dominance-Based Methods
2.2.2. Distance-Based Methods
2.2.3. Analysis
3. Proposed Method
3.1. Overview
3.2. Skylining Step
3.3. Clustering Step
Algorithm 1.: Clustering step |
Input: |
(1) S = {s1, s2, s3 … s|S|} /*Skyline set*/ |
Ouput: |
(1) C /*Root of the dendrogram*/ |
Algorithm: |
1. initialize C = {} |
2. FOR i = 0 TO |S| DO |
3. Create new cluster ci with ci.dist = 0, ci.rp = si, ci.cr = si, ci.S = {si} |
4. C.add(ci) |
5. END FOR |
6. WHILE |C| > 1 DO |
7. DMatrix = Calculate distance matrix of C |
8. Sort(DMatrix) in ascending order. |
9. cnew = Merge(DMatrix[0]) |
10. Remove C[DMatrix[0][0]] and C[DMatrix[0][1]] from C |
11. C.add(cnew) |
12. RETURN C |
3.4. Configuration Step
3.4.1. θ Distance
3.4.2. Representative Skyline Points Selection
Algorithm 2.: Representative skyline point selection |
Input: |
(1) DE /*Dendrogram that has been created in the clustering step*/ |
(2) input.θ /*User given input value in the unit range [0.0, 1.0]*/ |
(3) flag /*User given input flag either with value of top or diverse*/ |
Ouput: |
(1) RepSP /*List of representative skyline points*/ |
Algorithm: |
1. initialize RepSP = {} |
2. IF flag = diverse THEN |
3. RepSP = DiverseRepresentative(DE, input.θ) |
4. ELSE |
5. RepSP = TopRepresentative(DE, input.θ) |
6. RETURN RepSP |
Algorithm 3. Diverse Representative Selection |
Input: |
(1) DE /*Dendrogram that has been created in the clustering step*/ |
(2) input.θ /*User given input value in unit range [0.0, 1.0]*/ |
Ouput: |
(1) DivREP /*List of diverse representative skyline points*/ |
Algorithm: |
1. initialize DivREP = {} |
2. currentCluster = DE.getRoot() |
3. TreeTraversal(currentCluster, input.θ, DivREP) |
4. leftCluster = currentCluster.leftChild |
5. rightCluster = currentCluster.rightChild |
6. IF leftCluster.θ > input.θ THEN |
7. TreeTraversal(leftCluster, input.θ, DivREP) |
8. IF rightCluster.θ > input.θ THEN |
9. TreeTraversal(rightCluster, input.θ, DivREP) |
10. IF leftCluster.θ ≤ input.θ THEN |
11. DivREP.add(leftCluster.rp) |
12. IF rightCluster.θ ≤ input.θ THEN |
13. DivREP.add(rightCluster.rp) |
14. RETURN DivREP |
Algorithm 4. Top Representative Selection |
Input: |
(1) DE /*Dendrogram that has been created in the clustering step*/ |
(2) input.θ /*User given input value in unit range [0.0, 1.0]*/ |
Ouput: |
(1) TopREP /*List of top representative skyline points*/ |
Algorithm: |
1. initialize TopREP = {} |
2. topPoint = DE.getRoot().rp |
3. topCluster = findSingleton(topPoint) /*Singleton cluster of topPoint*/ |
4. WHILE topCluster.parent.θ <= input.θ DO |
5. TopREP = topCluster.S /*Skyline points of the cluster*/ |
6. topCluster = topCluster.parent |
7. RETURN TopREP |
3.4.3. Finding Appropriate input.θ
3.5. Theoretical Analysis
4. Performance Evaluation
4.1. Experimental Datasets and Environment
4.2. Experiment Results
4.2.1. Experiment 1. Representative Skyline Points of Anisotropic Dataset
4.2.2. Experiment 2. Comparison of Re-computation Time with Related Work on Synthetic Dataset with Varied Dimensions
4.2.3. Experiment 3. Comparison of Re-Computation Time with Related Work on Synthetic Dataset with Varied Cardinality
4.2.4. Experiment 4. Comparison of Diverse Representation Error Rate on Anisotropic Dataset with Varied k and input.θ
4.2.5. Experiment 5. Comparison of Diverse Representation Error Rate on NBA Dataset with Varied k and input.θ
4.2.6. Experiment 6. Comparison of Top Representation Error Rate on Anisotropic Dataset with Varied k and input.θ
4.2.7. Experiment 7. Comparison of Top Representation Error Rate on NBA Dataset with Varied k and input.θ
4.2.8. Experiment 8. Comparison of Number Of Cluster Accesses on Synthetic Dataset with varied input.θ
4.2.9. Experiment 9. Comparison of Number of Cluster Accesses on Real Dataset with Varied input.θ
4.2.10. Experiment 10. Ratio of the Representative Skyline Points to |S| on Synthetic Dataset with Varied input.θ
4.2.11. Experiment 11. Ratio of the Representative Skyline Points to |S| on Real Dataset with Varied input.θ
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Ihm, S.Y.; Lee, K.E.; Nasridinov, A.; Heo, J.S.; Park, Y.H. Approximate convex skyline: A partitioned layer-based index for efficient processing top-k queries. Knowl. Based Syst. 2014, 61, 13–28. [Google Scholar] [CrossRef]
- Wang, S.; Ooi, B.C.; Tung, A.K.H.; Xu, L. Efficient skyline query processing on peer-to-peer networks. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, 15–20 April 2007; pp. 1126–1135. [Google Scholar]
- Koizumi, K.; Eades, P.; Hiraki, K.; Inaba, M. BJR-tree: Fast skyline computation algorithm using dominance relation-based tree structure. Int. J. Data Sci. Anal. 2018, 7, 17–34. [Google Scholar] [CrossRef]
- Lin, X.; Yuan, Y.; Zhang, Q.; Zhang, Y. Selecting stars: The k most representative skyline operator. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, 15–20 April 2007; pp. 86–95. [Google Scholar]
- Tao, Y.; Ding, L.; Lin, X.; Pei, J. Distance-Based Representative Skyline. In Proceedings of the 25th International Conference on Data Engineering (ICDE’09), Shanghai, China, 29 March–2 April 2009; pp. 892–903. [Google Scholar]
- Zhao, F.; Das, G.; Tan, K.L.; Tung, A.K.H. Call to order: A hierarchical browsing approach to eliciting users’ preference. In Proceedings of the 2010 ACM SIGMOD international Conference on Management of Data (SIGMOD’10), Indianapolis, IN, USA, 6–10 June 2010; pp. 27–38. [Google Scholar]
- Borzsony, S.; Kossmann, D.; Stocker, K. The skyline operator. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01), Heidelberg, Germany, 2–6 April 2001; pp. 421–430. [Google Scholar]
- Chomicki, J.; Godfrey, P.; Gryz, J.; Liang, D. Skyline with Presorting: Theory and Optimizations. Intell. Inf. Process. Web Min. 2005, 31, 595–604. [Google Scholar]
- Godfrey, P.; Shipley, R.; Gryz, J. Maximal vector computation in large data sets. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), Trondheim, Norway, 30 August–2 September 2005; pp. 229–240. [Google Scholar]
- Bartolini, I.; Ciaccia, P.; Patella, M. SaLSa: Computing the skyline without scanning the whole sky. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06), Arlington, VA, USA, 6–11 November 2006; pp. 405–414. [Google Scholar]
- Vlachou, A.; Doulkeridis, C.; Kotidis, Y. Angle-based space partitioning for efficient parallel skyline computation. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, BC, Canada, 9–12 June 2008; pp. 227–239. [Google Scholar]
- Tan, K.L.; Eng, P.K.; Ooi, B.C. Efficient progressive skyline computation. In Proceedings of the 27th International Conference on Very Large Data bases (VLDB’02), Roma, Italy, 11–14 September 2001; pp. 301–310. [Google Scholar]
- Papadias, D.; Tao, Y.; Fu, G.; Seeger, B. An Optimal and Progressive Algorithm for Skyline Queries. In Proceedings of the 2003 ACM International Conference on Management of Data (SIGMOD’03), San Diego, CA, USA, 9–12 June 2003; pp. 467–478. [Google Scholar]
- Lee, K.C.; Lee, W.C.; Zhen, B.; Li, H.; Tian, Y. Z-SKY: An efficient skyline query processing framework based on Z-order. VLDB J. 2010, 19, 333–362. [Google Scholar] [CrossRef]
- Huang, J.; Ding, D.; Wang, G.; Xin, J. Tuning the Cardinality of Skyline. In Proceedings of the 10th Asia-Pacific Web Conference (APWeb’08) Workshops, Shenyang, China, 26–28 April 2008; pp. 220–231. [Google Scholar]
- Yuan, Y.; Lin, X.; Liu, Q.; Wang, W.; Yu, J.X.; Zhang, Q. Efficient computation of the skyline cube. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), Trondheim, Norway, 30 August–2 September 2005; pp. 241–252. [Google Scholar]
- Xia, T.; Zhang, D.; Tao, Y. On skylining with flexible dominance relation. In Proceedings of the 24th International Conference on Data Engineering (ICDE’08), Cancun, Mexico, 7–12 April 2008; pp. 1397–1399. [Google Scholar]
- Koltun, V.; Papadimitriou, C.H. Approximately dominating representatives. Theor. Comput. Sci. 2007, 371, 148–154. [Google Scholar] [CrossRef] [Green Version]
- Nanongkai, D.; Sarma, A.D.; Lall, A.; Lipton, R.J.; Xu, J. Regret-minimizing representative databases. Proc. VLDB Endowment 2010, 3, 1114–1124. [Google Scholar] [CrossRef] [Green Version]
- Chen, L.; Wu, J.; Deng, S.; Li, Y. Service recommendation: Similarity-based representative skyline. In Proceedings of the 6th World Congress on Services (SERVICES’10), Miami, FL, USA, 5–10 July 2010; pp. 360–366. [Google Scholar]
- Asudeh, A.; Nazi, A.; Zhang, N.; Das, G. Efficient Computation of Regret-ratio Minimizing Set: A Compact Maxima Representative. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17), Chicago, IL, USA, 14–19 May 2017; pp. 821–834. [Google Scholar]
- Magnani, M.; Assent, I.; Mortensen, M.L. Taking the Big Picture: Representative skylines based on significance and diversity. VLDB J. 2014, 23, 795–815. [Google Scholar] [CrossRef]
- Salvador, S.; Chan, P. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), Boca Raton, FL, USA, 6–8 November 2004; pp. 576–584. [Google Scholar]
- Kurita, T. An Efficient Agglomerative Clustering Algorithm using a Heap. Pattern Recognit. 1991, 24, 205–209. [Google Scholar] [CrossRef]
Method Name | Re-Computation | Diverse Representative | Top Representative |
---|---|---|---|
Max-Dom | √ | - | √ |
δ-Sky | √ | - | √ |
ε-Sky | √ | - | √ |
ADR | √ | √ | - |
Greedy | √ | √ | - |
Call-to-Order | √ | - | - |
SBRSA | √ | √ | - |
RRMS | - | √ | - |
Big Picture | √ | √ | - |
Property | Value |
---|---|
CPU | 4 x Intel(R) Core(TM) i5-4570 (3.20 GHz) |
RAM | Samsung PC3-12800 8 GB |
OS | Ubuntu 16.04.3 LTS |
HDD | Seageate ATA ST500D 500 GB(7200 RPM) |
Parameter | Value Range | Default Settings |
---|---|---|
Dataset cardinality | 200 K, 400 K, 600 K, 800 K, 1 M | 600 K |
input.θ | 0.01, 0.05, 0.1, 0.5, 1.0 | 0.1 |
Dataset dimension | 2, 3, 4 | 2 |
k | 2, 4, 6, 8, 10 | 6 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Battulga, L.; Nasridinov, A. Hierarchical Clustering Approach for Selecting Representative Skylines. Information 2019, 10, 96. https://doi.org/10.3390/info10030096
Battulga L, Nasridinov A. Hierarchical Clustering Approach for Selecting Representative Skylines. Information. 2019; 10(3):96. https://doi.org/10.3390/info10030096
Chicago/Turabian StyleBattulga, Lkhagvadorj, and Aziz Nasridinov. 2019. "Hierarchical Clustering Approach for Selecting Representative Skylines" Information 10, no. 3: 96. https://doi.org/10.3390/info10030096
APA StyleBattulga, L., & Nasridinov, A. (2019). Hierarchical Clustering Approach for Selecting Representative Skylines. Information, 10(3), 96. https://doi.org/10.3390/info10030096