1. Introduction
In data mining and data analysis, a multi-criteria analysis is a conventional method to address the problems of real-life applications from recommendation systems to decision making. One of the commonly used multi-criteria analysis methods is the skyline query. The skyline query enables obtaining of interesting and dominant data points from the given dataset [
1]. For example, consider a hotel recommendation system. Assume that a system has information regarding the various hotels in a particular location with attributes, including price and distance (i.e., distance from the beach). In most cases, the user would prefer hotels with a low price and those closest to the beach. Here, the skyline query returns the hotels that satisfy user preferences. Due to its usefulness, the skyline query is extensively studied in many fields [
2].
Formally, the skyline query can be defined as follows. Data point pi of a d-dimensional set of data points P with the attribute value of pi.vj dominates data point pk (1 ≤ (i, k) ≤ |P|, 1 ≤ j ≤ d, i ≠ k), if every attribute value of pi is equal or less than pk’s (pi.vj ≤ pk.vj), and if there exists an attribute value, which is less than pk’s (pi.vj < pk.vj), then pi dominates the pk (pi < pk).
Figure 1 illustrates an example of finding skyline points in two-dimensional space. Here,
Figure 1a visualizes the input data set in two-dimensional space, which contains 16 data points.
Figure 1b shows skyline points obtained from
Figure 1a. Assume that
Figure 1 depicts the hotels mentioned earlier. In this case,
A1 indicates the distance from the beach and
A0 indicates hotel prices. From
Figure 1b, it is easy to see that the highlighted points represent hotels with a low price and a distance closest to the beach.
Generally, in the low-dimensional dataset, the skyline query returns a small number of skyline points. However, as the dimensionality of the dataset increases, the number of skyline points also increases. Depending on data distribution and dimensionality, most of the data points may become skyline points. Koizumi et al. [
3] introduced a skyline query processing technique, called BJR-Tree. In the research, the authors conducted extensive experiments that show the skyline points’ ratio in various dimensionality and three different data distribution, namely uniform, correlated, and anti-correlated. Specifically, the experiment results for uniform and correlated datasets demonstrate that more data points become skyline points from fifteen to the twenty-five-dimensional dataset. The situation worsens for the anti-correlated dataset, where more data points become skyline points starting from eight dimensions. This study demonstrates the lack of skyline query processing methods when the dataset is high-dimensional. With the emergence of big data applications, where the data distribution and dimensionality is a significant problem, obtaining representative skyline points among resulting skyline points is necessary.
There have been several methods that focused on extracting representative skyline points with various success [
4,
5,
6]. These methods utilize a specific global threshold value. However, without prior knowledge of the given dataset, the user must guess or iteratively increase the global threshold value, which leads to a problem of re-computation. In other words, once the global threshold value for representative skyline points changes, most of the existing methods reset the whole process and begin over again. As a result, it increases the overall processing time required for finding representative skyline points.
This problem can become significant in real-world applications as well. For example, assume that a hotel recommendation system that recommends the
k representative hotels depending on the user given value. Suppose that the system stores hotels with low prices and distance closest to the beach, as shown in
Figure 1b. Let us assume that user1, user2, and user3 connected to the system concurrently requested
k = 3,
k = 1, and
k = 5 hotels, respectively. The system recommends three representative hotels (
k = 3) first and returns the result to the user1. Further, the system must begin a new process and re-compute a single representative hotel for user2. Finally, the system re-computes once again to show the five representative hotels to the user3. The overall computational time of these methods increases as the number of requests increases. Moreover, certain existing representative skyline query processing methods [
4,
5] find an approximate result when
d ≥ 3.
In this paper, we propose a new representative skyline query processing method, which solves the problems of existing methods. More specifically, our contributions are as follows:
We propose a method for selecting the representative skyline points that uses the hierarchical agglomerative clustering structure. Here, we define a new distance measurement called θ distance that enables us to obtain a set of top and diverse representative skyline points by configuring the hierarchical agglomerative clustering structure.
We also show the correctness and efficiency of our proposed method through a comprehensive experimental study.
The rest of the paper is organized as follows. In
Section 2, we first review the skyline query processing methods and then discuss the previous work on representative skyline query processing. Further, in
Section 3, we describe the proposed method for selecting the representative skyline query processing, which utilizes the hierarchical agglomerative clustering. In
Section 4, we demonstrate the result of experiments on both synthetic and real datasets. Finally, in
Section 5, we conclude the paper and discuss future work.