1. Introduction
In traditional data warehouses, sales data are usually loaded into data warehouses in offline mode on a, e.g., weekly or daily basis [
1,
2]. In that scenario, a data warehouse is not updated in real-time and returns answers to queries based on stale data. On the other hand, businesses are growing rapidly and they want to respond to their customers with the latest data. Therefore, data warehouses are migrating from periodic data uploading to near-real-time data uploading.
In near-real-time data warehousing, stream of sales records
S need to join with disk-based relation
R in the staging area under transformation phase of ETL (extraction, transformation, loading). In the operation of transformation, source level changes are mapped into the data warehouse format. Common examples of transformations are unit conversion, removal of duplicate tuples, information enrichment, filtering unnecessary data, sorting tuples, and translation of source data key. This transformation operation requires a semi-stream join. Since the join is between stream
S and relation
R, it is also called stream-relation join. Other examples of this type of join are network traffic monitoring [
3,
4], sensor data [
5], web log analysis [
6,
7], online auctions [
8], inventory and supply-chain analysis [
9,
10], and real-time data integration [
11].
In near-real-time data warehousing typically, joining is performed between primary (unique) key of
R input and foreign key of
S input. For example,
Figure 1 implemented the example of information enrichment. From the figure we consider that attributes
product-ID,
date, and
quantity are extracted from
S. In the transformation operation, in addition to key replacement (from source key
product-ID to warehouse key
W_key) there is some information added, namely, sales price denoted by
price to calculate the total amount, and the vendor information denoted by
vendor. In the figure the information with attributes’ names
W_key,
price, and
vendor are extracted at runtime from
R and are enriched to
S using a join operator.
The main problem with the stream-relation join is that the stream input
S is fast and bursty in nature, whereas the disk input
R is slow due to high I/O cost. Many algorithms have been proposed to implement the join operation between
S and
R. CACHEJOIN (cache join) is one of them which is considered an optimal algorithm for skew data (non-uniform data). The algorithm consists of two phases, the disk-probing phase and the stream-probing phase. However, these two phases execute sequentially. In this case, the stream input has to unnecessarily wait and therefore the algorithm cannot exploit CPU resources optimally. This ultimately causes the performance of the algorithm to be suboptimal. This is further explained in
Section 2.
In this paper we present a robust algorithm called the parallelised cache-based stream relation join (PCSRJ) which enables the execution of both phases of the existing CACHEJOIN algorithm in parallel. This means the disk-probing phase does not wait for finishing the stream-probing phase and vice versa. The new algorithm in this way exploits CPU resources optimally, and that ultimately improves its performance. The further details about the new algorithm are provided in
Section 3.
In summary, this paper makes following contributions:
A well-known stream-relation join algorithm CACHEJOIN has been explored with respect to its limitations.
A robust algorithm PCSRJ has been introduced in order to improve the performance of CACHEJOIN.
We propose the cost model for our algorithm, and we validated that cost model.
We present an experimental analysis of PCSRJ with the existing CACHEJOIN.
The rest of the paper is organised as follows.
Section 2 presents a review of existing approaches.
Section 3 presents the new algorithm, including its architecture, algorithm, and cost model. Rigorous experimentation on our algorithm is presented in
Section 4. Finally,
Section 5 concludes the paper.
2. Related Work
In near-real-time (online) data warehousing, we require an efficient stream-relation join that joins the incoming stream tuples with R using low resources in an effective way. In this section we present an overview of various stream-relation joins in detail.
A seminal algorithm MESHJOIN (Mesh Join) [
11] has been designed to join
R with
S. In MESHJOIN,
R is divided into partitions. Additionally,
S is divided into chunks; the size of each chunk is
w. In each iteration the algorithm loads
w tuples of
S and a partition of
R in memory. Every tuple of
S is then joined with every partition of
R. The algorithm ensures that each tuple from
S is joined with the whole of
R before it is removed from the join window. Once the whole
R gets read, the algorithm restarts its reading. MESHJOIN’s authors report that the algorithm does not perform well with skewed data (frequent tuples). The limitation of MESHJOIN is that when the number of tuples in
R changes, the size of disk buffer also changes; that makes memory distribution suboptimal. Therefore, in MESHJOIN, the service rate is inversely proportional to the size of
R.
Reduced Mesh Join (R-MESHJOIN) [
12] is an extension of MESHJOIN. R-MESHJOIN removes the complexities of MESHJOIN. In that algorithm, performance is improved by allocating optimal memory to the disk buffer and a hash table. Disk buffer is split into variety of logical segments
l. In R-MESHJOIN
l is greater than one. If
l is equal to one then R-MESHJOIN is equal to MESHJOIN. Although in R-MESHJOIN the service improves due to removing unnecessary constraints, the algorithm does not consider common characteristics (skewed distribution) in
S.
Partition-based join [
13] is an improved version of MESHJOIN. The algorithm finds out the often repeated stream tuples and joins these tuples with a small memory buffer. This is the main distinction between the partition-based join and MESHJOIN. In this algorithm, a wait-buffer is maintained that stores every stream tuple. The disk probe is invoked when wait-buffer is full or the number of stream tuples related to a partition is greater than the threshold. The algorithm ensures that one stream tuple is only joined with a single read from
R, which minimises the frequency of disk access. In a partition-based join, there is no limit of waiting time for a stream tuple to be joined. For infrequent stream tuples, the waiting time of a partition-based join is greater than in MESHJOIN.
Semi-stream index join (SSIJ) [
14], is an improved version of MESHJOIN. In the SSIJ algorithm, there are three phases—pending, online, and join. In the pending phase, the algorithm fills the input buffer with incoming stream tuples to make a batch. When the input buffer filled, the online phase starts. In the online phase, incoming tuples are ordered based on the frequently used index. The output of tuples is generated immediately if tuples match with the cache. Remaining tuples that are not matched with cache pages of
R wait for the last phase—that is, the join phase. When the online phase is complete, the join phase starts, which matches the incoming tuples with the tuples of
R that are on the disk. When the join phase is complete, the algorithm again starts the pending phase. The authors of SSIJ reported that there is a worst case when no stream tuple is matched with cache in online phase.
HYBRIDJOIN (Hybrid Join) [
15] is combination of index nested loop join (INLJ) [
16] and MESHJOIN. HYBRIDJOIN minimises the waiting and processing time of stream tuples by using index on
R and deals with stream effectively. However, the algorithm performs suboptimally for non-uniform distributions, which means it is not robust if some tuples from
R appear frequently in
S.
X-HYBRIDJOIN [
17] is a further extension of HYBRIDJOIN that uses indexing on
R and caches the most frequent portion of
R. In X-HYBRIDJOIN the disk buffer is split into two parts. The first part is non-swappable part that holds frequent pages of
R. The second part is the swappable part that one-by-one loads the rest of the partitions of
R. Although X-HYBRIDJOIN outperforms HYBRIDJOIN, however, some tuples in the non-swappable part can be infrequent. This does not fully exploit the cache and leaves the room for improving performance.
Optimised X-HYBRIDJOIN [
18] is an updated version of X-HYBRIDJOIN. Optimised X-HYBRIDJOIN resolved the problem of unnecessary processing cost by introducing two phases that can work independently. However, the issue of cache suboptimality is still remained.
In [
19] the authors defined a new operator for stream-relation join that uses two mechanisms, out-of-order processing and batch processing, in order to increase the service rate. This algorithm splits the fast-arriving stream tuples into miss and hit tuples using two separate threads (batch and out-of-order). In that algorithm, a cache is made that maintains its content by least recently used (LRU) policy. Fast-arriving stream tuples first join with the cache: if a match is found then that tuple is considered as hit tuple and is processed immediately. The tuples that are not matched are known as missed tuples. In that algorithm missed tuples are not processed individually; tuples are processed in the form of batches. Missed tuples are added to the waiting queue; when the waiting queue gets full, key values of all missed tuples are queried to
R. The authors of that algorithm reported that when the cache is less useful, then memory allocation between the waiting queue and the cache is suboptimal.
Semi-stream balanced join (SSBJ) [
20] deals with many-to-many equijoins efficiently. In SSBJ cache, inequality is used for minimum memory consumption and increase the service rate. There are two phases of SSBJ. First is the cache phase and second is the disk phase. In SSBJ each stream tuple first passes through the cache phase; if a match is found, then output is generated and if a match is not found, then that tuple shifts to the disk phase. In SSBJ, cache size is variable. Cache size is changed based on every join value. This helps to increase the service rate of the SSBJ algorithm.
CACHEJOIN (Cache Join) [
21] introduces a cache module that stores frequent tuples of
R and joins these with
S. CACHEJOIN has two phases; one is the stream-probing phase, and the other is the disk-probing phase. The stream-probing phase deals with only a limited portion of
R. A big part of
R is dealt with by the disk-probing phase. For every tuple that comes from the stream, it first passes through the stream-probing phase to find a quick match. If a match is found, that tuple is generated in the output; otherwise that tuple goes to the disk-probing phase. After that disk-probing phase, the algorithm again switches back to the stream-probing phase. However, in CACHEJOIN, due to sequential execution of both phases, stream tuples unnecessarily wait to complete the join, which decreases the performance of the algorithm. The performance can be improved if the two phases can execute in parallel which has not been considered in CACHEJOIN.
3. Parallelisation of CACHEJOIN
In this section, we present a robust algorithm called the parallelised cache-based stream relation join (PCSRJ). The algorithm distributes R at two different computer nodes using hash partitioning strategy. The hash function is based on even and odd values of a join attribute, e.g., productid in R. Additionally, an index has been applied at R using the same join attribute. At each node the algorithm implements both phases of the CACHEJOIN algorithm. The incoming stream tuples are directed towards the relevant node based on even and odd values of its join attribute using a mapper. The mapper implements the same hash function that we used for partitioning R. The following subsections present the architecture, algorithm, and cost model of PCSRJ.
3.1. Architecture
The major components of PCSRJ at each node are: a cache
H for caching frequent tuples of
R, a queue
Q that stores join attribute values, a hash table
H that stores the stream tuples that are not matched in the stream-probing phase, and a disk buffer that stores a partition of
R. Three other minor components are the mapper, stream buffer, and frequency recorder. The mapper is used to distribute the incoming stream tuples to the each node, and the stream buffer is used to hold the stream tuples for a short interval while the tuple in memory gets processed. The frequency recorder is used to find the frequent tuples of
R in
S and is responsible for loading them to
H. The memory taken up by these three components is very little so we do not include that in our memory cost calculation. However, to show their working we include these components in the architecture. The execution architecture of PCSRJ is shown in
Figure 2.
At each node PCSRJ consists of a stream-probing (also called cache) phase and a disk-probing phase. Every new stream tuple arrives from the mapper and the node is first directed to the stream-probing phase. If the tuple matches with corresponding tuple in the stream-probing phase, the output is generated; otherwise it is directed to the disk-probing phase where it is stored in H and its join attribute value is stored in Q. The algorithm then loads a partition of R into the disk buffer using the most old join attribute value from Q as an index. After loading a partition of R, the algorithm looks at each tuple from the disk buffer H. If the matched tuple is found in H, the algorithm processes that tuple in the output and removes it from H along with its join attribute value from Q. Once the algorithm matches all the tuples from the disk buffer, the algorithm switches back to the stream-probing phase. By using parallel execution of the join at two different nodes, our experiment showed that the number of tuples joined per second significantly increased as compared to existing CACHEJOIN.
3.2. Algorithm
The pseudo-code for PCSRJ is presented in Algorithm 1. Since the identical threads of the PCSRJ algorithm run at both nodes, we present the pseudo-code for one node. Line 1 in the algorithm represents an infinite execution of the algorithm. This is the norm in such types of stream-based algorithms. For each incoming stream tuple, the algorithm maps it to the relevant node by applying the hash function on the join attribute value (line 2).
Lines 3 to 10 specify the stream-probing phase for either of the two nodes. In line 3, the algorithm reads w stream tuples from the stream buffer and one-by-one matches each stream tuple t from w into H. If the matching tuple is found from H, the algorithm generates tuple t as an output (lines 4 to 6). If the matching tuple is not found, the algorithm then loads t into H and enqueues its join attribute value to Q (lines 7 to 10). In the first iteration, since hash table is empty, the value of w is equal to the size of in terms of tuples. Meanwhile, from the second iteration and onward, the value of w depends on the number of slots that are emptied from in the previous iteration.
Lines 11 to 21 specify the disk-probing phase where the algorithm processes the stream tuples which are not matched in the stream-probing phase. In line 11 the algorithm loads
b number of tuples from
R into the disk buffer. The algorithm then reads each tuple
r from the disk buffer one-by-one and looks for matching tuple into
H. If the matched tuple is found, the algorithm generates output for tuple
r (lines 12 to 14). Since the join type is one-too-many, there can be more matches against
r. The total number of matches is stored in
f (line 15). Lines 16 to 18 determine whether tuple
r is frequent. Therefore, the value of
f is compared with the preset frequency threshold. If
f is greater than the frequency threshold, the algorithm loads tuple
r to
H. If
H is full, the algorithm overwrites a least frequent tuple in
H by
r. Finally, the algorithm removes the matched tuples against
r from
H along with their join attribute values from
Q (line 19). The algorithm also enables the tuning of the threshold value based on the following principle. If tuples are removed from the cache frequently, then there is need to increase the value of the threshold. Similarly, if the cache is not getting full even after a complete round of
R then threshold is too high and should be lowered down slightly.
Algorithm 1 PCSRJ algorithm for one node. |
Input: Stream S and a master data relation R with index on the join attribute. Output: Parameters: w (Where w = w + w) tuples of S, b number of tuples of R
- 1:
while true do - 2:
MAP incoming stream tuples to the relevant node based on the even and odd hash function on the join attribute value - 3:
READ w stream tuples from the stream buffer - 4:
for each tuple t in w do - 5:
if then - 6:
OUTPUT t - 7:
else - 8:
LOAD stream t in H and its join attribute value to Q - 9:
end if - 10:
end for - 11:
LOAD b number of tuples of R in the disk buffer - 12:
for each tuple r in b from the disk buffer do - 13:
if then - 14:
OUTPUT r - 15:
f ← total number of tuples match in H against r - 16:
if f ≥ thresholdValue then - 17:
MOVE that tuple r into H - 18:
end if - 19:
REMOVE r from H with its joins attribute value from Q - 20:
end if - 21:
end for - 22:
end while
|
3.3. Cost Model
This section presents the cost model for our PCSRJ. The cost model is similar in nature to the CACHEJOIN [
21] cost model which consists of both memory and processing costs. Equation (
1) shows the memory cost for PCSRJ. As mentioned above, the components stream buffer, mapper, and frequency recorder are tiny (0.05 MB was enough for all our experiments), and therefore, we do not include these in our memory cost calculation. Equation (
2) explains the processing cost for PCSRJ. Symbols used in both cost calculations are presented in
Table 1.
3.3.1. Memory Cost
The major components to which memory is assigned are two hash tables H and H, the disk buffer, and the queue Q. We calculate double memory for each components as the PCSRJ algorithm runs in two computer nodes. To make it simple, we first calculate the memory for each component separately and then sum-up the individual memory requirements to calculate the total. Therefore, memory taken up by each component at both nodes is calculated as follows:
Memory for the disk buffer in bytes =
Memory for H in bytes =
Memory for H in bytes =
Memory for Q in bytes =
By combining the above, the total memory for PCSRJ can be calculated as shown in Equation (
1).
3.3.2. Processing Cost
In this section we compute the processing cost for PCSRJ algorithm. Similarly to the memory, we calculate double the processing cost for each component due to running the two algorithm threads in parallel on two nodes. Again, we first compute cost for each component separately and then add-up all these costs to compute the total cost for one loop iteration.
Cost of reading b number of tuples from R to the disk buffer =
Cost of matching w tuples in H =
Cost of matching all tuples in the disk buffer to H =
Cost of matching the frequency of all tuples in the disk buffer with the threshold value =
Cost to produce output for w tuples =
Cost to produce output for w tuples =
Cost to read w tuples from the stream buffer =
Cost to read w tuples from the stream buffer =
Cost to append w tuples into H and Q =
Cost to delete w tuples from H and Q =
The total processing cost for one loop iteration can be computed by combining the above individual processing costs, as shown in Equation (
2).
In the above equation
is used to convert nanoseconds into seconds. In each iteration, since the algorithm processes
w and
w tuples of
S at one node, the total number of tuples process in one second (also know as service rate
) can be computed using Equation (
3).