Generalized Sketches for Streaming Sets

: Many real-world datasets are given as a stream of user–interest pairs, where a user– interest pair represents a link from a user (e.g., a network host) to an interest (e.g., a website), and may appear more than once in the stream. Monitoring and mining statistics, including cardinality, intersection cardinality, and Jaccard similarity of users’ interest sets on high-speed streams, are widely employed by applications such as network anomaly detection. Although estimating set cardinality, set intersection cardinality, and set Jaccard similarity, respectively, is well studied, there is no effective method that provides a one-shot solution for estimating all these three statistics. To solve the above challenge, we develop a novel framework, SimCar . SimCar online builds an order-hashing (OH) sketch for each user occurring in the data stream of interest. At any time of interest, one can query the cardinalities, intersection cardinalities, and Jaccard similarities of users’ interest sets. Specially, using OH sketches, we develop maximum likelihood estimation (MLE) methods to estimate cardinalities and intersection cardinalities of users’ interest sets. In addition, we use OH sketches to estimate Jaccard similarities of users’ interest sets and build locality-sensitive hashing tables to search for users with similar interests with sub-linear time. We evaluate the performance of our methods on real-world datasets. The experimental results demonstrate the superiority of our methods.


Introduction
Many real-world networks (e.g., computer networks, phone networks, and financial networks) generate data (e.g., packets, calling records, and transactions) in a stream fashion, where an element (e.g., a packet) records a link from a user (e.g., a source IP address) to an interest (e.g., a destination IP address). The data can be modeled as a data stream of user-interest pairs, including duplicates, because multiple data records (e.g., packets from a source IP address to a destination IP address) may relate to the same user-interest pair. Due to the large size and high-speed nature of these data streams, it is prohibitive to collect all the data, especially when the computation and memory resources of data-collection devices (e.g., network routers) are limited. To solve this challenge, considerable attention has been paid to designing fast and memory-efficient data streaming algorithms for monitoring and mining user behaviors. For example, Count-Min sketch [1] and its variants have been successfully used to detect heavy hitters (e.g., user-interest pairs frequently occurred).
In addition to statistics using the frequency of duplicates, significant effort has also been made to mine users' interest sets, such as: • User cardinality. A user's cardinality is defined to be the number of distinct interests that the user links to, i.e., the cardinality of the user's interest set. Monitoring user cardinalities in computer networks is fundamental for applications such as network anomaly detection [2]. One can use a variety of sketch methods, such as LPC [3], LogLog [4], HyperLogLog [5], MinCount [6], RoughEstimator [7], and their variants [2,[8][9][10][11][12][13] to generate a compact data summary (or sketch) for each user's interest set, and then infer user cardinalities from the generated sketches. • Common-interest count. Two users' common-interest count is defined as the number of interests that they both link to, i.e., the cardinality of the intersection of two users' interest sets. It is popularly used for applications such as friend recommendation [14] and network anomaly detection [15]. One can use sketch and sampling methods in [13][14][15] to estimate users' common-interest counts. • User Jaccard similarity. The Jaccard similarity is a popular similarity measure. Two users' similarity can be evaluated as the Jaccard similarity of their interest sets, which is defined as their common-interest count divided by the number of distinct interests at least one of them links to. One can use sketch methods such as MinHash [16], OPH [17], and their variants [18][19][20][21][22] to estimate users' Jaccard similarities. Some of these sketch methods, such as Minhash, can be used for locality-sensitive hashing (LSH) indexing [16,[23][24][25][26], which is capable of searching a query user's similar users with sub-linear time.
The above three duplicate-irrelevant statistics may all be desired for applications such as network anomaly detection. For example, given a detected abnormal user (e.g., the IP address of an attacker captured by Honeynet systems), network administers may want to quickly search for similar users among all network users, which may also be abnormal users, and then estimate their cardinalities, Jaccard similarities, and common-interest counts for further inspection. Although many cardinality estimation methods can be easily extended to estimate the above three statistics because they generate mergeable sketches (mergeable cardinality estimation sketches of two sets S 1 and S 2 can be used to generate a sketch of S 1 ∪ S 2 , which is used to estimate the cardinality of S 1 ∪ S 2 .), Cohen et al. [15] observed that their extensions exhibit large errors for estimating common-interest counts and Jaccard similarities. Even worse, the sketches generated by these cardinality methods cannot be used for LSH indexing, which results in a high computational cost for searching users with similar interests. One can combine multiple existing methods to generate different sketches for estimating user cardinalities, common-interest counts, and Jaccard similarities, respectively. This straightforward method increases the required memory and computational resources, which limits its usage for high-speed data streams.
To address the above challenges, we develop a novel framework, SimCar, which is shown in Figure 1. The framework consists of two modules: (1) the online processing module, which uses Giroire's algorithm [6] to build an order-hashing (OH) sketch of the interest set of each user occurring in the stream; (2) the online query module, which provides functions for estimating the cardinalities, intersection cardinalities, and Jaccard similarities of users' interest sets at any time of interest. Specially, using OH sketches, we develop maximum likelihood estimation (MLE) methods for estimating cardinalities and intersection cardinalities of users' interest sets. In addition, we use the technique optimal densification [22] to generate densification OH (DOH) sketches and then use DOH sketches to estimate Jaccard similarities of users' interest sets and build locality-sensitive hashing tables to search for users with similar interests with sub-linear time. Our main contributions are summarized as follows: • For sets given in a stream fashion, our method, SimCar, builds only one type of sketches (i.e., OH sketch) to effectively mine a variety of their statistics, including cardinality, intersection cardinality, and Jaccard similarity, as well as fast search similar sets. It outperforms the straightforward method of combining existing sketch-based cardinality estimation methods and Jaccard similarity estimation methods. • We develop maximum likelihood estimation (MLE) methods to estimate set cardinalities and intersection cardinalities, which are more accurate than the original OH-sketch-based cardinality estimation method [6]. • Compared with state-of-the-art methods, experimental results on real-world datasets demonstrate that our method, SimCar, is up to 1000 times faster for online data streams processing, and reduces the memory usage by 13.5-23.8% to achieve the same performance accuracy.  The rest of this paper is organized as follows. The problem formulation is presented in Section 3. Section 4 introduces the preliminaries. Section 5 presents our framework, SimCar. The performance evaluation is presented in Section 6. Section 2 summarizes related work. Concluding remarks then follow.

Related Work
LSH for similarity estimation. LSH maps similar objects (e.g., sets and vectors) to the same hash value with higher probability than dissimilar ones [23], and has been widely used for applications such as similarity estimation and nearest neighbor search. Min-Hash [16] is a popular LSH technique for estimating the Jaccard similarity coefficient of sets. Recently, considerable attention has been paid to improving the performance of MinHash. For example, b-bit minwise hashing [18], odd sketch [19], and MaxLogHash [27] use probabilistic techniques such as sampling and sketching to further reduce the amount of memory used by MinHash. However, all these methods update each element with a high complexity of O(k). To solve this problem, OPH [17] and its variants [20][21][22]28] were developed to accelerate the speed of processing each element in sets by orders of magnitude. Very recently, Raul et al. [29] used HyperLogLog [5] and OPH [17] sketches to estimate the Jaccard similarity, as well as the containment. In addition to sets (i.e., binary vectors), a variety of sketch methods have also been developed to estimate similarity between real-value weighted vectors. For example, Charikar [25] developed a method, SimHash, for approximating angle similarity (i.e., cosine similarity) between weighted vectors. Datar et al. [26] used p-stable distributions to estimate l p distance between weighted vectors, where 0 < p ≤ 2. Refs. [30][31][32][33][34][35][36][37][38] were developed to approximate the weighted Jaccard similarity of weighted vectors. Ryan et al. [39] proposed a Gumbel-Max-Trick-based sketching method, P-MinHash, to estimate another novel Jaccard similarity metric, probability Jaccard similarity. They also demonstrated that the probability Jaccard similarity is scale invariant and more sensitive to changes in vectors. Qi et al. [40] proposed FastGM to further reduce the time complexity of P-MinHash by generating the hash values in order.
LSH for nearest neighbor search. To perform a nearest neighbor search, refs. [23,26] first, objects are hashed into a hash table. To perform a similarity search, a query object is hashed into a bucket of the hash table, and then the objects in the bucket are used as candidate objects, which need to be further verified. However, nearest neighbors may not appear as candidate objects because close objects may be hashed into different buckets. To solve this problem, refs. [41][42][43] proposed methods that look up more than one bucket of the hash table for a query, which significantly reduces the amount of storage. To further improve the performance, many techniques, such as C2LSH [44], SK-LSH [45] and LSB-Frest [46], were developed to reduce memory usage and query cost. After gathering the candidate objects, but before verifying them, Satuluri et al. [47] developed a method BayesLSH that performs similarity estimation and candidate pruning using LSH sketches to accelerate the procedure of verifying candidate objects. Real-world datasets are usually not distributed uniformly over the space, which results in unbalanced LSH buckets and degrades the performance of LSH for nearest neighbor search. To solve this problem, Gao et al. [48] developed a method DSH leveraging data distributions to hash nearest neighbors together with a larger probability. Wang et al. [49] developed a system, FLASH, using OPH and reservoir sampling to overcome the computational and parallelization hurdles. Recently, several LSH-based methods [50][51][52] and transformation-based methods [53][54][55] were developed to solve the maximum inner product search (MIPS) problem, i.e., search vectors having maximum inner products with a given query vector.
Cardinality estimation. To estimate a large data set's cardinality, Whang et al. [3] developed the first fast sketch method LPC. Additionally, refs. [2,12] used different sampling methods to enlarge the estimation range of LPC. Flajolet and Martin [56] developed a sketch method, FM, which uses a register initialized to zero for cardinality estimation. To further improve the performance of FM, sketch methods LogLog [4], HyperLogLog [5], RoughEstimator [7], and HLL-TailCut+ [57] were developed to use a list of m registers and compress the register size from 32 bits to 5 or 4 bits. Giroire [6] developed a sketch method, MinCount, (also known as bottom-k sketch [58]) which stores the k minimum hash values of elements in the stream to estimate the stream cardinality. Lumbroso [59] developed another order-statistics-based cardinality estimation method which hashes the data stream's elements into k registers at random, and each register stores the minimum hash value of elements hashed into it. Ting [53] developed a martingale-based estimator to further improve the accuracy of the above sketch methods, such as LPC, HyperLogLog, and MinCount. Chen et al. [60] extended HyperLogLog to sliding windows. In addition to sketch methods, two sampling methods [61,62] were also developed for cardinality estimation. Recently, considerable attention [8][9][10][11] has been given to developing fast sketch methods to monitor the cardinalities of network hosts over high-speed links. Ting [13] developed methods to estimate the cardinality of set unions and intersections from MinCount sketches. Cohen et al. [15] developed a method combining MinHash and HyperLogLog to estimate set intersection cardinalities. These two methods fail to solve the problem studied in this paper, because MinCount does not belong to the LSH family, and cannot be used for sub-linear time similarity search, and MinHash fails to deal with high-speed data streams due to its high time complexity.
LSH for clustering and feature tracking. LSH can also support other analytical tasks, such as clustering and feature tracking. Mao et al. [63] proposed the MR-PGDLSH algorithm to reduce the overhead of frequent communication between nodes through LSH, so as to solve the problem of excessive communication overhead faced by the partition-based K-means clustering algorithm in the big data environment. Corizzo et al. [64,65] proposed the DENCAST algorithm to process large-scale high-dimensional data using LSH, which is more efficient than the most advanced distributed clustering algorithm. Cao et al. [66] used LSH-based k-nearest neighbor matching to generate feature correspondence, and then used a ratio test method to remove outliers from the previous matching set, and finally achieved better results. Ding et al. [67] proposed a perceptual hash algorithm based on deep learning and LSH to generate edge feature samples to improve the robustness of HRRS image authentication.

Problem Formulation
Let Γ = e (1) , . . . , e (t) , . . . denote the stream of interest, where e (t) = (u (t) , w (t) ) is the t th element of stream Γ, which represents a link from a user u (t) to an interest w (t) . Note that a pair (u, w) may appear more than once in stream Γ. Denote by U the set of occurring users, and W the set of occurring interests. Let N u be the interest set of user u, i.e., the set of interests that user u links to. User u's cardinality is defined as the number of its distinct interests, i.e., d u = |N u |. For two users u and v, we define their common-interest count as d u∩v = |N u ∩ N v |, and their Jaccard similarity as In this paper, we aim to design a fast method to build a compact data summary (or, sketch) for each user in a single-pass stream Γ. For any specific query user u, generated sketches are effective for rapidly searching for nearest neighbors of user u (i.e., users v ∈ U with large Jaccard similarity J u,v ) without enumerating each user in U. More importantly, they are also accurate for estimating the cardinality d v of any user v, Jaccard similarity J u,v , as well as the common-interest count d u∩v of any two users u and v. For ease of reference, we list the notation used throughout the paper in Table 1.
The number of registers used for a MinHash, OPH, or OH sketch The number of registers used for an HLL sketch The set of interests affiliated with user u but not user v, i.e., The Jaccard similarity of sets N u and N v The set of elements in x u that are smaller than 1 k u The number of elements in x u that equal 1 The set of elements in x u and x v whose relations are Case j defined in Section 5.
The cardinality of set Φ

Preliminaries
In this section, we present Giroire's algorithm [6] for cardinality estimation, and the OPH method [17] as well as the optimal densification [22] for Jaccard similarity estimation, which are closely related with our framework, SimCar.

Giroire's Algorithm
Giroire [6] developed a fast method, CORE, for estimating a stream's cardinality, i.e., the number of distinct elements occurring in the stream. The basic idea behind CORE is that one can assign each element in the stream a rank that is randomly and uniformly selected from the range (0,1), and to some extent, the minimum of occurring elements' ranks reflects the number of distinct occurring elements (i.e., the stream's cardinality). To accelerate the performance, CORE splits elements in the stream into k subsets, and keeps track of the minimum rank for each subset, which is finally used to infer the stream's cardinality. In detail, CORE uses k registers x 1 , . . ., x k to generate a sketch of the stream, where x 1 , . . ., x k are initialized to 1. Let r e be the rank of an element e that is uniformly selected from range (0, 1) at random; that is, r e ∼ Uniform(0, 1). For an element e arriving on the stream, CORE computes i e = r e k + 1 and then updates the i th e register x i e as At the end of the stream, the stream's cardinality is estimated as d * = k(k−1) is severely biased for small cardinalities, CORE treats the list of registers x 1 , . . . , x k as an LPC sketch (i.e., a bitmap of k bits) [3]. Let n 1 be the number of registers among x 1 , . . . , x k that equal 1 at the end of the stream. When n 1 k ≤ 86%, CORE estimates the stream's cardinality as −k log n 1 k .

OPH
OPH [17] can be viewed as a sampling method, which samples (at most) k distinct elements from each set of interest without replacement and then estimates the Jaccard similarity of two sets based on their sampled elements. Compared with the well-known sketch method MinHash [16], which is used to estimate set Jaccard similarity, OPH reduces the time complexity of processing each element in set N u from O(k) to O(1). In detail, to build a sketch of set N u ⊆ W, OPH [17] uses a single permutation function π(·) : W → W to process each element in a set N u . Specifically, it evenly divides W = {1, . . . , p} into k bins: Given that at least one of x u,j and x v,j does not equal ∅, Li et al. [17] observed that x u,j = x v,j occurs with probability J u,v , and proposed estimating J u,v based on this observation. Let 1(P) be the indicator function that equals one when the predicate P is true, and zero otherwise. Formally, OPH estimates the Jaccard similarity J u,v of two sets N u and N v based on their OPH sketches x u and x v aŝ .
As shown in Algorithm 1, one can apply the above OPH algorithm online to stream Γ and compute the OPH sketch of each user's interest set N u , u ∈ U.

Optimal Densification
Clearly, the OPH method generates sketches (i.e., vectors) that may have empty elements (i.e., elements equal ∅), which hinders its application for LSH-based similarity search with sub-linear time, and also results in a large error in the Jaccard similarity estimation. To solve this problem, refs. [20][21][22] proposed several densification techniques, which reassign empty elements using the values of non-empty elements. The state-of-theart densification technique, optimal densification [22], is shown in Algorithm 2. Specifically, when the i th element of OPH sketch x u is empty (i.e., x u,i = ∅), optimal densification iteratively uses a hash function j = h(i, l) : {1, . . . , k} × {1, 2, . . . , } → {1, . . . , k}, to find a non-empty element x u,j (i.e., x u,j = ∅). Function h(i, l) takes two arguments: (1) i, the current empty element that needs to be reassigned; (2) l, the number of failed attempts made so far to reach a non-empty element, which is initialized to 1. Let x * u denote the optimal densification of OPH sketch x u . For any sets N u and N v , their Jaccard similarity Shrivastava [22] proves thatĴ * u,v is an unbiased estimator of J u,v and is more accurate than the original OPH method, as well as other densificiation techniques in [20,21].

Our Framework SimCar
In this section, we use Giroire's algorithm [6] (Algorithm 3) to build an OH sketch for each occurring user. Based on generated OH sketches, we first present our methods for estimating the Jaccard similarity of users' interest sets and quickly searching users with similar interests. Then, we propose methods for estimating the cardinalities and common-interest counts of users' interest sets.

Jaccard Similarity Estimation and Nearest Neighbor Search
Jaccard Similarity Estimation. Let x ( * ) u be the DOH sketch of OH sketch x u , u ∈ U, which is computed using the optimal densification technique introduced in Section 4. Similar to optimal densification [22], we find that P(x * u,i = x * v,i ) = J u,v for any 1 ≤ i ≤ k. Then, we estimate the Jaccard similarity of J u,v aŝ Following the error analysis in [22], we find thatĴ * u,v is an unbiased estimate of J u,v and has the same accurate performance as the optimal densification of OPH.
Nearest Neighbor Search. A variety of LSH methods [16,[23][24][25][26] have been developed to solve the task of nearest neighbor search with sub-linear time. The basic idea behind these LSH methods is to place similar users into the same bucket of a hash table with high probability. Following the principle of LSH, our nearest neighbor search method consists of two basic phases: • Retrieving phase. Given a specific user u, we search for its similar users. Instead of scanning all users in set U, we only probe ] from b hash tables, respectively, and report the users in any of these buckets as the potential candidates. Last, nearest neighbors of user u are detected by enumerating , and computing and sorting their Jaccard similarity estimations.

User Cardinality Estimation
In this section, we estimate the cardinality of any queried user. Specially, we first build a probabilistic model according to the state of the OH sketch, and then solve it using MLE methods.
Exact probabilistic model. For any user u ∈ U, let N u,i denote the set of its interests whose hash values equal i with respect to hash function f ; that is, At any time of stream Γ, from Equation (1), we have Define d u,i = |N u,i |. Let f (x u,i = x|d u,i = d) denote the probability density function (PDF) of random variable x u,i at x given d u,i = d. Then, we have Our algorithm randomly splits users in set N u into k subsets N u,1 , . . ., N u,k . Using the classical balls-in-urns model, we derive the probability distribution of vector (d u,1 , . . . , d u,k ) as where ( d u d u,1 ,...,d u,k ) = d u ! d u,1 !···d u,k ! . Therefore, the PDF of x u given d u is computed as Poisson approximation model. Clearly, d u,1 , . . . , d u,k are not independent, which hinders us when it comes to obtaining the MLE of d u . Similar to [5], we use the Poisson approximation technique to remove the dependence of d u,1 , . . . , d u,k . Specifically, we assume that the value of d u is distributed according to a Poisson distribution with parameter λ u , i.e., d u ∼ Poisson(λ u ). Then, the PDF of x u given λ u is Given d u ∼ Poisson(λ u ), the above equation indicates that the values of x u,1 , . . . , x u,k are independent and identically distributed. In addition, the values of d u,1 , . . . , d u,k are independent and identically distributed according to a Poisson distribution Poisson( λ u k ). Formally, we have For an estimator φ( x u ) of d u (e.g., Equation (5), derived later), we let E d u (φ( x u )) and Var d u (φ( x u )) denote the expectation and the variance of φ( x u ) under the exact probabilistic model (i.e., f ( x u |d u )), and let E Poisson(λ u ) (φ( x u )) and Var Poisson(λ u ) (φ( x u )) denote the expectation and the variance of φ( x u ) under the Poisson approximation model (i.e., f ( x u |d u )). The authors of [5,68,69] reveal that statistical properties of E d u (φ( x u )) and Var d u (φ( x u )) are well approximated by E Poisson(λ u ) (φ( x u )) and Var Poisson(λ u ) (φ( x u )) when setting λ u = d u , which is the depoissonization step.
Estimator of λ u . Next, we elaborate our method to derive the MLE of λ u under the Poisson approximation model. For any 0 < x < 1, the following equation holds where the last equation holds because of the Taylor series of λ u k e − λu x k at x 0 = 1. Then, we compute the PDF of x u,i at x given λ u as The last equation holds because we have d u,i = 0 when x = 1, which occurs with probability P(d u,i = 0|λ u ) = e − λu k . Let Φ u denote the set of elements in vector x u that are smaller than 1, i.e., Denoted by k u is the number of elements in vector x u that equal 1. Then, the likelihood function of λ u given by k independent observed variables x u,1 , . . . , x u,k is computed as To analyze the estimation error ofλ u , we first compute the Fisher information of λ u as The last equation is obtained because E(k u |λ u ) = ke − λu k , which is easily derived from Equation (4). From [70], we find that the MLEλ u is an asymptotically efficient unbiased estimator of λ u , and its mean square error (MSE) asymptotically approaches the Cramér-Rao lower bound (CRLB) of λ u , which is defined as 1 I(λ u ) . Formally, we have lim k→+∞λu → λ u and lim k→+∞ MSE(λ u ) → 1 I(λ u ) . Lastly, at the depoissonization step, we useλ u to approximate d u . In our experiments, we will show that the above MLE estimator is more accurate than the Giroire's algorithm (Section 4.1), especially for small cardinalities.

Common-Interest Count Estimation
Similarly, we use the MLE and Poisson approximation techniques to estimate the common-interest count d u∩v of any two users u, v ∈ U. Specifically, we assume Let d u∩v,i = |N u∩v,i |, d u\v,i = |N u\v,i |, and d v\u,i = |N v\u,i |. Similar to Equation (2), we find that d u∩v,1 , . . ., d u∩v,k , d u\v,1 , . . ., d u\v,k , d v\u,1 , . . ., d v\u,k are independent, and they are distributed according to the following Poisson distributions Define x u∩v,i = min w∈N u∩v,i r w , x u\v,i = min w∈N u\v,i r w , and x v\u,i = min w∈N v\u,i r w . Then, we find that x u,i = min(x u∩v,i , x u\v,i ), x v,i = min(x u∩v,i , x v\u,i ).
Let f (x u,i = x, x v,i = x |λ u , λ v , λ u,v ) denote the PDF of the random variable pair (x u,i , x v,i ) at (x, x ), given λ u , λ v , and λ u,v . In what follows, we omit the conditions λ u , λ v , and λ u,v for simplicity when no confusion is raised. We derive f (x u,i = x, x v,i = x ) for all possible relations between x u,i and x v,i as follows: Case 3: 0 < x u,i < 1 ∧ x v,i = 1. Similar to Case 2, we have Case 4: 0 < x u,i = x v,i < 1. This case indicates that x u∩v,i < x u\v,i and x u∩v,i < x v\u,i .
Similar to Equation (4), we obtain f ( Case 6: 0 < x v,i < x u,i < 1. Similar to Case 5, we have Let Φ (j) u,v denote the set of integers i ∈ {1, . . . , k}, such that the relation between x u,i and x v,i is Case j, 1 ≤ j ≤ 6. Define k We use the Newton-Raphson method [71] to compute the MLE of λ = (λ u , λ v , λ u,v ), i.e.,ˆ λ = arg max λ L( x u , x v ). The Newton-Raphson method starts from an initial estimation, λ (0) , and then repeats the following procedure until a sufficiently accurate root of function g( λ) = (0, 0, 0) T is reached, where g( λ) and H( λ) are the gradient and the Hessian matrix of log L( x u , x v ) at λ is defined as The closed formulas of g( λ) and H( λ) are given in Appendix A. To set λ (0) = (λ v using the estimates of λ u and λ v given in Section 5.2 (i.e., λ is an estimate of Jaccard similarity J u,v , which is introduced in Section 5.1. Next, we analyze the error of obtained MLEλ * u,v . The Fisher information matrix of λ is defined as The closed formula of E(H( λ)) is given in Appendix A. Let (I( λ)) −1 3,3 be the (3, 3) element of the inverse of matrix I( λ). From [70], we find that the MLEλ * u,v is an asymptotically efficient unbiased estimator ofλ u,v , and its MSE asymptotically approaches the CRLB of λ u,v , i.e., (I( λ)) −1 3,3 . Lastly, we useλ * u,v to approximate d u∩v .

Space and Time Complexities
Complexities of online generating OH sketches. We use k registers for each occurring user in stream Γ, and the time complexity of processing each element of stream Γ is O(1).

Complexities of densification.
No extra memory space is required for densification. From [22], we find that computing a user u's DOH sketch from its OH sketch x u requires time complexity O(k( k u k−k u + 2)).

Complexities of user cardinality estimation.
No extra memory space is required for user cardinality estimation. The time complexity of estimating a user's cardinality is O(k).

Complexities of common-interest count estimation.
No extra memory space is required for common-interest count estimation. We find that our method only requires several Newton-Raphson iterations to converge, where each iteration has a negligible computational cost. Therefore, the time complexity of estimating two users' common-interest counts is O(k).
Complexities of Jaccard similarity estimation. No extra memory space is required for common-interest count estimation. Based on two users' DOH sketches, time O(k) is required when estimating two users' Jaccard similarity.
Complexities of nearest neighbor search. Given DOH sketches of users in U, space O(|U|b) and time O(|U|b) are required for building LSH tables, and searching a query user's nearest neighbors requires time O(k log |U|).

Evaluation
In this section, we evaluate the runtime and accuracy of our framework, SimCar, compared to the state of the art. All experiments were run on the same machine with an Intel Xeon CPU E5-2620 v3 with 2.4 GHz, and all algorithms were implemented in Python.

Datasets
We performed our experiments on six real-world datasets from different areas. The dataset Flickr [72] is a bipartite network of Flickr users and their group memberships. The dataset Movie [73] is a bipartite network consisting of one million movie ratings from MovieLens (http://movielens.umn.edu/, accessed on 10 March 2022).
The dataset Wikipedia [74] is a bipartite network of excellent articles in the English Wikipedia (http://en.wikipedia.org/wiki/Wikipedia, accessed on 10 March 2022) which meet a core set of editorial standards, and the words they contain. The dataset Reuters [75] is a bipartite network of article-word inclusions in documents that appeared on Reuters newswire in 1987. The dataset Tropes [76] is a bipartite network of TV tropes (http:// tvtropes.org, accessed on 10 March 2022), characterizing artistic works by their tropes. The dataset TREC [77] is a bipartite network of 1.7 million text documents from the Text Retrieval Conference's (TREC) (http://trec.nist.gov/data/docs_eng.html, accessed on 10 March 2022), containing 556,000 words of different languages. The detailed statistics of these real-world datasets are summarized in Table 2. Table 2. Real-world datasets used in our experiments, where "size" refers to the number of elements in the stream.

Baselines
We compared our framework with the following state-of-the-art methods on all four tasks: user cardinality estimation, common-interest count estimation, Jaccard similarity estimation, and nearest neighbor search.
• MinHash and HLL. For common-interest count estimation, the baseline is the method [15], which estimates common-interest counts by generating both MinHash [16] and Hy-perLogLog [5] (short for HLL in the following paper) sketches for each user's interest set. Clearly, one can use generated HLL sketches to estimate user cardinalities, and use generated Minhash sketches to estimate Jaccard similarities and build LSH tables for a nearest neighbor search. • OPH+HLL. One can also solve all four tasks by combining OPH [17] and HLL, where OPH [17] is much faster than MinHash and exhibits comparable accuracy for applications such as Jaccard similarity estimation and nearest neighbor search [20][21][22]. In detail, HLL sketches were used to estimate user cardinalities, and optimal densified OPH sketches were used to estimate Jaccard similarity and build LSH tables for a rapid nearest neighbor search. To estimate the common-interest count of any two users u and v, we first easily obtained (1) estimationsd u ,d v , andd u∪v of d u , d v , and d u∪v = |N u ∪ N v | from HLL sketches and our OH sketches; (2) estimationĴ u,v of Jaccard similarity J u,v from optimal densified OPH sketches and DOH sketches. As shown in [15], the common-interest count d u∩v can be estimated by each of the following schemes: . Similar to MinHash and HLL method [15], we initialize the parameter T for our common-interest count estimation method in Section 5.3 and then run only a single Newton-Raphson iteration to obtain a more accurate estimation of d u∩v . In our experiments, common-interest count estimations given by OPH and HLL are computed by the above schemes 1, 2, and 3 in a direct manner.
In our experiments, by default, we let MinHash, OPH, and our sketch method OH use the same k, i.e., the number of registers used to generate one MinHash/OPH/OH sketch. We let k 1 denote the number of registers used for an HLL sketch. Therefore, our method, SimCar, uses k 1 fewer registers than baselines MinHash and HLL and OPH and HLL. Compared with MinHash, OPH, and OH using 32-bit registers, HLL uses 5-bit registers [5]; thus, our method, SimCar, reduces the memory usage by 13.5% and 23.8% when k 1 = k and k 1 = 2k, respectively.

Metric
We evaluated both the efficiency and effectiveness of our method, SimCar, in comparison with the above two baseline methods. For efficiency, we evaluated the running time of all methods. Specially, we used the online sketching time to measure the time of online processing stream Γ. For user cardinality estimation, common-interest count estimation, and Jaccard similarity estimation, we evaluated the error of estimationμ with respect to its true value µ using the normalized root mean square error (NRMSE), which is defined as NRMSE(μ) For the nearest neighbor search, given a query user, we report the average size of the retrieved user set, as well as the recall of the top 10 most similar users among that set. In our experiments, we computed the above metrics over 100 independent runs.

Runtime
As shown in Figure 2a, where we set k 1 = k = 2 9 , our framework, SimCar, significantly outperforms MinHash and HLL and OPH and HLL on all datasets in Table 2. For example, for Reuters, the online sketching time of SimCar is about 2 seconds, while MinHash and HLL and OPH and HLL need about 764 and 5.3 seconds, respectively. On average, SimCar is about 400 and 2.5 times faster than MinHash and HLL and OPH and HLL, respectively. Figure 2b shows the online sketching time for different k. We can see that the online sketching time of SimCar and OPH+HLL is almost constant, while the online sketching time of MinHash and HLL increases linearly to k, because Minhash needs to update each of its k registers when updating each element that occurs in stream Γ.

Accuracy
Results of cardinality estimation. We compared our method, SimCar, with Hyper-LogLog used by MinHash and HLL and OPH and HLL, as well as the CORE algorithm [59], which is introduced in Algorithm 3 in Section 5. In this experiment, all three sketch methods-HLL, CORE, and SimCar-use the same number of registers, i.e., k 1 = k. As shown in Figure 3, our method, SimCar, is significantly more accurate than HLL and CORE on these real-world datasets. To be more specific, SimCar gives a gain of 18%∼40% (resp. 10%∼23%) improvement to CORE (resp. HLL). For example, on Wikipedia, the average NRMSE of SimCar is 0.0285, while those of HLL and CORE are 0.0337 and 0.0475, respectively. Figure 4 further shows the average NRMSEs for different k. We can see that the average NRMSEs of all methods decrease as k increases, and our method is consistently more accurate than HLL and CORE. In addition, we compute the fine-grained NRMSEs with respect to the user cardinality ranging from 1 to 10, 000 with k = 2 8 and k = 2 9 . As shown in Figure 5, both HLL and CORE exhibit large estimation errors when user cardinality is small because they both use two different estimators for cardinalities within two different ranges, respectively. However, our method, SimCar, just uses one estimator, and we find that SimCar decreases the NRMSEs of HLL and CORE by 10% − 42% for different user cardinalities.
Results of common-interest count estimation. As shown in Figure 6a-c, where we set k 1 = k, we can see that our method, SimCar, is more accurate than the other two methods for all three schemes introduced in Section 6.2, and scheme 1 performs the worst among the three schemes, whose average NRMSEs are up to 8.7× those of the other two schemes. We raised k 1 to 2k, and the results are shown in Figure 6d-f. We can see that SimCar still outperforms MinHash and HLL and OPH and HLL for most of the datasets. Figure 7 show the average NRMSEs for different k on datasets Reuters, Tropes, and TREC, where we set k 1 = k. We can see that the average NRMSEs of all methods decrease as k increases. For scheme 1, our method, SimCar, is up to four and three times more accurate than MinHash and HLL and OPH and HLL, respectively. For schemes 2 and 3, we find that OPH and HLL gives results comparable to SimCar when k is medium. However, our method, SimCar, still outperforms OPH and HLL when k ≥ 2 10 .
Results of Jaccard similarity estimation and nearest neighbor search. We compared our method to the MinHash sketch and densified OPH sketch for Jaccard similarity estimation and nearest neighbor search. We used the technique discussed in Section 5.1 to build LSH for all three methods. The experimental results in Table 3 demonstrate that our method, SimCar, is comparable to MinHash and densified OPH on all the datasets.

Conclusions and Future Work
In this paper, a framework-SimCar-was developed for mining users' interest sets. We built an OH sketch for each user that occurred in the data stream of interest, and one can query several mining results of users' interest sets at any time of the data stream. Specially, we developed accurate methods for estimating cardinalities, common-interest counts, and Jaccard similarities of users' interest sets. In addition, we used OH sketches to build LSH tables to quickly search for users with similar interests with sub-linear time. We evaluated the performance of our methods on real-world datasets. The experimental results demonstrated the effectiveness and efficacy of our methods. In future, we plan to use techniques such as register sharing (i.e., compress the OH sketches of all users into a large register array) to further reduce the memory usage.
Funding: This research was funded by National Key R&D Program of China (2021YFB1715600).

Conflicts of Interest:
The authors declare that they have no conflict of interest.
Based the above equations and the formula of H( λ), we easily obtain the formula of each entry of matrix E(H( λ)).