A Data Analytic Algorithm for Managing, Querying, and Processing Uncertain Big Data in Cloud Environments

: Big data are everywhere as high volumes of varieties of valuable precise and uncertain data can be easily collected or generated at high velocity in various real-life applications. Embedded in these big data are rich sets of useful information and knowledge. To mine these big data and to discover useful information and knowledge, we present a data analytic algorithm in this article. Our algorithm manages, queries, and processes uncertain big data in cloud environments. More speciﬁcally, it manages transactions of uncertain big data, allows users to query these big data by specifying constraints expressing their interests, and processes the user-speciﬁed constraints to discover useful information and knowledge from the uncertain big data. As each item in every transaction in these uncertain big data is associated with an existential probability value expressing the likelihood of that item to be present in a particular transaction, computation could be intensive. Our algorithm uses the MapReduce model on a cloud environment for effective data analytics on these uncertain big data. Experimental results show the effectiveness of our data analytic algorithm for managing, querying, and processing uncertain big data in cloud environments.


Introduction
Big data [1][2][3] are everywhere. They are high-veracity, high-velocity, high-value, and/or high-variety data with volumes beyond the ability of commonly-used software to manage, query, and process within a tolerable elapsed time. These high volumes of valuable data can be easily collected or generated at a high velocity from different data sources (which may lead to different data formats) in various real-life applications such as bioinformatics, graph management, sensor and stream systems, smart worlds, social networks, as well as the Web [4][5][6][7][8]. Characteristics of these big data can be described by the following "5 Vs": 1. The mapper applies a mapping function to each value in the list of values and returns the resulting list; 2. The reducer applies a reducing function to combine all the values in the list of values and returns the combined result.
An advantage of using the MapReduce model is that users only need to focus on (and specify) these "map" and "reduce" functions-without worrying about implementation details for the following: • handling machine failures, • managing inter-machine communication, • partitioning the input data, or • scheduling and executing the program across multiple machines.
Several algorithms have been proposed over the past few years to use the MapReduce model-which mines the search space with parallel, distributed, or cloud computing-for big data analytics tasks like classification [20] and clustering [21]. In contrast, we focus on another big data analytics task-namely, association rule mining [22], which discovers interesting knowledge in the form of association rules A ⇒ C revealing associative relationships between (i) shopper market baskets A and C of frequently purchased merchandise items or (ii) collections A and C of frequently co-located events.
By applying association rule mining to valuable big market basket data, data scientists can help shop owners/managers find interesting or popular patterns that reveal customer purchase behaviour. The research problem of association rule mining usually consists of two key steps: 1. Mining of frequent patterns [23], and 2. Formation of association rules (by using the mined frequent patterns as antecedents and consequences of the rules).
Since the introduction of the frequent pattern mining problem [22], numerous studies [24][25][26] have been conducted (e.g., mining frequent patterns from traditional databases of shopper market basket transactions). With these traditional databases of precise data, users definitely know whether an item is present in (or is absent from) a transaction. In contrast, data in many real-life applications are riddled with uncertainty [27][28][29][30][31]. It is partially due to inherent measurement inaccuracies, sampling and duration errors, network latencies, or intentional blurring of data to preserve anonymity. As such, users are usually uncertain about the presence or absence of items. As a concrete example, a physician may highly suspect (but cannot guarantee) that a coughing patient suffers from the Middle East respiratory syndrome (MERS). The uncertainty of such suspicion can be expressed in terms of existential probability (e.g., a 60% likelihood of suffering from the MERS). With this notion, each item in a transaction t j in databases containing precise data can be viewed as an item with a 100% likelihood of being present in t j . To find frequent patterns from uncertain data, several uncertain data mining algorithms (e.g., UF-growth [32], tube-growth [33], and BLIMP-growth [34] algorithms) have been proposed.
For many real-life applications, users look for all frequent patterns. However, for some other real-life applications, users may have some particular phenomena in mind on which to focus the mining (e.g., a physician may want to find only those patients who are suffering from MERS). To avoid waiting for a long period of time for numerous patterns out of which only a tiny fraction may be interesting to the users, constrained pattern mining [35]-which aims to find those patterns that satisfy the user-specified constraints-is in demand. For example, we previously [36] exploited a special class of constraints called SAM constraints. Such exploitation helps reduce the search space when mining patterns satisfying user-specified SAM constraints. However, many commonly used constraints (e.g., sum(X.Expenses) < 300 CHF, which finds every combination X of items with a total cost less than 300 CHF) do not belong to the class of SAM constraints. In the current article, we explore another class of constraints, called anti-monotone (AM) constraints, to which commonly used constraints belong. We explore two sub-classes of anti-monotone constraints: (i) the frequency constraint; and (ii) non-frequency AM constraints.
In this article, our key contribution is our data analytic algorithm called MrCloud-which uses the MapReduce model in cloud environments for managing, querying, and processing uncertain big data. More specifically, MrCloud manages transactions of uncertain big data, allows users to query these big data by specifying anti-monotone constraints expressing their interests, and processes the user-specified constraints to discover useful information and knowledge in the form of frequent patterns from the uncertain big data. So, MrCloud can be considered as a non-trivial integration of (i) frequent pattern mining; (ii) big data mining; (iii) constrained mining; and (iv) uncertain data mining.
The remainder of this article is organized as follows. The next section presents background and related works. In Section 3, we propose our data analytic algorithm for mining patterns satisfying AM constraints from uncertain big data using MapReduce. Evaluation results and conclusions are given in Sections 4 and 5, respectively.

Background and Related Works
In this section, we present some background information and related works on extensions of frequent pattern mining, namely (Section 2.1) big data mining with the MapReduce model, (Section 2.2) constrained mining, and (Section 2.3) uncertain data mining.

Big Data Mining with the MapReduce Model
MapReduce [13] is a high-level programming model for processing vast amounts of data. It usually uses parallel and distributed computing on clouds of nodes (i.e., computers). As implied by its name, MapReduce involves two key functions: "map" and "reduce".
First, the input data are read, divided into several partitions (sub-problems), and assigned to different processors. Each processor executes the map function on each partition (sub-problem). The map function takes a pair of key, value and returns a list of key, value pairs as an intermediate result: where (i) key 2 is a key in some domains; and (ii) value 2 & value 3 are the corresponding values in some domains. Examples of MapReduce applications include the construction of an inverted index as well as the word counting of a document for data processing [13].
To mine frequent patterns from precise data using the MapReduce model, three Apriori-based algorithms called SPC, FPC and DPC [37] were proposed. Among them, SPC uses single-pass counting to find frequent patterns of cardinality k at the k-th pass (i.e., the k-th database scan) for k ≥ 1. FPC uses fixed-passes combined-counting to find all patterns of cardinalities k, (k+1), ..., (k+m) in the same pass or database scan. On the one hand, this fixed-passes technique fixes the number of required passes from k max (where k max is the maximum cardinality of all frequent patterns that can be mined from the precise data) to a user-specified constant. On the other hand, due to combined-counting, the number of generated candidates is higher than that of SPC. In contrast, DPC uses dynamic-passes combined-counting, which takes the benefits of both SPC and FPC by taking into account the workloads of nodes when mining frequent patterns with MapReduce. In addition, a parallel randomized algorithm called PARMA [38] was proposed for mining approximations to the top-k frequent patterns and association rules from precise data by using MapReduce.
As a preview, our MrCloud algorithm also uses MapReduce. However, unlike SPC, FPC or DPC [37] (which use the Apriori-based approach to mine frequent patterns from precise data), our MrCloud uses a tree-based approach to mine frequent patterns from uncertain data-which deals with a much larger search space than that for mining precise data due to the presence of the existential probability values. Moreover, unlike PARMA (which mines all of the approximately frequent patterns from precise data), our data analytic algorithm mines some-specifically, those interesting patterns that satisfy the user-specified constraints-of the truly frequent patterns from uncertain data.

Constrained Mining
Allowing users to specify their interest via the use of constraints [35,39,40] helps guide the data mining process so that only those sets of frequently co-occurring items satisfying the user-specified constraints can be found, which in turn avoids unnecessary computation for mining those uninteresting frequent patterns. These user-specified constraints can be imposed on items, events or objects in various domains, including shopper market baskets, meteorological records, and event planning calendars. In general, these constraints can be categorized into several overlapping classes according to the properties that they possess. In this article, we examine one class of constraints called anti-monotone (AM) constraints, which possess the properties of anti-monotonicity. This class of constraints can be further subdivided into the following two sub-classes: 1. Frequency constraints include the following: C 1 ≡ sup(X) ≥ minsup expresses the user interest in finding frequent patterns from precise data, i.e., every pattern X with actual support (or frequency) meeting or exceeding the user-specified minimum support threshold minsup; and C 2 ≡ expSup(X) ≥ minsup expresses the user interest in finding frequent patterns from uncertain data, i.e., every pattern X with expected support meeting or exceeding the user-specified minimum support threshold minsup.

Non-frequency AM constraints, with examples include the following:
• C 3 ≡ min(X.RewardPoints) ≥ 2000 expresses the user interest in finding every pattern X such that the minimum reward points earned by travellers among all airports visited are at least 2000; • C 4 ≡ max(X.Rainfall) ≤ 10 mm says that the maximum rainfall among all meteorological records in X is at most 10 mm (i.e., "relatively dry"); • C 5 ≡ X.Location = Europe expresses the user interest in finding every pattern X such that all places in X are located in Europe; • C 6 ≡ X.Weight ≤ 23 kg says that the weight of each object in X is at most 23 kg (e.g., no heavy checked baggage for a trip); and • C 7 ≡ sum(X.Expenses) < 300 CHF says that the total expenses on all items in X is less than 300 CHF.
Note that a constraint C is anti-monotone [40] if and only if all subsets of a pattern satisfying C also satisfy C. Equivalently, a constraint C is anti-monotone (AM) if and only if all supersets of a pattern violating C also violate C. A pattern X is valid in a database if such a pattern also satisfies the user-specified constraints. Given (i) a database; (ii) user-specified minsup; and (iii) user-specified AM constraints, the research problem of constrained pattern mining from uncertain data is to discover from the database a complete set of patterns satisfying the user-specified AM constraints (i.e., valid patterns).

Uncertain Data Mining
Let (i) Item be a set of m domain items and (ii) X = {x 1 , x 2 , . . . , x k } be a k-itemset (i.e., a pattern consisting of k items), where X ⊆ Item and 1 ≤ k ≤ m. Then, a transactional database is the set of n transactions, where each transaction t j ⊆ Item (for 1 ≤ j ≤ n). The projected database of X is the set of all transactions containing X. Each item x i in a transaction t j = {x 1 , x 2 , . . . , x h } in an uncertain database is associated with an existential probability P (x i , t j ), which represents the likelihood of the presence of x i in t j [41], with value The existential probability P (X, t j ) of a pattern X in t j is then the product of the corresponding existential probabilities of every item x i within X when these items are independent [41,42]: Finally, the expected support expSup(X) of X is the sum of P (X, t j ) over all n transactions in the database: With this notion of expected support, existing tree-based algorithms-such as UF-growth [32], tube-growth [33] and BLIMP-growth [34]-mine frequent patterns from uncertain data by first scanning the uncertain database once to compute the expected support of all domain items (i.e., singleton patterns). Infrequent items are pruned as their extensions/supersets are guaranteed to be infrequent. The algorithms then scan the database a second time to insert all transactions (with only frequent items) into a tree (e.g., UF-tree [32], TPC-tree [33], or BLIMP-tree [34]). Each node in the tree captures (i) an item x; (ii) its existential probability P (x, t j ); and (iii) its occurrence count. At each step during the mining process, the frequent patterns are expanded recursively.
A pattern X is frequent in an uncertain database if expSup(X) ≥ minsup. Given a database and minsup, the research problem of frequent pattern mining from uncertain data is to discover from the database a complete set of frequent patterns having expected support ≥ minsup.

MrCloud: Our Data Analytic Algorithm
Given (i) uncertain big data; (ii) user-specified minsup; (iii) a user-specified constraint C (e.g., an AM constraint), the research problem of constrained frequent pattern mining from uncertain big data is to discover from big data a complete set of patterns having expected support ≥ minsup and satisfying C (i.e., valid frequent patterns). In this section, we present our data analytic algorithm-called MrCloud-that uses MapReduce for managing, querying, and processing uncertain big data in cloud environments. Specifically, our algorithm manages transactions of uncertain big data, allows users to query these big data by specifying constraints expressing their interests, and processes the user-specified constraints to discover useful information and knowledge from the uncertain big data.

Managing Uncertain Big Data
To manage uncertain big data, our MrCloud algorithm keeps track of both (i) the transactions of uncertain data; and (ii) an auxiliary file capturing information about domain items in uncertain data. Here, items in each transaction of uncertain data are associated with existential probability values expressing the likelihood of these items in the transaction in the form of a set of every item x i with its existential probability value P (x i , t j ): Table 1 for an illustrative sample capturing transactions of uncertain data collected about the airports visited by travellers. Also see Table 2 for the auxiliary information about reward points that can be earned by travellers visiting those airports. For instance, t 1 captures the uncertain data that a traveller may have visited six airports (namely, AMS, BCN, CPH, DEL, EDI and FRA). Among them, it is 100% sure that he has visited BCN and EDI (where he earned 3000 and 2000 points, respectively). There is a 90% chance that he has visited AMS or DEL (where he would earn 2400 and 3200 points, respectively), 50% chance that he has visited CPH (where he would earn 2600 points), and only 20% chance that he has visited FRA (where he would earn 2200 points). Table 1. An illustrative sample set of uncertain big data.

Querying Uncertain Big Data
Once our MrCloud algorithm managed uncertain big data, it allows users to query the data. Users can express their interests by selecting one of SQL-style constraints in the form of (i) "X.attribute θ constant"; (ii) "agg(X.attribute) θ constant"; and (iii) their logical combinations via the conjunction operator "AND" (∧) or the disjunction operator "OR" (∨), where (i) agg is an aggregate function including max, min, sum, and (ii) θ is a comparison operator including >, ≥, =, ≤, <. Examples are not confined to the aforementioned (i) frequency constraints C 1 & C 2 or (ii) non-frequency AM constraints C 3 -C 7 , users can also specify AM constraints involving more than one aggregate function. The following is an example: • C 8 ≡ difference(X.Temperature) = max(X.Temperature) − min(X.Temperature) ≤ 10 • C says that the difference between the maximum and minimum temperatures in X is at most 10 • C (which involves the difference between two aggregate functions maximum and minimum).
Moreover, users can also specify constraints involving more than one AM constraint. Examples include the following: in finding every pattern X with temperature between 20 • C to 30 • C inclusive (which involves a logical conjunction "AND" of two AM constraints); • C 10 ≡ [min(X.RewardPoints) ≥ 2000] ∧ [X.Location = Europe] expresses the user interest in finding every pattern X such that the minimum reward points earned by travellers among all European airports visited are at least 2000 (which again involves a logical conjunction "AND" of two AM constraints); and Rainfall) ≤ 10 mm] expresses the user interest in finding all meteorological records matching "warm" or "relatively dry" patterns (which involves a logical conjunction "OR" of two AM constraints).
Last but not least, users can also specify constraints that are not AM (e.g., C 12 ≡ avg(X.Temperature) ≥ 25 • C expressing the user interest in finding every pattern X with average temperature at least 25 • C). Since the users specify their constraints by selecting one of the SQL-style constraints, MrCloud can easily determine whether the user-specified constraints are AM or not. Table 3 shows examples of the classification. Table 3. Classification of some AM and non-AM constraints.

Processing Uncertain Big Data
Once the users queried uncertain big data by specifying their constraints that express their interest, our MrCloud algorithm processes these user-specified queries to find frequent patterns that satisfy these user-specified constraints. Given (i) an implicit frequency constraint C 2 ≡ expSup(X) ≥ minsup; and (ii) an explicit user-specified constraints, MrCloud explores the anti-monotonicity of these constraints in pruning the search space. More specifically, the implicit frequency constraint satisfies the anti-monotonicity: • If a pattern X is frequent (i.e., expSup(X) ≥ minsup), then all subsets of X are guaranteed to satisfy the AM constraints because expSup(X ) ≥ expSup(X) ≥ minsup for every subset X ⊆ X.
• If a pattern Y is infrequent (i.e., expSup(Y ) < minsup), then all supersets of Y are guaranteed to be infrequent because expSup(Y ) < expSup(Y ) < minsup for every superset Y ⊇ Y . Thus, every superset Y of Y can be pruned.
However, if a pattern X is frequent, then some supersets of X may be frequent while some other may not be frequent. Thus, frequency checking is needed to be applied to every superset of X. Similarly, if the user-specified constraint satisfies the anti-monotonicity, then we can prune the search space due to the following: • If a pattern X satisfies AM constraints, then all subsets of X are guaranteed to satisfy the AM constraints.
• If a pattern Y does not satisfy AM constraints, then all supersets of Y are guaranteed not to satisfy the AM constraints and thus can be pruned.
However, if a pattern X satisfies AM constraints, then some supersets of X may satisfy the AM constraints while some other may not satisfy the AM constraints. Thus, constraint checking is needed to be applied to every superset of X. These observations about anti-monotone constraints hold not only for constraints involving one AM constraint but also constraints involving multiple AM constraints due to the following. Theorem 1. If constraints C a and C b are anti-monotone, then the constraint (C a ∧ C b ) is also anti-monotone.
Proof. For anti-monotone constraints C a and C b , if a pattern X satisfies C a and C b , then all subsets of X are guaranteed to satisfy C a and satisfy C b . In other words, all subsets of X are guaranteed to satisfy (C a ∧ C b ). Conversely, if a pattern Y does not satisfy C a and does not satisfy C b , then all supersets of Y are guaranteed not to satisfy C a and not to satisfy C b . In other words, all supersets of Y are guaranteed not to satisfy (C a ∧ C b ). Thus, (C a ∧ C b ) is also anti-monotone. Theorem 2. If constraints C a or C b are anti-monotone, then the constraint (C a ∨ C b ) is also anti-monotone.
Proof. For anti-monotone constraints C a or C b , if a pattern X satisfies C a and C b , then all subsets of X are guaranteed to satisfy C a or satisfy C b . In other words, all subsets of X are guaranteed to satisfy (C a ∨ C b ). Conversely, if a pattern Y does not satisfy C a or does not satisfy C b , then all supersets of Y are guaranteed not to satisfy C a or not to satisfy C b . In other words, all supersets of Y are guaranteed not to satisfy (C a ∨ C b ). Thus, (C a ∨ C b ) is also anti-monotone.
On the other hand, if the explicit user-specified constraints do not satisfy anti-monotonicity, our MrCloud first discovers all frequent patterns and then verifies the validity of each of these discovered patterns to see if it satisfies the user-specified constraints at a post-processing step.
To process uncertain big data, our MrCloud algorithm first reads and divides the uncertain big data into several partitions and assigns them to different processors. The map function (denoted as map 1 ) receives transaction ID, content of that transaction as input. To facilitate time-efficient and space-efficient constrained frequent pattern mining, MrCloud pushes the user-specified non-frequency AM constraints C AM early in the mining process by pushing them into the map 1 function. So, for every transaction t j , the map 1 function performs constraint checking for C AM and emits an x, P (x, t j ) pair for each occurrence of valid item x ∈ t j (i.e., those domain items satisfying the non-frequency AM constraints): In other words, by specifying the following, the map 1 function produces a list of valid x, P (x, t j ) pairs with many different valid x and P (x, t j ) for the keys and values: for each t j ∈ partition of the uncertain big data do for each item x ∈ t j and {x} satisfies C AM do emit x, P (x, t j ) .
Afterwards, these valid x, P (x, t j ) pairs are shuffled and sorted. Each processor then executes the reduce function (denoted as reduce 1 ) on the shuffled and sorted pairs to obtain the expected support of x. Recall that each item in the uncertain big data is associated with an existential probability value. The reduce 1 function computes the expected support of all domain items (i.e., singleton patterns) by using MapReduce with Equation (1), which can be simplified to become the following when computing singleton patterns: where P (x, t j ) is an existential probability of item x in transaction t j . In other words, the reduce 1 function sums all existential probabilities of x for each valid x to compute its expected support: More specifically, the reduce 1 function finds those items satisfying the frequency constraints by specifying the following: For the explicit user-specified non-frequency non-AM constraints C nonAM , our MrCloud verifies the validity of each discovered frequent {x} returned by the reduce 1 function to see if it satisfies the user-specified constraints. Consequently, we obtain all valid frequent singletons (i.e., domain items that satisfy the user-specified constraints) and their associated existential support values.
MrCloud then proceeds to the next step, which is computationally intensive, by rereading each transaction in the uncertain big data to form an {x}-projected database (i.e., a collection of all prefixes of transactions ending with x) for each valid frequent singleton {x} returned by the reduce 1 function. This second map function (denoted as map 2 ) map 2 : ID of transaction t j , content of t j → list of valid frequent {x}, part of t j with x can be specified as follows: for each t j ∈ partition of the uncertain big data do for each {x} ∈ {x}, expSup({x}) do if prefix of t j ending with x contains items besides x then emit {x}, prefix of t j ending with x .
The worker node corresponding to each partition helps to form an {x}-projected database for every valid frequent item x in the transactions assigned to that partition. The {x}-projected database consists of prefixes of relevant transactions (from the uncertain big data) that end with x. More precisely, the worker node outputs {x}, portion of t j for forming the {x}-projected database pairs.
Then, the reduce function reduce 2 : valid frequent {x}, {x}-projected database → list of valid frequent X, expSup(X) (6) shuffles and sorts these pairs of {x}-projected databases, from which valid frequent non-singleton patterns can be found and their expected support values can be computed. As any non-singleton patterns containing valid singleton items are not guaranteed to be valid, additional constraint check on AM constraints C AM is required when forming the projected database in mining valid frequent patterns. So, the worker node corresponding to each projected database then builds appropriate trees (e.g., UF-tree, TPC-tree, or BLIMP-tree)-based on the projected databases assigned to the worker node-to mine every valid frequent non-singleton pattern X (with cardinality k, where k ≥ 2). The worker node also outputs X, expSup(X) , i.e., every valid frequent non-singleton pattern with its expected support: for each x ∈ {x}-projected database do build a tree for {x}-projected database to find X; if X satisfies C AM and expSup(X) ≥ minsup then emit X, expSup(X) .
Again, for the explicit user-specified non-frequency non-AM constraints C nonAM , our MrCloud verifies the validity of each discovered frequent X returned by the reduce 2 function to see if it satisfies the user-specified constraints. Example 1. Let us consider an illustrative sample set of an uncertain big database (as shown in Table 1) and its auxiliary information (as shown in Table 2) with (i) the user-specified minsup=0.9; and (ii) a user-specified constraint C 10 ≡ [min(X.RewardPoints) ≥ 2000] ∧ [X.Location = Europe] (which expresses the user interest in finding every pattern X such that the minimum reward points earned by travellers among all European airports visited are at least 2000). Based on the auxiliary information, we learn that airports AMS, BCN, CPH, EDI, FRA and HEL (but not DEL, GUM or IST) satisfy C 10 .
More specifically, (i) both DEL and GUM are not in Europe; and (ii) travellers visiting either GUM or IST would not be able to earn at least 2000 reward points.
Then, for the first transaction t 1 , the map 1 function outputs only AMS, 0.

Evaluation Results
We evaluated our proposed data analytic algorithm MrCloud in mining user-specified constraints from uncertain big data. We used various benchmark datasets, which include real-life datasets (e.g., accidents, connect4, and mushroom) from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) and the FIMI Repository (http://fimi.ua.ac.be/). We also used IBM synthetic datasets, which were generated using the IBM Quest Dataset Generator [22]. For our experiments, the generated data ranges from 2 M to 10 M transactions with an average transaction length of 10 items from a domain of 1 K items. As the above real-life and synthetic datasets originally contained only precise data, we assigned to each item contained in every transaction an existential probability from the range (0, 1.0]. All experiments were run using either (i) a single machine with an Intel Core i7 4-core processor (1.73 GHz) and 8 GB of main memory running a 64-bit Windows 7 operating system; or (ii) the Amazon Elastic Compute Cloud (EC2) cluster-specifically, 11 High-Memory Extra Large (m2.xlarge) computing nodes (http://aws.amazon.com/ec2/). We implemented existing mining framework [35,39,40], UF-growth [32], tube-growth [33], BLIMP-growth [34], and our data analytic algorithm MrCloud all in the Java programming language. The stock version of Apache Hadoop 0.20.0 was used.
First, we demonstrated the functionality and capability of MrCloud by using (i) a database consisting of items all with existential probability value of 1.0 (indicating that all items are definitely present in the database); and (ii) a user-specified AM constraint. Experimental results show that, in terms of accuracy, our data analytic algorithm returned the same collection of valid frequent patterns as those returned by the existing mining framework [35,39,40] for finding valid frequent patterns from precise data. Note that, in terms of flexibility, MrCloud is not confined to finding valid frequent patterns from a database in which existential probability values of all items are 1.0. MrCloud is capable of finding valid frequent patterns from any database, in which existential probability values of all items are ranging from 0 to 1.0.
Moreover, we also experimented with (i) an uncertain database; and (ii) a user-specified AM constraint with 100% selectivity (so that every item is selected). Experimental results show that, in terms of accuracy, MrCloud returned the same collection of frequent patterns as those returned by UF-growth [32], tube-growth [33] and BLIMP-growth [34]. Note that, in terms of flexibility, MrCloud is not confined to handling AM constraints with 100% selectivity. MrCloud is capable of handling AM constraints with any selectivity.
Afterwards, we demonstrated the efficiency of MrCloud. Figure 1 shows that MrCloud took much shorter runtimes than the runtimes required by the existing UF-growth algorithm [32] when handling AM constraints with 100% selectivity because UF-growth was not designed to handle different selectivity of constraints. Hence, for a fair comparison, we used 100% selectivity for this experiment. We will show in Figure 2 how the runtimes for MrCloud decreased when the selectivity increased (i.e., fewer patterns were selected). For the current experiment, MrCloud was run in the aforementioned EC2 cluster. As a MapReduce-based algorithm, MrCloud takes advantage of all 11 nodes in the EC2 cluster. In contrast, as a non-MapReduce-based algorithm, UF-growth was run on a single machine (i.e., does not take advantage of multiple nodes.) This explains why MrCloud is faster than UF-growth. Moreover, the figure shows the scalability of MrCloud. When the number of transactions in the IBM synthetic dataset increased, the runtimes required by both MrCloud and UF-growth also increased. However, MrCloud required significantly lower runtimes than UF-growth. The same comments apply to other tested datasets (e.g., real-life accidents, connect4, and mushroom datasets).   Figure 2a shows the benefits of constraint pushing in the big data mining process than applying constraint pushing as a post-processing step in the IBM synthetic dataset, while Figure 2b shows those for the three real-life accidents, connect4, and mushroom datasets. Both figures show that, when selectivity decreased (i.e., fewer frequent patterns satisfy the constraints), runtimes also decreased, because (i) fewer pairs were returned by the map function; (ii) fewer pairs were shuffled and sorted by the reduce function; and/or (iii) fewer constraint checks were performed. Figure 3 shows that MrCloud led to high speedup (e.g., more than 7 times for the IBM synthetic dataset) even with just 11 nodes when compared with UF-growth [32].   and Figure 4c for the real-life connect4 and mushroom datasets, respectively.

Conclusions
Big data are everywhere. Existing big data analytic algorithms discover frequent patterns from precise databases. However, there are situations in which data are uncertain. As items in each transaction of these uncertain data are usually associated with existential probabilities expressing the likelihood of these items to be present in the transaction, the corresponding search space for uncertain data is much larger than that for from precise data. This matter is worsened when we are dealing with uncertain big data. Furthermore, in many real-life applications, users may be interested in only a tiny portion of this large search space. To avoid wasting lots of time and space in first discovering all frequent patterns and then pruning uninteresting ones at a post-processing step, we presented in this article a data analytic algorithm called MrCloud that manages uncertain big data, allows users to query these uncertain big data by expressing their interest, and processes the user-specified query by using the MapReduce model in cloud environments. As an output, MrCloud discovers from uncertain big data all and only those patterns that are interesting to the users. Experimental results show the effectiveness of our data analytic algorithm for managing, querying, and processing uncertain big data in cloud environments.