Relaxed Rule-Based Learning for Automated Predictive Maintenance: Proof of Concept

: In this paper we propose a novel approach of rule learning called Relaxed Separate-and-Conquer (RSC): a modiﬁcation of the standard Separate-and-Conquer (SeCo) methodology that does not require elimination of covered rows. This method can be seen as a generalization of the methods of SeCo and weighted covering that does not suffer from fragmentation. We present an empirical investigation of the proposed RSC approach in the area of Predictive Maintenance (PdM) of complex manufacturing machines, to predict forthcoming failures of these machines. In particular, we use for experiments a real industrial case study of a machine which manufactures the plastic bottle caps. We compare the RSC approach with a Decision Tree (DT) based and SeCo algorithms and demonstrate that RSC signiﬁcantly outperforms both DT based and SeCo rule learners. We conclude that the proposed RSC approach is promising for PdM guided by rule learning.


Introduction
Rule Learning (RL) is a well known methodology of Machine Learning (ML).By the Occam Razor principle [1], smaller models tend to make more accurate predictions.Based on this principle, RL algorithms should be designed to create small sets of small rules.
Arguably, the best known specific method of RL is Decision Tree (DT) algorithm.However, existing literature highlights the fact that if we insist on the 'tree-likeness' of the rules set, rules become prohibitively long and complicated (Section 1.5.3. of [2]).This is due to the effect of fragmentation.A detailed discussion of this phenomenon is provided in Example 2, see also a related discussion in [3].
Therefore, an alternative approach has been actively investigated in which rules are created one by one without insisting on their set to fit a DT structure.In order to design an algorithm based on this approach, the following two questions must be answered.
1. How to create a single rule? 2. How to create a collection of rules?
According to the Occam Razor principle, an algorithm for composition of a single rule must endeavor to make the rule as small as possible and as precise as possible.Thus, the task of a rule creation can be envisaged as an optimization problem with the objective function expressing a combination of these two criteria (plus the high coverage criteria to avoid overfitting).For a reasonably complex domain such an optimization problem is intractable [1], hence there is no hope to obtain an 'optimal' rule in a reasonable time.Consequently, the main methodology for obtaining a single rule are greedy local search (mainly Hill Climbing) algorithms [2].The common feature of these algorithms is the absence of backtracking.In other words, a local search algorithm grows a rule adding constraints on attributes one by one and cannot remove a constraint once it has been added.
In the above local search framework, the main effort concentrates on the heuristics for choosing the next attribute constraint being added to a rule.Moreover, the constraints have a special form outlined in Example 1 below.Also, we identify a dataset with a table, an attribute (attr) with a column in the dataset, and an instance with the row of the dataset.
Example 1. Assume that our learning task is to learn a concept depending on 5 attributes attr 1 , . . ., attr 5 of a dataset.Assume that these attributes have integer values ranging between 0 and 50.Then the rules are of the form attr 1 ∈ [10, . . ., 40] ∧ attr 3 ∈ [20, . . .35] → true.The above rule states that all the instances of the dataset with the value of attr 1 between 10 and 40 and the value of attr 3 between 20 and 35 then this instance 'belongs' to the concept.
Note that the constraints in Example 1 are given in the form of intervals.Moreover, the same attribute can occur more than once.In this case, the actual constraint on the attribute is the intersection of the intervals of all the occurrences of this attribute.For example, the rule as in Example 1 can be rewritten as follows attr 1 ∈ [0, . . ., 40] ∧ attr 3 ∈ [20, . . .35] ∧ attr 1 ∈ [10, 50] → true.
The rule growth procedure starts from an empty rule and performs a number of iterations.Each iteration chooses an attribute and an interval and adds to the rule being formed the respective constraint, stating that the value of the chosen attribute belongs to the value of the chosen interval.The procedure also needs a terminating condition.An obvious one is that all the instances covered by the current rule are invariant w.r.t. the concept being studied (all belong or all do not belong to the concept).However, such a terminating condition may lead to long rules and potentially prone to overfitting.To avoid this situation, there are terminating conditions that stop the rule growth procedure even in case there is no full invariance.
Let us now discuss approaches to forming a collection of rules.The main issue that needs to be addressed is the handling of conflicting predictions.Indeed, suppose that the same instance is covered by two rules, one of them states that the instance belongs to the concept, the other that the instance does not.Another matter to address is the terminating condition for the procedure of forming a rule collection: we add rules one by one to the collection, when are we to stop?Both of the above issues can be relatively straightforwardly addressed by a methodology of Separate-and-Conquer (SeCo) [2,3].According to this methodology, when a new rule is formed, the instances covered by this rule are removed.So, the newly formed rules are guaranteed to cover new instances and the process will stop when there are no new instances (of course the terminating condition can be relaxed to avoid overfitting).Unlike in the case of DT, the rules may overlap.Indeed, suppose a rule R 1 has been formed and the instances covered by it removed.When a new rule R 2 is being formed, the procedure growing it does not 'see' the removed instances but this does not mean that these instances cannot be covered.However, the rules formed by a SeCo procedure are ordered in a chronological order (according to the time they have been formed).When a prediction is about to be made about some instance x, the prediction is made by the first rule covering this instance.To understand the intuition, assume that the instance x is covered by the 5th rule.Then rules 1 to 4 do not cover the instance hence there is no point using them for making the prediction.Rules 6 onward can cover the instance.However, rule 5 has been formed as a result of analysis of a larger training set, so it is rational to assume that it will be more precise on instances covered by it and the subsequent rules.
Both DT and SeCo methods suffer from fragmentation, though for SeCo algorithm the effect is milder, as demonstrated by Example 2 below.

Example 2.
[Fragmentation] Consider the same settings of the attributes as in Example 1. Suppose that the concept can be described as the disjunction of the following two rules.The sets of instances covered by R 1 and R 2 clearly overlap.Therefore, after a SeCo procedure discovers the first rule, it is likely to be more difficult to discover the second rule.
Indeed, suppose such a procedure discovers R 1 .After that the rule learner will have to look at the part of the dataset that is not covered by R 1 and to discover rule R 2 .
A rule learner that learns non-overlapping rules (e.g., a DT algorithm) will have to discover rules that are covered by R 2 but not covered by R 1 .There are several way to present the corresponding set of rules, the most compact of them would look as follows.That is instead of a single rule of length 2, the rule learner will have to learn 4 rules of length 3.In case of k short rules, the number of rules to be learned grows exponentially with the number of rules already learned.We informally refer to this effect as fragmentation.
In case of a more general SeCo method, the situation is not as acute as in the case of non-overlapping rule learning because there is no need to avoid overlapping with R 1 .However, the dataset resulting from removal of the rows covered by R 1 is smaller that the original dataset and, more importantly, is 'distorted' by a non-uniform removal of instances.As a result, learning of R 2 in this distorted dataset becomes more difficult than in the original one.In case of more than two rules to be learned, this difficulty becomes even more pronounced.
One way to address the above deficiency is somehow to assign weights to direct the RL algorithm towards considering instances that have not been covered.The main two approaches doing that are weighted covering [2] and boosting [4].
The weighted covering [2] attempts to generalize the SeCo method as follows.We can see SeCo as a method that assigns weight 1 to the instances not yet covered by the existing rules and 0 to the instances that are covered.Then new rules are sought over instances of non-zero weight.Weighted covering uses more flexible methods of weights assignment.The related heuristics are organized so as to choose heavier instances and this creates as a result a 'fuzzy' version of SeCo.It is important to note that whatever way the weights are assigned, the part of the dataset covered by the existing rules will be discriminated against the instances that are not yet covered.In other words the distortion of the dataset as presented in Example 2 will still be present in case of weighted covering.
The boosting method [5] is a theoretical approach whose purpose is to show that a reasonable but not very accurate learning algorithm can undergo several rounds of retraining to learn a concept with an arbitrary degree of accuracy.The idea applied to RL [4] is that the learning algorithm first produces a single rule and then, as a result of boosting, new rules are added to the collection.
In this paper we propose an alternative rule learning approach considering instances that have already been covered by the previous rules.We call this approach Relaxed Separate-and-Conquer (RSC).In particular, when a new rule is formed, it is required to cover at least one instance not covered by the previous rules.This means that already covered instances are not excluded (like in the case of separate-and conquer) nor are they discriminated against (like in the case of weighted covering).However, the algorithm is forced to look not only at the already covered instances, but also elsewhere.
The proposed RSC approach generalizes both SeCo and weighted covering.In particular, any reasonable rule growing heuristic for SeCo or weighted covering can be simulated by an appropriate rule growing heuristic for the RSC method.Also, the rule growing heuristic can control 'tree-likeness' of the rules and hence can simulate any DT algorithm.More technical details related to the generalization are provided in Section 2.1 (Theorem 1).In addition in this subsection we propose Conjecture 1 formalising the intuition outlined in Example 2 regarding the advantage of RSC over SeCo.
In this paper we apply the RSC method in the area of failure prediction of complex manufacturing machines.These sophisticated machines are equipped with a series of sensors and actuators that provide a combination of real-time data about the state of machines (performance) and product state (quality) during the production process.The attributes of the dataset correspond to sensors and the values of the attributes are respective sensor readings.The binary outcome column is interpreted as an alarm (outcome 1) or no alarm (outcome 0).The purpose of the failure prediction is to notify the operator of a forthcoming failure.Therefore, there is no point to learn rules whose outcome is 0. That is all the rules learned by a method we are going to present have 1 (an alarm) as the outcome.This allows to introduce the following two simplifications.
1.The outcome can be omitted.Therefore, each rule can be presented as a conjunction of (attribute, interval) pairs.2. Since all the rules have outcome 1, conflicting predictions between overlapping rules cannot occur.
The failure prediction task explored in this paper is an important method of Predictive Maintenance (PdM).PdM is a set of techniques helping engineers to organize maintenance based on actual information about forthcoming failures [6].The main aim of PdM is to reduce operating costs of two other maintenance strategies [6]: (1) Run-to-Failure (R2F) where corrective maintenance is performed only after the occurrence of failures; and (2) Preventive Maintenance (PvM) where equipment checks are performed at fixed periods of time.PdM also prolong the useful life of the equipment [7] and optimize the use and management of assets [8].PdM uses predictive techniques, based on continuous machine monitoring, to decide when maintenance is needed to be performed.Two main approaches to the design of PdM software are Discrete-Event Simulation [9] and ML.
The ML approach is based on prediction of future performance based on historical data.Large volumes of past performance data have been collected in large enterprises.With the advent of modern ML approaches, the analysis of these data can provide very useful information about the future performance.There are many results applying ML to the past performance data of the equipment, see surveys [10][11][12] for comprehensive overviews.The existing ML methodologies for PdM are based on methods such as Support Vector Machines [6,[13][14][15][16][17][18], k-Nearest Neighbors [6,13,16], Artificial Neural Networks and Deep Learning [16,19,20], stochastic processes [21], K-means [13,16,22], Bayesian reasoning [23].Ensemble methodologies where several methods are used and the weighted average of their predictions are reported in e.g.[24][25][26].
Rule-based methods are rather under-represented in PdM.DT based methods have been proposed in e.g., [16,27,28].Random Forests (RF) based methods have been used in e.g., [29][30][31].The use of more generic rule-learning such as SeCo is even more limited in the area of PdM: we are only aware of work [32] (thanks to the anonymous reviewer for bringing this paper to our attention).Our paper reports a progress towards further exploration of the potential of rule learning in the area of PdM.
In the context of failure prediction, we report the following technical results.
1. We present a generic framework forming a collection of rules according to the RSC approach.
In particular, this framework allows implementation of a wide range of heuristics.2. We present one particular heuristic that aims to maximize the precision of the newly formed rule as well as the coverage of the positive instances that are not covered by the previous rules.3. We present empirical investigation of the resulting rule learner.In particular, we compare the RSC approach with a DT based and SeCo rule learners on two domains: (a) A randomly generated dataset simulating alarms caused by small number of factors.(b) A real industrial dataset collected from a machine which manufactures the plastic bottle caps.This dataset records alarms occurred in this machine and the associated sensor readings.
In both cases the RSC algorithm significantly outperforms the DT based rule learner and SeCo method using the same heuristic .The RSC produces a set of rules that is smaller and much more accurate DT based and SeCo rule learners.We conclude that the RSC is a promising approach deserving further investigation.
The rest of the paper is organized as follows.In Section 2 we describe the Relaxed Separate-and-Conquer (RSC) rule learning approach and provide its theoretical justification.In Section 3 we provide the experiments.Section 4 concludes the paper.

Relaxed Separate-and-Conquer Rule Learning approach
In this section, we describe the Relaxed Separate-and-Conquer (RSC) method of rule learning.We emphasize that, like Separate-and-Conquer (SeCo), this is an approach rather than a single algorithm.Several heuristic choices need to be made in order to turn this approach into an algorithm.We present the approach equipped with a quite straightforward heuristic based on a common sense.We also demonstrate that the RSC approach generalizes SeCo and weighted covering methods, the latter under a mild restriction.
In order to present the RSC approach, we introduce first the related terminology.Our dataset is presented as a table called DATA having n + 1 columns.The first n columns are referred to as attributes attr 1 , . . ., attr n .The values of attr i are integer numbers between 0 and some maximum possible value max i .The last column is called the outcome and denoted by out.The out column is binary with possible values 1 (interpreted as 'alarm') and 0 ('no alarm').Our aim is to learn the rules predicting alarms.We assume that there are no two distinct rows with the same tuple of attributes, to make sure that the dataset represents a function.
A rule is a set of AVPs.A row of DATA is covered by the rule if it is covered by all the AVPs.In other words, we can see a rule as a conjunction of AVPs.

Definition 3.
A collection of rules is a set of rules.A row is covered by a collection of rules if it is covered by at least one rule of the collection.
Thus we can see that a collection of rules can be seen as a monotone (no negations) Disjunctive Normal Form (DNF) with AVPs used instead of Boolean variables.
The algorithm consists of a generic function for forming a collection of rules and growing a single rule which needs an heuristic to choose the next AVP to add to the current rule (if any).
The main function is called RSC, which is an abbreviation of Relaxed Separate-and-Conquer.It starts with an empty collection of rules and runs a function FormRule that returns a rule.If this rule is not empty then it is added to a collection.If the rule returned by FormRule is empty then the algorithm stops and the collection formed so far is returned.The pseudocode of RSC and FormRule functions is given in Algorithm 1.
Note it is the responsibility of the function FormRule to ensure that the loop of the function RSC stops: FormRule must eventually return an empty rule.FormRule runs a heuristic function called ChooseNext.ChooseNext either returns an AVP which is added to the rule being formed or returns nil that means that the heuristic determines that the current rule should not be grown further.In this case FormRule returns the current rule.
The ChooseNext heuristic is, as we mentioned above, central in turning the approach into an algorithm.The heuristic chooses whether to return an AVP and, if positive, which one to return.The RSC approach does not prescribe a particular algorithm for ChooseNext, however imposes one important constraint: the returned AVP must cover a row not covered by the current collection of rules.The particular algorithm for ChooseNext presented below is just one possible variant fitting the pattern.First of all, for the sake of speeding up, rather than running through all the AVPs the heuristic runs only through half-intervals of the attributes as defined below.
In other words, (j, [a, b]) is a half interval if either a is the initial value of attr j or b is the final value of this attribute.For attribute j there are 2 * max j half-intervals and O(max 2 j ) AVPs in general.Therefore, going through half-intervals only significantly saves the runtime.Note that the expressive power is not affected because any AVP can be seen as a rule including two half-intervals.
The pseudocode of ChooseNext heuristic is provided in Algorithm 2. ChooseNext uses two auxiliary functions: IsChosen and IsReplaced.The function IsChosen operates when no AVP has been chosen yet to add to the current rule and this function decides whether the currently considered interval is a viable (though possibly not the best) candidate for the rule growth.The function IsReplaced operates when there is already a candidate AVP to be returned and a new one is considered, and the function decides whether the new AVP is preferable to the current favorite.
It is the responsibility of IsChosen to ensure that the whole algorithm does not enter into an infinite loop.In particular, when all the rows with outcome 1 have been covered by the current collection of rules, IsChosen must reject all the candidate AVPs.Then an empty rule will be returned by the function FormRule and the run of the main function RSC will be terminated.
In order to describe functions IsChosen and IsReplaced we need to introduce new terminology.First of all, each row of table DATA is associated with its index (as usual row 1, row 2, and so on).When we refer to a set of rows, we mean the set of their respective numbers.
Let R be a rule.We denote by POS(R) and NEG(R) the sets of rows covered by R that respectively have positive and negative outcomes.That is, POS(R) ∪ NEG(R) is the total set of rows covered by R.
Let C be a collection of rules.Then POS(C) = R∈C POS(R).In other words, the positive rows covered by the collection is the union of the positive rows covered by the rules in this collection.Definition 6.Let C be a collection of rules and let R be a rule such that R / ∈ C. Then the free coverage of R w.r.t.C is POS(R) \ POS(C) and it is denoted by Free(R, C).In other words, the free coverage corresponds to the positive rows that are covered by the new rule R being formed but is not covered by the current collection C of rules.The pseudocode of the function IsChosen is given in Algorithm 3. IsChosen uses two parameters (thresholds) init_ f ree and init_prec.They are not specified by the algorithm and their right value is determined by experiments.Thus IsChosen decides to not grow the rule with AVP if the result of adding AVP to the current rule covers less 'new' positive rows than the specified threshold.For this condition to prevent the whole algorithm running into an infinite loop, init_ f ree must be at least 1.Setting the parameter to a larger value will force the new rules to cover more new positive rows and, as a result, to potentially decrease the total number of rules needed.The initial precision threshold init_prec is not necessary for a properly functioning algorithm.However, making sure that the initial precision is sufficiently high, the algorithm potentially avoids creation of too long rules.
The pseudocode of the function IsReplaced is provided in Algorithm 3. IsReplaced compares two different AVPs to be added to the current rule.In order to compare them, IsReplaced forms two new rules, Rule 1 and Rule 2 , Rule 1 with the current best candidate to be added to the current rule and Rule 2 with a new AVP added.If the Rule 2 covers less new rows than init_ f ree then the new AVP is immediately discarded.The new AVP replaces the current one if the precision of Rule 2 is greater than the precision of Rule 1 .Another reason to prefer the new AVP if Rule 2 and Rule 1 have the same precision but Rule 2 has a greater free coverage.In fact, the function is ready to sacrifice precision a little bit for the sake of a greater coverage.In particular, we introduce a parameter prec_loss and consider Rule 2 preferable to Rule 1 if the precision of Rule 2 is at least the precision of Rule 1 minus prec_loss but the free coverage is larger.
The purpose of parameters.init_prec allows the whole algorithm to stop even if there is a small percentage of rows with uncovered outcome.In particular, this parameter is used to fight off noise.The parameter prec_loss helps to create rules that are possibly not 100% accurate but have a good coverage.Change of these parameters can affect (positively or negatively) the quality of rule learning.An extensive study of the right choice of parameters for SeCo has been performed in [33,34].Studying of the interplay of these parameters for the RSC is left for the future work.
SeCo with the ChooseNext heuristic.Below and in the next section, we use the SeCo method running exactly the ChooseNext heuristic (Algorithm 2 and 3) as the RSC.The only modification we need is a different way to calculate precision: without taking into account the rows covered by the collection of the existing rules.Let us state this formally.
Let C be the current collection of rules.Let The only constraint of the RSC method is that each new rule is reasonable w.r.t. the collection of the previously formed rules.This condition is significantly weaker than that required for SeCo.
Indeed, let C be a collection of rules and recall that COV(C) = R∈C (POS(R) ∪ NEG(R)).The SeCo method, having formed C excludes rows COV(C) from the dataset.A new rule R must have positive coverage outside of COV(C).Otherwise such a rule simply does not make sense.Clearly, such a rule R is reasonable.
Note also that ChooseNext heuristic receives the current collection C of rules as an argument.Therefore, ChooseNext can compute COV(C) and hence implement any SeCo heuristic.We conclude that the RSC method generalizes SeCo.
Having access to COV(C) also allows to implement any weight function within ChooseNext.We conclude that any rule growing heuristic for weighted covering that guarantees to return a reasonable rule w.r.t.C can be implemented within ChooseNext.
The above discussion is summarized by the following theorem.
Theorem 1.The RSC method is a generalization of the SeCo method.The RSC method also generalizes weighted covering for an arbitrary rule growing heuristic, guaranteeing to return a reasonable rule w.r..t. the current formed collection of rules.
1.It is unlikely that the weighted covering can simulate RSC.Indeed, any assignment of weights discriminates the rows covered by the existing collection of rules.This is a stronger constraint than the requirement of the RSC that the new rule must be just reasonable.2. Our implementation of the RSC method maintains COV(C) and Free(R, C), where C is the current collection of rules and R is the new rule being formed.Therefore, simulation of weighted covering or SeCo methods does not involve any computational overhead.3. Since the ChooseNext heuristic receives the current collection of rules as an argument, it can enforce tree-likeness of the collection of rules.Hence, any DT algorithm can be easily implemented within this framework.
Thus we have seen that RSC generalizes SeCo.We need to show now whether there is any advantage in this generalization.In the next section, we provide empirical evidence to that effect.In the rest of this section we argue that RSC is better than SeCo also from the theoretical perspective.
In particular, we propose a conjecture that, in order to have a comparable performance, SeCo must have a much larger training set.
This conjecture is stated for a broad domain called truth table learning, see Section 3.1.
We start from considering one particular scenario in which a rule learner has little choice but to make a wrong conclusion.In particular, consider a set of rules R = (x 1 ∧ x 2 ) ∨ (x 3 ∧ x 4 ) over a binary domain (that is, two rules x 1 = 1 ∧ x 2 = 1 and x 3 = 1 ∧ x 4 = 1).Assume further that the rule learning algorithm runs on the following rather unfortunate training set: in all the rows covered by R the variable x 5 equals 1 and in all the rows not covered by R the variable x 5 equals 0. In this case, a rule learner, seeking to learn a short rule, would gladly report that the underlying rule is x 5 (that is, the outcome equals one whenever x 5 = 1).
The above anomaly can easily occur in small training sets but the larger the training set becomes the less likely anomalous patterns are to occur because many random choices tend to concentrate around the expectation.In the particular example above, the values of x 5 can be considered as outcomes of independent coin tosses.If there are many such tosses then the percentages of 1 and 0 outcomes are likely to be close to 50%.Consequently, if there are many rows that are covered by rule R and many rows that are not covered by rule R then the above anomaly is very unlikely to happen.
The discussion above suggests that a rule learning needs a sufficiently large training set in order to work properly.Let us formalise this intuition.Suppose that we have n variables and the domain of each variable has m values.Further on, let f be a function on this variable induced by a set S of at most r random rules each involving at most k variables.Let A be a rule learner.Let us denote by Q A (n, m, r, k) the size of a training set such that with a high probability A guesses the function correctly given the training set of this size.Denote Q SECO and Q RSC the respective sizes of training sets for SeCo with the ChooseNext heuristic and RSC.Then we make the following conjecture.Conjecture 1. Q SECO (n, m, r, k) is exponentially (by factor about 2 r ) larger than Q RSC (n, m, r, k).
The intuition behind this conjecture is that SeCo in fact considers not one but many training sets that are obtained from the original set by removal the rows covered by the already discovered rules.Since we do not know in advance which rules will be discovered first, we must consider removal of rows covered by all possible 2 r subsets of rules.After those removals the remaining training set must be sufficiently large to derive the remaining rules.On the other hand the RSC is not subject to such a constraint.Thus we predict that the learning space needed for good performance of SeCo is larger by an exponential factor in r than such a space for RSC.This exponential factor is a compensation price for distortion of the learning space carried out by SeCo during its performance.
Conjecture 1 is closely related to so called Juntas Learning Problem [35] that is essentially a theoretical abstraction of the task of feature selection.The important difference is that we consider not a problem in general but rather specific algorithms for solving the problem.

Experiments
The purpose of this section is to empirically assess the potential of our Relaxed Separate-and-Conquer (RSC) approach.For this purpose, we compare RSC with Decision Tree (DT) and Separate-and-Conquer (SeCo) methods.We use the SeCo method equipped with the same heuristic as RSC (but computed over the dataset yet uncovered by the current collection of rules).Below we overview the DT method that we use for the experiments.
In the context of ML, DT is a directed rooted tree, whose non-leaf nodes correspond to conditions on attributes of a dataset and leaves correspond to the outcomes.The outgoing edges of each non-leaf node are labeled with True and False meaning whether or not the condition associated with that node is satisfied.Thus each edge is associated with a condition which is either condition associated with its tail or the negation of this condition.
The semantics of DT is tied to its root-leaf paths.Each such a path P is seen as the set of conditions Cond 1 , . . ., Cond q associated with the edges of P plus the outcome out associated with the leaf.Thus each root-leaf path P of DT can be seen as a rule of the form Cond 1 ∧ • • • ∧ Cond q → out, where Cond 1 ∧ • • • ∧ Cond q is the body of the rule consisting of conjunctions of individual conditions and out is the outcome of the rule.
The procedure of turning a DT into a set of rules as described above is called linearization.For example, the rules corresponding to the DT in Figure 1 are the following: We use a standard DT algorithm provided by the ML Python library Scikit-Learn [36], with the Gini index served as the splitter and the DT depth is upper-bounded by 7. To obtain a set of rules the resulting DT is linearized.For failure prediction we have two types of outcome only: out = 1 associated with a failure and out = 0 otherwise.We record only those rules whose outcome is 1.In other words, we ignore the rules with outcome 0 explaining why a particular failure does not occur because these rules are simply not relevant for our task.
Choice of parameters.As specified in the previous section, the RSC algorithm requires setting of three parameters: init_ f ree, init_prec, prec_loss.In all of our experiments, we set these parameters to 1, 1%, 0.5% respectively.
The rest of the section consists of two subsection.In each subsection we consider a particular domain and compare our RSC approach with the DT based and SeCo rule learners using this domain.

Learning the Truth Table of the Given Collection of Rules
Any function on finite domain variables can be defined using a truth table.The truth table consists of all possible tuples of assignment of variables with their domain values, and each tuple is associated with the respective value of the function.
In our first experiment we randomly generate a small collection of small rules, then randomly select a subset S of rows of the truth table of the collection.Next, we run a RL algorithm (RSC, SeCo, and a DT based rule learner) on S with the goal to create a collection of rules as close as possible matching the original one.
The rest of the subsection is organized as follows.
1. We define a truth table for a collection of rules.2. We specify an algorithm for generalization of a random collection of rules and of a random subset of its truth table.3. We describe the tests that we performed and their results.

Truth table for a collection of rules and the induced function.
A collection of rules can be associated with many truth tables.This is because, in addition to the variables of the rules, the truth table can also contain many variables that are not essential for the rule.However, since the RL algorithm does not 'know' that these extra variables are not essential, these variables make the RL task more difficult.
For example, consider a single rule consisting of a single AVP (x 1 , [2,4]).A truth table for this rule may consist of 100 variables x 1 , . . ., x 100 .The domain of each variable can be e.g.{1, . . ., 10}.However, the value of the respective function is determined only by the above AVP: it is 1 if the value of x 1 is between 2 and 4, and 0 otherwise.
Having in mind the above example, we give below a formal definition of a truth table for a collection of rules.As an intermediate notion we also define a function induced by the collection of rules, a notion that we will use for the description of a training set.Definition 8. Let C be a collection of rules.Let X be the set of variables of C. For each x ∈ X, let val(x) be the set of values of X used in the rules of C. Let X * be a set of variables such that X ⊆ X * .For each x ∈ X * , the domain dom(x) of x is defined under the following constraint: if x ∈ X then val(x) ⊆ dom(x).Otherwise, dom(x) is an arbitrary finite set.Then a function f induced by C is defined as follows.The domain of f is X * .Let X * = {x 1 , . . ., x n }.Let val 1 , . . ., val n be the tuple of assignments to the respective variables.If this tuple is covered by at least one rule of C then the corresponding value of f is 1, otherwise it is 0.
Given X * and their domains as above, the truth table of C becomes the truth table of f .That is the rows of the table correspond to the x 1 , . . ., x n and the last column is for the outcome.The rows of the table are all the tuples of assignments to X * and their corresponding value of f as described above.

Generation of a random collection of rules and a random subset of the related truth table.
1. Choose the following parameters.To make the work of a rule learner more complicated, we also generate rows by introducing a random noise.In order to do this, we choose a small parameter noise_prob (e.g.0.005) and then, in the above algorithm, after having computed the outcome, alter it with probability noise_prob.
Example 3 demonstrates the experiment.
Example 3. Suppose max_val = 1 that means that all the attribute values are binary: 0 or 1.Moreover, this also means that the collection of rules become Disjunctive Normal Forms (DNFs).
Then num_rules becomes the number of conjuncts and len_rule becomes the length of the conjuncts.Suppose that both of them equal 2. Let the collection of rules be (x 1 ∧ x 2 ) ∨ (x 3 ∧ x 4 ).Let also the number of attributes be 10.Thus we have defined the function f (x 1 , . . ., x 10 ) = (x 1 ∧ x 2 ) ∨ (x 3 ∧ x 4 ).The whole dataset is just the truth table of this function.The num_rows parameter is the size of the training set (seen by the algorithm).These num_rows rows are randomly selected out of the whole dataset.A RL algorithm is supposed to guess the whole function out of these rows.

Testing and analysis of the results.
Now, suppose a RL algorithm returns a collection of rules g.How can we determine the closeness of g to the function f induced by the collection of rules?The truth table of g consists of the same tuples as the truth table of f (but the values of the function can, of course be different).Therefore, we proceed as follows.
1. Calculate the numbers of rows that are satisfied by f , g and f ∧ g (both g and f ) and denote them by | f |, |g| and | f ∧ g|, respectively.2. The number | f ∧ g|/| f | is the percent of rows covered by f that are also covered by g.The larger this number the better is the quality of the learned model.3. The number | f ∧ g|/|g| is the percent of rows covered by g that are also covered by f .The larger this number the smaller the number of rows of g that are not covered by f and the better the quality of g.
Note that the computation of the number of rows is in general an intractable problem.However, since we consider small collections of small rules this can be done by a brute-force algorithm.
We test the algorithms (RSC, SeCo, and DT) in the modes specified by the following parameters.
1.The number of extra variables.Extra variables are those that do not take part in the rules of f .They are inessential, however their presence can seriously hinder performance of an RL algorithm.
Getting rid of such variables is the main task of the feature selection algorithm.We consider two extreme modes: few extra variables and many extra variables.We gradually increase the number of variables in order to see a point where the rule learner starts to work much worse.In case of many variables, the performance can be significantly improved with the introduction of feature selection algorithms.However, in this experiment we want to see how the algorithm deals alone with this matter.
Having many variables has another interesting feature: the size of the truth table becomes very large (100 variables with domain 2 in each of them result in a truth table of 2 100 rows).This means that a training set (in which rows are explicitly presented) becomes tiny compared to the whole truth table.It is interesting to see how an RL algorithms would cope with this situation.2. The number of extra values.Suppose x is a variable occurring in the rules of f .However, the domain of x may contain many values that do not take part in any interval of an AVP of x in f .When there are many additional values, the event of the function f equal 1 becomes rare and hence it is more difficult for a rule learner to 'spot' the rule.We will check truth tables with few and with many background values.
Thus the options of few/many extra variables and few/many extra values give us 4 modes of testing combined.If we add presence/absence of random noise this will make 8 modes of testing in total.
We perform experiments according to the above classification.Our conclusions based on these experiments are summarized below.
1. Small number of extra variables.In this case the RSC algorithm correctly reconstructs the original function.However, if we increase the number of values, the algorithm splits the original rules so the number of resulting rules is larger than the number of the original rules.The effect of splitting can be demonstrated on the following example.Consider a rule (x 1 , [1,10]) ∧ (x 2 , [1,10]).As a result of splitting this rule can be represented by the following collection of four rules {[x 1 , [1,5]) ∧ (x 2 , [1,5]), (x 1 , [1,5]) ∧ (x 2 , [6,10]), (x 1 , [6,10]) ∧ (x 2 , [1,5]), (x 1 , [6,10]) ∧ (x 2 , [6,10])} The larger the intervals the stronger output of our algorithm is affected by splitting.This effect can be alleviated by using non-zero prec_loss parameter, for instance, about half percent (prec_loss is defined for the function IsReplaces in Algorithm 3 in Section 2).As a result of this, the algorithm is 'encouraged' to move to a larger interval even if the resulting precision is slightly smaller.However, the fragmentation of the rules still exists.We believe this can be addressed by a post-learning algorithm that tries to simplify the already created rules [4].This is an interesting topic for future research.2. Many extra variables.In this case, the RSC algorithm has tendency to include irrelevant variables into the rules.This inclusion has an interesting side effect: redundant variables in correct rules.For example, suppose we have a rule [x 1 , [1,5]) ∧ (x 2 , [1,5]).If the number of variables is say 100 and the size of the sample of the truth table considered by the algorithm is say 1000 (tiny proportion of total number of 2 100 of the truth table) then there might be some irrelevant variable with an interval whose precision is better then any interval of x 1 or x 2 .In this case the algorithm picks something like (x 10 , [1,2]) and then a relevant variable.This effect makes the collection of rules longer than needed.Still, in the vast majority of cases, the function of the collection of rules formed was exactly the function of the original rules.
In about 1% of cases, we obtained rules with false positives.The reason for that is an effect of 'shadowing': when a training set is so tiny compared to the 'full' data, some statistical 'anomalies' are possible.For example, it may happen that an interval of an irrelevant variable perfectly correlates with the rows where the function is 1.Clearly, in this case, the algorithm will pick the correlating interval of an irrelevant variable.
The above situation can be fixed if the algorithm considers several random training sets of the same size.This allows the 'stray' irrelevant variable to be 'shaken off'.
If in additional to many extra variables some relevant variables contain many extra values, the negative effects specified above are, of course moderately aggravated.For instance, the RSC algorithm did not manage to correctly guess the function only in about 3% of cases.
3. The influence of noise.The noise does not significantly affect the behavior of the algorithm as specified above.In particular, the RSC algorithm is still able to recognize the main rules and does not try to 'collate' the 'noisy' rows with the main ones.4. Comparison of RSC with DT based and Separate-and-Conquer (SeCo) rule learners.Finally, it is important to say that on this domain the RSC algorithm works much better than the DT and SeCo (with ChooseNext heuristic) rule learners.
Indeed, in those rare cases where RSC returns an incorrect collection of rules, the difference between the output and the original collection of rules has never been more than 2%.On the other hand, the rules returned by the DT based rule learner even in case of few extra variables and small domains are at least 20% different from the original collection of rules.In case of many extra variables, the difference can be up to 40%.The SeCo (with ChooseNext heuristic) is only marginally better than DT.
A typical situation when both DT and SeCo fail to discover the right set of rules can be described by the following simple example.Suppose that the dataset consists of 10 attributes attr 1 , . . ., attr 10 , each attribute can take values 1, . . ., 5 and the outcome is 1 only for rows covered by one of the following two rules. (a) Both RSC and SeCo easily discover the first rule.RSC quickly discovers the second rule.However, the SeCo rule have been removed.It picks an unrelated variable and then creates many irrelevant rules just to cover the remaining rows.Unsurprisingly, on the testing set such rules are far from being accurate.

Failure Prediction Using a Real Industrial Dataset
For our experiments we use a real industrial dataset collected from a machine which manufactures the plastic bottle caps.This dataset consists of two following parts.
The first part is a collection of tuples of sensor readings provided in CSV format that have been collected over more than one year from this machine.Each tuple of sensor readings is associated with a timestamp.We create a table R with columns (attributes) corresponding to the sensors and the rows being the tuples of corresponding readings.To make connection with the second part of the data, we also keep the timestamps of the tuples in the memory.
The second part is information about alarms.This data consists of tuples having three components: start and end timestamps of an alarm and alarm error code.The alarms are associated with failures in this industrial machine, in the sense that if an alarm occurs, the machine should be switched off to find the failure.The alarm error codes are organized into four groups: shutdown, stoppage, mandatory action and message.The first two groups (shutdown and stoppage) are main errors that should be predicted to prevent failures in the machine.Five types of shutdown and stoppage alarms happen most often.In this section we refer to them by index i ∈ {1, 2, 3, 4, 5} for the purpose of explanation.
The rest of this subsection is divided into the following four parts.
1. Testing the ability of the considered algorithms to predict the actual alarms occurring at the given moment of time.2. Testing the remaining useful life prediction (RUL), this is effectively the ability to predict an alarm to occur in the near future.3. Testing the true and false positive rates.4. Making conclusion based on the obtained empirical results.

Prediction of actual alarms.
For all alarms, we form the respective datasets D 1 , . . ., D 5 .Each D i is formed as follows.
1. We take the table R created from the first part of the dataset and add to it one extra column out.2. For each row of R, we check whether alarm i occurred at the moment of the timestamp associated with the row.If it did, the value of out in this row is set to 1. Otherwise, the value of out is set to 0.
As a result, we obtain datasets where the sensor readings serve as attributes and the values of the last column serve as outcomes.
We perform the experiments for RSC, DT based and SeCo (with ChooseNext heuristic) algorithms as follows.
1. We run the algorithm for each D i separately.For this, we randomly partition the rows of D i into the training (70% of the rows) and testing sets (30% of the rows), and record all the rules.2. Each rule is tested on the testing set corresponding to the predicted alarm.That is, for the predicted alarm i any rule obtained from the the training set of D i is tested on the testing set of D i .For each rule, we record its precision with respect to the testing set (see Definition 5). 3. We record together the rules obtained from the exploration of all datasets D 1 , . . .D 5 , replacing the outcome 1 with the respective real alarm code, and remove those rules that cover less than 20 lines in the dataset as insignificant.
Some rules and their precision are reported in Table 1 for RSC, in Table 2 for a DT based rule learner and in Table 3 for SeCo.Each row in the tables corresponds to a rule.The first column 'alarm' states the predicted alarm code.The second column 'rules' describes the body of the rule, we grouped several rules predicting the same alarm.For example, the rule on the first row of Table 1 should be interpreted as follows.If the value of the attribute %ZP is 0 and the value of the attribute Ads.HmiVis.EXTR.PRESS_I is in the interval [8.63,170.95]then alarm 1017 occurs.The rule on the last row of Table 2 is interpreted as follows.If the value of the attribute %PS is greater than 40.7 and the value of the attribute Ads.HmiVis.CENTR.POT_M1B is greater than 61.1 then alarm 3099 occurs.The last column 'precision' measures how much the given rule is precise for the dataset, calculating the percentage of rows of the testing set on which the alarm actually occurs among those covered by the respective rule (as in Definition 5).Let us make one interesting remark.The rules generated by the above algorithms are overlapping in the sense that a row of table R can be covered by more than one rule meaning that the set of rules predict that more than one alarm is taking place during the corresponding timestamp.This means that two or more alarms may occur simultaneously.In fact, classifying each alarm separately is a standard ML approach for multiple classification tasks.It is called unordered rules and there is evidence that this approach makes more accurate predictions than learning mutually exclusive rules, see e.g., [37].

RUL prediction.
We also test the ability of algorithms to predict the remaining time to failure (or RUL -remaining useful life).In particular, for a time t seconds, we modify tables D 1 , . . ., D 5 to obtain the respective tables D t 1 , . . ., D t 5 as follows.Take table D i and set the out column to 1 in those rows whose timestamp is at most t seconds before the timestamp of a row having out = 1 in D i .The resulting table is D t i .We report experiments with two values of t: 60 and 120, chosen for the sake of demonstration.The resulting rules and the testing results of the respective RSC, DT based and SeCo algorithms are reported in Tables 4-6 for RUL = 60 s and in Tables 7-9 for RUL = 120 s.TP and FP rates.
We also calculate the True Positive (TP) and False Positive (FP) rates for our experiments.For this task we form the dataset D = D 1 ∨ D 2 ∨ D 3 ∨ ∨D 4 ∨ D 5 .That is, if any alarm occurred at the moment of the timestamp associated with each row, out = 1 in this row, otherwise, out = 0. We run RSC and DT based rule learner on obtained dataset D for rule generation.
To calculate TP (correct prediction of alarms), we define A as a set of rows associated with any alarm in the testing set of D (having out = 1) and a as a number of all rows in A. Then TP = t/a, where t is the number of such rows of A which are covered by at least one rule.The results are TP = 97% for RSC, TP = 90% for DT based rule learner and TP = 92% for SeCo.
To calculate FP (incorrect prediction of alarms), we define N be a set of rows in the testing set of D which are not associated with any alarm (with out = 0) and n is the number of all rows in N.

Figure 1 .
Figure 1.DT example (a) num_attr, the number of attributes.(b) max_val, the largest value for each attribute meaning that the attribute values will lay in the interval [0, max_val].(c) num_rules, the number of rules to be generated.(d) len_rule, the length of the generated rules (e) num_rows, the number of rows of the training set.

2 .
Randomly generate a collection of C having number of rules num_rules.Each rule is a random generation of len_rule AVPs that can be done as follows.(a) Randomly choose len_rule attributes for the given rule.(b) For each chosen attribute attr, randomly generate an interval [a, b] such that 0 ≤ a ≤ b ≤ max_val and the resulting AVP is (attr, [a, b]).

( a )
Randomly generate a value for each attribute between 0 and max_val.(b) Let tp be the resulting tuple of attribute values.(c) If tp is covered by C then out = 1, otherwise out = 0. (d) Add out in the last column of tp.

then return true end if if prec
and exactly at the same places.Two auxiliary functions for ChooseNext heuristic.Rule, CurAVP, AVP) Rule 1 = Rule ∪ {CurAVP} Rule 2 = Rule ∪ {AVP} (Rule 2 ) < prec(Rule 1 ) − prec_loss then return f alse else if Free(Rule 2 , Collection) > Free(Rule 1 , Collection) then Advantages of RSC versus Methods of Separate-and-Conquer (SeCo) and Weighted Covering Definition 7. Let C be a collection of rules and let R be a new rule.We say that R is reasonable w if |Free(Rule 2 , Collection)| < init_ f ree then return f alse end if if prec(Rule 2 ) > prec(Rule 1 ) 3. Randomly generate num_rows of the 'truth' table for the above rules in order to create a training set.A row of the truth table is generated as follows.