^{1}

^{*}

^{2}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Classification is one of the data mining problems receiving enormous attention in the database community. Although artificial neural networks (ANNs) have been successfully applied in a wide range of machine learning applications, they are however often regarded as black boxes,

Data mining, also popularly known as knowledge discovery in databases, refers to the process of automated extraction of hidden, previously unknown and potentially useful information from large databases. It is the process of finding and interpreting the valuable information by using the knowledge of multidisciplinary fields such as statistics, artificial intelligence, machine learning, database management and so on [

ANNs have the ability of distributed information storage, parallel processing, reasoning, and self-organization. It also has the capability of rapid fitting of nonlinear data, so it can solve many problems which are difficult for other methods [

In many applications, it is highly desirable to extract symbolic rules from these networks. Unlike a collection of weights, symbolic rules can be easily interpreted and verified by human experts. They can also provide new insights into the application problems and the corresponding data [

In this paper we have proposed a new data mining scheme; referred to as ESRNN (Extraction of Symbolic Rules from ANNs) to extract symbolic rules from trained ANNs. A four-phase training algorithm is proposed by using backpropagation learning. In the first and second phases, appropriate network architecture is determined using weight freezing based constructive and pruning algorithms. In the third phase, the continuous activation values of the hidden nodes are discretized by using an efficient heuristic clustering algorithm. Finally, in the fourth phase, symbolic rules are extracted using the frequently occurred pattern based rule extraction algorithm by examining the discretized activation values of the hidden nodes.

The rest of the paper is organized as follows. Section 2 describes the related work. The proposed data mining scheme is presented in Section 3. We discuss the performance evaluation in Section 4. Finally, in Section 5 we conclude the paper.

A neural network-based approach to mining classification rules from given databases has been proposed in [

In the literature, there are many different approaches for the rule extraction from ANNs. A number of algorithms for extracting rules from trained ANNs have been developed in the last two decades [

Liu and Tan proposed X2R in [

Setiono presented MofN3, a new method for extracting M-of-N rules from ANNs, in [

Jin and Sendhoff provide an up-to-date yet not necessarily complete review of the existing research on Pareto-based multiobjective machine learning (PMML) algorithms in [

The limitations of the existing rule extraction algorithms are summarized as follows:

Use predefined and fixed number of hidden nodes that require human experience and prior knowledge of the problem to be solved,

Clustering algorithms used to discretize the output values of hidden nodes are not efficient,

Computationally expensive,

Could not produce concise rules, and

Extracted rules are order sensitive.

To overcome these limitations we have proposed a scheme for data mining by extracting symbolic rules from trained ANNs. The proposed system successfully solves a number of data mining classification problems in the literature and described in detail in the next section.

Developing algorithms and applications that are able to gain knowledge of their experience and previous examples, and that show intelligent behavior is the domain of machine learning and ANNs. Data mining on the other hand deals with the analysis of large and complex databases in order to discover new, useful and interesting knowledge using techniques from machine learning and statistics. The data mining process using ANNs with the emphasis on symbolic rule extraction is described in this section. The proposed data mining scheme is composed of two steps: data preparation and rule extraction, as shown in

In many fields of artificial intelligence, such as pattern recognition, information retrieval, machine learning, and data mining, one needs to prepare quality data by pre-processing the raw data. The input to the data mining algorithms is assumed to be nicely distributed, containing no missing or incorrect values where all features are important. The real-world data may be incomplete, noisy, and inconsistent, which can disguise useful patterns. Data preparation is a process of the original data to make it fit to a specific data mining method. Data preparation is the first important step in the data mining and plays an important role in the entire data mining process.

The data mining using ANNs can only handle numerical data. How to represent the input and output attributes of a learning problem in a neural network is one of the key decisions influencing the quality of the solutions one can obtain. Depending on the kind of problem, there may be several different kinds of attributes that must be represented. For all of these attribute kinds, multiple reasonable methods of neural network representation exist. We will now discuss each attribute kind and some common methods to represent such an attribute.

_{2}

It is becoming increasingly apparent that without some form of explanation capability, the full potential of ANNs may not be realized. The rapid and successful proliferation of applications incorporating ANNs in many fields, such as commerce, science, industry, medicine

A standard three-layer feedforward ANN is the basis of the proposed ESRNN algorithm. The hyperbolic tangent function, which can take any value in the interval [–1, 1], is used as the hidden node activation function. Rules are extracted from near optimal ANN by using a new rule extraction algorithm. The aim of ESRNN is to search for simple rules with high predictive accuracy. The major steps of ESRNN are summarized in

The rules extracted by ESRNN are compact and comprehensible, and do not involve any weight values. The accuracy of the rules from pruned networks is as high as the accuracy of the original networks. The important features of the ESRNN algorithm are the rules extracted by rule extraction algorithm is recursive in nature and is order insensitive,

One drawback of the traditional backpropagation algorithm is the need to determine the number of nodes in the hidden layer prior to training. To overcome this difficulty, many algorithms that construct a network dynamically have been proposed in [

The training time is an important issue in designing ANNs. One approach for reducing the number of weights to be trained is to train few weights rather than all weights in a network and keep remaining weights fixed, commonly known as weight freezing. The idea behind the weight freezing-based constructive algorithm is to freeze input weights of a hidden node when its output does not change much in the successive few training epochs. Theoretical and experimental studies reveal that some hidden nodes of an ANN maintain almost constant output after some training epochs, while others continuously change during the whole training period.

In our algorithm, it has been proposed that the output of a hidden node can be frozen when its output does not change much in the successive training epochs. This weight freezing method can be considered as combination of the two extremes: for training all the weights of ANNs and for training the weights of only the newly added hidden node of ANNs [

_{pi}_{i}_{pi}_{i}_{m}_{m}^{−y}) and for the hidden layer is hyperbolic tangent function ^{y}^{−y}) / (^{y}^{−y}).

Pruning offers an approach for dynamically determining an appropriate network topology. Pruning techniques begin by training a larger than necessary network and then eliminate weights and nodes that are deemed redundant [

As the nodes of the hidden layer are determined automatically by weight freezing based constructive algorithm in ESRNN, the aim of this pruning algorithm used here is to remove as many unnecessary nodes and connections as possible. A node is pruned if all the connections to and from the node are pruned. Typically, methods for removing weights from the network involve adding a penalty term to the error function. It is hoped that by adding a penalty term to the error function, unnecessary connections will have small weights, and therefore pruning can reduce the complexity of the network significantly. The simplest and most commonly used penalty term is the sum of the squared weights.

Given a set of input patterns _{i}^{n}_{m}_{ml}_{m}_{pm.}

The backpropagation algorithm is applied to update the weights (_{pi}_{i}_{i}^{T}_{m}_{i}_{m}

The values for the weight decay parameters _{1},_{1} > 0 must be chosen to reflect the relative importance of the accuracy of the network verses its complexity. More weights may be removed from the network at the cost of a decrease in the network accuracy with larger values of these two parameters. They also determine the range of values where the penalty for each weight in the network is approximately equal to _{1}. The parameter

The value of the function ^{2} / (1 + ^{2}) is small when ^{′} (^{2} / (1 + ^{2})^{2} indicates that the backpropagation training will be very little affected with the addition of the penalty function for weights having large values. Consider the plot of the function _{1}^{2} / (1 + ^{2}) + _{2}^{2} shown in _{1} = 0.1, _{2} = 10^{−5}, and _{1} at _{2}, the interval over which the penalty value is approximately equal to _{1} can be made wider as shown in _{2} = 10^{−6}. A weight is prevented from taking too large value, since the quadratic term becomes dominant for the larger values of ^{′} (

This pruning algorithm removes the connections of the ANN according to the magnitudes of their weights. As the eventual goal of the ESRNN algorithm is to get a set of simple rules that describe the classification process, it is important that all unnecessary nodes and connections must be removed. In order to remove as many connections as possible, the weights of the network must be prevented from taking values that are too large [

The steps of the pruning algorithm are explained as follows:

Let _{1} and _{2} be positive scalars such that (_{1} + _{2}) < 0.5 (_{1} is the error tolerance, _{2} is a threshold that determines if a weight can be removed), where _{1} ∈ [0, 0.5). Let (

_{ml}_{ml}

In the second phase, connections between hidden nodes and output nodes are removed. For each_{pm}_{pm}

_{ml}_{ml}_{ml}

The pruning algorithm used in the ESRNN algorithm intended to reduce the amount of training time. Although it can no longer be guaranteed that the retrained pruned ANN will give the same accuracy rate as the original ANN, the experiments show that many weights can be eliminated simultaneously without deteriorating the performance of the ANN. The two conditions

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar within the same cluster and are dissimilar to the objects in other clusters. A cluster of a data objects can be treated collectively as one group in many applications [

After applying pruning algorithm in ESRNN, the ANN architecture produced by the weight freezing based constructive algorithm contains only important nodes and connections. Nevertheless, rules are not readily extractable because the hidden node activation values are continuous. The discretization of these values paves the way for rule extraction. It is found that some hidden nodes of an ANN maintain almost constant output while other nodes change continuously during the whole training process [

The aim of the clustering algorithm is to discretize the output values of the hidden nodes. Consider that the number of hidden nodes in the pruned network is

Find the smallest positive integer

Represent each activation value ^{d}_{i}_{i}_{,1}, _{i,}_{2}, …, _{i,k}_{1}, _{2}, …, _{k}_{1}, H_{2}, …, H_{H}

Let

Set

Sort the values of the _{i}

Find a pair of district adjacent values _{i, j}_{i}_{,} _{j}_{+1} in H_{i}_{i}_{,} _{j}_{+1} is replaced by _{i, j}

If such a pair of values exists, replace all occurrences of _{i}_{,} _{j}_{+1} in H_{i}_{i, j}

In our scheme, the activation value of an input pattern at hidden node

The array

Classification rules are sought in many areas from automatic knowledge acquisition [

Rule Extraction: This function first initializes the extracted rule list to be empty, and sorts the examples according to example frequency. Then it picks the most frequent occurring example as the base to generate a rule and adds the rule to the list of extracted rules. It then finds all the examples that are covered by the rule and removes them from the example space. It repeats the above process iteratively and continuously adds the extracted rules to the rule list until the examples space becomes empty because all data examples have been covered by the rules extracted and they have all been removed.

Rule Clustering: Rules are clustered in terms of their class levels. Rules of the same class are clustered together as one group of rules.

Rule Pruning: Redundant or more specific rules in each cluster are removed. In each of these clusters, more than one rule may cover the same example. For examples, the rule “if (color = green) and (height < 4) then grass” is already contained in a more general rule “if (color = green) then grass”, and thus the rule “if (color = green) and (height < 4) then grass” is redundant. RE eliminates these redundant rules in each cluster to further reduce the size of the best rule list.

A default rule should be chosen to accommodate possible unclassifiable patterns. If rules are clustered, the choice of the default rule is based on clusters of rules. The steps of the rule extraction algorithm are explained as follows:

Sort-on-frequency (data-without-duplicates);

while (data-without-duplicates is NOT empty){

extract _{i}

remove all the patterns covered by _{i}

The core of this step is a greedy algorithm that finds the shortest rule based on the first order information, which can differentiate the pattern under consideration from the patterns of other classes. It then iteratively extracts shortest rules and remove the patterns covered by each rule until all patterns are covered by the rules.

Cluster rules according to their class levels. Rules extracted in

Replace specific rules with more general ones;

Remove noise rules;

Eliminate redundant rules;

RE exploits the first order information in the data and finds shortest sufficient conditions for a rule of a class that can differentiate it from patterns of other classes. It can extract concise and perfect rules in the sense that the error rate of the rules is not worse than the inconsistency rate found in the original data. The novelty of RE is that the rule extracted by it is order insensitive,

This section evaluates the performance of the ESRNN algorithm on a set of well-known benchmark classification problems including diabetes, iris, wine, season, golf playing, and lenses that are widely used in machine learning and data mining research. The datasets representing all the problems were real world data are obtained from [

This subsection briefly describes the datasets used in this study. The characteristics of the datasets are summarized in

_{1}), plasma glucose concentration (_{2}), blood pressure (_{3}), triceps skin fold thickness (_{4}), Two hour serum insulin (_{5}), body mass index (_{6}), diabetes pedigree function (_{7}), and age (_{8}). In this database, 268 instances are positive (output equals 1) and 500 instances are negative (output equals 0).

_{1}), sepal width (_{2}), petal length (_{3}), and petal width (_{4}), all in centimeters. Among the three classes, class 1 is linearly separable from the other two classes, and classes 2 and 3 are not linearly separable from each other. To ease knowledge extraction, we reformulate the data with three outputs, where class 1 is represented by {1, 0, 0}, class 2 by {0, 1, 0}, and class 3 by {0, 0, 1}.

In all experiments, one bias node with a fixed input 1 was used for the hidden and output layers. The learning rate was set between [0.1, 1.0] and the weights were initialized to random values between [–1.0, 1.0]. A hyperbolic tangent function ^{y}^{−}^{y}^{y}^{−}^{y}^{−}^{y}

In this study, all datasets representing the problems were divided into two sets: the training set and the testing set. The numbers of examples in the training set and the testing set was chosen to be the same as those in other works, in order to make the comparison with those works possible. The sizes of the training and testing datasets used in this study are given in

The training time error for diabetes data with weight freezing is shown in

The number of rules extracted by the ESRNN algorithm and the accuracy of the rules is presented in

Rule 1: If Plasma glucose concentration (_{2}) <= 0.64 and Age (_{8}) <= 0.69 then tested negative

Default Rule: tested positive.

Rule 1: If Petal-length (_{3}) <= 1.9 then iris setosa

Rule 2: If Petal-length (_{3}) <= 4.9 and Petal-width (_{4}) <= 1.6 then iris versicolor

Default Rule: iris virginica.

Rule 1: If Input 10 (_{10}) <= 3.8 then class 2

Rule 2: If Input 13 (_{13}) >= 845 then class 1

Default Rule: class 3.

Rule 1: If Tree (A2) = yellow then autumn

Rule 2: If Tree (A2) = leafless then autumn

Rule 3: If Temperature (A3) = low then winter

Rule 4: If Temperature (A3) = high then summer

Default Rule: spring.

Rule 1: If Outlook (A1) = sunny and Humidity >=85 then don’t play

Rule 2: Outlook (A1) = rainy and Wind= strong then don’t play

Default Rule: play.

Rule 1: If Tear Production Rate (A4) = reduce then no contact lenses

Rule 2: If Age (A1) = presbyopic and Spectacle Prescription (A2) = hypermetrope and Astigmatic (A3) = yes then no contact lenses

Rule 3: If Age (A1) = presbyopic and Spectacle Prescription (A2) = myope and Astigmatic (A3) = no then no contact lenses

Rule 4: If Age (A1) = pre-presbyopic and Spectacle Prescription (A2) = hypermetrope and Astigmatic (A3) = yes and Tear Production Rate (A4) = normal then no contact lenses

Rule 5: If Spectacle Prescription (A2) = myope and Astigmatic (A3) = yes and Tear Production Rate (A4) = normal then hard contact lenses

Rule 6: If Age (A1) = pre-presbyopic and Spectacle Prescription (A2) = myope and Astigmatic (A3) = yes and Tear Production Rate (A4) = normal then hard contact lenses

Rule 7: If Age (A1) = young and Spectacle Prescription (A2) = myope and Astigmatic (A3) = yes and Tear Production Rate (A4) = normal then hard contact lenses

Default Rule: soft contact lenses.

This section compares experimental results of the ESRNN algorithm with the results of other works. The primary aim of this work is not to evaluate ESRNN in order to gain a deeper understanding of rule generation without an exhaustive comparison between ESRNN and all other works.

In this paper we have presented a neural network based data mining scheme to mining classification rules from given databases. This work is an attempt to apply the connectionist approach to data mining by extracting symbolic rules similar to that of decision trees. An important feature of the proposed rule extraction algorithm is its recursive nature. A set of experiments was conducted to test the proposed approach using a well defined set of data mining problems. The results indicate that, using the proposed approach, high quality rules can be discovered from the given data sets. The extracted rules are concise, comprehensible, order insensitive, and do not involve any weight values. The accuracy of the rules from the pruned network is as high as the accuracy of the fully connected networks. Experiments showed that this method helped a lot to reduce the number of rules significantly without sacrificing classification accuracy. In almost all cases ESRNN outperformed the others. With the rules extracted by the method introduced here, ANNs should no longer be regarded as black boxes.

This work was supported by Hankuk University of Foreign Studies Research Fund of 2011.

Data mining technique using ANNs.

Flow chart of the proposed ESRNN algorithm.

Flow chart of the weight freezing based constructive algorithm.

Plots of the function _{1}^{2} / (1 + ^{2}) + _{2}^{2} and its derivative ^{′}(_{1}^{2})^{2} + 2_{2}_{1} = 0.1, _{2} = 10^{−5}, and _{1} = 0.1, _{2} = 10^{−6}, and

Output of the hidden nodes.

A pruned network for the diabetes data.

Training time error for the diabetes data.

Training time error for the diabetes data with weight freezing.

Hidden node addition for the diabetes data.

Characteristics of datasets.

1 | Diabetes | 768 | 8 | 2 |

2 | Iris | 150 | 4 | 3 |

3 | Wine | 178 | 13 | 3 |

4 | Season | 11 | 3 | 4 |

5 | Golf Playing | 14 | 4 | 2 |

6 | Lenses | 24 | 4 | 3 |

Sizes of the training and the testing datasets.

1 | Diabetes | 384 | 384 |

2 | Iris | 75 | 75 |

3 | Wine | 89 | 89 |

4 | Season | 6 | 5 |

5 | Golf Playing | 7 | 7 |

6 | Lenses | 12 | 12 |

ANN architectures and the training epochs for the diabetes dataset.

Mean | 11 (8-1-2) | 10 | 13.1 | 31 | 12.1 | 19.7 | 306.4 |

Min | 11 (8-1-2) | 10 | 12.3 | 23 | 11.9 | 15 | 283 |

Max | 11 (8-1-2) | 10 | 13.9 | 38 | 13.2 | 24.4 | 329 |

ANN architectures and the training epochs for the irish dataset.

Mean | 8 (4-1-3) | 7 | 9 | 13 | 9 | 10.2 | 198.2 |

Min | 8 (4-1-3) | 7 | 8 | 8 | 8 | 7 | 185 |

Max | 8 (4-1-3) | 7 | 11 | 22 | 10 | 13.8 | 220 |

ANN architectures and the training epochs for the wine dataset.

Mean | 17 (13-1-3) | 16 | 18.6 | 38 | 17 | 24.8 | 215 |

Min | 17 (13-1-3) | 16 | 17.2 | 18 | 16 | 22 | 198 |

Max | 17 (13-1-3) | 16 | 21 | 62 | 21 | 42 | 240 |

ANN architectures and the training epochs for the season dataset.

Mean | 8 (3-1-4) | 7 | 8.9 | 13.1 | 8.8 | 11 | 87 |

Min | 8 (3-1-4) | 7 | 8 | 7 | 8 | 9.1 | 74 |

Max | 8 (3-1-4) | 7 | 10 | 14.2 | 10.5 | 15 | 105 |

ANN architectures and the training epochs for the golf playing dataset.

Mean | 7 (4-1-2) | 6 | 8.2 | 13 | 7.9 | 10.5 | 95.2 |

Min | 7 (4-1-2) | 6 | 7.3 | 6.1 | 7.1 | 6.2 | 88 |

Max | 7 (4-1-2) | 6 | 9.1 | 18.2 | 9.2 | 13.8 | 103 |

ANN architectures and the training epochs for the lenses dataset.

Mean | 8 (4-1-3) | 7 | 9.1 | 13.2 | 8.7 | 12 | 106 |

Min | 8 (4-1-3) | 7 | 8.3 | 7 | 8.2 | 7.8 | 99 |

Max | 8 (4-1-3) | 7 | 10.4 | 20.8 | 11 | 16 | 126 |

Number of extracted rules and rules accuracies.

1 | Diabetes | 2 | 76.56% |

2 | Iris | 3 | 98.67% |

3 | Wine | 3 | 91.01% |

4 | Season | 4 | 100% |

5 | Golf Playing | 3 | 100% |

6 | Lenses | 8 | 100% |

Performance comparison of the ESRNN with other algorithms for the diabetes data.

2 | 2 | 4 | – | – | – | – | ||

2 | 1 | 3 | – | – | – | – | ||

76.56 | 75 | 76.32 | 70.9 | 76.4 | 72.4 | 72.4 |

Performance comparison of the ESRNN algorithm with other algorithms for the irish data.

BIO RE | Partial RE | |||||||
---|---|---|---|---|---|---|---|---|

3 | 3 | 3 | 4 | 4 | 6 | 3 | ||

1 | 1 | 1 | 1 | 3 | 3 | 2 | ||

98.67 | 91.3 | 97.33 | 94.67 | 78.67 | 78.67 | 97.33 |

Performance of the ESRNN algorithm for the wine data.

3 | ||

3 | ||

91.01 |

Performance comparison of ESRNN with other algorithms for season data.

5 | 7 | 6 | ||

1 | 2 | 1 | ||

100 | 100 | 100 |

Performance comparison of ESRNN with other algorithms for golf playing data.

3 | 8 | 14 | 3 | ||

2 | 2 | 2 | 2 | ||

100 | 100 | 100 | 100 |

Performance comparison of ESRNN with other algorithm for lenses data.

8 | 9 | ||

3 | – | ||

100 | 100 |