A Simple Convolutional Neural Network with Rule Extraction

Guido Bologna

doi:10.3390/app9122411

Department of Computer Science, University of Applied Sciences and Arts of Western Switzerland, Rue de la Prairie 4, 1202 Geneva, Switzerland

^†

Current address: University of Applied Sciences and Arts of Western Switzerland, Rue de la Prairie 4, 1202 Geneva, Switzerland.

Appl. Sci.2019, 9(12), 2411;https://doi.org/10.3390/app9122411

This article belongs to the Special Issue Advances in Deep Learning

Version Notes

Order Reprints

Abstract

Classification responses provided by Multi Layer Perceptrons (MLPs) can be explained by means of propositional rules. So far, many rule extraction techniques have been proposed for shallow MLPs, but not for Convolutional Neural Networks (CNNs). To fill this gap, this work presents a new rule extraction method applied to a typical CNN architecture used in Sentiment Analysis (SA). We focus on the textual data on which the CNN is trained with “tweets” of movie reviews. Its architecture includes an input layer representing words by “word embeddings”, a convolutional layer, a max-pooling layer, followed by a fully connected layer. Rule extraction is performed on the fully connected layer, with the help of the Discretized Interpretable Multi Layer Perceptron (DIMLP). This transparent MLP architecture allows us to generate symbolic rules, by precisely locating axis-parallel hyperplanes. Experiments based on cross-validation emphasize that our approach is more accurate than that based on SVMs and decision trees that substitute DIMLPs. Overall, rules reach high fidelity and the discriminative n-grams represented in the antecedents explain the classifications adequately. With several test examples we illustrate the n-grams represented in the activated rules. They present the particularity to contribute to the final classification with a certain intensity.

Keywords:

CNN; model explanation; rule extraction; sentiment analysis; n-grams

1. Introduction

Artificial neural networks learn by examining numerous examples many times. After training, it is very difficult to explain their decisions, because their knowledge is embedded within the values of the parameters and neuron activations, which are at first glance incomprehensible. Deep neural networks are at the root of the significant progress accomplished over the past five years in areas such as artificial vision, natural language processing and speech recognition. In addition, a number of studies have been conducted to clarify the potential of deep models, such as Convolutional Neural Networks (CNNs) in Sentiment Analysis (SA) [,]. Nevertheless, the transparency of bio-inspired models is currently an open and important research topic, as in the long term, the acceptance of these models will depend on it. Furthermore, transparency is essential in relation to a recent European General Data Protection Regulation (GDPR), which also presents a right of explanation. Specifically, when an automated decision is taken by a system, one has the right to request a meaningful explanation.

Neural networks being considered as black-boxes have been made transparent by techniques that were first applied to shallow Multi Layer Perceptrons (MLPs). A natural way to explain MLP responses is through the use of propositional rules []. Andrews et al. introduced a taxonomy describing the general characteristics of all rule extraction methods []. Later, a desire to make neural network ensembles transparent became apparent and several techniques were proposed. Bologna introduced the Discretized Interpretable Multi Layer Perceptron (DIMLP) to generate symbolic rules from both single networks and ensembles [,]. The key idea behind rule extraction in DIMLPs is the precise localization of axis-parallel discriminative hyperplanes [,,,]. A brief explanation is provided in Section 3.2.

Recently, many works involving deep neural networks have gained momentum. On one hand, many techniques aiming at explaining CNN decisions in image classification are based on visualization of areas that are mainly relevant for the outcome [], but as stated by Rudin []: “it does not explain anything except where the network is looking”. On the other hand, the purpose of several methods is to learn an interpretable model in the local region close to an input instance []. Adadi and Berrada presented a comprehensive overview of Explainable Artificial Intelligence (XAI), including neural networks []. In addition, a survey on black-box models with its “explanators” is proposed by Guidotti et al. []. The state of the art highlights a lack of global methods aiming at extracting symbolic rules from CNNs. Local methods could represent good candidate techniques, in order to define a global algorithm that may aggregate all local models in an ensemble. Currently, the approach of generating rules from aggregated local models has not been tackled, since ensembles tend to be more opaque than single models. Note however that it could be carried out by the same technique used in [].

In this article we propose a rule extraction technique for a CNN, typically used in Sentiment Analysis [] with textual data. Our technique is not restricted to local regions, but it is global to all the samples of the input space. The layers of this network are: A two-dimensional input layer representing sentences by means of word embeddings []; a convolutional layer; a max-pooling layer; and finally a fully connected layer. After the training phase, the fully connected layer is replaced by a DIMLP [] that makes it possible to generate propositional rules. The DIMLP subnetwork approximates the fully connected part of the original CNN to any desired precision. The antecedents of the extracted rules represent maximal responses of convolutional filters. By propagating the rules back to the input layer, it is possible to determine the discriminative combination of words in the decision-making process. These words are structured into n-grams that depend on the size of the convolutional filters. As a result, each antecedent in a rule is given as: “if an n-gram in a sentence is present and maximal with respect to the list of possible n-grams, then...”. As an example, with the sentence “Hilarious, touching and wonderfully dyspeptic.”, the following rule with several n-grams is activated (Symbol “-” separates the words composing an n-gram; it is worth noting that punctuation is also taken into account.): “if hilarious and hilarious-, and hilarious-,-touching and touching-and-wonderfully and wonderfully-dyspeptic-. then POSITIVE.

This article extends our recent approach proposed in []. Specifically, the rule extraction algorithm is formalized and more details are provided. The experimental part is now based on cross-validation with many more examples of extracted rules. We also define a measure that makes it possible to determine for each rule antecedent its contribution to the final classification. Finally, a comparison with decision trees and SVMs has been carried out; it shows that our approach is more accurate. In the following paragraphs, Section 2 illustrates a number of representative works aiming at giving insight into deep architectures, Section 3 describes the proposed model with the CNN architecture and the DIMLP subnetwork, then we present the experimental results, followed by the conclusion.

3. Proposed Model

3.1. CNN Architecture

A CNN architecture is composed of several successive layers of neurons. Specifically, in this work we use:

a two-dimensional input layer;
a convolutional layer;
a max-pooling layer;
a fully connected layer.

This network model shown in Figure 1 is very similar to that proposed in []. The only structural difference is the number of fully connected layers, which is equal to one in this work and equal to two in Figure 1. The different layers are described with more details in the following paragraphs.

Figure 1. The Convolutional Neural Networks (CNN) architecture used in this work. From left to right are shown a two-dimensional input layer, a convolutional layer, a max-pooling layer, and an output layer which is fully connected.

3.1.1. Two-Dimensional Input Layer

A two-dimensional input layer is used to encode text. As shown by Figure 2, several words are represented on the vertical axis. Horizontally, a single word could be viewed as a boolean vector with zeros everywhere, except for a component whose value is equal to one. However, a drawback is the fact that thousands of components are typically required. A more parsimonious coding is achieved by word embeddings []. Specifically, a word is not anymore a boolean vector, but a vector with continuous values of typical size equal to 300. It is worth noting that the dimensionality of word embeddings is often between 100 and 300. As stated in [], 300 is one of the most adopted sizes in various studies.

Figure 2. Representation of text by a matrix of real numbers. A word is represented on the horizontal axis by word embeddings, while vertically several words are expressed.

3.1.2. Two-Dimensional Convolutional Layer and Max-Pooling Layer

A key element in CNNs is based on the convolution operator. Given a two-dimensional kernel

w_{p, q}

of size PxQ and a data matrix of elements

x_{a, b}

, the calculation of an element

c_{i j}

of the convolutional layer is

\begin{matrix} c_{i j} = f (\sum_{p}^{P} \sum_{q}^{Q} w_{p, q} \cdot x_{i + p, j + q} + b_{p, q}); \end{matrix}

(1)

with f a transfer function and

b_{p, q}

the bias. As a transfer function we use a hyperbolic tangent:

\begin{matrix} tanh (x) = \frac{exp (2 x) - 1}{exp (2 x) + 1} . \end{matrix}

(2)

We define

S_{p}

and

S_{q}

as the stride parameters along the horizontal and vertical axis between two successive convolutions. In this work

S_{p} = S_{q} = 1

. Moreover, we require the kernel to be completely inside the sample matrix (without zero padding). As an example, with data samples of size 6 × 6 and

P = Q = 3

, the resulting convoluted map has size 4 × 4.

Figure 3 illustrates a particular case of convolution with respect to a text matrix of size 7 × 5 and a kernel of size 4 × 5. This kernel moves over the text matrix, carrying out an element-wise multiplication with the part of the data it is currently on. This is repeated by sliding down the kernel by one position, vertically. Hence, the result of convolution is a vector of four components. “Wide convolution” is the denotation when the horizontal size of the kernel is equal to the horizontal size of the data matrix. With wide convolution, the size of a kernel is defined as its vertical size. For instance, in Figure 3 the kernel size is equal to four.

Figure 3. Wide convolution: The number of columns of the kernel is equal to the number of columns of the text matrix. The result of convolution (showed after the arrow) is a vector.

A remarkable relationship is fulfilled between the size of the kernels and the number of consecutive words taken into account by the convolution operator. Since the size of a kernel is the number of components on the vertical axis, it corresponds to the number of consecutive words processed at any position in a sentence. As an example, three consecutive words are denoted as ’trigrams’ and can be detected by kernels of size three. Similarly, two consecutive words are called ’bigrams’ and can be taken into account by kernels of size two. Finally, to emphasize single words it is possible to define filters of size one.

The max-pooling layer reduces the size of a vector or a matrix by applying a “Max” operator over non-overlapping regions. Figure 4 illustrates in the left a number of vectors obtained after convolution. From each vector the maximal value is extracted and concatenated in a new layer, which enables n-gram position invariance.

Figure 4. The max-pool operator: The maximal value of each vector in the left is selected and concatenated in a new layer denoted as the “Max-pooling” layer (right most vector).

3.1.3. Fully Connected Layer

In this work, a unique fully connected layer of weights follows the max-pooling layer. First, a dot product of

s_{l}

scalars is calculated:

\begin{matrix} s_{l} = \sum_{k} (v_{k l} \cdot m_{k}) . \end{matrix}

(3)

Symbol

m_{k}

represents vector components of the max-pooling layer and

v_{k l}

is a matrix term of weight coefficients, the bias being included in the sum. Then, a Softmax activation function is applied. Specifically, for a number N of

s_{i}

scalars it calculates an N-dimensional vector with values between 0 and 1:

\begin{matrix} o_{l} = \frac{exp (s_{l})}{\sum_{k} exp (s_{k})}; \end{matrix}

(4)

with

o_{l}

as the activation of a neuron in the output layer. The architecture of the CNN used in this work is summarized in Table 1. Specifically, I designates the input layer, C represents the convolutional layer with 40 kernels for each size (this value has been fixed empirically, without trying to reach the possible best predictive accuracy). (

C_{1}, C_{2}, C_{3}

), M is the max-pooling layer (120 neurons), and O designates the output layer including two neurons. Each word is coded in a vector of 300 components, with a maximum number of words per sample equal to 59.

Table 1. CNN architecture. Symbols for each layer are specified in the second row and sizes in the last.

To train the network, the loss function is the cross-entropy (J); here we give its version for two classes:

\begin{matrix} J (W) = - \sum_{p} \sum_{l} [t_{l}^{(p)} log (o_{l}^{(p)}) + (1 - t_{l}^{(p)}) log (1 - o_{l}^{(p)})] . \end{matrix}

(5)

Symbol W represents all the network weights, index p is related to training samples, index l designates an index for the neurons of the output layer,

o_{l}^{(p)}

is the activation of an output neuron, and

t_{l}^{(p)}

represents a target value.

3.2. The DIMLP Model

DIMLP differs from a standard MLP in the number of connections between the input layer and the first hidden layer. Specifically, any hidden neuron receives only a connection from an input neuron and the bias neuron, while all other layers are fully connected []. The activation function above the first hidden layer of a typical DIMLP is a sigmoid function given as:

\begin{matrix} σ (x) = \frac{1}{1 + exp (- x)} . \end{matrix}

(6)

For the first hidden layer a step function or its generalization corresponding to a staircase activation function is used. For simplicity, we first give the step function

τ (x)

, which is a particular case of the staircase function with only one step:

\begin{matrix} τ (x) = \{\begin{matrix} 1 & if x > 0; \\ 0 & otherwise . \end{matrix} \end{matrix}

(7)

The key idea behind rule extraction from DIMLPs is the precise localization of axis-parallel discriminative hyperplanes. In other words, the input space is split into hyper-rectangles representing propositional rules. Specifically, the first hidden layer creates for each input variable a number of axis-parallel hyperplanes that are effective or not, depending on the weight values of the neurons above the first hidden layer. As an example, Figure 5 illustrates an elementary DIMLP network with a hidden neuron, a weight w between the input neuron and the hidden neuron and a bias b between the bias neuron and the hidden neuron. Because of the step function as the activation of the hidden neuron, a potential hyperplane discriminator lies in

- b / w

. It will depend on the layers above the first hidden layer, whether the hyperplane discriminator will be effective or not. Generally, these hyperplanes are parallel to the axis of the input neurons. Hence they represent possible rule antecedents.

Figure 5. A Discretized Interpretable Multi Layer Perceptron (DIMLP) network that potentially creates a discriminative hyperplane in

- b / w

. The activation function of the hidden neuron is a step function, while for the output neuron it is a sigmoid.

The starting point of the rule extraction algorithm is the list of all potential hyperplane discriminators. The number of these hyperplanes depends on the number of stairs in the staircase activation function. Then, a decision tree is built and rules are generated from each tree path. Typically, at this stage the number of rules and antecedents is too large. Hence, a greedy algorithm progressively removes antecedents and rules. More details on the rule extraction algorithm can be found in [].

Since the CNN defined in the previous Section is trained with a Softmax function in the output layer (cf. Equation (4)), we replace the sigmoid by it. As stated previously, the activation function in the first hidden layer of DIMLPs is a staircase function

S (x)

, with

Θ

stairs that approximate the Identity function (

I (x)

) on a compact interval:

\begin{matrix} I (x) = x; \end{matrix}

(8)

\begin{matrix} S (x) = R_{m i n}, & if x \leq R_{m i n}; \end{matrix}

(9)

R_{m i n}

represents the abscissa of the first stair. By default

R_{m i n} = - 1

.

\begin{matrix} S (x) = R_{m a x}, & if x \geq R_{m a x}; \end{matrix}

(10)

R_{m a x}

represents the abscissa of the last stair. By default

R_{m a x} = 1

. Between

R_{m i n}

and

R_{m a x}

,

S (x)

is:

\begin{matrix} S (x) = I (R_{m i n} + [θ \cdot \frac{x - R_{m i n}}{R_{m a x} - R_{m i n}}] (\frac{R_{m a x} - R_{m i n}}{θ})) . \end{matrix}

(11)

Square brackets indicate the integer part function, with

θ = 1, \dots Θ

. The approximation of the Identity function by a staircase function depends on the number of stairs

Θ

. The larger the number of stairs the better the approximation.

3.3. The Interpretable CNN Architecture

The interpretable CNN architecture used in this work is illustrated in Table 2. Its layers are: I-C-M-H-O, with a DIMLP subnetwork included in layers M-H-O. Rule extraction is performed after the training of a CNN with layers I-C-M-O. Note that the CNN with layers I-C-M-H-O is not trained. Specifically, the weight matrix between layers M and O of the trained CNN is transferred to layers H and O of the CNN with layers I-C-M-H-O. Because of the Identity function between M and H, network I-C-M-H-O approximates network I-C-M-O to an arbitrary precision that depends on the number of stairs of the staircase activation function. Hence, rules can be generated from the DIMLP subnetwork. Yet, rule antecedents are related to layer M representing filter values. From that layer, discriminative combinations of words represented in the input layer have to be determined (see below).

Table 2. Interpretable CNN architecture with symbols and sizes.

Figure 6 depicts an example going from a tweet to a rule. First, each word of a sentence is provided to the input layer as horizontal vectors of numbers. Subsequently, these vectors are convolved by the convolutional layer; note that each rectangle on this layer represents a convolution filter. After convolution, the max-pooling layer takes over; its role is to simplify the processed data by retaining maximal values. From this layer to the output layer we have a DIMLP subnetwork. Hence, the extracted rule antecedents represent activation values of the the Max-Pooling layer. For clarity, those which are not represented in the rule at the bottom are left blank. Finally, the correspondence between antecedents and n-grams is shown at the bottom; the algorithm described below allows us to determine them.

Figure 6. Flow of data in the interpretable CNN (see the text for more details).

Generally, many rule extraction techniques generate ordered rules, which means that rules are given in a sequential order, with two consecutive rules linked by an “else” statement. A long list of ordered rules involve many implicit antecedents that makes the interpretation difficult. Rules generated from the DIMLP subnetwork (M-H-O) are unordered. With this type of rules, the “else” statement is absent. Thus, each rule is considered as a single piece of knowledge that can be examined in isolation [].

Each rule antecedent related to the max-pooling layer is given as

a < t

, or

a \geq t

. Since a rule antecedent can be true with one or more n-grams in the input layer, it involves a disjunction of n-grams (one or more n-grams connected by a logical or). Nevertheless, for a given sample and with the use of the “Max” function in the M layer, a unique n-gram becomes dominant (the one with the highest activation). Before giving an algorithm that allows us to determine discriminative n-grams for a given sample, let us define several sets and variables:

G: Set of n-Grams generated from a dataset;
R: Set of rules (with respect to layer M);
$R_{i}$ : A rule in R;
$A_{i j}$ : An antecedent in $R_{i}$ ; specifically $A_{i j} = (m_{j} < c_{i j})$ or $A_{i j} = (m_{j} \geq c_{i j})$ , with $m_{j}$ designating a neuron in the M layer and $c_{i j} \in ℜ$ ;
$S_{k}$ : A sample covered by $R_{i}$ ;
H: Set of n-Grams generated from $S_{k}$ ;
$Γ$ : Set of sought discriminative n-grams $(Γ \subseteq H)$ ;
$a c t (.)$ : The activation(s) of a specified neuron in layer M, with respect to one or more inputs in the input layer.

It is worth noting that with this CNN architecture the activation of neuron

m_{j}

in the M layer depends solely on kernel

K_{j}

in the C layer. Hence, given a sample

S_{k}

covered by

R_{i}

and its set of n-grams H, each discriminative n-gram related to antecedent

A_{i j}

is found by characterizing the n-gram (in H) activating neuron

m_{j}

the most. Given

R_{i}

and

S_{k}

, the following algorithm generates discriminative n-grams:

$Γ = \emptyset$
For all $A_{i j} \subset R_{i}$ loop (w.r.t variable j)
$G_{j} = {g \in G ⋂ H | A_{i j}$ is true }
$F_{j} = {x \in ℜ^{q} | q = |G_{j}|, x = a c t (G_{j})$ for neuron $m_{j}}$
$f_{j}^{*} =$ max( $F_{j}$ )
$g_{j}^{*} = {g \in G_{j} | f_{j}^{*} = a c t (g)$ for neuron $m_{j}}$
$Γ = Γ ⋃ g_{j}^{*}$
end loop

For clarity let us explain the different steps of the algorithm specified above. First, its result will be stored in

Γ

, which is initialized as an empty set. Secondly, a main loop that iterates on all rule antecedents is defined. Subsequently, in step three, for each rule antecedent

A_{i j}

we determine the n-grams of

S_{k}

(a sample) that makes

A_{i j}

true. Then, for each antecedent

A_{i j}

the purpose is to determine the n-gram involving the highest activation of neuron

m_{j}

in the M layer (steps four to six). For instance, with the antecedent

f_{113} \geq 0.24

of a particular rule generated from an interpretable CNN and a sentence given as “a real movie, about real people, that gives us a rare glimpse into a culture most of us don’t know”, four trigrams makes the antecedent true:

“culture most of” (0.251715)
“rare glimpse into” (0.257523)
“into a culture” (0.273310)
“us a rare” (0.311517)

the number after each trigram is the activation value of the 113th neuron in the M layer. The fourth trigram provides the highest activation, hence it is the detected trigram. Generally, it is plausible to obtain a global view of each extracted rule by characterizing for each antecedent its winning n-grams with respect to all the covered samples.

For a given rule antecedent and a given sample, it is useful to determine the linear contribution of each winning n-gram

g_{j}^{*}

, with respect to the final classification. Specifically, this measure corresponds to the product of the filter activation in the M-layer multiplied by the weight connecting the neuron of highest activation in the output layer. The output neuron of strongest activation indicating the class, a positive value of this measure means that the n-gram contributes in favor of the classification, while a negative value is against it. The linear contribution is given as a function:

\begin{matrix} L C (g_{j}^{*}) = v_{k} \cdot a c t (g_{j}^{*}); \end{matrix}

(12)

with act

(g_{j}^{*})

designating the activation of a neuron

m_{k}

in the M layer and

v_{k}

corresponding to a weight between

m_{k}

and the output layer (the one related to the neuron of highest activation).

4. Experimental Results

In this section, we first present the general results on the accuracy of CNNs based on cross-validation. Secondly, Decision Trees (DTs) [] are trained with CNN prediction classes instead of the true label. Thirdly, CNNs are compared to Support Vector Machines (SVMs) []. Subsequently, we replace DIMLP subnetworks by DTs. Then, representative examples of rules extracted from CNNs are shown. They emphasize how discriminative n-grams intuitively explain “tweet” classifications. Finally, we illustrate with two examples how to inspect rules, globally.

4.1. General Results

A CNN including a convolutional layer with 120 kernels of size one, two and three was defined empirically (cf. Section 3.1.3). Note that our purpose was not to find the best possible CNN architecture, but rather to create an acceptable CNN in terms of performance and then to generate rules to explain classifications. Training was performed with Lasagne libraries, version 0.2 []. The training parameters were:

learning rate: $0.02$ ;
momentum: $0.9$ ;
dropout = $0.2$ ;

To illustrate the results of rule extraction, we applied the CNN architecture defined above to a well-known binary classification problem related to “tweets” of movie reviews []. The characteristics of the dataset are:

number of samples: 10,662;
maximal number of words in a “tweet”: 59;
positive sentiment samples: 5331;
negative sentiment samples: 5331;
single words: 21,426;
bigrams: 111,590;
trigrams: 174,628.

The positive and negative subsets have been divided into 10 folds, in order to carry out ten-fold cross validation trials. Moreover, a randomly selected subset of the training set representing 10% of it was extracted as a tuning set for early-stopping [], which is useful to avoid over-training. Table 3 illustrates the results. The first row of this Table is related to the original CNN, while the other ones provide results obtained by interpretable CNNs, which depend on the number of stairs in the staircase activation function. Columns from left to right designate:

Table 3. Average results based on cross-validation. The models are CNNs and interpretable CNN approximations with varying numbers of stairs in the staircase activation function. Standard deviations are given between parentheses.

average train accuracy;
average predictive accuracy on the testing set;
average fidelity, which is the degree of matching between generated rules and the CNN;
average predictive accuracy of the rules;
average predictive accuracy of the rules when rules and CNN agree;
average number of extracted rules;
average number of rule antecedents.

The average predictive accuracy of the rules is slightly lower than that obtained by the CNN (73.7% versus 74.1%, for the best result). Moreover, average fidelity on the testing set is above 95%, meaning that rules explain CNN responses in a large majority of cases. Finally, the best average predictive accuracy of the rules when rules and model agree is higher than that obtained by the CNN (75.1% versus 74.1%).

As a baseline, a simple pedagogical technique aiming at explaining CNN predictions is represented by a DT that learns training samples with CNN prediction classes, instead of true labels. Table 4 illustrates the results obtained by cross-validation. The

λ

parameter, which is the minimum number of samples required to be at a leaf node makes it possible to control the size of the trees. The number of nodes of a tree is in turn related to the proportion of learned training samples. It is worth noting that the fidelity on the training samples decreases when

λ

is increased. Moreover, the average predictive accuracy is never above 60%, the average fidelity being always below 63%. On one hand, this performance is substantially lower than that obtained by interpretable CNNs with DIMLP subnetworks (see Table 3). On the other hand, with

λ \geq 10

, a lower number of rules are generated.

Table 4. Average results obtained by Decision Trees (DTs) trained to learn datasets with CNN targets, instead of true labels. Columns from left to right represent average results on: Training accuracy; predictive accuracy; fidelity on the training set; fidelity on the testing set; number of extracted rules; and number of antecedents in the rules. In the rows, the

λ

parameter controlling the size of the trees varies.

We may wonder about replacing the DIMLP subnetwork in interpretable CNNs by Decision Trees, from which rules can be extracted. Note that a decision tree can be viewed as a set of rules, with each rule represented by a path going from the root to a leaf. Table 5 illustrates the results with respect to the extracted rules, by varying the

λ

parameter. The best average predictive accuracy is equal to 68.6%, which is much lower that that obtained by CNNs with the DIMLP subnetwork (68.6% versus 73.7%). An intuitive reason explaining this result is that the fully connected layer of the original CNN is better approximated by DIMLPs than DTs. However, DTs generate a significant smaller number of rules, on average: 201.8 versus 573.8.

Table 5. Average results obtained by the rules generated from Decision Trees that replace the fully connected layer of CNNs. Columns from left to right represent average results on: Training accuracy; testing accuracy; number of extracted rules; number of antecedents in the rules. In the rows, the

λ

parameter controlling the size of the trees varies.

Support Vector Machines are usually very competitive with very highly dimensional classification problems. Here we have 17,700 (

59 * 300

) input variables for each sample; thus, a natural question is whether SVMs perform better than CNNs. Due to the high input dimensionality it is recommended to use linear SVMs. Table 6 present the results by varying the C parameter, which controls the proportion of misclassified training samples []. The best average predictive accuracy is less than that obtained by interpretable CNNs (70.6% versus 73.7%). A question arising is whether this difference is statistically significant.

Table 6. Average results obtained by Support Vector Machines (SVMs). Each row illustrates the accuracy results with respect to the C parameter.

A multiple statistical comparison test is performed to find out whether average predictive accuracies are significantly different. With an ANOVA statistical test we aim at determining whether all the models used here obtain the same average predictive accuracy against the alternative hypothesis that they are not all the same. In other words, the null hypothesis states that the means are all equal. For this statistical test we define a significance level equal to 0.01, which involves a 1% risk of concluding that a difference exists when there is no actual difference. Not surprisingly, ANOVA rejects the null hypothesis that all model average predictive accuracies are equal (p-value

= 3.1252 \cdot 10^{- 59})

.

At this point it is within reach to compare a subset of the models, such as the SVMs related to their highest average predictive accuracy and interpretable CNNs. These results are shown in Table 7. The small p-values involve that with very high probability the predictive accuracy of the rules generated from each interpretable CNN is significantly different from the one measured on SVMs with C parameter equal to 0.01. Similar results are illustrated in Table 8, with respect to SVMs with C parameter equal to 0.005.

Table 7. ANOVA comparison between CNNs and SVMs providing the better average predictive accuracy (equal to 70.6%).

Table 8. ANOVA comparison between CNNs and SVMs providing the second better average predictive accuracy (equal to 70.5%).

4.2. Examples of Detected N-Grams from the Rules

First, rules are expressed with antecedents given as filter responses. Then, n-grams are determined from the antecedents. Specifically, the truthfulness of the antecedents involve long lists of n-grams. Nevertheless, for a given sample and a given rule antecedent, only a unique n-gram “wins the competition” (cf. Section 3.3). Rules are ranked according to their support with respect to the training set, which corresponds to the number of covered samples. Finally, rules are not disjointed, which means that a sample can activate more than a rule.

4.2.1. Discriminative N-Grams Determined from $R_{26}$

We first illustrate rule number 26 (

R_{26}

), with support equal to 464 samples with respect to the training set and 47 samples with respect to the testing set:

$(m_{35} < 0.24) (m_{54} \geq 0.16) (m_{77} < 0.22) (m_{98} \geq 0.18) (m_{112} \geq 0.12) (m_{120} < 0.1)$ Class = NEGATIVE

Here,

m_{i} (i = 1, \dots 120)

designates neuron activations in the max-pooling layer. Indexes between one and 40 are related to single words, those between 41 and 80 correspond to bigrams, and those between 81 and 120 involve trigrams. The accuracy of this rule is 93.3% on the training set and 87.2% on the testing set.

In the following Figures are depicted examples of “tweets” belonging to the testing set and including the discriminative n-grams, ranked according to the linear contribution in the output layer (cf. Equation (12)). Figure 7 shows a correctly classified “tweet” related to

R_{26}

. Single words, bigrams and trigrams are illustrated vertically. A “*” or a “+” designates a repeated n-gram, since two different antecedents are able to code the same n-gram. In Figure 7 “too-bad-the” is a trigram, “too-bad” is a bigram and “style” is a single word. Note that “style” and “style-.” (punctuation is also coded in word embeddings) are against the final classification. The most contributing n-gram is “too-bad-the”, which appears twice. In this case, all n-grams containing “bad” contribute to the negative polarity. This fact is also in agreement with our common sense.

Figure 7. N-grams determined from

R_{26}

and the following “tweet” (in the testing set): “it is just too bad the film’s story does not live up to its style”. On the vertical axis are represented the n-grams classified according to their linear contribution. Negative values are against the final classification, while positive values are in its favor. The sum of the n-grams linear contributions is 0.435.

Figure 8 illustrates another example in which the most contributing n-gram to the final classification is a trigram: “mediocre-special-effects”. Note that this is also in accordance with our perception. Bigram “worst-of” is the second most contributing n-gram. Surprisingly, single word “worst” is almost neutral with respect to the final classification.

Figure 8. N-grams determined from

R_{26}

and the following “tweet”: “a zippy 96 min of mediocre special effects, hoary dialogue, fluxing accents, and—worst of all—silly-looking morlocks”. The sum of the n-grams linear contributions is 0.302.

Figure 9 shows that the most discriminative n-gram is trigram “so-mind_numbingly-awful”, then follows bigram “so-mind_numbingly”. Note also that three n-grams are against the final classification. Among them, single word “as” is the strongest element in contradiction with the final classification.

Figure 9. N-grams determined from

R_{26}

and the following “tweet”: “so mind-numbingly awful that you hope britney won’t do it one more time, as far as movies are concerned”. The sum of the n-grams linear contributions is 0.211.

As a last example for rule

R_{26}

, we can see in Figure 10 that trigram “just-bad-;” and bigram “just-bad” are the most contributing n-grams. Again, “as” is the strongest element in favor of the positive polarity (thus, in contradiction with the “tweet” classification).

Figure 10. N-grams determined from

R_{26}

and the following “tweet”: “the tuxedo wasn’t just bad; it was, as my friend david cross would call it, hungry-man portions of bad.” The sum of the n-grams linear contributions is 0.283.

4.2.2. Discriminative N-Grams Determined from $R_{24}$

As a second rule,

R_{24}

is:

$(m_{29} < 0.18) (m_{54} < 0.26) (m_{93} < 0.16) (m_{95} < 0.18) (m_{110} \geq 0.22) (m_{113} \geq 0.24) (m_{118} < 0.20)$ Class = POSITIVE

Its accuracy is 93.4% on the training set with support equal to 469 samples and 98.0% on the testing set, with support equal to 51 samples. Figure 11 depicts a covered “tweet” in the testing set. The most important n-gram is trigram “has-a-subtle”, which is evoked by two different antecedents. Moreover, n-grams related to “it’s over” are associated to negative polarities.

Figure 11. N-grams determined from

R_{24}

and the following “tweet”: “it has a subtle way of getting under your skin and sticking with you long after it’s over”. The sum of the n-grams linear contributions is 0.260.

In Figure 12, we illustrate a “tweet” that is more difficult to classify, since it starts in a rather negative manner and then becomes strongly positive. As a consequence, the CNN detects a considerable number of negative elements. Four n-grams are not in favor of the correct classification, with two of them including “not”. Overall, the contribution of the positive parties is stronger, with a strongest trigram related to two different antecedents (“offers-gorgeous-imagery”).

Figure 12. N-grams determined from

R_{24}

and the following “tweet”: “though the controversial korean filmmaker’s latest effort is not for all tastes, it offers gorgeous imagery, effective performances, and an increasingly unsettling sense of foreboding.” The sum of the n-grams linear contributions is 0.251.

Figure 13 illustrates a short “tweet”. Two trigrams appear twice and surprisingly single word “entertaining” contributes negatively to the final class. In Figure 14 the strongest n-gram is: “beautifully-accomplished-lyrical”, which is very complimentary. Curiously, bigram “beautifully-accomplished” is slightly negative in its contribution to the final class. This would indicate a defect in the classifier, as well as “entertaining” in the previous “tweet”. An explanation for the negative connotation of this single word would be that it appears 31 times in the dataset of negative “tweets”. Finally, Figure 15 illustrates a testing case, which is wrongly classified by the CNN, but correctly classified by

R_{24}

. Here, “is-a-feast” is the only element that contributes positively to the classification.

Figure 13. N-grams determined from

R_{24}

and the following “tweet”: “thoughtful, provocative and entertaining”. The sum of the n-grams linear contributions is 0.398.

Figure 14. N-grams determined from

R_{24}

and the following “tweet”: “it’s a beautifully accomplished lyrical meditation on a bunch of despondent and vulnerable characters living in the renown chelsea hotel . . .”. The sum of the n-grams linear contributions is 0.355.

Figure 15. N-grams determined from

R_{24}

and the following “tweet” (in the testing set): “some movies are like a tasty hors-d’oeuvre; this one is a feast”. It is correctly classified by the rule, but wrongly classified by the network.

4.3. Global View of Rules

A rule antecedent can be true with several different n-grams. Hence, to analyze a rule as a whole we must detect its related n-grams for each antecedent and for all covered samples. Note that a rule with antecedents depending on neuron activations in the M-layer can be formulated in the input layer as:

“if one of the n-grams related to the first antecedent is present and maximal (Only the n-gram that activates the most its related neuron in the M layer is the one that makes the antecedent true.) then ⋯” and ⋯ “if one of the n-grams related to the last antecedent is present and maximal then ⋯”.

Let us give a first example with a rule covering a moderate number of tweets in the training set. We take into account a CNN network with parameter

Θ

equal to 100. Rule 347 (

R_{347}

) presents two antecedents. It covers 32 samples in the training set and it is given by:

$(m_{36} \geq 0.28) (m_{110} \geq 0.28)$ Class = POSITIVE;

with

m_{36}

and

m_{110}

representing neurons in the M-layer. Antecedent

m_{36}

is related to five single words: “cinema”; “see”; “you”; “you’ll”; “your”. Antecedent

m_{110}

is related to 31 different trigrams, with five of them starting with “is” (“is-a-beautiful”, “is-a-compelling”, “is-a-satisfying”, “is-a-startling”, “is-a-sweet”, “is-the-heart”); four trigrams starting with “it’s” (“it’s-a-beautiful”, “it’s-a-smart”, “it’s-a-sweet”, “it’s-a-very”); three trigrams starting with “yet” (“yet-deeply-watchable”, “yet-sensual-entertainment”, “yet-shadowy-vision”); two trigrams starting with “warm” (“warm-and-charming”, “warm-your-heart”), and 17 other trigrams (“touching-and-tender”, “simple-,-sweet”, “intriguing-and-beautiful”, etc.). Note that these n-grams clearly belong to the class of positive polarity.

Another example with a rule covering more samples is represented by rule

R_{66}

(253 samples in the training set):

$(m_{74} \geq 0.12) (m_{84} < 0.24) (m_{110} \geq 0.28)$ Class = POSITIVE;

Antecedent

m_{74}

is related to 155 distinct bigrams. About two thirds of them start with “a” or “an” or “and” (for instance: “a-perfect”; “an-amusing”, “and-touching”). Other bigrams start with “deeply”, such as: “deeply-moving”; “deeply-absorbing”; “deeply-moving”; “deeply-touching”, etc. Finally, “surprisingly” is also present in five bigrams (“surprisingly-charming”, “surprisingly-engaging”, “surprisingly-manages”, “surprisingly-powerful”, “surprisingly-touching”).

Regarding

m_{110}

, similarly to

R_{347}

in which this antecedent is present, we see many trigrams starting with “is” or “it’s”. Other trigrams having several times the same first word starts with: “charming”; “deeply”; “funny”; “heart”; “intelligent”; “intriguing”; “lovely”; “offers”; “powerful”; “smart”; “surprisingly”; “touching”; “warm”; “yet”; etc. For antecedent

m_{84}

we also have a certain number of trigrams starting with “is” or “it’s”, though significantly less than previously. Moreover, we counted more than 40 trigrams starting with “,” and more than 20 trigrams starting with “and”. Other non-numerous trigrams with the same first word start with: “a”; “but”; “film”; “in”; “melodrama”; “movie”; “smart”; “surprisingly”; “that”; “smart”.

Since we generate unordered rules, each rule can be viewed as a single classifier that can be analyzed without considering the other rules. The more the number of rule antecedents and samples covered, the longer the examination will take. However, in principle this is always possible. Specifically, by looking at the n-grams related to each antecedent it is realizable to understand in which context a rule is applied.

4.4. Discussion

DTs were not as good as DIMLPs at approximating the fully connected layer at the top of the original CNN. Nevertheless, more rules were generated from DIMLPs. Thus, the best approximation precision is at the cost of more complicated rulesets. With DIMLPs it would be feasible to alleviate the size of extracted rulesets with the use of reduced training sets, as it was carried out in [].

The linear contribution of n-grams is a measure that allows us to characterize the relative importance of discriminative combinations of words. On one hand, it helps to characterize flaws in the classifier, such as the detection of n-grams that contribute to an opposite class. On the other hand, it enables to understand which word combinations are used correctly. The contribution of n-grams is easily calculated with only a unique fully connected layer, but an open question is to find a way to calculate it with a greater number of fully connected layers. A solution to this problem could be the use of a feature importance value, as described in the framework reported in [].

To determine a rule as a whole, it is sufficient to characterize all discriminative n-grams over all covered samples. From the linear contributions it is plausible to determine the n-grams that go against the rule class. Then, flaws can be detected and possibly a corrective strategy could be developed with the use of additional training examples putting into play problematic n-grams.

The CNN architecture used in this work is simple, as it includes a unique convolutional layer. We might wonder whether our rule extraction technique could be adapted to more complex architectures used in object recognition, such as LeNet [], AlexNet [], and VGGNet []. In a similar manner to what has been achieved here, we would extract unordered rules at the level of the first fully connected layer. Given a sample and a rule activated by this sample, it would be conceivable for each rule antecedent to determine one or more image regions that contribute positively or negatively to the final classification. The extent of these regions would be characterized by going back to the input layer, after passing through an arbitrary number of convolutional and max-pooling layers.

5. Conclusions

We presented a new rule extraction technique applied to a CNN in sentiment analysis. Our approach is global and could be extended to object recognition problems encompassing a moderate number of convolutional layers. Our rule extraction method consisted in approximating a trained CNN by transforming the top layers into a transparent neural network model. Thus, rules generated from the transparent subnetwork were propagated backward to the input layer and became comprehensible, as the antecedents represent n-grams. The rules can be inspected globally, by determining for each antecedent all the related n-grams, or locally, by characterizing for a given sample the discriminative n-grams that contribute to the final classification. These n-grams allowed us to explain with several examples why the classifier worked well or badly. In the future it will be interesting to determine how to correct flaws in a neural network with the help of the extracted rules. One way could be to inject supplementary training examples, aiming at modifying the linear contributions of discriminative n-grams.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SA	Sentiment Analysis
CNN	Convolutional Neural Network
DIMLP	Discretized Interpretable Multi Layer Perceptron
MLP	Multi Layer Perceptron
SVM	Support Vector Machines
DT	Decision Trees
FI	Features Importance

References

Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
Cliche, M. BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs. arXiv 2017, arXiv:1704.06125. [Google Scholar]
Holzinger, A.; Biemann, C.; Pattichis, C.S.; Kell, D.B. What do we need to build explainable AI systems for the medical domain? arXiv 2017, arXiv:1712.09923. [Google Scholar]
Andrews, R.; Diederich, J.; Tickle, A.B. Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl.-Based Syst. 1995, 8, 373–389. [Google Scholar] [CrossRef]
Bologna, G. A study on rule extraction from several combined neural networks. Int. J. Neural Syst. 2001, 11, 247–255. [Google Scholar] [CrossRef]
Bologna, G. Is it worth generating rules from neural network ensembles? J. Appl. Log. 2004, 2, 325–348. [Google Scholar] [CrossRef][Green Version]
Bologna, G. Rule extraction from a multilayer perceptron with staircase activation functions. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN 2000), Como, Italy, 27 July 2000; Volume 3, pp. 419–424. [Google Scholar]
Bologna, G. A model for single and multiple knowledge based networks. Artif. Intell. Med. 2003, 28, 141–163. [Google Scholar] [CrossRef]
Zhang, Q.S.; Zhu, S.C. Visual interpretability for deep learning: A survey. Front. Inf. Technol. Electron. Eng. 2018, 19, 27–39. [Google Scholar] [CrossRef]
Rudin, C. Please Stop Explaining Black Box Models for High Stakes Decisions. arXiv 2018, arXiv:1811.10154. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should i trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Pedreschi, D.; Turini, F.; Giannotti, F. Local rule-based explanations of black box decision systems. arXiv 2018, arXiv:1805.10820. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Bologna, G. A Rule Extraction Study Based on a Convolutional Neural Network. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction; Springer: Cham, Switzerland, 2018; pp. 304–313. [Google Scholar]
Tran, S.N.; Garcez, A.D. Knowledge extraction from deep belief networks for images. In Proceedings of the IJCAI-2013 Workshop on Neural-Symbolic Learning and Reasoning, Beijing, China, 3–9 August 2013. [Google Scholar]
Hayashi, Y. Use of a Deep Belief Network for Small High-Level Abstraction Data Sets Using Artificial Intelligence with Rule Extraction. Neural Comput. 2018, 30, 3309–3326. [Google Scholar] [CrossRef] [PubMed]
Setiono, R.; Baesens, B.; Mues, C. Recursive neural network rule extraction for data with mixed attributes. IEEE Trans. Neural Netw. 2008, 19, 299–307. [Google Scholar] [CrossRef]
Zilke, J. Extracting Rules from Deep Neural Networks. Master’s Thesis, Computer Science Department, Technische Universitat Darmstadt, Darmstadt, Germany, 2015. [Google Scholar]
Zilke, J.R.; Mencía, E.L.; Janssen, F. DeepRED—Rule extraction from deep neural networks. In International Conference on Discovery Science; Springer: Cham, Switzerland, 2016; pp. 457–473. [Google Scholar]
Bologna, G.; Hayashi, Y. A rule extraction study on a neural network trained by deep learning. In Proceedings of the IEEE 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 668–675. [Google Scholar]
Bologna, G.; Hayashi, Y. Characterization of symbolic rules embedded in deep DIMLP networks: A challenge to transparency of deep learning. J. Artif. Intell. Soft Comput. Res. 2017, 7, 265–286. [Google Scholar] [CrossRef]
Hendricks, L.A.; Akata, Z.; Rohrbach, M.; Donahue, J.; Schiele, B.; Darrell, T. Generating visual explanations. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 3–19. [Google Scholar]
Babiker, H.K.B.; Goebel, R. Using KL-divergence to focus Deep Visual Explanation. arXiv 2017, arXiv:1711.06431. [Google Scholar]
Lapuschkin, S.; Binder, A.; Montavon, G.; Muller, K.R.; Samek, W. Analyzing classifiers: Fisher vectors and deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2912–2920. [Google Scholar]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 818–833. [Google Scholar]
Mahendran, A.; Vedaldi, A. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5188–5196. [Google Scholar]
Dosovitskiy, A.; Brox, T. Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4829–4837. [Google Scholar]
Turner, R. A model explanation system. In Proceedings of the 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Salerno, Italy, 13–16 September 2016; pp. 1–6. [Google Scholar]
Koh, P.W.; Liang, P. Understanding black-box predictions via influence functions. arXiv 2017, arXiv:1703.04730. [Google Scholar]
Frosst, N.; Hinton, G. Distilling a neural network into a soft decision tree. arXiv 2017, arXiv:1711.09784. [Google Scholar]
Zhang, Q.; Yang, Y.; Wu, Y.N.; Zhu, S.C. Interpreting CNNs via decision trees. arXiv 2018, arXiv:1802.00121. [Google Scholar]
Jacovi, A.; Shalom, O.S.; Goldberg, Y. Understanding Convolutional Neural Networks for Text Classification. arXiv 2018, arXiv:1809.08037. [Google Scholar]
Arras, L.; Horn, F.; Montavon, G.; Müller, K.R.; Samek, W. “What is relevant in a text document?” An interpretable machine learning approach. PLoS ONE 2017, 12, e0181142. [Google Scholar] [CrossRef] [PubMed]
Craven, M.; Shavlik, J.W. Extracting tree-structured representations of trained networks. In Advances in Neural Information Processing Systems; MIT Press: Denver, CO, USA, 1996; pp. 24–30. [Google Scholar]
Augasta, M.G.; Kathirvalavakumar, T. Reverse engineering the neural networks for rule extraction in classification problems. Neural Process. Lett. 2012, 35, 131–150. [Google Scholar] [CrossRef]
Yin, Z.; Shen, Y. On the dimensionality of word embedding. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 887–898. [Google Scholar]
Lakkaraju, H.; Bach, S.H.; Leskovec, J. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1675–1684. [Google Scholar]
Quinlan, J.R. C4.5: Programs for machine learning. morgan kaufmann publishers, Inc., 1993. Mach. Learn. 1994, 16, 235–240. [Google Scholar]
Vapnik, V.N.; Vapnik, V. Statistical Learning Theory; Wiley: New York, NY, USA, 1998; Volume 1. [Google Scholar]
Dieleman, S.; Schlüter, J.; Raffel, C.; Olson, E.; Sønderby, S.K.; Nouri, D.; Maturana, D.; Thoma, M.; Battenberg, E.; Kelly, J.; et al. Lasagne: First Release; Zelando: Genève, Switzerland, 2015. [Google Scholar]
Pang, B.; Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; Association for Computational Linguistics; p. 271. [Google Scholar]
Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y. LeNet-5, Convolutional Neural Networks. 2015, p. 20. Available online: http://yann. lecun. com/exdb/lenet (accessed on 13 June 2019).
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]

Figure 1. The Convolutional Neural Networks (CNN) architecture used in this work. From left to right are shown a two-dimensional input layer, a convolutional layer, a max-pooling layer, and an output layer which is fully connected.

Figure 2. Representation of text by a matrix of real numbers. A word is represented on the horizontal axis by word embeddings, while vertically several words are expressed.

Figure 3. Wide convolution: The number of columns of the kernel is equal to the number of columns of the text matrix. The result of convolution (showed after the arrow) is a vector.

Figure 4. The max-pool operator: The maximal value of each vector in the left is selected and concatenated in a new layer denoted as the “Max-pooling” layer (right most vector).

Figure 5. A Discretized Interpretable Multi Layer Perceptron (DIMLP) network that potentially creates a discriminative hyperplane in

- b / w

. The activation function of the hidden neuron is a step function, while for the output neuron it is a sigmoid.

Figure 6. Flow of data in the interpretable CNN (see the text for more details).

Figure 7. N-grams determined from

R_{26}

and the following “tweet” (in the testing set): “it is just too bad the film’s story does not live up to its style”. On the vertical axis are represented the n-grams classified according to their linear contribution. Negative values are against the final classification, while positive values are in its favor. The sum of the n-grams linear contributions is 0.435.

Figure 8. N-grams determined from

R_{26}

and the following “tweet”: “a zippy 96 min of mediocre special effects, hoary dialogue, fluxing accents, and—worst of all—silly-looking morlocks”. The sum of the n-grams linear contributions is 0.302.

Figure 9. N-grams determined from

R_{26}

and the following “tweet”: “so mind-numbingly awful that you hope britney won’t do it one more time, as far as movies are concerned”. The sum of the n-grams linear contributions is 0.211.

Figure 10. N-grams determined from

R_{26}

and the following “tweet”: “the tuxedo wasn’t just bad; it was, as my friend david cross would call it, hungry-man portions of bad.” The sum of the n-grams linear contributions is 0.283.

Figure 11. N-grams determined from

R_{24}

and the following “tweet”: “it has a subtle way of getting under your skin and sticking with you long after it’s over”. The sum of the n-grams linear contributions is 0.260.

Figure 12. N-grams determined from

R_{24}

and the following “tweet”: “though the controversial korean filmmaker’s latest effort is not for all tastes, it offers gorgeous imagery, effective performances, and an increasingly unsettling sense of foreboding.” The sum of the n-grams linear contributions is 0.251.

Figure 13. N-grams determined from

R_{24}

and the following “tweet”: “thoughtful, provocative and entertaining”. The sum of the n-grams linear contributions is 0.398.

Figure 14. N-grams determined from

R_{24}

and the following “tweet”: “it’s a beautifully accomplished lyrical meditation on a bunch of despondent and vulnerable characters living in the renown chelsea hotel . . .”. The sum of the n-grams linear contributions is 0.355.

Figure 15. N-grams determined from

R_{24}

and the following “tweet” (in the testing set): “some movies are like a tasty hors-d’oeuvre; this one is a feast”. It is correctly classified by the rule, but wrongly classified by the network.

Table 1. CNN architecture. Symbols for each layer are specified in the second row and sizes in the last.

Input	Convolution	Max-Pooling Layer	Output
I	$C = (C_{1}, C_{2}, C_{3})$	M	O
59 × 300	1 × 300 × 40 2 × 300 × 40 3 × 300 × 40	120	2

Table 2. Interpretable CNN architecture with symbols and sizes.

Input	Conv.	Max-Pooling Layer	DIMLP Hid. Layer	Output
I	C = (C1, C2, C3)	M	H	O
59 × 300	1 × 300 × 40 2 × 300 × 40 3 × 300 × 40	120	120	2

Table 3. Average results based on cross-validation. The models are CNNs and interpretable CNN approximations with varying numbers of stairs in the staircase activation function. Standard deviations are given between parentheses.

	Tr. Acc.	Tst. Acc.	Fid.	Rul. Acc. (1)	Rul. Acc. (2)	#Rul.	#Ant.
CNN	82.1 (1.3)	74.1 (1.1)	–	–	–	–	–
CNN ( $Θ = 100$ )	82.1 (1.3)	74.1 (1.1)	95.3 (0.6)	73.7 (1.0)	75.1 (1.0)	651.7 (53.1)	5368.9 (379.3)
CNN ( $Θ = 200$ )	82.2 (1.4)	74.1 (1.0)	95.8 (0.4)	73.7 (1.0)	74.9 (1.0)	573.8 (49.0)	5040.7 (354.0)
CNN ( $Θ = 500$ )	82.1 (1.4)	74.1 (1.1)	95.6 (0.8)	73.6 (1.0)	75.0 (0.9)	566.1 (28.0)	4980.7 (364.5)

Table 4. Average results obtained by Decision Trees (DTs) trained to learn datasets with CNN targets, instead of true labels. Columns from left to right represent average results on: Training accuracy; predictive accuracy; fidelity on the training set; fidelity on the testing set; number of extracted rules; and number of antecedents in the rules. In the rows, the

λ

parameter controlling the size of the trees varies.

Table 4. Average results obtained by Decision Trees (DTs) trained to learn datasets with CNN targets, instead of true labels. Columns from left to right represent average results on: Training accuracy; predictive accuracy; fidelity on the training set; fidelity on the testing set; number of extracted rules; and number of antecedents in the rules. In the rows, the

λ

parameter controlling the size of the trees varies.

	Tr. Acc.	Tst. Acc.	Tr. Fid.	Tst Fid.	#Rul.	#Ant.
DT $(λ = 1)$	82.1 (1.3)	57.5 (1.4)	100.0 (0.0)	60.9 (0.8)	797.9 (16.9)	10,354.3 (704.6)
DT $(λ = 5)$	78.6 (1.1)	57.9 (0.9)	94.3 (0.2)	61.0 (1.0)	615.7 (6.4)	7134.4 (304.6)
DT $(λ = 10)$	75.6 (0.8)	58.4 (1.3)	89.1 (0.2)	61.4 (1.3)	456.8 (7.4)	4863.7 (149.3)
DT $(λ = 15)$	73.3 (0.7)	58.5 (1.4)	85.4 (0.4)	61.6 (0.8)	362.2 (5.6)	3666.2 (116.4)
DT $(λ = 20)$	71.9 (0.6)	58.5 (1.5)	82.9 (0.3)	61.5 (2.2)	299.5 (4.5)	2904.5 (86.4)
DT $(λ = 25)$	70.6 (0.6)	58.6 (1.6)	81.1 (0.4)	61.4 (1.8)	253.9 (3.5)	2363.1 (71.6)
DT $(λ = 30)$	69.8 (0.6)	59.3 (1.0)	79.8 (0.4)	62.1 (1.8)	222.7 (4.4)	2015.9 (52.1)
DT $(λ = 35)$	69.0 (1.0))	59.3 (1.6)	78.5 (0.4)	62.5 (1.5)	194.4 (4.2)	1707.9 (64.1)
DT $(λ = 40)$	68.5 (0.7)	59.7 (1.8)	77.4 (0.3)	62.3 (1.4)	173.9 (3.8)	1482.8 (42.4)
DT $(λ = 50)$	67.6 (0.6)	59.6 (1.1)	75.9 (0.4)	62.7 (1.2)	143.8 (2.7)	1168.4 (31.5)

Table 5. Average results obtained by the rules generated from Decision Trees that replace the fully connected layer of CNNs. Columns from left to right represent average results on: Training accuracy; testing accuracy; number of extracted rules; number of antecedents in the rules. In the rows, the

λ

parameter controlling the size of the trees varies.

Table 5. Average results obtained by the rules generated from Decision Trees that replace the fully connected layer of CNNs. Columns from left to right represent average results on: Training accuracy; testing accuracy; number of extracted rules; number of antecedents in the rules. In the rows, the

λ

parameter controlling the size of the trees varies.

	Tr. Acc.	Tst. Acc.	#Rul.	#Ant.
DT $(λ = 1)$	100.0 (0.0)	66.0 (1.6)	943.2 (23.6)	11,160.8 (367.7)
DT $(λ = 5)$	93.4 (0.2)	66.8 (1.4)	640.0 (16.3)	6710.8 (191.2)
DT $(λ = 10)$	89.0 (0.3)	67.0 (1.4)	450.4 (9.3)	4414.5 (97.7)
DT $(λ = 15)$	86.5 (0.4)	67.6 (1.1)	342.6 (5.6)	3198.0 (60.5)
DT $(λ = 20)$	84.8 (0.4)	68.0 (1.0)	276.8 (7.4)	2483.2 (69.9)
DT $(λ = 25)$	83.6 (0.4)	67.9 (1.3)	234.6 (4.7)	2043.1 (38.4)
DT $(λ = 30)$	82.7 (0.4)	68.6 (0.9)	201.8 (4.2)	1704.5 (38.1)
DT $(λ = 35)$	82.0 (0.5)	68.3 (1.1)	178.0 (3.5)	1468.0 (40.1)
DT $(λ = 40)$	81.3 (0.5)	68.3 (1.2)	159.1 (3.5)	1281.6 (29.0)

Table 6. Average results obtained by Support Vector Machines (SVMs). Each row illustrates the accuracy results with respect to the C parameter.

	Tr. Acc.	Tst. Acc.
SVM $(C = 0.005)$	75.1 (0.1)	70.5 (1.4)
SVM $(C = 0.01)$	77.1 (0.2)	70.6 (1.3)
SVM $(C = 0.05)$	81.6 (0.2)	70.2 (0.8)
SVM $(C = 0.1)$	83.6 (0.3)	69.9 (1.0)
SVM $(C = 0.5)$	88.6 (0.2)	68.1 (0.9)
SVM $(C = 1)$	90.7 (0.2)	67.5 (0.9)

Table 7. ANOVA comparison between CNNs and SVMs providing the better average predictive accuracy (equal to 70.6%).

	CNN Rul. Acc. (Average)	CNN Rul. Acc. (Median)	p-Value
CNN ( $Θ = 100$ ) and SVM ( $C = 0.01$ )	73.7	73.8	$1.0946 \cdot 10^{- 6}$
CNN ( $Θ = 200$ ) and SVM ( $C = 0.01$ )	73.7	73.6	$9.6623 \cdot 10^{- 7}$
CNN ( $Θ = 500$ ) and SVM ( $C = 0.01$ )	73.6	73.9	$1.5288 \cdot 10^{- 6}$

Table 8. ANOVA comparison between CNNs and SVMs providing the second better average predictive accuracy (equal to 70.5%).

	p-Value
CNN ( $Θ = 100$ ) and SVM ( $C = 0.005$ )	$8.8375 \cdot 10^{- 7}$
CNN ( $Θ = 200$ ) and SVM ( $C = 0.005$ )	$8.4786 \cdot 10^{- 7}$
CNN ( $Θ = 500$ ) and SVM ( $C = 0.005$ )	$1.0091 \cdot 10^{- 6}$

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Simple Convolutional Neural Network with Rule Extraction

Abstract

1. Introduction

3. Proposed Model

3.1. CNN Architecture

3.1.1. Two-Dimensional Input Layer

3.1.2. Two-Dimensional Convolutional Layer and Max-Pooling Layer

3.1.3. Fully Connected Layer

3.2. The DIMLP Model

3.3. The Interpretable CNN Architecture

4. Experimental Results

4.1. General Results

4.2. Examples of Detected N-Grams from the Rules

4.2.1. Discriminative N-Grams Determined from $R_{26}$

4.2.2. Discriminative N-Grams Determined from $R_{24}$

4.3. Global View of Rules

4.4. Discussion

5. Conclusions

Funding

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

A Simple Convolutional Neural Network with Rule Extraction

Abstract

1. Introduction

2. Related Work

2.1. Rule Extraction from Deep Networks without Convolution

2.2. Explanation of CNN Classifications

2.3. Differences and Similarities with Our Approach

3. Proposed Model

3.1. CNN Architecture

3.1.1. Two-Dimensional Input Layer

3.1.2. Two-Dimensional Convolutional Layer and Max-Pooling Layer

3.1.3. Fully Connected Layer

3.2. The DIMLP Model

3.3. The Interpretable CNN Architecture

4. Experimental Results

4.1. General Results

4.2. Examples of Detected N-Grams from the Rules

4.2.1. Discriminative N-Grams Determined from R 26

4.2.2. Discriminative N-Grams Determined from R 24

4.3. Global View of Rules

4.4. Discussion

5. Conclusions

Funding

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

4.2.1. Discriminative N-Grams Determined from $R_{26}$

4.2.2. Discriminative N-Grams Determined from $R_{24}$