A Simple Convolutional Neural Network with Rule Extraction

: Classiﬁcation responses provided by Multi Layer Perceptrons (MLPs) can be explained by means of propositional rules. So far, many rule extraction techniques have been proposed for shallow MLPs, but not for Convolutional Neural Networks (CNNs). To ﬁll this gap, this work presents a new rule extraction method applied to a typical CNN architecture used in Sentiment Analysis (SA). We focus on the textual data on which the CNN is trained with “tweets” of movie reviews. Its architecture includes an input layer representing words by “word embeddings”, a convolutional layer, a max-pooling layer, followed by a fully connected layer. Rule extraction is performed on the fully connected layer, with the help of the Discretized Interpretable Multi Layer Perceptron (DIMLP). This transparent MLP architecture allows us to generate symbolic rules, by precisely locating axis-parallel hyperplanes. Experiments based on cross-validation emphasize that our approach is more accurate than that based on SVMs and decision trees that substitute DIMLPs. Overall, rules reach high ﬁdelity and the discriminative n-grams represented in the antecedents explain the classiﬁcations adequately. With several test examples we illustrate the n-grams represented in the activated rules. They present the particularity to contribute to the ﬁnal classiﬁcation with a certain intensity.


Introduction
Artificial neural networks learn by examining numerous examples many times. After training, it is very difficult to explain their decisions, because their knowledge is embedded within the values of the parameters and neuron activations, which are at first glance incomprehensible. Deep neural networks are at the root of the significant progress accomplished over the past five years in areas such as artificial vision, natural language processing and speech recognition. In addition, a number of studies have been conducted to clarify the potential of deep models, such as Convolutional Neural Networks (CNNs) in Sentiment Analysis (SA) [1,2]. Nevertheless, the transparency of bio-inspired models is currently an open and important research topic, as in the long term, the acceptance of these models will depend on it. Furthermore, transparency is essential in relation to a recent European General Data Protection Regulation (GDPR), which also presents a right of explanation. Specifically, when an automated decision is taken by a system, one has the right to request a meaningful explanation.
Neural networks being considered as black-boxes have been made transparent by techniques that were first applied to shallow Multi Layer Perceptrons (MLPs). A natural way to explain MLP responses is through the use of propositional rules [3]. Andrews et al. introduced a taxonomy describing the general characteristics of all rule extraction methods [4]. Later, a desire to make neural network ensembles transparent became apparent and several techniques were proposed. Bologna introduced the Discretized Interpretable Multi Layer Perceptron (DIMLP) to generate symbolic rules from both formulas from the output layer to the input layer. Recently, Hayashi trained DBNs and then transferred the last layer of weights into MLPs having a unique hidden layer [17]. Rules were generated from shallow MLPs by the Re-RX algorithm [18].
Zilke proposed a rule extraction technique applied to deep networks of stacked auto-associators [19,20]. Specifically, rules were first generated by decision trees between two successive layers. Subsequently, rules putting into play the input layer and the output layer were formed by transitivity. Similarly, in [21,22] the author generated rules from deep DIMLPs based on stacked auto-encoders.

Explanation of CNN Classifications
On object recognition problems, Hendriks et al. explained CNN classifications by training a second network that learns to produce sentences as an explanation feature [23]. Specifically, it learned to generate sentences that realized a global sentence property by reinforcement learning. Babiker et al. proposed to generate a heat map from a CNN to localize important regions of the image for classification [24]. The key idea behind this approach is the use of the Kullback-Leibler divergence gradient. Lapuschkin et al. introduced Layer-wise Relevance Propagation (LRP) to determine the inherent reasoning of deep neural networks [25]. In practice, this technique generated heatmaps of relevant areas contributing to the final classification. Furthermore, the authors characterized the importance of the context, with respect to targeted objects in images. For medical images, Holzinger et al. used very deep CNN architectures with residuals [3]. To understand the internal structure of the trained networks, they proposed to replace the majority of convolutional layers by AM-FM components (Amplitude Modulation-Frequency Modulation) and to retrain the upper network layers. Overall, convolution layers were visualized through their frequency coverage.
Zeiler and Fergus presented DeconvNet [26], in which strong activations are propagated backward to determine parts of the image causing these activations. Mahendran and Vedaldi proposed an optimization method based on image priors to invert a CNN [27]. As a result, visualizations yielded insight on the information represented at each layer. Dosovitskiy and Brox proposed to analyze which data is preserved by a feature representation and which data is discarded [28]. Specifically, from a feature vector a neural network was trained to predict the expected pre-image, corresponding to the average image producing the given feature vector.
Layer-wise Relevance Propagation (LRP) estimates which individual words are relevant to the overall classification decision.

Differences and Similarities with Our Approach
Many approaches aiming at explaining CNN classifications are based on the visualization of image areas that are mainly relevant to the result. A recent survey on this topic is presented in [9]. Guidotti et al. present a comprehensive survey of techniques aimed at elucidating deep neural networks [13]. In their view, explanators of black-box models include decision trees, symbolic rules but also other entities like: Features Importance (FI) and Sensitivity Analysis. Specifically, the latter allows the inspection of the black-box by observing changes in the classification result when varying inputs. FI is a measure representing the importance of the inputs, which is very often related to the coefficients resulting from trained linear models. Note that both Sensitivity Analysis and FI highlight specific properties of the model, without necessitating an overall understanding of it.
Our ultimate goal is to determine propositional rules, because they are close to the logic used by humans. Moreover, with the use of symbolic rules, discrimination between different classes is explained through rule antecedent values. This is much more precise than simply characterizing the relevant image subregions, because the way in which discrimination between different classes is carried out is undetermined [10]. The main difference between our method and the most recent techniques described in [12,13] aiming at generating decision trees or symbolic rules from deep neural networks is that our method is global and the others are local.
With respect to the taxonomy introduced by Andrews et al. [4] describing the general characteristics of all rule extraction methods, pedagogical techniques would be able to potentially generate symbolic rules from any neural network architecture (globally or locally). Specifically, a pedagogical method learns the associations between the input layer and the classification responses of the black-box model, without taking into account the values of its weights. Subsequently, rules can be generated, since the model that has been trained to recognize input-output associations is transparent. Examples of pedagogical techniques are reported in [35,36]. Despite their potential, pedagogical techniques have been applied to deep architectures like CNNs, very rarely [31]. It is worth noting that our rule extraction technique is not pedagogical, since it belongs to the 'eclectic' category [4].
Eclectic techniques are both pedagogical and decompositional. The latter category involves that the transparent model is generated by analyzing weight values of the neural network model. Note that the majority of decompositional techniques suffers from exponential algorithmic complexity [6]. As a result, in many cases a pruning step is carried out to reduce the number of weights. Hence, it seems very difficult to apply decompositional algorithms to CNN architectures. DIMLP rule extraction method has polynomial algorithmic complexity [8]. It is eclectic because weights of the first hidden layer define possible rule antecedents. In a second step these rule antecedents are confirmed or rejected in the rules by a pedagogical algorithm.

CNN Architecture
A CNN architecture is composed of several successive layers of neurons. Specifically, in this work we use: • a two-dimensional input layer; • a convolutional layer; • a max-pooling layer; • a fully connected layer.
This network model shown in Figure 1 is very similar to that proposed in [2]. The only structural difference is the number of fully connected layers, which is equal to one in this work and equal to two in Figure 1. The different layers are described with more details in the following paragraphs.

2D Input
Convolutional Layer Max Pooling Layer

Two-Dimensional Input Layer
A two-dimensional input layer is used to encode text. As shown by Figure 2, several words are represented on the vertical axis. Horizontally, a single word could be viewed as a boolean vector with zeros everywhere, except for a component whose value is equal to one. However, a drawback is the fact that thousands of components are typically required. A more parsimonious coding is achieved by word embeddings [14]. Specifically, a word is not anymore a boolean vector, but a vector with continuous values of typical size equal to 300. It is worth noting that the dimensionality of word embeddings is often between 100 and 300. As stated in [37], 300 is one of the most adopted sizes in various studies.

Two-Dimensional Convolutional Layer and Max-Pooling Layer
A key element in CNNs is based on the convolution operator. Given a two-dimensional kernel w p,q of size PxQ and a data matrix of elements x a,b , the calculation of an element c ij of the convolutional layer is with f a transfer function and b p,q the bias. As a transfer function we use a hyperbolic tangent: We define S p and S q as the stride parameters along the horizontal and vertical axis between two successive convolutions. In this work S p = S q = 1. Moreover, we require the kernel to be completely inside the sample matrix (without zero padding). As an example, with data samples of size 6 × 6 and P = Q = 3, the resulting convoluted map has size 4 × 4. Figure 3 illustrates a particular case of convolution with respect to a text matrix of size 7 × 5 and a kernel of size 4 × 5. This kernel moves over the text matrix, carrying out an element-wise multiplication with the part of the data it is currently on. This is repeated by sliding down the kernel by one position, vertically. Hence, the result of convolution is a vector of four components. "Wide convolution" is the denotation when the horizontal size of the kernel is equal to the horizontal size of the data matrix. With wide convolution, the size of a kernel is defined as its vertical size. For instance, in Figure 3 the kernel size is equal to four. A remarkable relationship is fulfilled between the size of the kernels and the number of consecutive words taken into account by the convolution operator. Since the size of a kernel is the number of components on the vertical axis, it corresponds to the number of consecutive words processed at any position in a sentence. As an example, three consecutive words are denoted as 'trigrams' and can be detected by kernels of size three. Similarly, two consecutive words are called 'bigrams' and can be taken into account by kernels of size two. Finally, to emphasize single words it is possible to define filters of size one.
The max-pooling layer reduces the size of a vector or a matrix by applying a "Max" operator over non-overlapping regions. Figure 4 illustrates in the left a number of vectors obtained after convolution. From each vector the maximal value is extracted and concatenated in a new layer, which enables n-gram position invariance.

Fully Connected Layer
In this work, a unique fully connected layer of weights follows the max-pooling layer. First, a dot product of s l scalars is calculated: Symbol m k represents vector components of the max-pooling layer and v kl is a matrix term of weight coefficients, the bias being included in the sum. Then, a Softmax activation function is applied. Specifically, for a number N of s i scalars it calculates an N-dimensional vector with values between 0 and 1: with o l as the activation of a neuron in the output layer. The architecture of the CNN used in this work is summarized in Table 1. Specifically, I designates the input layer, C represents the convolutional layer with 40 kernels for each size (this value has been fixed empirically, without trying to reach the possible best predictive accuracy). (C 1 , C 2 , C 3 ), M is the max-pooling layer (120 neurons), and O designates the output layer including two neurons. Each word is coded in a vector of 300 components, with a maximum number of words per sample equal to 59. Table 1. CNN architecture. Symbols for each layer are specified in the second row and sizes in the last.

Input Convolution Max-Pooling Layer Output
To train the network, the loss function is the cross-entropy (J); here we give its version for two classes: Symbol W represents all the network weights, index p is related to training samples, index l designates an index for the neurons of the output layer, o

The DIMLP Model
DIMLP differs from a standard MLP in the number of connections between the input layer and the first hidden layer. Specifically, any hidden neuron receives only a connection from an input neuron and the bias neuron, while all other layers are fully connected [7]. The activation function above the first hidden layer of a typical DIMLP is a sigmoid function given as: For the first hidden layer a step function or its generalization corresponding to a staircase activation function is used. For simplicity, we first give the step function τ(x), which is a particular case of the staircase function with only one step: The key idea behind rule extraction from DIMLPs is the precise localization of axis-parallel discriminative hyperplanes. In other words, the input space is split into hyper-rectangles representing propositional rules. Specifically, the first hidden layer creates for each input variable a number of axis-parallel hyperplanes that are effective or not, depending on the weight values of the neurons above the first hidden layer. As an example, Figure 5 illustrates an elementary DIMLP network with a hidden neuron, a weight w between the input neuron and the hidden neuron and a bias b between the bias neuron and the hidden neuron. Because of the step function as the activation of the hidden neuron, a potential hyperplane discriminator lies in −b/w. It will depend on the layers above the first hidden layer, whether the hyperplane discriminator will be effective or not. Generally, these hyperplanes are parallel to the axis of the input neurons. Hence they represent possible rule antecedents.
The starting point of the rule extraction algorithm is the list of all potential hyperplane discriminators. The number of these hyperplanes depends on the number of stairs in the staircase activation function. Then, a decision tree is built and rules are generated from each tree path. Typically, at this stage the number of rules and antecedents is too large. Hence, a greedy algorithm progressively removes antecedents and rules. More details on the rule extraction algorithm can be found in [8]. Since the CNN defined in the previous Section is trained with a Softmax function in the output layer (cf. Equation (4)), we replace the sigmoid by it. As stated previously, the activation function in the first hidden layer of DIMLPs is a staircase function S(x), with Θ stairs that approximate the Identity function (I(x)) on a compact interval: R min represents the abscissa of the first stair. By default R min = −1.
R max represents the abscissa of the last stair. By default R max = 1. Between R min and R max , S(x) is: Square brackets indicate the integer part function, with θ = 1, . . . Θ. The approximation of the Identity function by a staircase function depends on the number of stairs Θ. The larger the number of stairs the better the approximation.

The Interpretable CNN Architecture
The interpretable CNN architecture used in this work is illustrated in Table 2 Figure 6 depicts an example going from a tweet to a rule. First, each word of a sentence is provided to the input layer as horizontal vectors of numbers. Subsequently, these vectors are convolved by the convolutional layer; note that each rectangle on this layer represents a convolution filter. After convolution, the max-pooling layer takes over; its role is to simplify the processed data by retaining maximal values. From this layer to the output layer we have a DIMLP subnetwork. Hence, the extracted rule antecedents represent activation values of the the Max-Pooling layer. For clarity, those which are not represented in the rule at the bottom are left blank. Finally, the correspondence between antecedents and n-grams is shown at the bottom; the algorithm described below allows us to determine them. Generally, many rule extraction techniques generate ordered rules, which means that rules are given in a sequential order, with two consecutive rules linked by an "else" statement. A long list of ordered rules involve many implicit antecedents that makes the interpretation difficult. Rules generated from the DIMLP subnetwork (M-H-O) are unordered. With this type of rules, the "else" statement is absent. Thus, each rule is considered as a single piece of knowledge that can be examined in isolation [38].
Each rule antecedent related to the max-pooling layer is given as a < t, or a ≥ t. Since a rule antecedent can be true with one or more n-grams in the input layer, it involves a disjunction of n-grams (one or more n-grams connected by a logical or). Nevertheless, for a given sample and with the use of the "Max" function in the M layer, a unique n-gram becomes dominant (the one with the highest activation). Before giving an algorithm that allows us to determine discriminative n-grams for a given sample, let us define several sets and variables: • G: Set of n-Grams generated from a dataset; • R: Set of rules (with respect to layer M); It is worth noting that with this CNN architecture the activation of neuron m j in the M layer depends solely on kernel K j in the C layer. Hence, given a sample S k covered by R i and its set of n-grams H, each discriminative n-gram related to antecedent A ij is found by characterizing the n-gram (in H) activating neuron m j the most. Given R i and S k , the following algorithm generates discriminative n-grams: For clarity let us explain the different steps of the algorithm specified above. First, its result will be stored in Γ, which is initialized as an empty set. Secondly, a main loop that iterates on all rule antecedents is defined. Subsequently, in step three, for each rule antecedent A ij we determine the n-grams of S k (a sample) that makes A ij true. Then, for each antecedent A ij the purpose is to determine the n-gram involving the highest activation of neuron m j in the M layer (steps four to six). For instance, with the antecedent f 113 ≥ 0.24 of a particular rule generated from an interpretable CNN and a sentence given as "a real movie, about real people, that gives us a rare glimpse into a culture most of us don't know", four trigrams makes the antecedent true: 1. "culture most of" (0.251715) 2. "rare glimpse into" (0.257523) 3. "into a culture" (0.273310) 4. "us a rare" (0.311517) the number after each trigram is the activation value of the 113th neuron in the M layer. The fourth trigram provides the highest activation, hence it is the detected trigram. Generally, it is plausible to obtain a global view of each extracted rule by characterizing for each antecedent its winning n-grams with respect to all the covered samples.
For a given rule antecedent and a given sample, it is useful to determine the linear contribution of each winning n-gram g * j , with respect to the final classification. Specifically, this measure corresponds to the product of the filter activation in the M-layer multiplied by the weight connecting the neuron of highest activation in the output layer. The output neuron of strongest activation indicating the class, a positive value of this measure means that the n-gram contributes in favor of the classification, while a negative value is against it. The linear contribution is given as a function: with act(g * j ) designating the activation of a neuron m k in the M layer and v k corresponding to a weight between m k and the output layer (the one related to the neuron of highest activation).

Experimental Results
In this section, we first present the general results on the accuracy of CNNs based on cross-validation. Secondly, Decision Trees (DTs) [39] are trained with CNN prediction classes instead of the true label. Thirdly, CNNs are compared to Support Vector Machines (SVMs) [40]. Subsequently, we replace DIMLP subnetworks by DTs. Then, representative examples of rules extracted from CNNs are shown. They emphasize how discriminative n-grams intuitively explain "tweet" classifications. Finally, we illustrate with two examples how to inspect rules, globally.

General Results
A CNN including a convolutional layer with 120 kernels of size one, two and three was defined empirically (cf. Section 3.1.3). Note that our purpose was not to find the best possible CNN architecture, but rather to create an acceptable CNN in terms of performance and then to generate rules to explain classifications. Training was performed with Lasagne libraries, version 0.2 [41]. The training parameters were: • learning rate: 0.02; • momentum: 0.9; • dropout = 0.2; To illustrate the results of rule extraction, we applied the CNN architecture defined above to a well-known binary classification problem related to "tweets" of movie reviews [42]. The characteristics of the dataset are: The positive and negative subsets have been divided into 10 folds, in order to carry out ten-fold cross validation trials. Moreover, a randomly selected subset of the training set representing 10% of it was extracted as a tuning set for early-stopping [43], which is useful to avoid over-training. Table 3 illustrates the results. The first row of this   The average predictive accuracy of the rules is slightly lower than that obtained by the CNN (73.7% versus 74.1%, for the best result). Moreover, average fidelity on the testing set is above 95%, meaning that rules explain CNN responses in a large majority of cases. Finally, the best average predictive accuracy of the rules when rules and model agree is higher than that obtained by the CNN (75.1% versus 74.1%).
As a baseline, a simple pedagogical technique aiming at explaining CNN predictions is represented by a DT that learns training samples with CNN prediction classes, instead of true labels. Table 4 illustrates the results obtained by cross-validation. The λ parameter, which is the minimum number of samples required to be at a leaf node makes it possible to control the size of the trees. The number of nodes of a tree is in turn related to the proportion of learned training samples. It is worth noting that the fidelity on the training samples decreases when λ is increased. Moreover, the average predictive accuracy is never above 60%, the average fidelity being always below 63%. On one hand, this performance is substantially lower than that obtained by interpretable CNNs with DIMLP subnetworks (see Table 3). On the other hand, with λ ≥ 10, a lower number of rules are generated. We may wonder about replacing the DIMLP subnetwork in interpretable CNNs by Decision Trees, from which rules can be extracted. Note that a decision tree can be viewed as a set of rules, with each rule represented by a path going from the root to a leaf. Table 5 illustrates the results with respect to the extracted rules, by varying the λ parameter. The best average predictive accuracy is equal to 68.6%, which is much lower that that obtained by CNNs with the DIMLP subnetwork (68.6% versus 73.7%). An intuitive reason explaining this result is that the fully connected layer of the original CNN is better approximated by DIMLPs than DTs. However, DTs generate a significant smaller number of rules, on average: 201.8 versus 573.8. Table 5. Average results obtained by the rules generated from Decision Trees that replace the fully connected layer of CNNs. Columns from left to right represent average results on: Training accuracy; testing accuracy; number of extracted rules; number of antecedents in the rules. In the rows, the λ parameter controlling the size of the trees varies.

Tr. Acc.
Tst. Acc. #Rul. #Ant. Support Vector Machines are usually very competitive with very highly dimensional classification problems. Here we have 17,700 (59 * 300) input variables for each sample; thus, a natural question is whether SVMs perform better than CNNs. Due to the high input dimensionality it is recommended to use linear SVMs. Table 6 present the results by varying the C parameter, which controls the proportion of misclassified training samples [44]. The best average predictive accuracy is less than that obtained by interpretable CNNs (70.6% versus 73.7%). A question arising is whether this difference is statistically significant.  A multiple statistical comparison test is performed to find out whether average predictive accuracies are significantly different. With an ANOVA statistical test we aim at determining whether all the models used here obtain the same average predictive accuracy against the alternative hypothesis that they are not all the same. In other words, the null hypothesis states that the means are all equal. For this statistical test we define a significance level equal to 0.01, which involves a 1% risk of concluding that a difference exists when there is no actual difference. Not surprisingly, ANOVA rejects the null hypothesis that all model average predictive accuracies are equal (p-value = 3.1252 · 10 −59 ).
At this point it is within reach to compare a subset of the models, such as the SVMs related to their highest average predictive accuracy and interpretable CNNs. These results are shown in Table 7. The small p-values involve that with very high probability the predictive accuracy of the rules generated from each interpretable CNN is significantly different from the one measured on SVMs with C parameter equal to 0.01. Similar results are illustrated in Table 8, with respect to SVMs with C parameter equal to 0.005. Table 7. ANOVA comparison between CNNs and SVMs providing the better average predictive accuracy (equal to 70.6%).

Examples of Detected N-Grams from the Rules
First, rules are expressed with antecedents given as filter responses. Then, n-grams are determined from the antecedents. Specifically, the truthfulness of the antecedents involve long lists of n-grams. Nevertheless, for a given sample and a given rule antecedent, only a unique n-gram "wins the competition" (cf. Section 3.3). Rules are ranked according to their support with respect to the training set, which corresponds to the number of covered samples. Finally, rules are not disjointed, which means that a sample can activate more than a rule.

Discriminative N-Grams Determined from R 26
We first illustrate rule number 26 (R 26 ), with support equal to 464 samples with respect to the training set and 47 samples with respect to the testing set: Here, m i (i = 1, . . . 120) designates neuron activations in the max-pooling layer. Indexes between one and 40 are related to single words, those between 41 and 80 correspond to bigrams, and those between 81 and 120 involve trigrams. The accuracy of this rule is 93.3% on the training set and 87.2% on the testing set.
In the following Figures are depicted examples of "tweets" belonging to the testing set and including the discriminative n-grams, ranked according to the linear contribution in the output layer (cf. Equation (12)). Figure 7 shows a correctly classified "tweet" related to R 26 . Single words, bigrams and trigrams are illustrated vertically. A "*" or a "+" designates a repeated n-gram, since two different antecedents are able to code the same n-gram. In Figure 7 "too-bad-the" is a trigram, "too-bad" is a bigram and "style" is a single word. Note that "style" and "style-." (punctuation is also coded in word embeddings) are against the final classification. The most contributing n-gram is "too-bad-the", which appears twice. In this case, all n-grams containing "bad" contribute to the negative polarity. This fact is also in agreement with our common sense.   Figure 8 illustrates another example in which the most contributing n-gram to the final classification is a trigram: "mediocre-special-effects". Note that this is also in accordance with our perception. Bigram "worst-of" is the second most contributing n-gram. Surprisingly, single word "worst" is almost neutral with respect to the final classification.  Figure 9 shows that the most discriminative n-gram is trigram "so-mind_numbingly-awful", then follows bigram "so-mind_numbingly". Note also that three n-grams are against the final classification. Among them, single word "as" is the strongest element in contradiction with the final classification.  Figure 9. N-grams determined from R 26 and the following "tweet": "so mind-numbingly awful that you hope britney won't do it one more time, as far as movies are concerned". The sum of the n-grams linear contributions is 0.211.
As a last example for rule R 26 , we can see in Figure 10 that trigram "just-bad-;" and bigram "just-bad" are the most contributing n-grams. Again, "as" is the strongest element in favor of the positive polarity (thus, in contradiction with the "tweet" classification).

Discriminative N-Grams Determined from R 24
As a second rule, R 24 is: Its accuracy is 93.4% on the training set with support equal to 469 samples and 98.0% on the testing set, with support equal to 51 samples. Figure 11 depicts a covered "tweet" in the testing set. The most important n-gram is trigram "has-a-subtle", which is evoked by two different antecedents. Moreover, n-grams related to "it's over" are associated to negative polarities. it's-over-.* it's-over Figure 11. N-grams determined from R 24 and the following "tweet": "it has a subtle way of getting under your skin and sticking with you long after it's over". The sum of the n-grams linear contributions is 0.260.
In Figure 12, we illustrate a "tweet" that is more difficult to classify, since it starts in a rather negative manner and then becomes strongly positive. As a consequence, the CNN detects a considerable number of negative elements. Four n-grams are not in favor of the correct classification, with two of them including "not". Overall, the contribution of the positive parties is stronger, with a strongest trigram related to two different antecedents ("offers-gorgeous-imagery").  Figure 13 illustrates a short "tweet". Two trigrams appear twice and surprisingly single word "entertaining" contributes negatively to the final class. In Figure 14 the strongest n-gram is: "beautifully-accomplished-lyrical", which is very complimentary. Curiously, bigram "beautifully-accomplished" is slightly negative in its contribution to the final class. This would indicate a defect in the classifier, as well as "entertaining" in the previous "tweet". An explanation for the negative connotation of this single word would be that it appears 31 times in the dataset of negative "tweets". Finally, Figure 15 illustrates a testing case, which is wrongly classified by the CNN, but correctly classified by R 24 . Here, "is-a-feast" is the only element that contributes positively to the classification. . N-grams determined from R 24 and the following "tweet" (in the testing set): "some movies are like a tasty hors-d'oeuvre; this one is a feast". It is correctly classified by the rule, but wrongly classified by the network.

Global View of Rules
A rule antecedent can be true with several different n-grams. Hence, to analyze a rule as a whole we must detect its related n-grams for each antecedent and for all covered samples. Note that a rule with antecedents depending on neuron activations in the M-layer can be formulated in the input layer as: • "if one of the n-grams related to the first antecedent is present and maximal (Only the n-gram that activates the most its related neuron in the M layer is the one that makes the antecedent true.) then . . . " and . . . "if one of the n-grams related to the last antecedent is present and maximal then . . . ".
Since we generate unordered rules, each rule can be viewed as a single classifier that can be analyzed without considering the other rules. The more the number of rule antecedents and samples covered, the longer the examination will take. However, in principle this is always possible. Specifically, by looking at the n-grams related to each antecedent it is realizable to understand in which context a rule is applied.

Discussion
DTs were not as good as DIMLPs at approximating the fully connected layer at the top of the original CNN. Nevertheless, more rules were generated from DIMLPs. Thus, the best approximation precision is at the cost of more complicated rulesets. With DIMLPs it would be feasible to alleviate the size of extracted rulesets with the use of reduced training sets, as it was carried out in [22].
The linear contribution of n-grams is a measure that allows us to characterize the relative importance of discriminative combinations of words. On one hand, it helps to characterize flaws in the classifier, such as the detection of n-grams that contribute to an opposite class. On the other hand, it enables to understand which word combinations are used correctly. The contribution of n-grams is easily calculated with only a unique fully connected layer, but an open question is to find a way to calculate it with a greater number of fully connected layers. A solution to this problem could be the use of a feature importance value, as described in the framework reported in [45].
To determine a rule as a whole, it is sufficient to characterize all discriminative n-grams over all covered samples. From the linear contributions it is plausible to determine the n-grams that go against the rule class. Then, flaws can be detected and possibly a corrective strategy could be developed with the use of additional training examples putting into play problematic n-grams.
The CNN architecture used in this work is simple, as it includes a unique convolutional layer. We might wonder whether our rule extraction technique could be adapted to more complex architectures used in object recognition, such as LeNet [46], AlexNet [47], and VGGNet [48]. In a similar manner to what has been achieved here, we would extract unordered rules at the level of the first fully connected layer. Given a sample and a rule activated by this sample, it would be conceivable for each rule antecedent to determine one or more image regions that contribute positively or negatively to the final classification. The extent of these regions would be characterized by going back to the input layer, after passing through an arbitrary number of convolutional and max-pooling layers.

Conclusions
We presented a new rule extraction technique applied to a CNN in sentiment analysis. Our approach is global and could be extended to object recognition problems encompassing a moderate number of convolutional layers. Our rule extraction method consisted in approximating a trained CNN by transforming the top layers into a transparent neural network model. Thus, rules generated from the transparent subnetwork were propagated backward to the input layer and became comprehensible, as the antecedents represent n-grams. The rules can be inspected globally, by determining for each antecedent all the related n-grams, or locally, by characterizing for a given sample the discriminative n-grams that contribute to the final classification. These n-grams allowed us to explain with several examples why the classifier worked well or badly. In the future it will be interesting to determine how to correct flaws in a neural network with the help of the extracted rules. One way could be to inject supplementary training examples, aiming at modifying the linear contributions of discriminative n-grams.
Funding: This research received no external funding.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: