Pat-in-the-Loop : Declarative Knowledge for Controlling Neural Networks

: The dazzling success of neural networks over natural language processing systems is imposing an urgent need to control their behavior with simpler, more direct declarative rules. In this paper, we propose Pat-in-the-Loop as a model to control a speciﬁc class of syntax-oriented neural networks by adding declarative rules. In Pat-in-the-Loop, distributed tree encoders allow to exploit parse trees in neural networks, heat parse trees visualize activation of parse trees, and parse subtrees are used as declarative rules in the neural network. Hence, Pat-in-the-Loop is a model to include human control in speciﬁc natural language processing (NLP)-neural network (NN) systems that exploit syntactic information, which we will generically call Pat. A pilot study on question classiﬁcation showed that declarative rules representing human knowledge, injected by Pat, can be effectively used in these neural networks to ensure correctness, relevance, and cost-effective.


Introduction
Neural networks are obtaining dazzling successes in natural language processing (NLP). General neural networks learned on terabytes of data are replacing decades of scientific investigations by showing unprecedented performances in a variety of NLP tasks [1]. Hence, systems based on NLP and on neural networks (NLP-NN) are everywhere.
As a consequence of their success, public opinion is extremely fast in spotting possibly catastrophic, unwanted behavior on deployed NLP-NN systems (see, for example, [2,3]). As many learned systems [4,5], NLP-NN systems are also exposed to biased decisions or biased production of utterances. This problem is becoming so important that extensive analyses are performed, for example, for the tricky class of systems for sentiment analysis [6].
To promptly recover from catastrophic failures, NLP-NN systems should be endowed with the possibility of modifying their behavior by using declarative languages. Deductive teaching is an extremely difficult task even in the human learning process [7,8]. Active learning techniques [9,10] can require too many examples and may focus the attention of NLP-NN systems on irrelevant peculiarities of datasets [11]. Usually we do not have the time or budget for human input on every data point, and so need strategies for deciding which data points are the most important for human review. Due to the high costs to obtain human-generated activity data using solutions for which a very limited number of examples with supervised information such as Few-Shot Learning or One-shot learning [12,13] could be used. But the core issue of these techniques is the unreliable empirical risk minimizer that makes them hard to learn. Understanding the core issue helps categorize different works into data, Distributed tree encoders allow to produce heat parse trees and developers can explore activation of parse trees for specific decisions to derive rules for correcting system behavior.
In the following work, we performed a pilot study on question classification where Pat-in-the-Loop showed that human knowledge can be effectively used to control the behavior of a syntactic NLP-NN system.
In the next section (Section 2) we report the related works about the visualization of neural networks models. Next follow a description of Pat-in-the-Loop works (Section 3) and finally (Section 4), we show the improvements achieved by the proposed model.

Related Work
In recent years with the advent of neural networks, many methods to visualize neural networks have been developed. The most common methods to display neural networks is using a node-link graph where nodes depict computational units and edge weights indicate an input-output connection between these nodes. Generally, for ease of understanding and to encourage the user, the magnitude of a parameter or activation is displayed using different colors and sizes for the edge weights.
For example, ActiVis [19] offers a view of neuron activations and can be used to view interactive model interpretations of large heterogeneous data formats such as images and text.
ActiVis can closely integrate multiple coordinated views, such as a model architecture calculation graph and a neuron activation view for model discovery and comparison, users can explore complex models of deep neural networks at both instance and subset level. Although it is a progressive system, ActiVis does not support recurrent architectures, a common type of architecture in natural language tasks.
For this extent, Ming et al. [20] and Strobelt et al. [21] proposed respectively dedicated visualizers for recurrent neural networks (RNNviz) and long short-term memory networks (LSTMviz) that are able to inspect the dynamic of the hidden state. The ultimate purpose is to show the functions of hidden state units and explain them using their expected response to input texts, i.e., words. This allows users to gain a more complete understanding and greater confidence in the hidden RNN and LSTM mechanism through various visual techniques.
Recently, with the advent of transformer models [22], a lot of work has been done in order to interpret activations of attention heads [23][24][25]. In this new world of multi-layered, multi-headed attention, mechanisms of the Transformer model can be difficult to decipher. To make the model more accessible, many researchers have begun to think about an open-source tool that visualizes attention at multiple scales, each of which provides a unique perspective on the attention mechanism. All these Transformer visualizers allow to view the magnitude of softmax attention heads correlated with input tokens to interpret model's decisions. By way of example, we selected BERTviz [23] as the representative for this category of transformer visualizers.
Embedding Projector [26] is an interactive tool for visualizing and interpreting embeddings. This tool uses different dimensionality reduction techniques to map high-dimensional embedding vectors into low-dimensional output vectors that are easier to visualize. It can be used to analyze the geometry of words and explore the embedding space, although it cannot be used directly to explain a neural network model.
The following table (Table 1) shows a sample of the most common types of visualization tools for neural networks in the context of natural language processing. After Training Table 1, we can observe the basic characteristics offered by the above mentioned works. The features that everyone shares are: the target audience, i.e., Developer-Friendly, the time of training when we can avail ourselves of these systems, i.e., After Training and finally the purpose of the systems themselves, i.e., improve the elements Interpretability & Explainability. The distinguishing features offered by our system are: the display and easy choice of the underlying model to use, i.e., Algorithm Attribution & Features Visualization and the ability to manipulate the model itself to improve it in a very simple way.
In addition to this visualizer, we propose Pat-in-the-Loop as a model to include human control in specific NLP-NN systems that exploit syntactic information. Our system allows to display heat parse trees that are a handy way to represent syntactic node contributions in a neural network directly into syntactic trees and a declarative language for controlling the behavior of neural networks. The following section describes in detail how it works.

The Model
In Pat-in-the-Loop (see Figure 2), a generic developer, which we call Pat, may inspect the reasons why her/his neural network takes some decisions. In fact, Pat's neural network model is based on distributed tree encoders W dt to directly exploit parse trees in neural networks (Section 3.2). Pat can visualize why some decisions are taken from the network according to parse trees of examples x by using "heat parse trees" (Sections 3.1 and 3.3). Hence, Pat can control the behavior of neural networks with declarative rules represented as subtrees by encoding these rules in W H (Section 3.4). In other words, the key idea we propose in Pat-in-the-Loop model is using "heat parse trees" to analyze which parts of parse trees are responsible for the activation of specific neurons (Section 3.3); and, then, controlling the behavior of neural networks with declarative rules derived from the analysis of these heat parse trees (Section 3.4). This is a loop (see the red arrow in Figure 2) where Pat analyzes the output of the Neural Network (NN). The red block, which is the Declarative rule embedder, is a special module that allows Pat to encode declarative rules. These rules, which are embedded in special vectors (see in Section 3.4) will affect the decision of the neural network by modifying its behavior during training.
Before starting the description of the core components of the Pat-in-the-Loop model, Section 3.1 introduces some preliminary notation and the notion of heat parse trees. Below is part relating to the foundations of the proposed system Section 3.2. Then, we close with a section about the visualization (Section 3.3) and the additional layer (Section 3.4).

Preliminary Notation
Parse trees and heat parse trees are core representations in our model. This section introduces the notation to describe these two representations.
Parse trees T and parse subtrees τ are recursively represented as trees t = (r, [t 1 , . . . , t k ]) where r is the label representing the root of the tree and [t 1 , . . . , t k ] is the list of child trees t i . Leaves t are represented as trees t = (r, []) with an empty list of children or directly as t = r.
Heat parse trees, similarly to "heat trees" in biology [27], are heatmaps over parse trees (see Figure 1). The underlying representation is an active tree t, that is, a tree where an activation value v r ∈ R is associated to each node: t = (r, v r , [t 1 , . . . , t k ]). Heat parse trees are then the graphical visualization of active trees t where colors and sizes of nodes r depend on their activation values v r .

Distributed Tree Encoders for Exploiting Parse Trees in Neural Networks
Distributed tree encoders are the encoders used in Pat-in-the-Loop to directly exploit parse trees in neural networks. These encoders, stemming from tree kernels [28] and distributed tree kernels [29], give the possibility to represent parse trees in vector spaces R d that embed huge spaces of subtrees R n .
Tree kernels [28] have offered an important opportunity to fully exploit parse trees in learning with kernel machines [30,31]. Tree kernels are functions implicitly computing the similarity among parse trees T mapped in vectors x T ∈ R n where dimensions are subtrees τ. For example, the 52629-th dimension of x T ∈ R n can represent the subtree τ (52629) =(SQ,[(VBD,[did]),NP,VP]) (see Table 2). Vectors x T for parse trees T generally have: where S(T ) is the set of valid subtrees of T , 0 < λ < 1 is a decay factor penalizing large subtrees, and |τ (i) | is the size of the node set of τ (i) . Valid subtrees τ ∈ S(T ) in [28] are connected subtrees of T of at least two nodes and, if τ contains a node c, it should contains all the siblings of the node c in T . For example, x T e 52629 = λ 5 2 for the parse tree in Figure 1 since τ (52629) is a valid subtree of T e . The power of these tree kernels is that parse trees are are never explicitly represented as vectors x T but the tree kernel functions implicitly compute their dot product. Distributed tree kernels [29] may transfer the opportunity given by tree kernels [28] within neural networks since distributed tree kernels implicitly embed vectors x T ∈ R n into a reduced space R d in the context of support vector machines. Distributed tree kernels build on Johnson-Lindenstrauss Transformation [32] and holographic reduced representations (HRR) [33].
Building on distributed tree kernels, we propose distributed tree encoders that may be seen as linear transformations W dt ∈ R d×n (similarly to Johnson-Lindenstrauss Transformation [32]). These linear transformations embed vectors x T ∈ R n in the space of tree kernels in smaller vectors y T ∈ R d : Columns w i of W dt encode subtree τ (i) and are computed with an encoding function w i = E(τ (i) ) as follows: where the operation u ⊗ v is the shuffled circular convolution, that is, a circular convolution (as for HRR [33]) with a permutation matrix Φ: u ⊗ v = u * Φv; and, r ∼ N (0, 1 √ d I) is drawn from a multivariate gaussian distribution.
As for tree kernels also for distributed tree encoders, linear transformations W dt and vectors x T ∈ R n are never explicitly produced and encoders are implemented as recursive functions [29].

Visualizing Activation of Parse Trees
Distributed tree encoders give the possibility of using heat parse trees to visualize the activation of parse trees in final decisions or intermediate neuron outputs.
To compute of active trees t useful to produce heat parse trees, a neural network should be sliced at the desired layer. Let NN be the sliced neural network, x = x T , x r and o its output: where, given an example x, x T is the vector representing the tree T in the space of subtrees related to the example x, W dt is the distributed tree encoder, and x r is the rest of the features associated to x.
Using parse trees T in neural networks is straightforward with distributed trees. In fact, distributed trees y T = W dt x T for parse trees T may be directly used in neural networks as these distributed trees are vectors.
Our heat parse trees show the overlap of activation of subtrees in S(T ) of specific trees T related to a specific example x in a specific net. This shows how subtrees in S(T ) contribute to the final activation o i , that is, a dimension of o. We believe this is more convenient than representing an extremely large heatmap for the list of subtrees in S(T ) and their related value o i (see Table 2).
The computation of active trees t for displaying heat parse trees is the following. The activation weight v r of each node r represents how much the node is responsible for the activation of the overall syntactic tree for the output of the given neuron o i . Then, the activation value v r is computed as follows: where τ is the one-hot vector in the subtree space that indicates the subtree τ and r ∈ τ detects in r is node in τ.
With the above computation of t, active subtrees τ for the output o i of a specific neuron are overlapped in single heat parse trees.
The activation value can be calculated in other ways, for example using Layer-wise Relevance Propagation (LRP) [34]. They compute activation value v r in active tree t by using LRP, that is a framework to explain the decisions of a generic neural network using local redistribution rules and is able to explain which input features contributed most to the final classification. This method unfortunately does not allow you to split the network at the desired layer, so it has not been taken into account.

Human-in-the-Loop Layer
Pat now has an important possibility of understanding why decisions are taken by a specific network and, hence, s/he can define specific rules to control the behavior of the neural network. By looking at the activation of specific neurons for specific examples, Pat can understand why the decision has been made. For example, the heat parse tree in Figure 1 suggests that the subtree (SQ,[VBD,NP,VP]) is the more active in generating the decision if this is taken for the output of a neuron that represents a final class.
If Pat aims to correct the behavior of the system for a given output, s/he selects the specific instances, derives some declarative rules and embeds these rules into the network to control its behavior. More specifically, Pat selects a subtree τ and insert E(τ) as a row in matrix W H that embeds declarative rules (see Figure 2). This specific rule will affect the decisions made by the network on the example under review and all similar examples when the neural network is re-trained after rule injection in W H .
The actual procedure to build up the matrix W H is the following. Let us say that Pat aims to capture k different groups of characteristics s/he assumes to be important to control the behavior of the neural network. For each group i, s/he selects a set S i of subtrees τ (i) corresponding to the i-th characteristic. The matrix W H is then the following: ) and E(τ (i) ) is specified in Equation (1). Hence, the matrix W H is the editable component of the overall system and the procedure to build-up the matrix W H offers an actionable procedure for allowing external agents, that is, Pats, to interact with this neural network-based system. The matrix W H can definitely allow external agents to manipulate the behavior of the neural network by encoding rules capturing characteristics they consider relevant for a specific task.

Pilot Experiment
We experimented with Pat-in-the-Loop by using a question classification dataset [35]. This data helps to classify the given Questions into respective categories based on what type of answer it expects such as a numerical answer or a text description or a place or human name, etc. The dataset is extremely well studied and performances systems can achieve are very high also if the dataset is extremely small. Hence, the dataset offer a very intriguing possibility to run a complex experiment where a human in the loop can make the difference in calibrating the overall system.

Experimental Set-Up
We experimented with the Question Classification dataset [35], which contains 5242 training questions and 500 testing questions. We focused on the coarse grain classification problem with 6 target classes: Abbreviation (ABBR), Description (DESC), Entity (ENTY), Human (HUM), Location (LOC), and Numeric (NUM).
The Pat-in-the-Loop (see Figure 2) used in the experiments has the following configuration. Distributed trees W dt x T are encoded in a space R d with d = 4000. The decaying factor of tree kernels is λ = 0.6. The module NN(W dt x T , x r ) is a multi-layer perceptron that combines two multi-layer perceptrons: Synt(W dt x T ) and Sem(x r ). Synt exploit syntactic information and its output is 1800. Sem exploits a Bag-of-Word model of the input with word embedding input of 300 from fastText [36] and output of 180. Synt and Sem are concatenated and feed a multi-layer perceptron with two layers: 100 and 6. Finally, W H has an input dimension of d = 4000 and an output dimension of 6 where 6 is the number of output categories required in the question classification dataset [35]. In this case, we have opted for W H , which encodes 6 different characteristics where each characteristic is linked to an output class. Then, the output of W H and the output of NN(W dt x T , x r ) are concatenated in a single vector that feeds a final linear layer. We used a ReLU activation function among layers. The last activation function is a softmax. The optimizer is Adam [37]. All experiments were run for 20 epochs in Keras [38]. Finally, we used the CoreNLP constituent-based parser [39] for parsing questions.
We performed a 3-fold cross validation with the training set to accumulate misclassified examples for the human learning loop. Pat inspected these examples by using heat parse tree and encoded the declarative rules in W H ( Table 3). The encoded declarative rules in W H are encoded from this example ( Figure 1) and then injected as rows in matrix W H as described in Section 3.4 We compared three systems: BoW that contains only the word embedding used as a bag-of-word; PureNN that is the system without human knowledge; and HumNN that is the full system with Pat's declarative knowledge.

Results and Discussion
Results of our pilot experiment show important facts that we will examine in the following, focusing also on the limitations of this analysis.
Distributed tree encoders positively introduce syntactic information in neural networks: 0.84 to 0.93 of improvement in f-measure from BoW to PureNN (Table 3). This confirms a general trend observed in a similar experiment carried out in other classification tasks observed in [40].
The analysis of the errors in the training set produced very reasonable rules for two specific classes: Abbreviation (ABBR) and Numerical (NUM) ( Table 4). For what concerns the abbreviation class, Pat selected very reasonable rules such as a question asks for the explanation of abbreviation if it contains parse subtrees representing the verbal phrases "stand for", "mean" or the noun phrases contaning the adjective "full" or the noun "abbreviation". For what concerns the NUM class, rules are fairly more specific or definitely more general. Important indicators that a question is asking for a numerical answer are, respectively, that the question contains WH-noun-phrases "What debts" or contains noun phrases which are a sequence of two proper nouns, a possessive ending, and another noun, that is, (NP (NP (NNP)(NNP)(POS))(NN)). This latter is a very general rule. These rules are then used to build up the matrix W H used in the model with human knowledge (HumNN). Pat could change positively the behavior of the system although global results of the model with human knowledge (HumNN) are similar and even slightly higher than those of PureNN. On the general results, the effect on the results of the system are small. In fact, the micro-average is 0.93 for both models is 0.93 and macro average is 0.92 for HumNN with respect to 0.91 of PureNN. Looking more specifically on the confusion matrix (Table 5 and 6), we may observe that Pat has changed the behavior of the system where he wanted. Since Pat aimed to manipulate the behavior of the system in favor of the classes ABBR and NUM, s/he focused the attention to examples where PlainNN fails. Pat's rules coded in W H . After learning the new model HumNN disturbed by human declarative knowledge, results on the test set are encouraging. In fact, although the overall performance is unchanged, target classes have had positive improvement. Both ABBR and NUM have an additional positively classified example (Table 6). This tiny improvement suggests that the model can positively use declarative human knowledge. Finally, heat parse trees are informative. In fact, Pat could understand why some specific cases were misclassified and could select declarative rules to change the behavior of the system. Being a pilot study, the experiment has some intrinsic limitations. Clearly, the first limitation is the fact that the model has been experimented in a single and small dataset. However, this first pilot experiment is confirming our hypothesis. The second limitation is that we have not performed an ablation test on rules in Table 4. When adding external knowledge, introducing rules and consequently manipulating NNs processes could have negative impact on the system depending on the introduced rules to the system. However, in our pilot experiment, we introduced a very small set of rules which shows that Pat can obtain a positive variation of the behavior of the overall system. This is the major objective of the present study. In fact, globally, results of the pilot experiment confirmed our hypothesis: human can positively manipulate results of the system by inducing rules from the training set.

Conclusions and Future Work
In the line of understanding neural networks and trying to control their behavior besides using training examples, we presented Pat-in-the-Loop. Our model exploits syntactic information in neural networks by using distributed tree encoders, visualizes activation of syntactic information with heat parse trees, and encodes declarative knowledge in a neural network by keeping humans in the learning loop. Pat-in-the-Loop exploits Pat to understand why decisions are taken by a specific network and, hence, Pat can define specific rules to control the behavior of the neural network and s/he can understand why the decision has been made by looking at the activation of specific neurons for specific examples. According to our pilot study, Pat can obtain the desired change of the behavior of the overall Pat-in-the-Loop. Although giving encouraging results, our pilot experiment leaves some issues unanswered: the impact of the size of the dataset on the results and the impact of the quality of the introduced rules. These open issues will shape our future research. Hence, these encouraging results on a pilot study are a first "declarative pat" on neural networks applied to natural language processing, which may open a wide range of possible researches also, demonstrating as the humans in the loop is an important direction to ensure correctness, relevance, and cost-effective.
Our future plans stem on our recent result. We have expanded our approach with Kernel-inspired Encoder with Recursive Mechanism for Interpretable Trees (KERMIT) [40] and its visualizer KERMITviz. Hence, our future goal is to analyze more carefully the interaction between the syntactic and semantic sources of information on heterogeneous tasks. Setting up a clear procedure for selecting positive declarative rules by means of ablation tests on a development set. The improvement given by this analysis may open the possibility of producing better rules for controlling the neural network. Then, we may better keep Human-in-the-loop of an Artificial Intelligence system [41].