# Language Semantics Interpretation with an Interaction-Based Recurrent Neural Network

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Overview

#### 1.2. Problems in RNN

#### 1.3. Problems in Text Classification Using RNN

#### 1.4. Performance Diagnosis Test

#### 1.5. Remark

#### 1.6. Contributions

#### 1.7. Organization of Paper

## 2. A Novel Influence Measure: I-Score

#### 2.1. I-Score, Confusion Table, and AUC

**Table 3.**Confusion Table. This table defines the components of a confusion table and the relationship between sensitivity/recall, precision, and F1 Score. For simplicity, denote true positive, false negative, false positive, and true negative to be ${\alpha}_{1}$, ${\alpha}_{2}$, ${\alpha}_{3}$, and ${\alpha}_{4}$, respectively. In generalized situations where the predicted condition has more than two partitions, the cells of the confusion table can be simplified using the same $\alpha $’s directly.

Predicted Condition | ||||
---|---|---|---|---|

Condition Positive | Condition Negative | |||

Actual Condition | Positive | True positive ${T}_{p}={\alpha}_{1}$ (Correct) | False negative ${F}_{n}={\alpha}_{2}$ (Incorrect) | Sensitivity/Recall Rate (RR) $\frac{{T}_{p}}{{T}_{p}+{F}_{n}}\times 100\%=\frac{{\alpha}_{1}}{{\alpha}_{1}+{\alpha}_{2}}\times 100\%$ |

Negative | False positive ${F}_{p}={\alpha}_{3}$ (Incorrect) | True negative ${T}_{n}={\alpha}_{4}$ (Correct) | Specificity Rate (SR) $\frac{{T}_{n}}{{T}_{n}+{F}_{p}}\times 100\%=\frac{{\alpha}_{4}}{{\alpha}_{3}+{\alpha}_{4}}\times 100\%$ | |

Precision/Positive Predictive Value (PPV) $\frac{{T}_{p}}{{T}_{p}+{F}_{p}}\times 100\%$ | Negative Predictive Value (NPV) $\frac{{T}_{n}}{{T}_{n}+{F}_{n}}\times 100\%$ | |||

F1 Score $\frac{2\xb7{T}_{p}}{2\xb7{T}_{p}+{F}_{p}+{F}_{n}}\times 100\%$ |

- First, part (i) is a function of sensitivity. More importantly, the I-score serves as a lower bound of sensitivity. The proposed statistics I-score is high when the sensitivity is high, which means the I-score can be used as a metric to select high sensitivity variables. A nice benefit from this phenomenon is that high sensitivity is the most important driving force to raise AUC values. This relationship is presented in the top right plot in Figure 1.
- Second, part (ii) is a function of ${\alpha}_{2}$, which approximates to zero value when the variable is highly predictive. This leaves the second part to be largely determined by the global average of the response variable Y but scaled up in proportion with the number of observations that fall in the second partition (${X}_{1}=0$), which is the sum ${\alpha}_{2}+{\alpha}_{4}$. An interesting benefit from this phenomenon is that the near-zero ${\alpha}_{2}$ value, jointly with part (i), implies that the specificity is high, which is another important driving force to raise AUC values. In other words, when the predictor has all the information to make a good prediction, the value of ${\alpha}_{2}$ is expected to be approximately zero. In addition, the global mean of the true condition can be written as $\overline{Y}=\frac{1}{n}({\alpha}_{1}+{\alpha}_{2})$. Hence, this means that part (ii) can be rewritten as ${\left(\overline{Y}{\alpha}_{4}\right)}^{2}$, where ${\alpha}_{4}$ positively affect specificity because specificity is ${\alpha}_{4}/({\alpha}_{3}+{\alpha}_{4})$. Thus, part (ii) is a function of specificity.
- Third, the I-score is capable of measuring the variable set as a whole without making any assumption of the underlying model. However, AUC is defined between a response variable Y and a predictor $\widehat{Y}$. If a variable set has more than one variable, some underlying assumptions of the model need to be made—we would need $\widehat{Y}:=f({X}_{1},{X}_{2},..)$—in order to compute the AUC value.

#### Remark

#### 2.2. An Interaction-Based Feature: Dagger Technique

#### 2.3. Discretization

Algorithm 1:Discretization. Procedure of Discretization for an Explanatory Variable |

#### 2.4. Backward Dropping Algorithm

Algorithm 2:BDA. Procedure of the Backward Dropping Algorithm (BDA) |

#### 2.5. A Toy Example

- In the scenario when the data set has many noisy variables and each variable observed does not have any marginal signal, the common practice AUC value will miss the information. This is because AUC still relies on the marginal signal. In addition, AUC is defined under the response variable Y and its predictor $\widehat{Y}$, which requires us to make assumptions on the underlying model formulation. This is problematic because the mistakes carried over in making the assumptions can largely affect the outcome of AUC. However, this challenge is not a roadblock for the proposed statistic I-score at all. In the same scenario with no marginal signal, as long as the important variables are involved in the selection, the I-score has no problem signaling us the high predictive power disregarding whether the correct form of the underlying model can be found or not.
- The proposed I-score is defined using the partition of the variable set. This variable set can have multiple variables in it, and the computation of the I-score does not require any assumption of the underlying model. This means the proposed I-score iss not subject to the mistakes carried over in the assumption or searching of the true model. Hence, the I-score is a non-parametric measure.
- The construction of the I-score can also be used to create a new variable that is based on the partition of any variable set. We call this new variable ${X}^{\u2020}$, hence the name “dagger technique”. It is an engineering technique that combines a variable set to form a new variable that contains all the predictive power that the entire variable set can provide. This is a very powerful technique due to its high flexibility. In addition, it can be constructed using the variables with a high I-score value after using the Backward Dropping Algorithm.

**Table 4.**Simulation Results. This table presents the simulation results for the model $Y={X}_{1}+{X}_{2}\left(\mathrm{mod}\phantom{\rule{4.pt}{0ex}}2\right)$. In this simulation, we create toy data with just 10 variables (all drawn from Bernoulli distribution with probability $1/2$). We define the true model using only the first two variables, and the remaining variables are noisy information. The task is to present the results of the AUC and the I-score values. The experiment is repeated 30 times, and we present the average and standard derivation (SD) of the AUC and I-score values. We can see that there is no marginal information contributed by any of the variables alone because each variable by itself has low AUC values and I-score values below 1 (if the I-score is below 1 it provides almost no predictive power). We can use the guessed model (i) to (iv) that is composed of the true variables (here we assume that we know $\{{X}_{1},{X}_{2}\}$ is important, but we do not know the true form). We assign $\u03f5=0.0001$ to ensure the division in model (iv) to be legal in the case ${X}_{2}=0$. Last, we present the true model as a benchmark. Note: the “NA” entry means that AUC cannot be computed, i.e., not applicable or NA. The measure of AUC values has a major drawback: it cannot successfully detect useful information. Even with the correct variables selected (all guessed models only use the important variables $\{{X}_{1},{X}_{2}\}$), AUC measures subjects to serious attack from incorrect model assumptions. This flaw renders applications of using the AUC measure to select models less ideal and sub-optimal. However, the proposed I-score is capable of indicating that the most important variables, ${X}_{1}$ and ${X}_{2}$, disregard the forms of the underlying model. Moreover, the dagger technique of building ${X}^{\u2020}$ using partitions generated by the variable set ${X}_{1}$ and ${X}_{2}$ completely recovers the full information of the true model even before any machine learning or model selection procedure, which is a novel invention that the literature has not yet seen.

Average AUC | SD. of AUC | Average I-Score | SD. of I-Score | ||
---|---|---|---|---|---|

Important | ${X}_{1}$ | 0.51 | 0.01 | 0.49 | 0.52 |

${X}_{2}$ | 0.51 | 0.01 | 0.52 | 0.68 | |

Noisy | ${X}_{3}$ | 0.51 | 0.01 | 0.65 | 0.71 |

${X}_{4}$ | 0.5 | 0.01 | 0.41 | 0.6 | |

${X}_{5}$ | 0.51 | 0.01 | 0.61 | 0.77 | |

${X}_{6}$ | 0.5 | 0.01 | 0.27 | 0.29 | |

${X}_{7}$ | 0.51 | 0.01 | 0.42 | 0.7 | |

${X}_{8}$ | 0.5 | 0.01 | 0.33 | 0.48 | |

${X}_{9}$ | 0.51 | 0.01 | 0.49 | 0.68 | |

${X}_{10}$ | 0.51 | 0.01 | 0.39 | 0.48 | |

Guessed models (using $\{{X}_{1},{X}_{2}\}$) | model (i): ${X}_{1}+{X}_{2}$ | 0.51 | 0.01 | 749.68 | 0.61 |

model (ii): ${X}_{1}-{X}_{2}$ | 0.51 | 0.01 | 749.54 | 0.37 | |

model (iii): ${X}_{1}\xb7{X}_{2}$ | 0.49 | 0.25 | 248.2 | 18.26 | |

model (iv): ${X}_{1}/({X}_{2}+\u03f5)$ | 0.57 | 0.11 | 250.03 | 11.72 | |

${X}^{\u2020}$ (see Equation (15)) | 1 | 0 | 999.14 | 0.54 | |

$\{{X}_{1},{X}_{2}\}$ | NA | NA | 500.08 | 0.48 | |

True model | ${X}_{1}+{X}_{2}\phantom{\rule{4.pt}{0ex}}(\mathrm{mod}\phantom{\rule{4.pt}{0ex}}2)$ | 1 | 0 | 999.14 | 0.54 |

#### 2.6. Why I-Score?

#### 2.6.1. Why Is I-Score the Best Candidate?

#### 2.6.2. Non-Parametric Nature

#### 2.6.3. High I-Score Produces High AUC Values

#### 2.6.4. I-Score and the “Dagger Technique”

## 3. Application

#### 3.1. Language Modeling

#### 3.1.1. N-Gram

#### 3.1.2. Recurrent Neural Network

#### 3.1.3. Backward Propagation Using Gradient Descent

#### 3.1.4. Implementation with I-Score

- First, we can compute the I-score for each RNN unit. For example, in Figure 3A, we can first compute the I-score on the text vectorization layer. Then we can compute the I-score on the embedding layer. With the distribution of the I-score provided from the feature matrix, we can use a particular threshold to identify the cutoff used to screen for important features that we need to feed into the RNN architecture. We denote this action by using $\mathsf{\Gamma}(\xb7)$, and it is defined as $\mathsf{\Gamma}\left(\mathcal{X}\right):=\mathcal{X}\xb7\mathbb{1}\left(\mathrm{I}\right(Y,\mathcal{X})>\mathrm{threshold})$. For the input layer, each feature ${I}_{t}$ can be released or omitted according to its I-score values. That is, we use $\mathsf{\Gamma}\left({I}_{t}\right):={I}_{t}\xb7\mathbb{1}(\mathrm{I}(Y,{I}_{t})>\mathrm{threshold})$ to determine whether this input feature ${I}_{t}$ is predictive and important enough to be fed into the RNN architecture. For the hidden layer, each hidden neuron ${h}_{t}$ can be released or omitted according to its I-score values. In other words, we can use $\mathsf{\Gamma}\left({h}_{t}\right):={h}_{t}\xb7\mathbb{1}(\mathrm{I}(Y,{h}_{t})>\mathrm{threshold})$ to determine whether this hidden neuron ${h}_{t}$ is important enough to be inserted in the RNN architecture. If at certain t the input feature ${X}_{t}$ fails to meet the certain I-score threshold (${X}_{t}$ would fail if the I-score of ${X}_{t}$ is too low, then $\mathsf{\Gamma}\left({X}_{t}\right)=0$), then this feature is not fed into the architecture and the next unit ${h}_{t+1}$ is defined, using Equation (19), as ${h}_{t+1}=g(W\xb7{h}_{t-1}+U\xb70+b)$. This is the same for any hidden neuron as well. If a certain hidden neuron ${h}_{t}$ has the previously hidden neuron ${h}_{t-1}$, it fails to meet the I-score criteria, then ${h}_{t}$ is defined as ${h}_{t}=g(W\xb70+U\xb7{X}_{t}+b)$. Hence, this $\mathsf{\Gamma}(\xb7)$ function acts as a gate to allow the information of the neuron ${h}_{t}$ to pass through according to a certain I-score threshold. If $\mathsf{\Gamma}\left({I}_{t}\right)$ is zero, that means this input feature is not important at all and hence can be omitted by replacing it with a zero value. In other words, it is as if this feature never existed. In this case, there is no need to construct $\mathsf{\Gamma}\left({h}_{t}\right)$. We show later in Section 3 that important long-term dependencies that are associated with language semantics can be detected using this $\mathsf{\Gamma}(\xb7)$ function because I-score has the power to omit noisy and redundant features in the RNN architecture. Since I-score is compared throughout the entire length of T, the long-term dependencies between features that are far apart can be captured using high I-score values.
- Second, we can use the “dagger technique” to engineer and craft novel features using Equation (15). We can then calculate the I-score on these dagger feature values to see how important they are. We can directly use a 2-gram model, and the I-score is capable of indicating which 2-gram phrases are important. These phrases can act as two-way interactions. According to the I-score, we can then determine whether we want all the words in the 2-gram models, 3-gram models, or even higher level of N-gram models. When n is large, we recommend using the proposed Backward Dropping Algorithm to reduce dimensions within the N-word phrase before creating a new feature using the proposed “dagger technique”. For example, suppose we use the 2-gram model. A sentence such as “I love my cats” can be processed into (I, love), (love, my), (my, cats). Each feature set has two words. We can denote the original sentence “I love my cats” into four features $\{{X}_{1},{X}_{2},{X}_{3},{X}_{4}\}$. The “dagger technique” suggests that we can use Equation (15) with 2-gram models fed in as inputs. In other words, we can take $\{{X}_{1},{X}_{2}\}$ and construct ${\overline{Y}}_{j}$, where j is the running index tracking the partitions formed using $\{{X}_{1},{X}_{2}\}$. If we discretize ${X}_{1}$ and ${X}_{2}$ (see Section 2.3 for a detailed discussion of discretization using the I-score) and they both take values of $\{0,1\}$, then there are ${2}^{2}=4$ partitions, and hence j can take values of $\{1,2,3,4\}$. In this case, the novel feature ${X}_{1}^{\u2020}$ can take on four values, i.e., an example can be seen in Table 2. The combination of the I-score, Backward Dropping Algorithm, and the “dagger technique” allows us to prune the useful and predictive information in a feature set so that we can achieve maximum prediction power with as few features as possible.
- Third, we can concatenate many N-gram models with different n values. For example, we can carry out N-gram modeling using $n=2$, $n=3$, and $n=4$. This way, we can combine more high-order interactions. In order to avoid overfitting, we can use the I-score to select the important interactions and then use these selected phrases (which can be two-word, three-word, or four-word) to build RNN models.

#### 3.2. IMDB Dataset

**Table 5.**Sample Data and Selected Semantics. This table presents sample data. We present two samples. The first column is the paragraphs directly taken from the IMDB movie database. The second column is the corresponding label for the first column. The proposed I-score selects words that have a significant association with the semantics of the sentence. This is because I-score selects features that are highly predictive of the target variable. In this application, the target variable carries tones and preferences of a movie of which the same writer provides critics and reviews. The semantics in the critics and reviews reflect the tones and the preferences of the movie reviewers, which is why the I-score is able to detect the features using the provided preferences in the label.

No. | Samples | I-Score Features | Label | |
---|---|---|---|---|

Uni–Gram | 2-Gram, 3-Gram | |||

1 | $<\mathrm{UNK}>$ this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert $<\mathrm{UNK}>$ is an amazing actor and now the same being director $<\mathrm{UNK}>$ father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and … | {congratulations, lovely, true} | {amazing actor, really suited} | 1 |

2 | $<\mathrm{UNK}>$ big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i’ve seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it’s just so damn terribly written … | {bad, ridiculous} | {bad music, terribly written}, {damn terribly written} | 0 |

**Table 6.**Interpreted Semantics Using I-score. This table presents sample data. We present two samples. The first column is the paragraphs directly taken from the IMDB movie database. The second column presents features selected by I-score according to different I-score thresholds (we use the top 7.5% and the top 25% as examples). The last column presents the corresponding label. The semantics of the selected features are a subset of words from the original sample. We observe that I-score can select a subset of words while maintaining the same semantics.

No. | Samples (Original Paragraphs) | I-Score Features (Using Different Thresholds) | Label | |
---|---|---|---|---|

Top 7.5% I-Score | Top 25% I-Score | |||

1 | $<\mathrm{UNK}>$ this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert $<\mathrm{UNK}>$ is an amazing actor and now the same being director $<\mathrm{UNK}>$ father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy’s that played the $<\mathrm{UNK}>$ of … | {congratulations often the play them all a are and should have done you think the lovely because it was true and someone’s life after all that was shared with us all} | {for it really at the so sad you what they at a must good this was also congratulations to two little boy’s that played the <UNK> of norman and paul they were just brilliant children are often left out the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done do not you think the whole story was so lovely because it was true and was someone’s life after all that was shared with us all} | 1 |

400 words | 31 words | 101 words | ||

2 | $<\mathrm{UNK}>$ big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i’ve seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it’s just so damn terribly written the clothes are sickening and funny in equal measures the hair is big lots of boobs bounce men wear those cut tee shirts that show off their $<\mathrm{UNK}>$ sickening that men actually wore them and the music is just $<\mathrm{UNK}>$ trash that plays over and over again in almost every scene there is trashy music boobs and … | {those $<\mathrm{UNK}>$ every is trashy music away all aside this whose only is to look that was the 80’s and have good old laugh at how bad everything was back then} | {script best worked who the just so terribly the clothes in equal hair lots boobs men wear those cut shirts that show off their $<\mathrm{UNK}>$ sickening that men actually wore them and the music is just $<\mathrm{UNK}>$ trash that over and over again in almost every scene there is trashy music boobs and $<\mathrm{UNK}>$ taking away bodies and the gym still does not close for $<\mathrm{UNK}>$ all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80’s and have a good old laugh at how bad everything was back then} | 0 |

400 words | 31 words | 101 words |

#### 3.3. Results

**Remark**

**1.**

**Table 7.**Test Performance Comparison. The table presents experimental results of prediction performance on the IMDB dataset. The performance is measured in AUC, and the proposed method increases the prediction result to 97%, which is an 85% error reduction.

Model | IMDB (Test Set) |
---|---|

CNN [33] | 37.5% |

RNN | 85.7% |

LSTM | 86.6% |

GRU | 86.7% |

2-gram | 92.2% |

2-gram + 3-gram | 91.1% |

2-gram + 3-gram + 4-gram | 90.2% |

Average: | 81% |

Proposed: | |

2-gram: use high I-score features | 96.5% |

2-gram + 3-gram: use high I-score features | 96.7% |

2-gram + 3-gram + 4-gram: use high I-score features | 97.2% |

2-gram + 3-gram + 4-gram: use novel “dagger” features (based on Equation (15)) | 97.2% |

## 4. Conclusions

**Theoretical Contribution.**We provide theoretical and mathematical reasoning as to why I-score can be considered a function of AUC. The construction of the I-score can be analyzed using partitions. We see from mathematical rearrangements of the I-score formula that sensitivity plays a major component. This role of I-score provides the fundamental driving force to raise AUC values if the variables selected to compute I-score are important and significant. In addition to its theoretical parallelism with AUC, I-score can be used anywhere in a neural network architecture, which allows its end-users to flexibly deploy this computation, which is a feature that does not exist in AUC. AUC is also vulnerable under incorrect model specifications. Any estimation of the true model, disregarding whether it is accurate or not, is harmless to the performance of the I-score due to its non-parametric nature, which is a novel measure for feature selection that the literature has not yet seen.

**Backward Dropping Algorithm.**We also propose a greedy search algorithm called the Backward Dropping Algorithm that handles long-term dependencies in the dataset. Under the curse of dimensionality, the Backward Dropping Algorithm is capable of efficiently screening out the noisy and redundant information. The design of the Backward Dropping Algorithm also takes advantage of the nature of the I-score, because the I-score increases when the variable set has less noisy features, while the I-score decreases when the noisy features are included.

**Dagger Technique.**We propose a novel engineering technique called the “dagger technique” that combines a set of features using partition retention to form a new feature that fully preserves the relationship between the explanatory variable and the response variable. This proposed “dagger technique” can successfully combine words and phrases with long-term dependencies into one new feature that carries long-term memory. It can also be used in constructing the features in many other types of deep neural networks, such as Convolutional Neural Networks (CNNs). Though we present empirical evidence in sequential data application, this “dagger technique” can actually go beyond most image data and sequential data.

**Application.**We show, with empirical results, that this “dagger technique” can fully reconstruct target variables with the correct features, which is a method that can be generalized into any feed-forward Artificial Neural Network (ANN) and Convolutional Neural Network (CNN). We demonstrate the usage of I-score and the proposed methods with simulation data. We also show with a real-world application on the IMDB Movie dataset that the proposed methods can achieve 97% AUC value, an 81% error reduction from its peers or similar RNNs without an I-score.

**Future Research Directions.**We call for further exploration in the direction of using the I-score to extrapolate features that have long-term dependencies in time-series and sequential data. Since it is computationally costly to rely on high-performance CPU/GPUs, the direction I-score is taking leads researchers to rethink designing longer and deeper neural networks. Instead, future research can continue under the approach of building low dimensional but extremely informative features so that we can construct less complicated models in order for end-users.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent advances in recurrent neural networks. arXiv
**2017**, arXiv:1801.01078. [Google Scholar] - Bengio, Y.; Boulanger-Lewandowski, N.; Pascanu, R. Advances in optimizing recurrent networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8624–8628. [Google Scholar]
- Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput.
**2006**, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed] - Mahmood, M.A.; Popescu, A.C.; Oane, M.; Channa, A.; Mihai, S.; Ristoscu, C.; Mihailescu, I.N. Bridging the analytical and artificial neural network models for keyhole formation with experimental verification in laser-melting deposition: A novel approach. Results Phys.
**2021**, 26, 104440. [Google Scholar] [CrossRef] - Mahmood, M.A.; Visan, A.I.; Ristoscu, C.; Mihailescu, I.N. Artificial neural network algorithms for 3d printing. Materials
**2021**, 14, 163. [Google Scholar] [CrossRef] - Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks
**1994**, 5, 157–166. [Google Scholar] [CrossRef] - Sutskever, I.; Martens, J.; Hinton, G.E. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information
**2019**, 10, 150. [Google Scholar] [CrossRef][Green Version] - Spirovski, K.; Stevanoska, E.; Kulakov, A.; Popeska, Z.; Velinov, G. Comparison of different model’s performances in task of document classification. In Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, Novi Sad, Serbia, 25–27 June 2018; pp. 1–12. [Google Scholar]
- Jayawant, N. Mandrekar. Receiver operating characteristic curve in diagnostic test assessment. J. Thorac. Oncol.
**2010**, 5, 1315–1316. [Google Scholar] - Halligan, S.; Altman, D.G.; Mallett, S. Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: A discussion and proposal for an alternative approach. Eur. Radiol.
**2015**, 25, 932–939. [Google Scholar] [CrossRef][Green Version] - Apicella, A.; Donnarumma, F.; Isgrò, F.; Prevete, R. A survey on modern trainable activation functions. Neural Netw.
**2021**, 138, 14–32. [Google Scholar] [CrossRef] - Bengio, Y.; LeCun, Y. Scaling learning algorithms towards ai. Large-Scale Kernel Mach.
**2007**, 34, 1–41. [Google Scholar] - Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machine. Mach. Learn.
**2002**, 46, 389–422. [Google Scholar] [CrossRef] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2017. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1097–1105. [Google Scholar] [CrossRef] - LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst.
**1989**, 2, 396–404. [Google Scholar] - LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] - Chernoff, H.; Lo, S.H.; Zheng, T. Discovering influential variables: A method of partitions. Ann. Appl. Stat.
**2009**, 3, 1335–1369. [Google Scholar] [CrossRef][Green Version] - Lo, A.; Chernoff, H.; Zheng, T.; Lo, S.H. Why significant variables aren’t automatically good predictors. Proc. Natl. Acad. Sci. USA
**2015**, 112, 13892–13897. [Google Scholar] [CrossRef][Green Version] - Lo, A.; Chernoff, H.; Zheng, T.; Lo, S.H. Framework for making better predictions by directly estimating variables’ predictivity. Proc. Natl. Acad. Sci. USA
**2016**, 113, 14277–14282. [Google Scholar] [CrossRef][Green Version] - Lo, S.H.; Yin, Y. An interaction-based convolutional neural network (icnn) towards better understanding of COVID-19 x-ray images. arXiv
**2021**, arXiv:2106.06911. [Google Scholar] - Lo, S.H.; Yin, Y. A novel interaction-based methodology towards explainable ai with better understanding of pneumonia chest x-ray images. arXiv
**2021**, arXiv:2104.12672. [Google Scholar] - Lo, S.H.; Zheng, T. Backward haplotype transmission association algorithm—A fast multiple-marker screening method. Hum. Hered.
**2002**, 53, 197–215. [Google Scholar] [CrossRef][Green Version] - Carrington, A.M.; Fieguth, P.W.; Qazi, H.; Holzinger, A.; Chen, H.H.; Mayr, F.; Manuel, D.G. A new concordant partial auc and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC Med. Inform. Decis. Mak.
**2020**, 20, 4. [Google Scholar] [CrossRef] [PubMed] - Baker, S.G. The central role of receiver operating characteristic (roc) curves in evaluating tests for the early detection of cancer. J. Natl. Cancer Inst.
**2003**, 95, 511–515. [Google Scholar] [CrossRef] [PubMed][Green Version] - Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv
**2016**, arXiv:1607.01759. [Google Scholar] - Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag.
**1988**, 24, 513–523. [Google Scholar] [CrossRef][Green Version] - Goldberg, Y.; Levy, O. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv
**2014**, arXiv:1402.3722. [Google Scholar] - Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Tang, D.; Qin, B.; Feng, X.; Liu, T. Effective lstms for target-dependent sentiment classification. arXiv
**2015**, arXiv:1512.01100. [Google Scholar]

**Figure 1.**Mechanism between I-score gain and AUC gain. This figure presents the mechanism of how the I-score can increase AUC. There are four plots. The top left plot is a ROC curve with one particular pair of (1-Specificity, Sensitivity). The top right plot presents sensitivity gain from the I-score. The bottom left plot presents specificity gain from the I-score. Both sensitivity and specificity are driving forces of the AUC values because they move the dot up or left, which then increases the area-under-curve (the blue area). The bottom right plot presents performance gain from both sensitivity and specificity. In summary, the implementation of using the proposed I-score can increase AUC by selecting the features raising both sensitivity (from part (i) of the I-score, see Equation (14)) and specificity (from part (ii) of the I-score, see Equation (14)).

**Figure 2.**A Simple RNN for Text Classification. This diagram illustrates the basic steps of using RNN for text classification. The input features are $\{{X}_{1},{X}_{2},\dots \}$. The hidden neurons are $\{{h}_{1},{h}_{2},\dots \}$. The output prediction is $\widehat{Y}$. Since this is the text classification problem, the architecture has many inputs and one output, hence the name “many-to-one”. The architecture has parameters $\{U,V,W\}$ and these weights (or parameters) are shareable throughout the architecture.

**Figure 3.**Executive Diagram for the Proposed Method. This figure represents the proposed methodologies using N-grams, I-score, and the “dagger technique”. The forward propagation and the backward propagation remain the same before and after using I-score. The function $\mathsf{\Gamma}(\xb7)$ acts as a gate to release the neuron according to certain I-score criteria. (

**A**) Panel A presents the Recurrent Neural Network design to implement I-score at each step of the neural network architecture. (

**B**) Panel B presents features constructed using the “dagger technique” and then fed into the Recurrent Neural Network (RNNs) architecture.

**Figure 4.**Learning Paths Before and After Discretization. This figure presents the training procedure. All graphs present the training and validating paths. The first graph is from the original bi-gram data. The second is from using discretized bi-gram (discretized by I-score). The third is using the top 18 variables according to I-score values. The proposed method can significantly improve the computation efficiency.

**Figure 5.**Learning Paths Before and After Text Reduction Using I-score. This figure presents the training procedure. All graphs present the training and validating paths. The first graph is from the original bi-gram data. The second is from using discretized bi-gram (discretized by I-score). The third is using the top 18 variables according to I-score values. The proposed method can significantly improve the computation efficiency.

**Table 1.**Famous Activation Functions. This table presents three famous non-linear activation functions used in a neural network architecture. We use the ReLU as the activation function in the hidden layers and the Sigmoid as the activation function for the output unit. The activation functions are discussed in detais in Apicella (2021) [12], and we also compute the derivatives of these common activations in the table.

Name | Function | Figure | Derivative |
---|---|---|---|

Sigmoid | $\sigma \left(x\right)=\frac{1}{1+{e}^{-x}}$ | $\frac{\partial}{\partial x}\sigma \left(x\right)=\frac{{e}^{-x}}{{({e}^{-x}+1)}^{2}}$ | |

tanh | $\sigma \left(x\right)=\frac{{e}^{x}-{e}^{-x}}{{e}^{x}+{e}^{-x}}$ | $\frac{\partial}{\partial x}\sigma \left(x\right)=\frac{4{e}^{2x}}{{({e}^{2x}+1)}^{2}}$ | |

ReLU | $f\left(x\right)=\left(\right)open="\{"\; close>\begin{array}{cc}0\hfill & \phantom{\rule{3.33333pt}{0ex}}\mathrm{if}\phantom{\rule{3.33333pt}{0ex}}x0\hfill \\ x\hfill & \phantom{\rule{3.33333pt}{0ex}}\mathrm{if}\phantom{\rule{3.33333pt}{0ex}}x\ge 0.\hfill \end{array}$ | $\frac{\partial}{\partial x}f\left(x\right)=\left(\right)open="\{"\; close>\begin{array}{cc}0\hfill & \phantom{\rule{3.33333pt}{0ex}}\mathrm{if}\phantom{\rule{3.33333pt}{0ex}}x0\hfill \\ 1\hfill & \phantom{\rule{3.33333pt}{0ex}}\mathrm{if}\phantom{\rule{3.33333pt}{0ex}}x0\hfill \\ \mathrm{not}\phantom{\rule{4.pt}{0ex}}\mathrm{applicable}\hfill & \phantom{\rule{3.33333pt}{0ex}}\mathrm{elsewhere}\hfill \end{array}$ |

**Table 2.**Interaction-based Engineer: “Dagger Technique”. This table summarizes the construction procedure of ${X}^{\u2020}$ (the “dagger technique”). Suppose there is a variable set $\{{X}_{1},{X}_{2}\}$ and each of them can take values in $\{0,1\}$. The ${X}^{\u2020}$ can be constructed, and the values of this new feature are defined using the local average of the target variable Y based on the partition retained from the variable set $\{{X}_{1},{X}_{2}\}$. Here the variable set $\{{X}_{1},{X}_{2}\}$ produces four partitions. Therefore, the ${X}^{\u2020}$ can be defined according to the following table. In the test set, the target variable (or response variable) Y is not observed, so the training set values are used. Hence, the reminder is that in generating test set ${X}^{\u2020}$, we use ${\widehat{y}}_{j}$’s from the training set.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lo, S.-H.; Yin, Y.
Language Semantics Interpretation with an Interaction-Based Recurrent Neural Network. *Mach. Learn. Knowl. Extr.* **2021**, *3*, 922-945.
https://doi.org/10.3390/make3040046

**AMA Style**

Lo S-H, Yin Y.
Language Semantics Interpretation with an Interaction-Based Recurrent Neural Network. *Machine Learning and Knowledge Extraction*. 2021; 3(4):922-945.
https://doi.org/10.3390/make3040046

**Chicago/Turabian Style**

Lo, Shaw-Hwa, and Yiqiao Yin.
2021. "Language Semantics Interpretation with an Interaction-Based Recurrent Neural Network" *Machine Learning and Knowledge Extraction* 3, no. 4: 922-945.
https://doi.org/10.3390/make3040046