Language Semantics Interpretation with an Interaction-based Recurrent Neural Networks

Text classification is a fundamental language task in Natural Language Processing. A variety of sequential models is capable making good predictions yet there is lack of connection between language semantics and prediction results. This paper proposes a novel influence score (I-score), a greedy search algorithm called Backward Dropping Algorithm (BDA), and a novel feature engineering technique called the"dagger technique". First, the paper proposes a novel influence score (I-score) to detect and search for the important language semantics in text document that are useful for making good prediction in text classification tasks. Next, a greedy search algorithm called the Backward Dropping Algorithm is proposed to handle long-term dependencies in the dataset. Moreover, the paper proposes a novel engineering technique called the"dagger technique"that fully preserve the relationship between explanatory variable and response variable. The proposed techniques can be further generalized into any feed-forward Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), and any neural network. A real-world application on the Internet Movie Database (IMDB) is used and the proposed methods are applied to improve prediction performance with an 81% error reduction comparing with other popular peers if I-score and"dagger technique"are not implemented.


Introduction
Overview Artificial Neural Networks (ANNs) are created using many layers of fully connected units that are called artificial neurons. A "shallow network" refers to an ANN with one input layer while a "deep network" can have many hidden layers [1]. In the architecture of a feed-forward ANN, each neuron has a linear component and a non-linear component that is defined using activation functions. Shallow networks provide simple architecture while deeper networks generate more abstract data representation [1]. An important roadblock is the optimization difficulties caused by the non-linearity at each layer. Due to this nature, not many significant advances can be achieved before 2006 [2,3]. Another important issue is is the generation of a big pool of datasets [4,5]. A family of ANNs that have recurrent connections are called Recurrent Neural Networks (RNNs). These network architectures are designed to model sequential data for sequence recognition, classification, and prediction [6]. There are a number of research in the literature about RNNs that are investigating discrete-time and continuous-time frameworks. In this paper, we focus on discrete-time RNNs.
Problems in RNN The development of the back-propagation using gradient descent (GD) -back-propagation through time (BPTT) -has provided many great opportunities for training RNNs [7]. However, there are still a lot of challenges remain unsolved in modelling long-term dependencies [1]. If we have a sequence of time-series features X t , it is extremely challenging to detect the interaction relationship X t and X t+c have when c is a large constant. This roadblock caused many inefficient training of BPTT which leads to extremely costly training procedure and lack of interpretation. arXiv:2112.02997v1 [cs.CL] 2 Nov 2021 Table 1: Famous Activation Functions. This table presents three famous non-linear activation functions used in a neural network architecture. We use the ReLU as activation function in the hidden layers and the Sigmoid as activation function for the output unit. The activation functions are discussed in details in Apicella (2021) [12] and we also compute the derivatives of these common activations in the table.

Name
Function Figure  Derivative Sigmoid σ(x) = 1 Problems in Text Classification Using RNN Recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of prediction performance [8]. A major drawback of using deep RNNs is the lack of interpretability for the prediction performance. Due to the nature of language, one word could appear in different forms (i.e. singular versus plural, present tense versus past tense) while the semantic meaning of each form is the same [9]. Though many researchers have addressed some techniques to tackle the interpretation and semantic problems in natural language, few have been successful and most of the work still face limitations of feature selection [9]. One famous technique is the N-gram technique (see Section 3.1 for detailed overview). However, a major drawback of the N-gram technique is its difficulty at extracting features with long-term dependencies in the text document.
Performance Diagnosis Test The performance of a diagnostic test in machine learning as well as Natural Language Processing in the case of a binary predictor can be evaluated using measures of sensitivity and specificity [10]. Due to the nature of activation functions (see Table 1) used in a deep learning architecture, we often times obtain predictors with continuous scale. This means we cannot directly measure the sensitivity, specificity, and accuracy rate which can cause inconsistency in the tuning process of the validating set and the robustness of the test set performance. In tackling this problem, a measure of using different range of cutoff points for the predictor to construct confusion table is proposed and this is called a Receiver Operating Characteristic (ROC) curve [10]. One major problem of using ROC curve to compute area-under-curve (AUC) values is that ROC treats sensitivity and specificity as equally important across all thresholds [11].
Remark We start with discussion of the field of Natural Language Processing. In this field, we are facing data sets with sequential nature, which RNNs are designed to replace conventional neural networks architecture. A fundamental problem in this field is to study the language semantics using sequential data in text classification tasks. Many methods and models have been proposed, yet the conclusions are questionable. This is because in diagnostic tests of the prediction performance accuracy or AUC values are used as the benchmark methods to assess the robustness of a model. We show with both theoretical argument and simulation results that AUC exhibits major flaws in measuring how predictive a variable set or a model is at predicting target variable (or response variable). This is a fatal attack for AUC, because even with the correct features given AUC performs extremely poorly if the estimation of the true model is incorrect. We propose a novel I-score that increases while AUC value increases, but the proposed I-score does not subject to any attack from incorrect estimation of the true model. Accompanied with I-score, the "dagger technique" is proposed to further combine important features to form extremely powerful predictor. We regard this the major innovation for our paper.
Contributions The proposed I-score exists in order to make good predictions. In supervised learning, suppose there are explanatory variables or features X and target variables Y . The goal is to build a model (or a machine) to learn and estimate the true relationship between X and Y . Dependent on different types of explanatory variables, many different approaches of neural networks are designed such as Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN). It is a large consent to make the network deeper [13,14,15,16,17,18,19,20,2,21] and rely on convolutional operation to extract features, yet the literature lacks exploration in truly understanding features by directly look at how features impact the prediction performance. If a machinef (·) is trained, the predictionŶ of the target variable Y is a combination off (·) and X. It is not clear whether Y makes good prediction. Is it because off (·) or is it because of X? If it is because of X, how much impact does X have on Y ? The proposal of I-score is essentially to help us answer these questions cleanly withoutf (·) to cloud our judgement.
I-score is derived from the theoretical prediction rate of explanatory features based on partition retention technique [22,23,24]. Based on I-score, Backward Dropping Algorithm (BDA) is also introduced to iteratively search for highorder interactions while omitting noisy and redundant variables (see detailed introduction in 2.4 Backward Dropping Algorithm). This paper extends the design of I-score and introduces a concept called the "dagger technique" (see discussion in 2.2 Interaction-based Feature: Dagger Technique). I-score and BDA screens for important and predictive features directly while the "dagger technique" constructs a new feature based on the selected features that uses the local averages of the target variable Y in training set. This powerful technique is able to efficiently screen for important features and constructs them into modules before they are fed into any neural network for training.
A theoretical novelty in this paper is the technical reasoning provided to show that I-score is a function of AUC. While AUC can only be used in the end of model, which means the impact of the features are couded by fitted model, I-score can be used anywhere in a neural network architecture. This paper shows proposed design of how I-score can be implemented in a type of neural network for text classification: Recurrent Neural Network (RNN). Though this paper focuses on RNN, similar design can be carried out implementing I-score with other types of neural networks.
In practice, deep neural networks are generally considered "black box" techniques. This means that we typically feed in training data X and Y , and predictions are generated by the network without clearly illustrating how X affect the prediction. This inexplainability presents issues to end-users and prevents end-users to deploy a machine, sometimes well trained with high prediction performance, to live application. The entire production chain may hit a roadblock simply because end-users do not have sophisticated tools to understand the performance. I-score, in practice, shed lights to this problem. With direct influence measured by I-score, the impact that X has on Y can be computed without any incorrect assumption of the underlying model. The "dagger technique" can combine a subset of explanatory variables, which has two benefits. First, "dagger technique" features can be directly used to make predictions. Second, "dagger technique" features also presents end-users explainable and interpretable descriptions of the local average of target variable a partition has. In other words, if an observation in test set that an instance falls in a certain partition, we can directly read off the potential Y value for this instance based on "dagger technique". This is practical novelty is not yet discovered in the literature.

Organization of Paper
The rest of the paper is organized as follows. Section 2 starts with the presentation of a novel Influence Measure (i.e. Influence-score or I-score). This definition is introduced in paragraphs in 2.1 I-score, Confusion Table, and AUC. This definition, based on previous work [22,23,24], is derived from the lower bound of the predictivity where we further discover the relationship between I-score and AUC. Next, we introduce an interaction-based feature engineering method called the "dagger technique" in paragraphs discussed in 2.2 Interaction-based Feature: Dagger Technique. In addition, we present a greedy search algorithm called the Backward Dropping Algorithm (BDA) in paragraphs discussed in 2.4 Backward Dropping Algorithm, which is an extension of previous work [22,23,24]. We provide a toy example in paragraphs discussed in 2.5 Toy Example to demonstrate the application of I-score where we show with simulation that I-score is able to reflect true information of the features in ways that AUC cannot achieve. Section 3 Application discusses basic language modeling and the procedure of how to implement the proposed I-score and the "dagger technique". This section presents the basics of N-gram models and RNN in portions of paragraphs discussed in 3.1 Language Modeling, an introduction of the dataset in paragraphs of 3.2 IMDB dataset, and experimental results in paragraphs of 3.3 Result.

A Novel Influence Measure: I-score
This section introduces a novel statistical measure that assess the predictivity of the a variable set given the response variable (for definition of predictivity, see [23] and [24]). This I-score is formally introduced in the following.
Suppose the response variable Y is binary (taking values 0 and 1) and all explanatory variables are discrete. Consider the partition P k generated by a subset of k explanatory variables {X b1 , ..., X b k }. Assume all variables in this subset to be binary. Then we have 2 k partition elements; see the first paragraph of Section 3 in (Chernoff et al., 2009 [22]). Let n 1 (j) be the number of observations with Y = 1 in partition element j. Letn(j) = n j × π 1 be the expected number of Y = 1 in element j. Under the null hypothesis the subset of explanatory variables has no association with Y , where n j is the total number of observations in element j and π 1 is the proportion of Y = 1 observations in the sample. In Lo and Zheng (2002) [25], the influence score is defined as (1) The statistics I-score is the summation of squared deviations of frequency of Y from what is expected under the null hypothesis. There are two properties associated with the statistics I. First, the measure I is non-parametric which requires no need to specify a model for the joint effect of {X b1 , ..., X b k } on Y . This measure I is created to describe the discrepancy between the conditional means of Y on {X b1 , ..., X b k } disregard the form of conditional distribution. Secondly, under the null hypothesis that the subset has no influence on Y , the expectation of I remains non-increasing when dropping variables from the subset. The second property makes I fundamentally different from the Pearson's χ 2 statistic whose expectation is dependent on the degrees of freedom and hence on the number of variables selected to define the partition. We can rewrite statistics I in its general form when Y is not necessarily discrete whereȲ j is the average of Y -observations over the j th partition element (local average) andȲ is the global average. Under the same null hypothesis, it is shown (Chernoff et al., 2009 [22]) that the normalized I, I/nσ 2 (where σ 2 is the variance of Y ), is asymptotically distributed as a weighted sum of independent χ 2 random variables of one degree of freedom each such that the total weight is less than one. It is precisely this property that serves the theoretical foundation for the following algorithm. Table,  Balanced Accuracy = (Sensitivity + Specificity)/2 (6) F1 Score = 2 · True positive 2 · True positive + False positive + False negative (7) while the components of true positive, true negative, false positive, and false negative are presented in Table 2.

I-score, Confusion
As Table 2 presented, these performance measure are well defined and suitable for two-class classification problem. However, the output from the forward pass of a neural network is generated by sigmoid activation. The sigmoid formula takes the form of σ(x) = 1/(1 + exp(−x)) and this definition bounds the output of a sigmoid function to be within the range of [0, 1]. Additional non-linear activation functions can be found in Table 1 Specificity = α 4 α 3 + α 4 (9) and the pair of statistics, (Sensitivity, 1 -Specificity), presents one dot on the figure which allows us to compute area-under-curve. Figure 1: Mechanism between I-score Gain and AUC Gain. This figure presents the mechanism of how I-score can increase AUC. There are four plots. The top left plot is a ROC curve with one particular pair of (1 -Specificity, Sensitivity). The top right plot presents sensitivity gain from I-score. The bottom left plot presents specificity gain from I-score. Both sensitivity and specificity are driving force of the AUC values because they move the dot up or left which then increase the area under curve (the blue area). The bottom right plot presents performance gain from both sensitivity and specificity. In summary, implementation of using the proposed I-score can increase AUC by selecting the features raising both sensitivity (from part (i) of I-score, see equation 14) and specificity (from part (ii) of I-score, see equation 14). In order to compute I-score, partition is an important concept to understand. Since X 1 ∈ {0, 1}, the two partitions are simply X 1 = 1 and X 1 = 0. In the partition of X 1 = 1, Y = 1 with α 1 observations and Y = 0 with α 3 observations. In the partition of X 1 = 0, Y = 1 has α 2 observations and Y = 0 has α 4 observations. With these information, we can write the followingȲ and next we can write out each terms in the proposed I-score formula (here each term is a partition). The first partition is dividing top and bottom (inside fraction) by (α 1 + α 2 )n = α1/(α1+α2)−(α1+α3)/n 1/(α1+α2) 2 , recall sensitivity = α 1 /(α 1 + α 2 ) = Sensitivity−(α1+α3)/n 1/(α1+α2) 2 , a function of sensitivity (12) This means that if the variables with high sensitivity are selected, the I-score will increase which corresponds to an increase in AUC. Alternatively, the nature of I-score dictates to select highly predictive variables with high sensitivity which directly cause the ROC to move towards top left corner of the plot (see Figure 1), i.e. resulting in higher AUC values.
Next, the second partition can be written as Notice that under the scenario when the predictor X 1 is extremely informative, there can be very little observations for false negatives (which is α 2 ). In other words, α 2 ≈ 0 when extremely predictive variable are present. In this case, the value of the second partition is completely determined by (α 1 + α 2 )(α 2 + α 4 )/n which is largely dictated by the global average of the response variable Y but scaled up with a factor of (α 2 + α 4 ). A benefit from near zero α 2 also implies that specificity can be high which is another direction to push AUC higher.
As a summary, the two partitions that consists of the main body of I-score can be written as the following where it allows us to conclude: • First, part (i) is a function of sensitivity. More importantly, I-score serves as a lower bound of sensitivity.
The proposed statistics I-score is high when sensitivity is high which means I-score can be used as a metric to select high sensitivity variables. A nice benefit from this phenomenon is that high sensitivity is the most important driving force to raise AUC values. This relationship is presented in the top right plot in Figure 1. • Second, part (ii) is a function of α 2 which approximates to zero value when the variable is highly predictive.
This leaves the second part to be largely determined by the global average of the response variable Y but scaled up in proportion with the number of observations that fall in the second partition (X 1 = 0) which is the sum α 2 + α 4 . An interesting benefit from this phenomenon is that the near zero α 2 value jointly with part (i) implies that the specificity is high, which is another important driving force to raise AUC values. In other words, when the predictor has all the information to make good prediction performance, the value of α 2 is expected to be approximately zero. In addition, the global mean of the true condition can be written asȲ = 1 n (α 1 + α 2 ). Hence, this means that part (ii) can be rewritten as (Ȳ α 4 ) 2 where α 4 positively affect specificity, because specificity is α 4 /(α 3 + α 4 ). Thus, part (ii) is a function of specificity.
• Third, I-score is capable of measuring variable set as a whole without making any assumption of the underlying model. However, AUC is defined between a response variable Y and a predictorŶ . If a variable set has more than one variable, some underlying assumption of the model need to be made -we would need Y := f (X 1 , X 2 , ..) -in order to compute AUC value.
Remark In generalized situation, the confusion table in Table 2 may have more than two partitions for predicted condition. In any scenarios that the table has three or more partitions, they can be written into two partitions using a threshold. For example, instead of positive or negative, there can be predicted condition to take values of {1, 2, ..., K}.
In this case, any number can be taken excluding 1 and K as threshold to reduce K levels back into 2 levels. Suppose "2" is used as a threshold, the partitions greater than 2 can be one partition and the partitions less than or equal to 2 can be one partition. This allows to reduce K levels into 2 levels. The same proof in equation 14 would follow.

An Interaction-based Feature: Dagger Technique
The concept of interaction-based Feature is initially proposed in Lo and Yin (2021) [26]. In their work, the authors defined an interaction-based feature that is used to replace the construction of using filters in designing Convolutional Neural Networks (CNNs). The conventional practice relies on pre-defined filters and these filters are small 2-by-2 or 3-by-3 window that are designed to capture certain information based on prior knowledge. The art of using interactionbased feature to create novel features within a 2-by-2 or 3-by-3 window in an image is to allow the data rather than meaningless filters to indicate the predictive information in the image. These new features are denoted as X † 's, and hence the name "dagger technique". The rest of this subsection formally define this method of using partitions to define novel features.
A major benefit for using the proposed I-score is the partition retention technique. This is a feature engineering technique that helps us to preserve the information of a variable set and convert it into one feature. Since ROC AUC cannot be directly computed between a response variable and the potential variable set, common procedure tends to fit a model first before AUC is computed. This is a very costly method for the following two reasons. First, the fitting of a regression or a classification model can be very costly to train. Second, the model fitting procedure cannot guarantee the prediction results of the final predictor. If the AUC value is low, there is no solution to distinguish whether the poor AUC result comes from model fitting or variable selection. It is shown with evidence in simulation that there can be extremely poor AUC values when the correct features are present with an incorrect estimation of the model (see Table 3: Interaction-based Engineer: "Dagger Technique". This table summarizes the construction procedure of X † (the "dagger technique"). Suppose there is a variable set {X 1 , X 2 } and each of them can take values in {0, 1}. The X † can be constructed and the values of this new feature is defined using the local average of the target variable Y based on the partition retained from the variable set {X 1 , X 2 }. Here the variable set {X 1 , X 2 } produces 4 partitions. Hence, the X † can be defined according to the following table. In test set, the target variable (or response variable) Y is not observed, so the training set values are used. Hence, the reminder is that in generating test set X † we useŷ j 's from training set.
Training set : 1 y 2 (generated from training set) 1 0 y 3 (generated from training set) 0 1 y 4 (generated from training set) 0 0 subsection 2.5 for simulation discussion and Table 4 for simulation results). We consider this the major drawback of using ROC AUC.
To tackle this problem, a proposed technique is to use partition retention. These new features are denoted as X † 's and hence this method is named the "dagger technique". Now we introduce this technique as follows. Suppose there is a supervised learning problem and there are explanatory variables X and response variable Y . Suppose X has partitions size k. A novel non-parametric feature can be created using the following formula where k is the size of the total partitions formed by X. For example, suppose there is X 1 ∈ {1, 0} and X 2 ∈ {1, 0}.
Then the variable set {X 1 , X 2 } has 4 partitions, i.e. computed using 2 2 = 4. In this case, the running index for notating the partition j can take values {1, 2, 3, 4}. Then, based on this variable set {X 1 , X 2 }, a new feature can be created called X † X1,X2 that is a combination of X 1 and X 2 using partition retention. Hence, this new feature can be defined as X † X1,X2 :=Ȳ j while j ∈ {1, 2, 3, 4} as discussed above. The results of this example is summarized in tabular form (see Table 3).

Discretization
The partition retention technique requires the partition space of the subset of variables in I-score formulation to be countable. In other words, the partition is necessary to form and it is can be defined by the subsetX b disregard how many variables are selected in this group b. If each variable takes values {0, 1}, then we have 2 k partitions for this subsetX b . This is, however, not always guaranteed from practice. In some situation, there can be variables with many unique values that can be considered continuous. To avoid sparsity problem 1 , we would need to discretize the continuous variable into discrete variable. The algorithm is presented in Algorithm 1.
Suppose any explanatory variable X j has l unique levels while j in 1, 2, ..., p. That means, we can use order statistics to write out all the unique levels using X j,(1) , ..., X j,(l) . To discrete X j from l levels into two levels, we need to choose a cutoff to compare all levels against this threshold (call it t), i.e. binary output based on t would be 1(X j > t), in order to create a new variable that takes only values in {0, 1}. This threshold t can take any value in the unique levels, i.e. ∀t ∈ {X j,(1) , ..., X j,(l) } and at each t we can compute I-score using equation 2. The best threshold t * would be the candidate that maximizes I-score. The objective function can be stated as the following: while s t is dependent on the threshold t and can take {1, 2}, because the partition is constructed based on an indicator function, i.e. 1(X j > t) and can only be two partitions.
Algorithm 1: Discretization. Procedure of Discretization for an Explanatory Variable Define unique levels: X j,(1) , X j, (2) , ..., X j,(l) Initialize: set t * = 0, and I * = 0 for t in unique levels: for t in unique levels: do Conversion: Use indicator function to convert X j into binary form according to threshold t * , i.e. 1(X j > t * ).

Backward Dropping Algorithm
In many situation, a variable set has noisy information that can damage the prediction performance. In this case, we recommend Backward Dropping Algorithm to omit the noisy variables first before we do machine learning or using the "dagger technique" which refers to the feature engineering technique in equation 15.
The Backward Dropping Algorithm is a greedy algorithm to search for the optimal subsets of variables that maximizes the I-score through step-wise elimination of variables from an initial subset sampled in some way from the variable space. The steps of the algorithm are as follows. Consider a training set The size p can be very large. All explanatory variables are discrete. Then we select an initial subset of k explanatory variables For the rest of the paper, we refer this formula as Influence Measure or Influence Score (I-score). Tentatively drop each variable in S b and recalculate the I-score with one variable less. Then drop the one that gives the highest I-score. Call this new subset S b which has one variable less than S b . Continue to the next round of dropping variables in S b until only one variable is left. Keep the subset that yields the highest I-score in the entire process. Refer to this subset as the return set R b . This will be most important and influential variable module from this initial training set. The above steps can be summarized in the following, see Algorithm 2.
For the rest of the paper, we refer this formula as Influence Measure or Influence Score (I-score). Drop Variables: Tentatively drop each variable in S b and recalculate the I-score with one variable less. Then drop the one that gives the highest I-score. Call this new subset S b which has one variable less than S b . l = |S b | (update l with length of current subset of variables) end

A Toy Example
Let us create the following simulation to further investigate the advantage the proposed I-score has over the conventional measure AUC. In this experiment, suppose there are X 1 , ..., X 10 ∼ iid Bernoulli(0.5). In other words, the sequence of all 10 independent variables defined above can only take values from {1, 0}. We create a sample of 2,000 observations for this experiment. Suppose we define the following model while "mod" refers to modulo of 2, i.e. 1 + 1 = 2 ≡ 0. We can compute the AUC values and the I-score values for all 10 variables. In addition, we also compute both measures for the following models (this is to assume when we do not know the true form of the real model): (i) X 1 + X 2 , (ii) X 1 − X 2 , (ii) X 1 · X 2 , (iv) X 1 /(X 2 + ). In model (iv), we add = 10 −5 to ensure the division is legal. To further illustrate the power of I-score statistics, we introduce a new variable specifically constructed by taking the advantage of partition retention, i.e. X † :=ȳ j while j ∈ Π X1,X2 (the novel dagger technique that is defined using variable partition is widely used in application, see equation 15). The simulation has 2,000 observations. We make a 50-50 split. The first 1,000 observations are used to create partitions and the local average of the target variable Y that is required in creating X † feature is only taken from the first 1,000 observations. In the next 1,000 observations, we can directly observe {X 1 , X 2 } and retain the partition. For each of the partition, we then go to training set (the first 1,000 observations) and use theȳ j values created using only the training set. In other words, the We present the simulation results in Table 4.
The simulation results show that the proposed I-score is capable of selecting truly predictive variables while the AUC values might be misleading. We can see that the challenge of this toy dataset is that the variables do not have marginal signal. In other words, variables alone do not have predictive power. We can see this from the average AUC and I-score values. Since AUC value cannot be computed using multiple variables directly, assumptions of the underlying models are built. We guess four models: (i) X 1 + X 2 , (ii) X 1 − X 2 , (ii) X 1 · X 2 , (iv) X 1 /(X 2 + ) and these models all have low AUC values even though the variables are correct. This implies under false assumption of the underlying model, AUC produce no reliable measure of detecting how significant the variables are. However, the proposed statistics on the assumed models (i)-(iv) are much higher than on individual variables alone. This means that the proposed I-score has the capability of detecting important variables even under incorrect assumption of the model formulation.
The section "Guessed" in Table 4 consists of 4 assumed models containing the correct variables, X † , and the variable set {X 1 , X 2 }. In this section, we observe that the AUC values are only high for X † and are rather poor in the rest of assumed model. In addition, we cannot compute AUC for a variable set {X 1 , X 2 }. Alternatively, we can use I-score. The I-score values are drastically different in this section than the rest of the paper. The I-score for model (i) and (ii) are both above 700. The I-score for X † is exactly the same as the theoretical value and so is the AUC for X † . This means the X † technique takes the advantage of partition of the variable set successfully contains all the information that we need in order to make the best prediction result. This is not yet discovered in the literature.
As a summary for this simulation, we can conclude the following • In the scenario when the data set has many noisy variables and each variable observed does not have any marginal signal, the common practice AUC value will miss the information. This is because AUC still relies on the marginal signal. In addition, AUC is defined under the response variable Y and its predictorŶ which requires us to make assumption on the underlying model formulation. This is problematic because the mistakes carried over in making the assumption can largely affect the outcome of AUC. However, this challenge is not a roadblock for the proposed statistics I-score at all. In the same scenario with no marginal signal, as long as the important variables are involved in the selection, I-score has no problem signaling us the high predictive power disregard whether correct form of the underlying model can be found or not. • The proposed I-score is defined using the partition of the variable set. This variable set can have multiple variables in it and the computation of I-score does not require any assumption of the underlying model. This means the proposed I-score does not subject to the mistakes carried over in the assumption or searching of the true model. Hence, I-score is a non-parametric measure. • The construction of I-score can also be used to create a new variable that is based on the partition of any variable set. We call this new variable X † , hence the name "dagger technique". It is a engineering technique in our work that combines a variable set to form a new variable that contains all the predictive power that the entire variable set can provide. This is a very powerful technique due to its high flexibility. In addition, it can be constructed using the variables with high I-score value after Backward Dropping Algorithm.

Why I-score?
Why is I-score the best candidate? The design of the proposed I-score has the following three benefits. First, I-score is a non-parametric measurement. This means I-score does not require any model fitting procedure. Second, the behavior of I-score is parallel with that of AUC. If a predictor has high I-score, this predictor must have high AUC value. However, AUC cannot be directly defined if we have a set of variables. In this case, an estimation of the true model must be used in order to compute AUC which means any attack from incorrect estimation of the true model will lower AUC value. However, I-score does not subject to this attack. Third, the proposed I-score can be developed into an interaction-based feature engineer method called the "dagger technique". This technique can recover all useful information from a subset of variables. Next, we will explain each of the above three reasoning. Table 4: Simulation Results. This table presents the simulation results for the model Y = X 1 + X 2 (mod 2). In this simulation, we create a toy data with just 10 variables (all drawn from Bernoulli distribution with probability 1/2). We define the true model using only the first two variables and the remaining variables are noisy information. The task is to present the results of the AUC and the I-score values. The experiment is repeated 30 times and we present the average and standard derivation (SD) of the AUC and I-score values. We can see that there is no marginal information contributed by any of the variables alone, because each variable by itself has low AUC values and I-score values below 1 (if I-score is below 1, this provide almost no predictive power). We can use the guessed model (i) to (iv) that is composed of the true variables (here we assume that we know {X 1 , X 2 } is important but we do not know the true form). We assign = 0.0001 to ensure the division in model (iv) to be legal in case X 2 = 0. Last, we present the true model as a benchmark. Note: the "NA" entry means that AUC cannot be computed, i.e. not applicable or NA. The measure of AUC values has a major drawback: it cannot successfully detect the useful information.
Even with the correct variables selected (all guessed models only use the important variables {X 1 , X 2 }), AUC measure subjects to serious attack from incorrect model assumption. This flaw renders applications of using AUC measure to select models less ideal and sub-optimal. However, the proposed I-score is capable of indicating the most important variables, X 1 and X 2 , disregard the forms of the underlying model. Moreover, the dagger technique of building X † using partitions generated by the variable set X 1 and X 2 completely recovers full information of the true model even before any machine learning or model selection procedure, which is a novel invention that the literature has not yet seen.
Average AUC SD. of AUC Average I-score SD. of I-score Non-parametric Nature The proposed I-score (equation 2) does not rely or make any assumptions on the model fitting procedure. The procedure of model fitting refers to the step of searching for a model that estimate the true relationship between features and target variable. Suppose we have a subset of the features X ∈ R d and their corresponding target variable Y . The relationship between X and Y (denote this relationship to be f (·)) is often times not observable. We would have to make an assumption of the underlying modelf (·) in order to construct a predictor Y :=f (X). Only then can we measure the difference between our estimate and the true target variable. We can do so by using a loss function L(Y,Ŷ ) to estimate the error the predictorŶ has at predicting the target variable Y . Disregard the numerical magnitude of the result from the loss function, we are uncertain the source of the mistakes we have. This is because the predictorŶ is composition of both the estimation of the true modelf (·) and the selected features X. The literature has not yet discovered any procedure to directly measure the influence X has on target variable Y without the estimation of true modelf (·). Hence, the error, measured by loss L(Y,Ŷ ) or ROC AUC values, is confounded and we do not know clearly how the features X influence the target variable Y if it has any influence.
The proposed I-score can directly measure the influence of features X has on target variable Y without the usage of any estimation of the true model. In other words, the formula of I-score (equation 2) has no component off (·) and there is no need to havef (·) as long as we can compute I-score. This clears our judgement of feature selection and allows the end-users to immediately know not only the impact of the features X but also how important they are at predicting target variable Y .
From the toy example (see the simulation results in Table 4 in subsection 2.5), we can see that the estimation of the true model, if incorrect, produce extremely poor AUC values (model (i) -(iv) in Table 4 have AUC values approximately 50%, equivalent to random guessing) even though the important and significant features are present in the variable set. I-score does not subject to the attack of incorrect estimation of the true model. We can see that I-score of the same incorrect models are extremely high as long as the important features, {X 1 , X 2 }, are included and present. Consequently, we regard this as the most important contribution and reason why I-score and the "dagger technique" are the best tools.
High I-score Produces High AUC Values We provide in the subsection 2.1 a technical analysis on why high I-score induce high AUC values. In the anatomy of I-score formula in subsection 2.1, we can see that the first partition (this is the partition whenŶ is positive) is directly related to sensitivity. In addition, we reward this outcome with a factor of n 2 1 where n 1 is the number of observations that falls in this partition, i.e.Ŷ is positive. This construction raises I-score significantly and proportionate to the number of true positive cases which is a major component causing AUC value to rise. Hence, the benefit of using I-score is that it allows statisticians and computer scientists to immediately identify the features that can raise the AUC values. This parallel behavior allows easy interpretation of I-score values for its end-users.
Researchers have pointed out that ROC treats sensitivity and specificity equally [11] which can cause lack of interpretability in some areas of practice such that the diagnostic is focusing on true positives. Poor sensitivity could mean that the experiment is missing the positive cases. In other words, AUC can consider a test that increase sensitivity at low specificity superior to one that increases sensitivity at high specificity. It is recommended that better tests should reward sensitivity to avoid false positives [27]. The construction of the I-score directly rewards sensitivity proportionate to the number of observations classified positive cases correctly which allows I-score to exponentially grow as the sample size in training set increases. This nature allows I-score to not only behave in parallel to AUC values but is capable of reflecting the sensitivity values proportionate to the sample size.
I-score and the "Dagger Technique" The "dagger technique" is the most innovative method ever introduced based on the literature in developing I-score [25] [22] [23] [24]. The proposed "dagger technique" organically combines the raw information from each feature by using partition retention [22] which allows the final combined feature to enrich the prediction performance free of incorrect estimation of the true model. This nature of the "dagger technique" is reflected in Table 4. The guessed models are all formulated using the correct variable sets: {X 1 , X 2 }. However, model (i) -(iv) produced shocking 50% AUC values because AUC subjects to the attack of incorrect model estimation. This is a major drawback of using AUC to conduct model selection. However, the "dagger technique", as long as the variable set includes the important variables, produce an outstanding 100% AUC with no variation. This means the "dagger technique" is capable of reconstructing the original information and enrich the predictor. We can see that the remaining methods all have some standard deviation when we simulate experiment with a different seed. However, the setting of different seed has no impact on the proposed "dagger technique" which means any time we use the "dagger technique" we are able to recover the full information of the target variable Y with the correct variable set {X 1 , X 2 }. This fact justifies the remarkable value of the "dagger technique" towards the literature. The only other place that this is true in Table 4 is the model. Thus, the "dagger technique" has the capability to recover the full information in target variable if the features are included correctly and it does not subject to any attacks of false specification or estimation of the true model. We strongly recommend practitioners to use the proposed I-score and the "dagger technique" in the future whenever possible.

Language Modeling
N-gram In Natural Language Processing, the prediction of the probability of a word or the classification of the label of a paragraph is a fundamental goal in building language models. The most basic language model is the "N-gram" models.
In "N-gram" modeling, a single word such as "machine" as a feature is a uni-gram (n = 1). The 2-word phrase "machine learning" is a bi-gram (n = 2). The 3-word phrase "natural language processing" is a tri-gram (n = 3). Any text document can be processed to construct N-gram features using this technique. For example, we may observe a sentence and it states "I love this movie". If the goal is to classify the sentence into one of the two classes: {positive, negative}, a common practice is to estimate the probability of the label Y belongs to a certain class based on the previous words in the same sentence: P(Y = 1|i), uni-gram P(Y = 1|i love), bi-gram P(Y = 1|i love this), tri-gram P(Y = 1|i love this movie), 4-gram (18) Joulin et. al. (2016) [28] summarized a list of NLP algorithms to produce a range of performance from 80% to 90% range. They also show evidence that N-gram models can increase the accuracy slightly more by using more N-grams. However, the aggregate N-gram models are not yet discovered. In addition, continuing adding N-gram features would result in overfitting which leads to more and more complex architecture of RNN such as Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and bi-directional RNN. Thus far, there has not been any dimensionality reduction technique implemented within a RNN in the literature.

Recurrent Neural Network
We introduced N-gram models above where the conditional probability of the label of a sentence y only depends on the previous n − 1 words. Let us briefly discuss the basic RNN that we will use in the application. The diagram for the basic RNN is presented in Figure 2. Suppose we have input features X 1 , X 2 , .... These features are directly processed from the text document which can be processed word index or they can be embedded word vectors. The features are fed into the hidden layer where the neurons (or units) are denoted as h 1 , h 2 , .... There is a weight connecting the previous neuron with the current neuron. We denote this weight as W . Each current neuron has contribution from current feature which is connected with a weight parameter U . For any t in {1, 2, ..., T }, we can compute each hidden neuron by using the following formula where W and U are trainable parameters, b is the bias term, and g(·) can be an activation function. This choice of the activation function is completely determined by the dataset and the end-user. A list of famous activation functions can be found in Table 1. In the end of the architecture, we can finally compute the predicted probability of Y given the hidden neurons by using the formulaŶ where the weights W , U , and V are shareable in the entire architecture. Forward propagation of RNN is referring to the procedure when information flow from the input features to output predictor using equation 19 and equation 20.
Backward Propagation Using Gradient Descent The forward propagation allows us to pass information from input layer which are features extracted from text document to the output layer which is the predictor. To search for the optimal weights, we need to compute the loss function and optimize the loss function by updating the weights. Since the task is text classification, we only have one output in the output layer. In addition, the task is a two-class classification problem because Y can only take value of {0, 1}. This means we can define the loss function using cross-entropy function, which is defined as where y i is the true label for instance i andŷ i is the predicted value for instance i. Notice that in the proposed architecture we use I-score and the "dagger technique" to construct X † for the input features where T is smaller than T and it is a length dependent on tuning process. Next, we can use gradient descent (GD) to update the parameters. At each step s, we can update the weights by where η is learning rate (usually a very small number), the symbol ∇ means gradient, and ∇L(·) is the gradient (partial derivative) of the loss function L(·). We can formally write the gradients in the following where ∇ parameter means gradient or partial derivative with respect to that parameter. The output prediction iŝ Y . Since this is the text classification problem, the architecture has many inputs and one output, hence the name "many-to-one". The architecture has parameters {U, V, W } and these weights (or parameters) are shareable throughout the architecture.
Implementation with I-score This section we introduce the proposed algorithm which is built upon the idea of a novel statistical measure I-score. The executive diagram for "N-gram" models using I-score as feature selection and engineering technique is presented in Figure 3. Figure 3 has presented two main pipelines for this paper that implements the proposed I-score into a basic RNN structure. Panel A presents the first simple framework of combining "N-grams" with I-score. The procedure of "N-grams" process text data into averaged numerical data. The conventional setup is presented in the left plot in Figure 2. The text document is processed using text vectorization and then it is embedded into a matrix form. We can then fit a basic feed-forward ANN or sequential RNN. Panel B presents an alternative input using the "dagger technique". We introduce them as follows.
The proposed I-score can be used in the following approaches: • First, we can compute I-score for each RNN unit. For example, in the Panel A of Figure 3, we can first compute I-score on text vectorization layer. Then we can compute I-score on the embedding layer. With the distribution of I-score provided from the feature matrix, we can use a particular threshold to identify the cutoff used to screen for important features that we need to feed into the RNN architecture. We denote this action by using Γ(·) and it is defined as Γ(X ) := X · 1(I(Y, X ) > threshold). For input layer, each feature I t can be released or omitted according to its I-score values. That is, we use Γ(I t ) := I t · 1(I(Y, I t ) > threshold) to determine whether this input feature I t is predictive and important enough to be fed into the RNN architecture. For hidden layer, each hidden neuron h t can be released or omitted according to its I-score values. In other words, we can use Γ(h t ) := h t · 1(I(Y, h t ) > threshold) to determine whether this hidden neuron h t is important enough to be inserted in the RNN architecture. If at certain t the input feature X t fails to meet certain I-score threshold (X t would fail if the I-score of X t is too low, then Γ(X t ) = 0), then this feature is not fed into the architecture and the next unit h t+1 is defined, using equation 19, as h t+1 = g(W · h t−1 + U · 0 + b). This is the same for any hidden neuron as well. If certain hidden neuron h t has the previous hidden neuron h t−1 fails to meet I-score criteria, then h t is defined as h t = g(W · 0 + U · X t + b). Hence, this Γ(·) function acts as a gate to allow the information of the neuron h t to pass through according to a certain I-score threshold. If Γ(I t ) is zero, that means this input feature is not important at all and hence can be omitted by replacing it with zero value. In other words, it is as if this feature never existed. In this case, there is no need to construct Γ(h t ). We show later in section 3 that important long-term dependencies that are associated with language semantics can be detected using this Γ(·) function, because I-score has the power to omit noisy and redundant features in the RNN architecture. Since I-score is compared throughout the entire length of T , the long-term dependencies between features that are far apart can be captured using high I-score values. • Second, we can use the "dagger technique" to engineer and craft novel features using equation 15. We can then calculate I-score on these dagger feature values to see how important they are. We can directly use 2-gram model and I-score is capable of indicating which 2-gram phrases are important. These phrases can act as two-way interactions. According to I-score, we can then determine whether we want all the words in the 2-gram models, 3-gram models, or even higher level of N-gram models. When n is large, we recommend to use the proposed Backward Dropping Algorithm to reduce dimension within the N-word phrase before creating new feature using the proposed "dagger technique". For example, suppose we use 2-gram model. A sentence such as "I love my cats" can be processed into (I, love), (love, my), (my, cats). Each feature set has two words. We can denote the original sentence "I love my cats" into 4 features {X 1 , X 2 , X 3 , X 4 }. The "dagger Hidden: technique" suggests that we can use equation 15 with 2-gram models fed in as inputs. In other words, we can take {X 1 , X 2 } and constructȲ j where j is the running index tracking the partitions formed using {X 1 , X 2 }.
If we discretize X 1 and X 2 (see subsection 2.3 for detailed discussion of discretization using I-score) and they both take values {0, 1}, then there are 2 2 = 4 partitions and hence j can take values {1, 2, 3, 4}. In this case, the novel feature X † 1 can take on 4 values, i.e. an example can be seen in Table 3. The combination of I-score, Backward Dropping Algorithm Algorithm, and the "dagger technique" allow us to prune the useful and predictive information in a feature set so that we can achieve maximum prediction power with as little number of features possible.
• Third, we can concatenate many N-gram models with different n values. For example, we can carry out N-gram modeling using n = 2, n = 3, and n = 4. This way we can more combination of higher order interactions. In order to avoid overfitting, we can use I-score to select the important interactions and then use these selected phrases (which can be two-word, three-word, or four-word) to build RNN models.

IMDB Dataset
In this application, we use the IMDB Movie Database which includes 25,000 paragraphs in training and testing set.
In total, there are 50,000 paragraphs and each paragraph carries a label. The label is dichotomous, i.e. Y = 1 if the review is positive and Y = 0 if the review is negative. A sample of the data is presented in Table 5. The first sample is a positive review. The second sample is a negative review.
In the experiment of text classification, we use data provided in IMDB move database. The dataset consists of 25,000 observations in training set and testing set each. In total, there are 50,000 observations. The data set has equal portion for two classes. The two classes are Y = 1 if the movie review is positive and Y = 0 if the movie review is negative. The goal is to read in a paragraph and predict whether the tone of this paragraph of movie review is positive or negative. We present sample data in the Table 5.
In Table 5, we present evidence that I-score can detect important words that is highly impactful for predicting the label of positive or negative tones. In Table 6, we present the semantics of the same two samples that I-score narrows down. The original RNN uses the processed word vector to make predictions. For example, scholars have been using N-gram models to extract features from the text document [28]. Other common techniques of feature extractions are Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency (TF) [29], Word2Vec [30], and Global Vectors for Word Representation (GloVe) [31]. However, I-score is able to shrink the word counts with the reduced dimensions to be a tuning parameter. In this experiment, we reduce the number of words from 400 to 100 or even 30 while remaining the same semantics. This dimension reduction technique is completely novel and can provide what goes beyond human intuition in language semantics problem in NLP.
The training and validating process is summarized in Table 4. In Figure 4, we summarize the training and the validating paths that are generated from the experiment before and after using the proposed statistics I-score. The first plot of Figure 4 presents the training and validating paths for the original bi-gram data (here processed the data into 100 features). The second plot of Figure 4 is the same data but discretized using I-score. Figure 5 presents the learning paths before and after dimensionality reduction using the proposed I-score. To demonstrate the potential of I-score, we first use the extracted features from the embedding layer which generates 400 features (a tuning result). We can use these 400 features to send into a feed-forward ANN or sequential RNN. The test set Table 5: Sample Data and Selected Semantics. This table presents sample data. We present two samples. The first column is the paragraphs directly taken from the IMDB movie database. The second column is the corresponding label for the first column. The proposed I-score selects words that have significant association with the semantics of the sentence. This is because I-score selects features that are highly predictive of the target variable. In this application, the target variable carries tones and preferences of a movie of which the same writer provides critics and reviews. The semantics in the critics and reviews reflect the tones and the preferences of the movie reviewers which is why I-score is able to detect the features using the provided preferences in the label.
No. Samples I-score Features Label Uni-gram 2-gram, 3-gram 1 < UNK > this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert < UNK > is an amazing actor and now the same being director < UNK > father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and ...
{congratulations, lovely, true} {amazing actor, really suited} 1 2 < UNK > big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written ...
{bad, ridiculous} {bad music, terribly written}, {damn terribly written} 0 Table 6: Interpreted Semantics Using I-score. This table presents sample data. We present two samples. The first column is the paragraphs directly taken from the IMDB movie database. The second column presents features selected by I-score according to different I-score threshold (we use top 7.5% and top 25% as examples). The last column presents the corresponding label. The semantics of the selected features are subset of words from the original sample. We observe that I-score can select subset of words while maintaining the same semantics.
No. Samples (Original Paragraphs) I-score Features (using different thresholds) Label Top 7.5% I-score Top 25% I-score 1 < UNK > this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert < UNK > is an amazing actor and now the same being director < UNK > father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the < UNK > of ...
{congratulations often the play them all a are and should have done you think the lovely because it was true and someone's life after all that was shared with us all} {for it really at the so sad you what they at a must good this was also congratulations to two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all} 1 400 words 31 words 101 words 2 < UNK > big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal measures the hair is big lots of boobs bounce men wear those cut tee shirts that show off their < UNK > sickening that men actually wore them and the music is just < UNK > trash that plays over and over again in almost every scene there is trashy music boobs and ...
{those < UNK > every is trashy music away all aside this whose only is to look that was the 80's and have good old laugh at how bad everything was back then} {script best worked who the just so terribly the clothes in equal hair lots boobs men wear those cut shirts that show off their < UNK > sickening that men actually wore them and the music is just < UNK > trash that over and over again in almost every scene there is trashy music boobs and < UNK > taking away bodies and the gym still doesn't close for < UNK > all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then} 0 400 words 31 words 101 words Figure 4: Learning Paths Before and After Discretization. This figure presents the training procedure. All graphs present the training and validating paths. The first graph is from the original bi-gram data. The second is from using discretized bi-gram (discretized by I-score). The third is using the top 18 variables according to I-score values. The proposed method can significantly improve computation efficiency. Figure 5: Learning Paths Before and After Text Reduction Using I-score. This figure presents the training procedure. All graphs present the training and validating paths. The first graph is from the original bi-gram data. The second is from using discretized bi-gram (discretized by I-score). The third is using the top 18 variables according to I-score values. The proposed method can significantly improve computation efficiency.
performance is 94% which is measured using AUC values. We can compute marginal I-score (this means we compute the I-score using each variable as a predictor independently). Amongst these 400 features, we can rank them using I-score values and pick the top 30 features. We feed these 30 features into a feed-forward ANN or sequential RNN and we already achieved 87% on test set. To further improve the learning performance, we can increase the I-score threshold so that we can include more top influential features. We can use top 30, 100, and 145 respectively. We plot the learning paths in Figure 5. The first graph "1" is the learning path for the original 400 features. We can see that the training set and validating set error merely breached 0.2 in 30 epochs. However, when we use top 100 features, we are able to see in graph "3" that we are able to achieve near convergence of approximately 10 epochs. This increased convergence speed is largely due to the nature that I-score is able to erase the noisy and redundant features from the input layer. In addition, I-score is able to deliver this efficient learning performance with only 25% of the number of the original features which is something the literature has not yet seen. We regard this another major benefit of using the proposed technique in training neural networks.

Result
We show in Table 7 the experiment results for the text classification task in IMDB Movie Dataset. We start with bi-gram models. The performance of using I-score to select the important bi-gram features produced 96.5% AUC on the test set while the bi-gram model without I-score produced 92.2%. We also used a combination of different N-gram models while we set the level of N to be {2, 3, 4}, i.e. 2-gram, 3-gram, and 4-gram, respectively. We can concatenate the N-gram features and then feed the features into a feed-forward neural networks directly. First, we concatenate the 2-gram and 3-gram models together and then we use I-score to reduce the prediction performance. We recommend to use the top 5% I-score threshold to screen for the important semantics among all the 2-gram and 3-gram features. This corresponds to approximately 40 out of the 800 concatenated 2-gram and 3-gram features. We observe that I-score is able to raise the prediction performance to 97.7% while a combination of 2-gram and 3-gram model without I-score is only able to produce 91.1% which is a 74% error reduction.

Conclusion
This paper proposes a novel I-score to detect and search for the important language semantics in text document that are useful for making good prediction in text classification tasks.
Theoretical Contribution. We provide theoretical and mathematical reasoning why I-score can be considered as a function of AUC. The construction of I-score can be analyzed using partitions. We see from mathematical rearrangements of the I-score formula that sensitivity plays a major component. This role of I-score provides the fundamental driving force to raise AUC values if the variables selected to compute I-score are important and significant. In addition to its theoretical parallelism with AUC, I-score can be used anywhere in a neural network architecture which allows its end-users to flexibly deploy this computation, which is a nature that does not belong to AUC. AUC is also vulnerable under incorrect model specification. Any estimation of the true model, disregard whether it is accurate or not, is harmless to the performance of I-score due to its non-parametric nature, which is a novel measure for feature selection that the literature has not yet seen.
Backward Dropping Algorithm. We also propose a greedy search algorithm called the Backward Dropping Algorithm that handles long-term dependencies in the dataset. Under the curse of dimensionality, the Backward Dropping Algorithm is capable of efficiently screening out the noisy and redundant information. The design of the Backward Dropping Algorithm also takes advantage of the nature of I-score, because I-score increases when the variable set has less noisy features while I-score decreases when the noisy features are included.
Dagger Technique We propose a novel engineering technique called the "dagger technique" that combines a set of features using partition retention to form a new feature that fully preserve the relationship between explanatory variable and response variable. This proposed "dagger technique" can successfully combine words and phrases with long term dependencies into one new feature that carries long term memory. It can also be used in constructing the features in many other types of deep neural networks such as Convolutional Neural Networks (CNNs). Though we present empirical evidence in sequential data application, this "dagger technique" can actually go beyond most image data and sequential data.
Application. We show with empirical results that this "dagger technique" can fully reconstruct target variable with the correct features, which is a method that can be generalized into any feed-forward Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). We demonstrate the usage of I-score and proposed methods with simulation data. We also show with a real world application on the IMDB Movie Dataset that the proposed methods can achieve 97% AUC value, an 81% error reduction from its peers or similar RNNs without I-score.
Future Research Directions. We call for further exploration in the direction of using I-score to extrapolate features that have long-term dependencies in time-series and sequential data. Since it is computationally costly to rely on high-performance CPU/GPUs, the direction I-score is taking leads researchers to rethink about designing longer and deeper neural networks. Instead, future research can continue under the approach of building low dimensional but extremely informative features so that we can construct less complicated models in order for end-users.

Contribution
We would like to dedicate this to H. Chernoff, a well-known statistician and a mathematician worldwide, in honor of his 98th birthday and his contributions in Influence Score (I-score) and the Backward Dropping Algorithm (BDA). We are particularly fortunate in receiving many useful comments from him. Moreover, we are very grateful for his guidance on how I-score plays a fundamental role that measures the potential ability to use a small group of explanatory variables for classification which leads to much broader impact in fields of pattern recognition, computer vision, and representation learning.