# Measuring Interactions in Categorical Datasets Using Multivariate Symmetrical Uncertainty

## Abstract

## 1. Introduction

**Multiple Correlation**. In the multivariate world, many of the observed phenomena require a nonlinear model, and hence, a good measure of correlation should be able to detect both linear and nonlinear correlations. The so-called Coefficient of Multiple Correlation ${R}^{2}$ is computed in multiple regression from the square matrix ${R}_{xx}$ formed by all the paired correlations between variables [3]. It measures how well a given variable can be predicted using a linear function of the set of the other variables. In effect, R measures the linear correlation between the observed and the predicted values of the target attribute or response Y.

**Interaction.**Consider a pure multivariate linear regression model of a continuous random variable Y explained by a set of continuous variables ${X}_{1},{X}_{2},\dots ,{X}_{n}$. From here on, we adopt statistical usage whereby capital letters refer to random variables and the corresponding small case letters refer to particular values or outcomes observed. Each outcome ${y}_{i}$ is modeled as a linear combination of the observed variable values [31],

**Contributions**. The main contribution of this paper is that it proposes a formalization of the concept of interaction for both continuous and categorical responses. Interaction is often found in Multiple Linear Regression [31] and Analysis of Variance models [34], and it is described as a departure from the linearity of effect in each variable. However, for an all-categorical-variables context, there is no definition of interaction. This work proposes a definition that is facilitated by the MSU measure and shows that it is suitable for both types of variables. The detection and quantification of interactions in any group of features of a categorical dataset is the second aim of the work.

## 2. Patterned Records and the Detection of Interactions

**Example**

**Definition of MSU.**Let ${X}_{i}$ be a categorical (discrete) random variable with cardinality $c\left({X}_{i}\right)\in \mathbb{N}$, and possible values ${x}_{ij}$ with $j=\{1,\dots ,c\left({X}_{i}\right)\}$. Let $P\left({X}_{i}\right)$ be its probability mass function. The entropy H of the individual variable ${X}_{i}$ is a measure of the uncertainty in predicting the value of ${X}_{i}$ and is defined as:

- The MSU values are in the unit range, $MSU\left({X}_{1:n}\right)\in [0,1]$;
- Higher values in the measure correspond to higher correlation among variables, i.e., a value of 0 implies that all variables are independent while a value of 1 corresponds to a perfect correlation among variables; and
- MSU detects linear and non-linear correlations between any mix of categorical and/or discretized numerical variables.

**Interaction among continuous variables.**Let us begin with a two-variable example. Consider the regression model

**Interaction among categorical variables.**Categorical or nominal features are also employed to build various types of multivariate models with a categorical response. Established modeling techniques include, for example, Categorical Principal Components Analysis, Multiple Correspondence Analysis, and Multiple Factor Analysis [36]. In this realm, we can measure the strength of association between two, three, or more categorical variables by means of both MSU and the study of patterns’ behavior; this will, in turn, allow us to detect interactions.

## 3. Simulations Using Patterns

#### 3.1. Three-Way XOR

#### 3.2. Four-Way XOR

#### 3.3. Four-Way AND

#### 3.4. Further Simulations

#### 3.5. Discussion and Interpretation of Results

**gain in multiple correlation**obtained by adding B (or BC) to AC, forming ABC is defined as

**interaction**among variables in $\mathcal{C}$ on top of j variables as

**Complexity of Interaction Calculation**. The following approach is module-based. In a dataset of r observation rows on n variables, let ${c}_{i}$ be the cardinality of the i-th variable. The two sets being considered are $\mathcal{C}$ with k variables and $\mathcal{A}$ with j variables, such that $\mathcal{A}\subset \mathcal{C}$.

- Entropy of each attribute—For each attribute ${X}_{i}$, there are ${c}_{i}$ frequencies $P\left({x}_{i}\right)$ and ${c}_{i}$ logarithms ${log}_{2}\left(P\left({x}_{i}\right)\right)$, which are multiplied according to Equation (4), giving $3{c}_{i}$ operations. This is conducted k times, giving $3{\sum}_{1}^{k}{c}_{i}$.
- Joint entropy of all k attributes—There are ${\prod}_{1}^{k}{c}_{i}$ combinations of values, and for each one of them, the frequencies as well as their logarithms are calculated and multiplied according to Equation (5), giving $3{\prod}_{1}^{k}{c}_{i}$ operations. This is conducted one time.
- $msucost$($\mathcal{C}$)—Using Equation (6), the costs of the numerator and the denominator are added, followed by one division and one difference. This gives $3{\sum}_{1}^{k}{c}_{i}+3{\prod}_{1}^{k}{c}_{i}+2$ operations.

**Proof.**

**intrinsic interaction**due to pattern $\mathcal{P}$.

## 4. Comparison with Interaction on Continuous Variables

- Discretize bf, st and mc;
- Take as pattern the set of distinct observed records, discretized;
- Simulate sampling scenarios to find ${M}_{L}$;
- Check whether the ${M}_{L}$ value reveals interactions.

#### 4.1. Discretization

#### 4.2. Seeking Interaction in the Pattern

#### 4.3. Creating Ad Hoc Interaction

#### 4.4. Discretizing the Modified Data

^{o}exponent, meaning that they have been recategorized just because of modified cutoff values. All this can be verified by comparing Table 9 with Table 6.

#### 4.5. Interaction in the New Pattern

## 5. Discussion on ${\mathbf{M}}_{\mathbf{L}}$ and Linear Models

## 6. Conclusions and Future Work

**Figure 1.**Dataset, pattern, and sample in a 3-variable example. The dataset (or population) may contain many records, of which only a sample is actually collected. Pattern is the name given to the set of distinct records in the sample.

**Figure 2.**Moving a few body fat data points to produce an interaction: On a graph of bf as a function of product $st.c\xb7mc.c$, six points were moved to induce interaction in the linear regression.

**Table 1.**MSU values of 3-way XOR: minimum of 0.5 and maximum of 0.75. Here $C=A\u2a01B$ where ⨁ represents the XOR operation.

3-Way Collective | 3-Way ABC | 1-Way A | 1-Way B | 1-Way C | ||||
---|---|---|---|---|---|---|---|---|

A | B | C | X | $P\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ |

0 | 0 | 0 | 000 | 0.25 | −0.5 | |||

0 | 1 | 1 | 011 | 0.25 | −0.5 | −0.5 | −0.5 | −0.5 |

1 | 0 | 1 | 101 | 0.25 | −0.5 | |||

1 | 1 | 0 | 110 | 0.25 | −0.5 | −0.5 | −0.5 | −0.5 |

$H\left(X\right)$ | 2 | 1 | 1 | 1 | ||||

$MSU$ | 0.5 | |||||||

3-Way Collective | 3-Way ABC | 1-Way A | 1-Way B | 1-Way C | ||||

A | B | C | X | $P\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ |

0 | 0 | 0 | 000 | 0.25 | −0.5 | |||

0 | 1 | 1 | 011 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | −0.5 | −0.31 | −5.30 × 10${}^{-78}$ |

1 | 0 | 1 | 101 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | |||

1 | 1 | 0 | 110 | 0.75 | −0.311 | −0.311 | −0.5 | 0. |

$H\left(X\right)$ | 0.811 | 0.811 | 0.811 | 5.30 × 10${}^{-78}$ | ||||

$MSU$ | 0.75 |

**Table 2.**MSU values of the 4-way XOR with a minimum of 1/3 and a maximum of 0.746. Here $D=A\u2a01B\u2a01C$.

4-Way Collective | 4-Way ABCD | 1-Way A | 1-Way B | 1-Way C | 1-Way D | |||||
---|---|---|---|---|---|---|---|---|---|---|

A | B | C | D | X | $P\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ |

0 | 0 | 0 | 0 | 0000 | 0.125 | −0.375 | ||||

0 | 0 | 1 | 1 | 0011 | 0.125 | −0.375 | ||||

0 | 1 | 0 | 1 | 0101 | 0.125 | −0.375 | ||||

0 | 1 | 1 | 0 | 0110 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.5 |

1 | 0 | 0 | 1 | 1001 | 0.125 | −0.375 | ||||

1 | 0 | 1 | 0 | 1010 | 0.125 | −0.375 | ||||

1 | 1 | 0 | 0 | 1100 | 0.125 | −0.375 | ||||

1 | 1 | 1 | 1 | 1111 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.5 |

$H\left(X\right)$ | 3 | 1 | 1 | 1 | 1 | |||||

$MSU$ | 0.333 | |||||||||

A | B | C | D | X | $P\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ |

0 | 0 | 0 | 0 | 0000 | 1.000 | 0.000 | ||||

0 | 0 | 1 | 1 | 0011 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | ||||

0 | 1 | 0 | 1 | 0101 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | ||||

0 | 1 | 1 | 0 | 0110 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | 0.0 | 0.0 | 0.0 | 0.0 |

1 | 0 | 0 | 1 | 1001 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | ||||

1 | 0 | 1 | 0 | 1010 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | ||||

1 | 1 | 0 | 0 | 1100 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | ||||

1 | 1 | 1 | 1 | 1111 | 1.00 × 10${}^{-80}$ | −2.66 × 10${}^{-78}$ | −1.06 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ |

$H\left(X\right)$ | −1.86 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ | −1.06 × 10${}^{-77}$ | |||||

$MSU$ | 0.746 |

**Table 3.**MSU values of the 4-way AND show a minimum of 0.2045 and a maximum of 1. Here, $D=A\wedge B\wedge C$.

4-Way Collective | 4-Way ABCD | 1-Way A | 1-Way B | 1-Way C | 1-Way D | |||||
---|---|---|---|---|---|---|---|---|---|---|

A | B | C | D | X | $P\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ | $P\left(X\right)$ $logP\left(X\right)$ |

0 | 0 | 0 | 0 | 0000 | 0.125 | −0.375 | ||||

0 | 0 | 1 | 1 | 0011 | 0.125 | −0.375 | ||||

0 | 1 | 0 | 1 | 0101 | 0.125 | −0.375 | ||||

0 | 1 | 1 | 0 | 0110 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.169 |

1 | 0 | 0 | 1 | 1001 | 0.125 | −0.375 | ||||

1 | 0 | 1 | 0 | 1010 | 0.125 | −0.375 | ||||

1 | 1 | 0 | 0 | 1100 | 0.125 | −0.375 | ||||

1 | 1 | 1 | 1 | 1111 | 0.125 | −0.375 | −0.5 | −0.5 | −0.5 | −0.375 |

$H\left(X\right)$ | 3 | 1 | 1 | 1 | 0.544 | |||||

$MSU$ | 0.205 |

Name | n | c | k | Probab Distribution | Partial MSU Values | Global MSU |
---|---|---|---|---|---|---|

XOR | 3 | 2 | 4 | Equal likelohoods | MSU(AC) = 0 | MSU(ABC) = 0.5 |

MSU(BC) = 0 | ||||||

3 | 2 | 4 | 0.25; 1.00 ×${10}^{-80}$; 1.00 ×${10}^{-80}$; 0.75 | MSU(AC) = 0 | MSU(ABC) = 0.75 | |

MSU(BC) = 0 | ||||||

XOR | 4 | 2 | 8 | Equal likelihoods | MSU(AD) = 0 | MSU(ABCD) = 0.333 |

MSU(BD) = 0 | ||||||

MSU(CD) = 0 | ||||||

4 | 2 | 8 | 1; 1.00 ×${10}^{-80}$; 1.00 ×${10}^{-80}$; … | MSU(AD) = 0.371 | MSU(ABCD) = 0.746 | |

MSU(BD) = 0.371 | ||||||

MSU(CD) = 0.371 | ||||||

AND | 3 | 2 | 4 | Equal likelihoods | MSU(AC) = 0.258 | MSU(ABC) = 0.433 |

MSU(CD) = 0.258 | ||||||

3 | 2 | 4 | 0.25; 1.00 ×${10}^{-21}$; 1.00 ×${10}^{-21}$; 0.75 | MSU(AC) = 0.75 | MSU(ABC) = 1 | |

MSU(CD) = 0.75 | ||||||

AND | 4 | 2 | 8 | Equal likelihoods | MSU(AD) = 0.179 | MSU(ABCD) = 0.205 |

MSU(BD) = 0.179 | ||||||

MSU(CD) = 0.179 | ||||||

4 | 2 | 8 | 0.2; 1.00 ×${10}^{-80}$; …; 1.00 ×${10}^{-80}$; 0.8 | MSU(AD) = 1 | MSU(ABCD) = 1 | |

MSU(BD) = 1 | ||||||

MSU(CD) = 1 | ||||||

OR | 3 | 2 | 4 | 1.00 ×${10}^{-21}$; 0.1; 1.00 ×${10}^{-21}$; 0.9 | MSU(AC) = 0 | MSU(ABC) = 0 |

MSU(BC) = 0.654 | ||||||

3 | 2 | 4 | Equal likelihoods | MSU(AC) = 0.344 | MSU(ABC) = 0.433 | |

MSU(BC) = 0.344 | ||||||

3 | 2 | 4 | 0.4; 1.00 ×${10}^{-21}$; 1.00 ×${10}^{-21}$; 0.6 | MSU(AC) = 1 | MSU(ABC) = 1 | |

MSU(BC) = 1 | ||||||

OR | 4 | 2 | 8 | 1.00 ×${10}^{-80}$; 0.001; 0.001; | MSU(AD) = 0 | MSU(ABCD) = 0.005 |

0.009; 0.01; 0.125; | MSU(BD) = 0 | |||||

0.125; 0.729 | MSU(CD) = 0 | |||||

4 | 2 | 8 | Equal likelihoods | MSU(AD) = 0.179 | MSU(ABCD) = 0.205 | |

MSU(BD) = 0.179 | ||||||

MSU(CD) = 0.179 | ||||||

4 | 2 | 8 | 0.2; 1.00 ×${10}^{-80}$; …; 1.00 ×${10}^{-80}$; 0.8 | MSU(AD) = 1 | MSU(ABCD) = 1 | |

MSU(BD) = 1 | ||||||

MSU(CD) = 1 | ||||||

$A\wedge notB$ | 3 | 2 | 4 | 1.00 ×${10}^{-21}$; 0.25; 1.00 ×${10}^{-21}$; 0.75 | MSU(AC) = 0 | MSU(ABC) = 0 |

MSU(BC) = 0.654 | ||||||

3 | 2 | 4 | 1.00 ×${10}^{-21}$; 1.00 ×${10}^{-21}$; 0.1; 0.9 | MSU(AC) = 0 | MSU(ABC) = 0.75 | |

MSU(BC) = 1 |

# | st.c | mc.c | bf |
---|---|---|---|

1 | −5.805 | 1.48 | 11.9 |

2 | −0.605 | 0.58 | 22.8 |

3 | 5.395 | 9.38 | 18.7 |

4 | 4.495 | 3.48 | 20.1 |

5 | −6.205 | 3.28 | 12.9 |

6 | 0.295 | −3.92 | 21.7 |

7 | 6.095 | −0.02 | 27.1 |

8 | 2.595 | 2.98 | 25.4 |

9 | −3.205 | −4.42 | 21.3 |

10 | 0.195 | −2.82 | 19.3 |

11 | 5.795 | 2.38 | 25.4 |

12 | 5.095 | 0.68 | 27.2 |

13 | −6.605 | −4.62 | 11.7 |

14 | −5.605 | 0.98 | 17.8 |

15 | −10.705 | −6.32 | 12.8 |

16 | 4.195 | 2.48 | 23.9 |

17 | 2.395 | −1.92 | 22.6 |

18 | 4.895 | −3.02 | 25.4 |

19 | −2.605 | −0.52 | 14.8 |

20 | −0.105 | −0.12 | 21.1 |

# | dst | dmc | dbf |
---|---|---|---|

1 | low | high | low |

2 | med | med | high |

3 | high | high | low |

4 | high | high | med |

5 | low | high | low |

6 | med | low | med |

7 | high | med | high |

8 | med | high | high |

9 | low | low | med |

10 | med | low | med |

11 | high | high | high |

12 | high | med | high |

13 | low | low | low |

14 | low | med | low |

15 | low | low | low |

16 | high | high | high |

17 | med | low | med |

18 | high | low | high |

19 | low | med | low |

20 | med | med | med |

Pattern 1 | $\mathit{P}\left(\mathit{X}\right)$ | $\mathit{P}\left(\mathit{X}\right)log\left(\mathit{P}\right(\mathit{X}\left)\right)$ | 1-Way $\mathit{dst}$ | 1-Way $\mathit{dmc}$ | 1-Way $\mathit{dbf}$ | ||
---|---|---|---|---|---|---|---|

low | low | low | 0.027 | −0.141 | −0.302 | −0.360 | −0.390 |

low | low | med | 0.027 | −0.141 | |||

low | med | low | 0.008 | −0.054 | |||

low | high | low | 0.023 | −0.126 | |||

med | low | med | 0.015 | −0.093 | −0.228 | −0.194 | −0.530 |

med | med | med | 0.008 | −0.054 | |||

med | high | high | 0.023 | −0.126 | |||

high | low | high | 0.046 | −0.205 | −0.186 | −0.209 | −0.507 |

high | med | high | 0.019 | −0.110 | |||

high | high | low | 0.077 | −0.285 | |||

high | high | med | 0.332 | −0.528 | |||

high | high | high | 0.386 | −0.530 | |||

Entropy: | 2.448 | 0.716 | 0.763 | 1.428 | |||

MSU: | 0.237 |

# | st.c | mc.c | bf.mod |
---|---|---|---|

1 | −5.805 | 1.48 | 11.9 |

2 | −0.605 | 0.58 | 22.8 |

3 | 5.395 | 9.38 | 31 |

4 | 4.495 | 3.48 | 20.1 |

5 | −6.205 | 3.28 | 12.9 |

6 | 0.295 | −3.92 | 21.7 |

7 | 6.095 | −0.02 | 24 |

8 | 2.595 | 2.98 | 25.4 |

9 | −3.205 | −4.42 | 21.3 |

10 | 0.195 | −2.82 | 19.3 |

11 | 5.795 | 2.38 | 25.4 |

12 | 5.095 | 0.68 | 22 |

13 | −6.605 | −4.62 | 28 |

14 | −5.605 | 0.98 | 17.8 |

15 | −10.705 | −6.32 | 32 |

16 | 4.195 | 2.48 | 23.9 |

17 | 2.395 | −1.92 | 22.6 |

18 | 4.895 | −3.02 | 17 |

19 | −2.605 | −0.52 | 14.8 |

20 | −0.105 | −0.12 | 21.1 |

**Table 9.**Modified Body Fat Data discretized. Superscript symbol o denotes recategorized data because of modified cutoff values. Superscript symbol * denotes underlying numerical value modified to produce interaction.

# | dst | dmc | dbf |
---|---|---|---|

1 | low | high | low |

2 | med | med | med ^{o} |

3 | high | high | high * |

4 | high | high | low ^{o} |

5 | low | high | low |

6 | med | low | med |

7 | high | med | high * |

8 | med | high | high |

9 | low | low | med |

10 | med | low | low ^{o} |

11 | high | high | high |

12 | high | med | med * |

13 | low | low | high * |

14 | low | med | low |

15 | low | low | high * |

16 | high | high | high |

17 | med | low | med |

18 | high | low | low * |

19 | low | med | low |

20 | med | med | med |

Pattern 2 | $\mathit{P}\left(\mathit{X}\right)$ | $\mathit{P}\left(\mathit{X}\right)log\left(\mathit{P}\right(\mathit{X}\left)\right)$ | 1-Way dst | 1-Way dmc | 1-Way dbf | ||
---|---|---|---|---|---|---|---|

low | low | med | 0.04 | −0.185 | −0.523 | −0.521 | −0.468 |

low | low | high | 0.06 | −0.244 | |||

low | med | low | 0.08 | −0.292 | |||

low | high | low | 0.13 | −0.383 | |||

med | low | low | 0.06 | −0.244 | −0.435 | −0.494 | −0.423 |

med | low | med | 0.03 | −0.152 | |||

med | med | med | 0.03 | −0.152 | |||

med | high | high | 0.05 | −0.216 | |||

high | low | low | 0.11 | −0.350 | −0.491 | −0.515 | −0.514 |

high | med | med | 0.06 | −0.244 | |||

high | med | high | 0.07 | −0.269 | |||

high | high | low | 0.18 | −0.445 | |||

high | high | high | 0.1 | −0.332 | |||

Entropy: | 3.506 | 1.449 | 1.530 | 1.406 | |||

MSU: | 0.301 |

Name | n | c | k | Record Frequencies | Partial MSU Values | Global MSU | Interaction |
---|---|---|---|---|---|---|---|

Pattern1 | 3 | 3 | 13 | 7, 7, 2, 6, 4, 2 | MSU(dst, dbf) = 0.142 | MSU(dst, dmc, dbf) = 0.237 | 0.095 |

2, 6, 12, 5, 20, 86, 100 | MSU(dmc, dbf) = 0.012 | ||||||

3 | 3 | 13 | 2, 1, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 2 | MSU(dst, dbf) = 0.441 | MSU(dst, dmc, dbf) = 0.367 | $-0.074$ | |

(original observations) | MSU(dmc, dbf) = 0.097 | ||||||

3 | 3 | 13 | Equal frequencies | MSU(dst, dbf) = 0.312 | MSU(dst, dmc, dbf) = 0.326 | 0.014 | |

MSU(dmc, dbf) = 0.043 | |||||||

Pattern2 | 3 | 3 | 13 | 4, 6, 8, 13, 6, 3 | MSU(dst, dbf) = 0.037 | MSU(dst, dmc, dbf) = 0.301 | 0.176 |

3, 5, 11, 6, 7, 18, 10 | MSU(dmc, dbf) = 0.124 | ||||||

3 | 3 | 13 | 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 3 | MSU(dst, dbf) = 0.152 | MSU(dst, dmc, dbf) = 0.367 | 0.206 | |

(original observations) | MSU(dmc, dbf) = 0.161 | ||||||

3 | 3 | 13 | Equal frequencies | MSU(dst, dbf) = 0.043 | MSU(dst, dmc, dbf) = 0.326 | 0.186 | |

MSU(dmc, dbf) = 0.141 |

