# An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. AI Systems for COVID-19 Chest X-rays

#### 1.2. What Is XAI?

#### 1.3. Problems in Image Classification and Deep CNNs

#### 1.4. An Interaction-Based Convolutional Neural Network (ICNN) to Address XAI Problems

## 2. Contributions of the Paper

#### 2.1. Why the Proposed Methodology Satisfies XAI Dimensions

**Proposed Architecture**. The executive diagram of the proposed architecture is presented in Figure 2. First, the architecture starts with image data that consists of X-ray pictures sized 128 × 128 (see detailed discussion of COVID-19 data set in Section 4). The architecture proposes using a 2 ×2 rolling window (we use a 2 × 2 window for simplicity; larger sizes can be applicable as well in practice depending on the data). Since the window size is 2 × 2, this means there are four variables every time the window rests on a certain location of the image. Within this subset of variables, we execute the proposed backward dropping algorithm (BDA). This procedure finely selects a subset of variables that is highly predictive by omitting the noisy variables in this small neighborhood on the image. Next, the selected variables (which can be any subset of the original four) undergo a proposed procedure called interaction-based feature engineer (see (6) for a definition). The BDA procedure is illustrated in the bottom left corner of Figure 2 (we use a 2 × 2 window for demonstration purpose). In addition, we set the starting point to be 12 which means we start from the pixel in the 12th row and the 12th column. From data (size of 128 × 128) to the 1st layer (58 × 58), this procedure produces a new feature matrix with size $\lfloor (128-12-2+1)/2+1\rfloor =58$ on both edges, which means the new feature matrix has 58 × 58 variables (the formula is presented in Equation (18)). This feature matrix constitutes the first interaction-based convolutional layer. We can then use the same methodology to construct the second and third interaction-based convolutional layers. The third interaction-based convolutional layer can be used as the input layer for a neural network classifier. For each layer, we can compute the proposed I-score and the AUC value (see Section 4.4 for a detailed discussion of AUC values) for each variable (assuming using this variable as a predictor when computing the AUC value). The paths of the I-score and AUC values have parallel behavior, which is shown in the color palette of the spectrum in Figure 2.

**Why the Proposed I-score Satisfies XAI Dimensions**. The design of the proposed architecture in Figure 2 mainly focuses on using the I-score and the BDA to extract and engineer features from the original images. The proposed I-score is nonparametric (see Section 3.1 for a definition of this measure). This means the impact of the explanatory variables on the response variable measured by the I-score does not rely on the knowledge of the correct specification of the underlying model. In other words, the computation of the I-score does not rely on any model fitting procedure. This characteristic satisfies the first dimension, $\mathcal{D}1$, defined in Section 1 about interpretable measures.

#### 2.2. Organization of the Paper

## 3. Proposed Method

#### 3.1. Influence Score (I-Score)

#### 3.2. Backward Dropping Algorithm (BDA)

- Training Set: Consider a training set $\{({y}_{1},{x}_{1}),\dots ,({y}_{n},{x}_{n})\}$ of n observations, where ${x}_{i}=({x}_{1i},\dots ,{x}_{pi})\}$ is a p-dimensional vector of explanatory variables. The size p can be very large. All explanatory variables are discrete.
- Sampling from Variable Space: Select an initial subset of k explanatory variables ${S}_{b}=\{{X}_{{b}_{1}},\dots ,{X}_{{b}_{k}}\}$, $b=1,\dots ,B$
- Compute Standardized I-score: Calculate $I\left({S}_{b}\right)=\frac{1}{n{\sigma}^{2}}{\sum}_{j\in {\mathcal{P}}_{k}}{n}_{j}^{2}{({\overline{Y}}_{j}-\overline{Y})}^{2}$. For the rest of the paper, we refer o this formula as the influence measure or influence score (I-score).
- Drop Variables: Tentatively drop each variable in ${S}_{b}$ and recalculate the I-score with one variable less. Then drop the one that produces the highest I-score. Call this new subset ${S}_{b}^{\prime}$, which has one variable less than ${S}_{b}$.
- Return Set: Continue to the next round of dropping variables in ${S}_{b}^{\prime}$ until only one variable is left. Keep the subset that yields the highest I-score in the entire process. Refer to this subset as the return set, ${R}_{b}$. This will be the most important and influential variable module from this initial training set.

#### 3.3. Interaction-Based Convolutional Layer

#### 3.4. Interaction-Based Feature Engineer

#### 3.5. Simulation with Artificial Examples

#### 3.5.1. Artificial Example I: Variable Investigation

**Scenario I.**Assume the statistician knows the model in this simulation. This means they are fully aware of ${S}_{1}=\{{X}_{1},{X}_{2}\}$ as an important variable set and ${S}_{2}=\{{X}_{3},{X}_{4},{X}_{5}\}$ as the other. In other words, they can use the first module as a predictor to make predictions on response variable Y. They are able to compute the theoretical prediction rate of the first variable set as 75%. This is because the response variable is defined as the first variable module ${S}_{1}=\{{X}_{1},{X}_{2}\}$ exactly 50% of the times, so ${S}_{1}$ is able to guess Y correctly at least 50% of the time. The other 50% of the time, the response variable is defined as the second variable module ${S}_{2}=\{{X}_{3},{X}_{4},{X}_{5}\}$. Since there is no marginal signal, the performance is exactly like random guessing, so the rest of the occurrences are only correct 50% of the time. In other words, assuming training sample size has n data points, we can write the following:

**Scenario II.**In practice, it is often the case that we do not exactly observe the underlying model in a given data set. The recommendation is to use the I-score to select the important local information and then create an interaction-based feature based on the selected variable. The diagram for this action is depicted in Figure 4.

**Constructing the Proposed Convolutional Layer.**In the proposed artificial example, we have a data set with 36 variables. This means each observation can be reshaped into a $6\times 6$ grid structure. In other words, each observation can be considered as an image. The first row and the first column is variable ${X}_{1}$. The first row and the last column is variable ${X}_{6}$, so we can arrange these variables into the following structure:

**Prediction.**After the variable modules (the variables named ${X}^{\u2020}$’s) are constructed, we can proceed with building the neural network classifier. In Table 3, there are 25 variable modules, i.e., $\{{X}_{1}^{\u2020},\dots ,{X}_{25}^{\u2020}\}$. We use these variable modules as the input layer of a neural network architecture. In other words, the input layer has 25 neurons (these neurons are the exact 25 variable modules named ${X}^{\u2020}$’s). For the variable modules in Table 4, the input layer consists of 16 neurons, which are $\{{X}_{1}^{\u2020},\dots ,{X}_{16}^{\u2020}\}$. Since we are dealing with a rather small input layer (only 25), no hidden layer is recommended. The output layer can simply be one neuron, and the final outcome $\widehat{Y}$ can be computed using a sigmoid function as follows:

#### 3.5.2. Artificial Example II: Four Modules with Discrete Variables

**Scenario I.**Suppose we know the model formulation. We see that the first module occurs with a probability of 0.5. In this case, by correctly identifying the first module, we achieve a 50% accuracy rate. Additionally, this module is able to determine half of the rest of the occurrences in the data. Hence, the theoretical Bayes’ rate for model is 75%, which is computed below. Consider the first correct module identified as ${S}_{1}=\{{X}_{1},{X}_{2}\}$. In this case, the theoretical prediction rate ${\theta}_{c}\left({S}_{1}\right)$ can be calculated, assuming n samples in the training data,

**Scenario II.**In practice, we assume we do not have the knowledge of the underlying model. To illustrate the performance of the I-score as a feature selection methodology, we conducted the following simulation: We created data with 49 variables drawn from $\mathrm{Bernoulli}(1/2)$ random variables independently. We allowed the number of in-sample training size to be $\{50,100,1000\}$. For each value of the training sample size, we conducted experiments using machine learning algorithms such as bagging (We used the statistical package called ipred: Improved Predictors. Source: https://cran.r-project.org/web/packages/ipred/index.html (2 November 2021)), logistic, random forest (RF) (We used the statistical package called randomForest. Source: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest (2 November 2021)), iterative random forest (iRF) (We use the statistical package irf. Source: https://www.rdocumentation.org/packages/vars/versions/1.5-3/topics/irf (2 November 2021)), and neural network (NN) (We used the Keras package and produced a neural network with no hidden layer and one output neuron. Source: https://github.com/yiqiao-yin/YinsLibrary__/blob/9822f36ca097b1e19f7b669e4f42ca39ea9aa608/r/KerasNN.R#L50-L57 (2 November 2021)). We used default values from the packages for the first four algorithms. For the NN, we used the variables modules (i.e., ${X}^{\u2020}$) as the input layer and the architecture takes the form shown in Figure 5, which states the classification rules using a forward propagation structure (as shown in Section 4.2, Figure 5, there is a linear transformation using weights $\overrightarrow{w}$ and a nonlinear transformation using an activation function $a(\xb7)$). The performance was measured using out-of-sample test set with 1000 data points. The metric for performance is area under the curve (AUC) from the receiver operating characteristic (ROC) curve (see Section 4.4 for a detailed discussion of AUC values). An AUC value of 50% indicates that the area under the curve mapped from a list of precision and recall rates has no prediction power of the ground truth. An AUC value of 100% indicates that the predictor can perfectly guess the ground truth. For each machine learning algorithm with each value in the training size, we conducted experiments using all variables as the benchmark. We also ran the same experiments using variable modules selected by the I-score.

## 4. Application

#### 4.1. Background

#### 4.2. Model Training

**Forward Propagation.**To illustrate the procedure of model training, let us consider a set of input variables $\{{X}_{1}^{\u2020},{X}_{2}^{\u2020},{X}_{3}^{\u2020}\}$. In the proposed work, this refers to the variable modules, also notated as ${X}^{\u2020}$, that we created using the interaction-based feature engineer (see Equation (6)). For this discussion, we define a set of weights $\{{w}_{1},{w}_{2},{w}_{3}\}$ to construct a linear transformation. The symbol $\Sigma $ in the following diagram represents this linear transformation that takes the form ${X}_{1}^{\u2020}{w}_{1}+{X}_{2}^{\u2020}{w}_{2}+{X}_{3}^{\u2020}{w}_{3}$. For simplicity of notation, we write $\Sigma ={\sum}_{j=1}^{3}{w}_{j}{X}_{j}^{\u2020}$. Then, we denote $a(\xb7)$ as an activation function. We chose sigmoid to be this activation function $a(\xb7)$. This means we have output $\widehat{y}$ to be defined as $a\left(\Sigma \right)$. In other words, we can write the following:

**Architecture.**This architecture of a neural network is presented below. For simplicity when drawing this picture, we assume there are 3 input variable modules: $\{{X}_{1}^{\u2020},{X}_{2}^{\u2020},{X}_{3}^{\u2020}\}$. In practice, the number of variable modules (the total number of ${X}^{\u2020}$) depends on image data dimensions, window size, stride level, and starting point (please see Section 4.3 and Equation (18) for the exact calculation).

**Backward Propagation.**To search for the optimal weights, we used an optimizer algorithm called root mean square propagation (RMSprop), a named suggested by Geoffrey Hinton. With the loss function computed above, we derive the gradient of the loss function as $\nabla \mathcal{L}:=\partial \mathcal{L}(y,\widehat{y})/\partial w$. At each iteration t, we compute ${v}_{t,\nabla \mathcal{L}}:=\beta {v}_{t-1,\nabla \mathcal{L}}+(1-\beta )\nabla {\mathcal{L}}^{2}$, where $\beta $ is a tuning parameter. Note that the square term on $\nabla \mathcal{L}$ is element-wise multiplication. Then, we can update the weights using ${w}_{t}:={w}_{t-1}-\eta \xb7\nabla \mathcal{L}/\sqrt{{v}_{t,\nabla \mathcal{L}}}$, where $\eta $ is learning rate. The value of learning rate $\eta $ is a tuning parameter, which is usually a very small number. This process starts with the loss function and returns to the beginning to update the weights $w=\{{w}_{1},{w}_{2},{w}_{3}\}$. Hence, it earned the name backward propagation.

#### 4.3. Model Parameters

**Window Size.**Window size is the size of the local area that we focus on run the backward dropping algorithm. For example, in the first artificial example in Section 3.3, the data size is $6\times 6$. A $2\times 2$ window means that we start with a local area that investigates the first row and the first two columns and the second row and the first two columns. For each row i and each column j in this $6\times 6$ grid structure, a window size of $2\times 2$ means a local area of the following $2\times 2$ matrix:

**Stride Level.**The level of stride is the number of rows or columns that is skipped. This tuning parameter allows the algorithm to move faster but its disadvantage is that some variables are skipped. For example, we investigate a stride level of one starting from row i and column j. Assume we use a $2\times 2$ window and let us start from $(i,j)$. We can visualize this action using the following diagram:

**Starting Point.**Another tuning parameter that we recommend to adjust is the starting point. The starting point represents the location of the first pixel in the proposed operation. The most common point is starting the rolling window from the pixel located in the first row and the first column. This is illustrated in the following matrix:

**Computation of Dimensions.**The above discussion introduced the tuning parameters: window size, stride level, and starting point. These parameters update our input matrix and generate a new matrix with different sizes. Let us denote window size as w, stride level as l, and starting point as p. Given a ${s}_{\mathrm{in}}$ by ${s}_{\mathrm{in}}$ input matrix, the output matrix has new dimensions computed as:

#### 4.4. Evaluation Metrics

**Sensitivity and Specificity.**The notion of sensitivity is interchangeable with recall or true positive rate. In a simple two-class classification problem, the goal is to investigate covariate matrix X in order to produce an estimated value of Y. From the output of a NN model, the predicted values are always between zero and one, which acts as a probabilistic statement to describe the chance an observation is class one or zero. Given a threshold between zero and one, we can compute sensitivity as:

**Area under the curve (AUC).**The AUC value is a single number derived from a predicted probability by a classification rule and the true label [32]. Given a vector of the true label Y and a vector of the predicted probability $\widehat{Y}$, we can use statistical package pROC (The package is called Display and Analyze ROC Curves. Source: https://github.com/xrobin/pROC (2 November 2021)) to assist this computation. The package uses automatically generated thresholds to convert $\widehat{Y}$ into binary format. For example, we can use a threshold of ${t}_{1}=0.3$ to convert a vector of the predicted probabilities $\widehat{Y}=[0,0.2,0.4,0.8]$ into binary form by writing ${\widehat{Y}}_{{t}_{1}}=\mathbb{1}(\widehat{Y}>{t}_{1})=[0,0,1,1]$. Let us assume the true label to be $Y=[0,0,1,1]$. Thus, we can compute specificity as one and sensitivity as one1. We can then change threshold to a different value to compute another pair of specificity and sensitivity. By tracking all pairs of specificity and sensitivity, we can generate a curve called the ROC curve [32]. The value of the AUC is exactly the area under the ROC. Assume the predicted probability contains some mistakes. In other words, let us assume the predicted probability vector is $\widehat{Y}=[0,0.2,0.2,0.8]$. It is not possible for two observations to have the same prediction probability when they come from different classes. Therefore, there must be a mistake in one of them. This information is reflected using the same procedure (Table Figure 7).

#### 4.5. Performance: Proposed Models and AUC Values

**Model 1.**This model starts with input images that are sized 128 × 128. Using the parameter in set $\u25b5$, we create the first interaction-based convolutional layer (i.e., 1st Conv. Layer in Table 10). This new matrix has dimension $\lfloor (128-6-2+1)/2+1\rfloor \times \lfloor (128-6-2+1)/2+1\rfloor =61\times 61=3721$. These 3721 variables are directly used to create the output layer with two units (assuming using SoftMax as the loss function). Therefore, the total number of parameters for the network architecture $3721\times 2=7442$ parameters. The test set performance, measured by the AUC, is 98.5% for Model 1.

**Model 2.**This model builds upon the architecture of Model 1. The only difference is that there is one hidden layer with 64 units (or neurons). We fully connect each variable in the 1st Conv. Layer with each neuron in the hidden layer; afterward, we fully connect the hidden layer with the output layer. This means that from the 1st Conv. Layer to the hidden layer, there are $3721\times 64=238,144$ parameters. From the hidden layer of 64 neurons to output layer with two units, there are $64\times 2=128$ parameters. This means that, in total, there are $238,144+128=238,272$ parameters. The performance of this architecture is 99.7%. The design of this one hidden layer with 64 units reduced the error rate from 1.5% in Model 1 to 0.3% in Model 2, which is an 80% error reduction.

**Model 3.**This model has two interaction-based convolutional layers. The 1st Conv. Layer uses the set of parameters in $\u25b5$ and the 2nd Conv. Layer uses the set of parameters in $\square $. From the 1st Conv. Layer in Model 1, we are left with $61\times 61=3721$ variables. Using the parameters in $\square $, we have new matrix with size $\lfloor (61-1-2+1)/2+1\rfloor \times \lfloor (61-1-2+1)/2+1\rfloor =30\times 30=900$ variables. These 900 variables can be the input layer and we can directly pass these 900 variables into the output layer for making predictions. In other words, the output layer has $900\times 2=1800$ parameters. The test set AUC value is 97.0%.

**Model 4.**This model is the deepest amongst all six models. Model 4 has two interaction-based convolutional Llyers and one hidden layer. From the 2nd Conv. Layer in Model 3, we are left with 900 variables. The architecture has one hidden layer with 64 units. The 900 variables are fully connected with the hidden layer, which creates $900\times 64=57,600$ variables. From the hidden layer with 64 units to the output layer with 2 units, there are $64\times 2=128$ parameters. In total, there are $57,600+128=57,728$ parameters. The prediction performance is 99.6% on the test set.

**Model 5.**Both Models 5 and 6 have wider convolutional layers instead of aiming for depth. Model 5 has a concatenated of features from both convolutional layers. This means the architecture takes the 7442 variables from the 1st Conv. Layer and 900 variables from the 2nd Conv. Layer from the previous models together as one large convolutional layer. In other words, Model 5 has a 1st Conv. Layer with $3721+900=4621$ variables. These 4621 variables can be directly used to be fed into the output layer with two units. In total, this architecture creates $4621\times 2=9242$ parameters with a test set performance of 98.3%.

**Model 6.**The last model, Model 6, is just as wide as the previous Model 5. Model 6 also has a first convolutional layer that is a concatenation of features. It has $3721+900=4621$ variables. In addition to Model 5, it has one hidden layer with 64 units. We fully connect the convolutional layer of 4621 variables with the hidden layer of 64 variables. This results in $4621\times 64=295,744$ parameters. The hidden layer with 64 units are then fully connected with the output layer, which produces $64\times 2=128$ parameters. In total, the model has $295,744+128=295,872$ parameters. This model has the highest AUC value on test set, i.e., 99.8%.

#### 4.6. Visualization: Images and Convolutional Layer

**Original Images in the 1st Conv. Layer.**The input images are sized 128 by 128. With the 1st Conv. Layer constructed, we have $61\times 61=3721$ new variables. We return to the same samples as shown in the first row in Figure 9 and use these 3721 variables only. When we plot these samples with these new variables, we resize them back in a 61 by 61 matrix form. Panel A represents the COVID class and panel B represents the non-COVID class. In addition, we use Model 1 in Table 10 to produce the texts that state the predicted probability of the COVID class. The red indicates the ground truth as COVID class (panel A) and the green indicates ground truth as non-COVID class (panel B).

**1st Conv. Layer. to 2nd Conv. Layer.**From the resulting matrix of the 1st Conv. Layer, we are left with 3721 variables. We apply the proposed design in Table 10 and we create a new convolutional layer, i.e., the 2nd Conv. Layer. This new layer has $30\times 30=900$ variables. We take the same 10 sampled images from before and use these 900 variables to present these images. In this presentation, we resize these 900 variables into 30 by 30. In other words, we obtain a smaller matrix that is a smaller version with similar patterns as before. We use Model 4 to generate the predicted probabilities. These probabilities are printed on the top left corner of each image and they are color-coded similar as before (red probabilities indicate the ground truth of COVID class and green probabilities indicate the ground truth of non-COVID class).

**Visualization Interpretation.**The plot in Figure 9 of the original images for patients infected by COVID-19 has grey and cloudy textures in the chest area. Because an X-ray is at its brightest when most of the light beams emitted bounce back from the object, bones show as white and the margin is completely black. For muscle and organs inside the human body, X-rays that are emitted can only be partially collected; this causes the greyscale on the X-ray images in the chest area. For COVID-19 patients, there are grey and shaded areas in the chest X-rays. This is due to the inflammatory fluid produced when patients exhibit pneumonia-like symptoms. The fluid inside the chest area is a consequence of the human immune system fighting outside diseases. This shaded area (as seen in panel A in Figure 9) prevents us from observing clear areas in lungs. This is different in panel B, where the lung areas are dark and almost black, because a healthy lung is filled with air (i.e., the black present in normal cases’ X-ray images). The black and white contrast in the two panels is directly related to how much inflammatory fluid is present in human lungs. This contrast translates to greyscale on pictures and is directly related to COVID and non-COVID cases (i.e., response variable Y). The same contrast can be seen using the new variables (these are ${X}^{\u2020}$ based on Equation (6)) in the 1st Conv. Layer (sized 61 by 61). For COVID-19 patients, the lung area is cloudy and unclear, whereas for the healthy cases, it is clearly visible. This is not a surprising coincidence because the proposed new variable modules, ${X}^{\u2020}$, are engineered using Equation (6), which relies on the response variable ${\overline{y}}_{j}$ in the training set. The 61 by 61 images from the proposed algorithm are a direct translation of not only the original pixels but also the response variable. In other words, this visualization presents how the I-score considers image data.

## 5. Conclusions

**Explainable AI System for Early COVID-19 Screening.**As the most important contribution of this paper, an explainable artificial intelligence (XAI) system is proposed to assist radiologists in the initial screening of COVID-19 and other related diseases using chest X-ray images for treatment and disease control. This innovation can revolutionize the application of AI systems in hospitals and healthcare systems. We anticipate that other related diseases with viral pneumonia signs can use the same detection methods proposed in our paper, which ensure the development of testing procedures with accountability, responsibility, and transparency for human users and patients.

**A Heuristic and Theoretical Framework of XAI.**This paper introduced a heuristic and theoretical framework for addressing the XAI problems in large-scale and high-dimensional data sets. We provided three criteria as necessary conditions and premises for a measure to be regarded as explainable and interpretable. The first dimension, $\mathcal{D}1$, states that an interpretable measure does not need to rely on the knowledge of the true model, because any mistakes made in model fitting would be carried over in explaining the features. The second dimension, $\mathcal{D}2$, states that an interpretable measure should be able to indicate the impact of a combination of variables on the response variable. This means that any inclusion of influential variables would increase this measure, whereas any injection of noisy and useless variables would decrease this measure. This desirable property allows human users to directly compare the impact of the features when any classifier is trained to make prediction decisions. Though we provided detailed work with an arbitrary image data set, the proposed method can be generalized and adapted to any big data problem. Moreover, it opens up future possibilities for feature selection and dimension reduction in any large-scale and high-dimensional data set. Last, the third dimension, $\mathcal{D}3$, associates an interpretable measure with the predictivity of a set of features. This property benefits human users because it allows us to establish connections and foresee the potential prediction performance (such as AUC values) that a set of features can deliver before any model fitting procedure.

**An ICNN.**To address the XAI problems heuristically described above, this paper introduced a novel design of an explainable and self-interpretable interaction-based convolutional neural network (ICNN).

**We provided a flexible approach to contribute to the major issues regarding explainability, interpretability, transparency, and trustworthiness in black-box algorithms. We introduced and implemented a nonparametric and interaction-based feature selection methodology and used this as a replacement for predefined filters that are widely used in ultra-deep CNNs. Under this paradigm, we presented an ICNN that extracts important features. The proposed architecture uses these extracted features to construct influential and predictive variable modules that are directly associated with the predictivity of the response variable. The proposed design and its many characteristics provide an extremely flexible pipeline that can learn, extract useful information, and identify the hidden potential from any large-scale or high-dimensional data set.**The proposed methods were presented with both artificial examples and real data applications to COVID-19 chest X-ray image data. We conclude from both simulation and empirical application results that the I-score has unparalleled potential to explain informative and influential local information in large-scale data sets. High I-score values suggest that local information possesses the capability to have higher lower bounds of the predictivity, which thus leads not only to highly accurate prediction performance but also strong explanatory power. By arranging features according to the I-score from high to low, we are able to cater the dimensions of our model to any neural network architecture. Furthermore, we also show potential applications of the interaction-based neural network architecture, which can help us advance the field of explainable artificial intelligence. We think that the proposed design can be adapted to any type of CNN. Thus, any CNN architecture that adopts the proposed technology can be regarded as in interaction-based convolutional neural network (ICNN or interaction-based network). We encourage both the statistics and computer science communities to further explore this area to increase the transparency, trustworthiness, and accountability of deep learning algorithms and to build a world with truly responsible A.I.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Velavan, T.P.; Meyer, C.G. The COVID-19 epidemic. Trop. Med. Int. Health
**2020**, 25, 278. [Google Scholar] [CrossRef][Green Version] - Li, F.; Wang, Y.; Li, X.Y.; Nusairat, A.; Wu, Y. Gateway placement for throughput optimization in wireless mesh networks. Mob. Netw. Appl.
**2008**, 13, 198–211. [Google Scholar] [CrossRef][Green Version] - Wang, S.; Kang, B.; Ma, J.; Zeng, X.; Xiao, M.; Guo, J.; Cai, M.; Yang, J.; Li, Y.; Meng, X.; et al. A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). Eur. Radiol.
**2021**, 31, 6096–6104. [Google Scholar] [CrossRef] [PubMed] - Li, Q.; Guan, X.; Wu, P.; Wang, X.; Zhou, L.; Tong, Y.; Ren, R.; Leung, K.S.; Lau, E.H.; Wong, J.Y.; et al. Early transmission dynamics in Wuhan, China, of novel coronavirus—Infected pneumonia. N. Engl. J. Med.
**2020**, 382, 1199–1207. [Google Scholar] [CrossRef] [PubMed] - Ai, T.; Yang, Z.; Hou, H.; Zhan, C.; Chen, C.; Lv, W.; Tao, Q.; Sun, Z.; Xia, L. Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases. Radiology
**2020**, 296, E32–E40. [Google Scholar] [CrossRef][Green Version] - Aloysius, N.; Geetha, M. A review on deep convolutional neural networks. In Proceedings of the 2017 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 6–8 April 2017; pp. 0588–0592. [Google Scholar]
- Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev.
**2020**, 53, 5455–5516. [Google Scholar] [CrossRef][Green Version] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1097–1105. [Google Scholar] [CrossRef] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Katona, J.; Ujbanyi, T.; Sziladi, G.; Kovari, A. Examine the effect of different web-based media on human brain waves. In Proceedings of the 2017 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Debrecen, Hungary, 11–14 September 2017; pp. 000251–000256. [Google Scholar]
- Katona, J.; Kovari, A. Speed control of Festo Robotino mobile robot using NeuroSky MindWave EEG headset based brain-computer interface. In Proceedings of the 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Wroclaw, Poland, 16–18 October 2016; pp. 000251–000256. [Google Scholar]
- Katona, J.; Ujbanyi, T.; Sziladi, G.; Kovari, A. Electroencephalogram-Based Brain-Computer Interface for Internet of Robotic Things. Available online: https://link.springer.com/chapter/10.1007/978-3-319-95996-2_12#citeas (accessed on 2 November 2021).
- Katona, J. Analyse the Readability of LINQ Code using an Eye-Tracking-based Evaluation. Acta Polytech. Hung.
**2021**, 18, 193–215. [Google Scholar] [CrossRef] - Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv
**2017**, arXiv:1702.08608. [Google Scholar] - Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell.
**2019**, 267, 1–38. [Google Scholar] [CrossRef] - Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access
**2018**, 6, 52138–52160. [Google Scholar] [CrossRef] - DARPA. Broad Agency Announcement, Explainable Artificial Intelligence (XAI). Available online: https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf (accessed on 2 November 2021).
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef][Green Version] - Lo, A.; Chernoff, H.; Zheng, T.; Lo, S.H. Framework for making better predictions by directly estimating variables’ predictivity. Proc. Natl. Acad. Sci. USA
**2016**, 113, 14277–14282. [Google Scholar] [CrossRef][Green Version] - LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst.
**1990**, 2, 396–404. [Google Scholar] - Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv
**2016**, arXiv:1610.02357. [Google Scholar] - Chernoff, H.; Lo, S.H.; Zheng, T. Discovering influential variables: A method of partitions. Ann. Appl. Stat.
**2009**, 3, 1335–1369. [Google Scholar] [CrossRef][Green Version] - Lo, S.; Zheng, T. Backward haplotype transmission association algorithm—A fast multiple-marker screening method. Hum. Hered.
**2002**, 53, 197–215. [Google Scholar] [CrossRef][Green Version] - Lo, A.; Chernoff, H.; Zheng, T.; Lo, S.H. Why significant variables aren’t automatically good predictors. Proc. Natl. Acad. Sci. USA
**2015**, 112, 13892–13897. [Google Scholar] [CrossRef][Green Version] - Wang, H.; Lo, S.H.; Zheng, T.; Hu, I. Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics
**2012**, 28, 2834–2842. [Google Scholar] [CrossRef][Green Version] - LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput.
**1989**, 1, 541–551. [Google Scholar] [CrossRef] - Yann, L.; Corinna, C.; Christopher, J. The Mnist Database of Handwritten Digits. 1998. Available online: http://yhann.lecun.com/exdb/mnist (accessed on 2 November 2021).
- Bai, H.X.; Wang, R.; Xiong, Z.; Hsieh, B.; Chang, K.; Halsey, K.; Tran, T.M.L.; Choi, J.W.; Wang, D.C.; Shi, L.B.; et al. Artificial intelligence augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other origin at chest CT. Radiology
**2020**, 296, E156–E165. [Google Scholar] [CrossRef] [PubMed] - Minaee, S.; Kafieh, R.; Sonka, M.; Soufi, G. Deep-COVID: Predicting COVID-19 from Chest X-ray Images Using Deep Transfer Learning. Med. Image Anal.
**2020**, 65, 101794. [Google Scholar] [CrossRef] [PubMed] - Hand, D.J. Measuring classifier performance: A coherent alternative to the area under the ROC curve. Mach. Learn.
**2009**, 77, 103–123. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**DARPA document (DARPA-BAA-16-53) proposed this figure to illustrate a basic Explainable AI problem [19,20]. It presents the relationship between learning performance (usually measured by prediction performance) and effectiveness of explanations (also known as explainability). The proposed method in our work aims to take any deep learning method and provide explainability without sacrificing prediction performance. In the diagram, the proposed method is the orange dot on the upper right corner of the relationship plot.

**Figure 2.**This executive diagram summarizes the key components of the methods proposed in this paper. We start with the COVID-19 image data. With a small rolling window defined, we execute the backward dropping algorithm to select the important features within this window. For example, the rolling window may cover 4 variables, $\{{X}_{1},{X}_{2},{X}_{3},{X}_{4}\}$, and the BDA could select $\{{X}_{1},{X}_{2}\}$ as a variable module. Then, we can construct a new variable using the interaction-based feature engineer technique (see the construction of ${X}^{\u2020}$ in Equation (6) to appreciate this new design). In other words, using the selected variables ${X}_{1}$ and ${X}_{2}$, we construct ${X}^{\u2020}$. The procedure of the BDA is illustrated in the bottom left corner of the figure (we use a 2 × 2 window for simplicity). We set the starting point to be 12 (i.e., start from the pixel in the 12th row and the 12th column). From the data (size of 128 × 128) to the 1st layer (58 × 58), this gives us a new dimension that is computed by $\lfloor (128-12-2+1)/2+1\rfloor =58$ (see Equation (18)). We can repeat the process in the 2nd layer and the 3rd layer. After the 3rd layer, we shrink the dimension to 14 × 14 (which gives us 196 new variable modules, i.e., the new ${X}^{\u2020}$). We fully connect these 196 variable modules with the 10 neurons in the hidden layer (in practice, the number of hidden layers and the number neurons are tuning parameters). This novel design is fundamentally different than the conventional practice of using pretrained filters because it proposes to use an explainable measure, the I-score, to extract and build information directly from images in the training data. For each local variable (in the data, this refers to pixels; in the layers, it is referred to as variables), we compute the I-score values and the AUC (see Section 4.4 for a detailed discussion of AUC values) for that variable (using this variable as a predictor alone). We observe that the I-score value fully represents the predictivity of each local variable, which can be confirmed by the variable’s AUC value. The color spectrum of both the I-score and AUC are presented in the bottom part of the diagram. We observe I-score values exhibiting behavior parallel to AUC values. This means the variables with high I-score values have high AUC values, which indicates strong predictive power for the information in that location. This design heavily relies on the I-score and has an architecture that is interpretable at each location of the image at each convolutional layer. More importantly, the proposed design satisfies all three dimensions ($\mathcal{D}1$, $\mathcal{D}2$, and $\mathcal{D}3$ in the Introduction) of the definition of interpretability and explainability.

**Figure 3.**This is the architecture of the interaction-based convolutional neural network (ICNN). In Panel

**A**, the input matrix is 64 × 64. Suppose the small rolling window is 4 × 4. By rolling this small window from the top left corner to the bottom right corner of the input matrix, we create a 61 × 61 output matrix. These 61 × 61 variables are then used to build a neural network classifier with a hidden layer that has 32 units. The number of hidden layers and number of units per layer are tuning parameters. Panel

**B**depicts an architecture that is much deeper, with each layer adopting the design in Panel

**A**.

Panel A | Panel B |

**Figure 4.**The network architecture with simulated data. The artificial data has $6\times 6=36$ variables, which can be arranged in a grid structure with shape of 6 × 6. We use a window size that is 2 × 2. By passing this window from the top left corner of the original 6 × 6 matrix, we create a new 5 × 5 matrix. The number 32 is the number of units in the hidden layer. In this example, there is one hidden layer, which is sufficient for the dimension of the data in the example.

**Figure 5.**The above architecture presents a feed forward neural network with three input variables. The input variables are $\{{X}_{1}^{\u2020},{X}_{2}^{\u2020},{X}_{3}^{\u2020}\}$, which are variable modules created using Equation (6).

**Figure 6.**A total of 16 images randomly sampled from each of the (

**A**) COVID class and (

**B**) non-COVID class. The images for the COVID class appear cloudier and more unclear than the images for the non-COVID class. This is because in the X-ray images for the COVID class contain substances that are not air. These substances may be liquid, germs, or inflammatory fluid, which causes the images to have cloudy, unclear, and shady areas.

(A) | (B) |

**Figure 7.**Two paths of ROC curves: ROC1 and ROC2. For each ROC curve, we can compute the AUC value. From ROC1, we can compute AUC1. From ROC2, we can compute AUC2. The mistake discussed in this section is reflected by a reduction in AUC values from path 1 to path 2.

**Figure 8.**The AUC path for all six models in the proposed work. These models are listed in Table 10 with detailed information including the parameters from each layer and the out-of-sample prediction performance.

**Figure 9.**A summary of randomly sampled images from the COVID and non-COVID classes (10 each). Panel

**A**represents COVID patients and panel B represents non-COVID individuals. The first row plots the original 128 by 128 images. The 1st Conv. Layer generates $61\times 61=3721$ new variables. We plot the same 10 images from both classes using these 3721 variables in the second row. We also provide the predicted COVID probabilities on top left corner of each image. The 2nd Conv. Layer generates $30\times 30=900$ variables. We plot the same 10 images from both classes using these 900 variables in the third row. We also provide the predicted COVID probabilities in the top left corner of each image assuming only these 900 variables are used as predictors. The plot of the original images for patients infected by COVID-19 has grey and cloudy textures in the chest area, which are due to inflammatory fluid produced when patients exhibit pneumonia-like symptoms. This shaded area (as seen in panel

**A**) prevents us from clearly observing the lungs. This is different in panel

**B**, where the lung areas are dark and almost black, which means the lung is filled with air (i.e., normal cases). The black/white contrast in the two panels is directly related to the amount of much inflammatory fluid in human lungs, which translates to greyscale on pictures. The same contrast can be seen using the new variables (these are ${X}^{\u2020}$ based on Equation (6)) in the 1st Conv. Layer (sized 61 by 61). For COVID-19 patients, the lung area is cloudy and unclear, whereas for the healthy cases, it is clearly visible.

Panel A | Panel B |

True Label: COVID | True Label: Non-COVID |

Input Images: 128 by 128 | Input Images: 128 by 128 |

(Randomly select 10 samples) | (Randomly select 10 samples) |

1st Conv. Layer: 61 by 61 | 1st Conv. Layer: 61 by 61 |

(Starting Point = 6, Window 2 by 2, Stride = 2) | (Starting Point = 6, Window 2 by 2, Stride = 2) |

Remark: $61\times 61=3721$ variables | Remark: $61\times 61=3721$ variables |

Same 10 images above with 3721 variables | Same 10 images above with 3,721 variables |

Labels predicted using Model 1 | Labels predicted using Model 1 |

2nd Conv. Layer: 30 by 30 | 2nd Conv. Layer: 30 by 30 |

(Starting Point = 6, Window 2 by 2, Stride = 2) | (Starting Point = 6, Window 2 by 2, Stride = 2) |

Remark: $30\times 30=900$ variables | $30\times 30=900$ variables |

Same 10 images above with 900 variables | Same 10 images above with 900 variables |

Labels predicted using Model 4 | Labels predicted using Model 4 |

Name | Number of Parameters |
---|---|

LeNet [22] | 60,000 |

AlexNet [8] | 60 million |

ResNet50 [9] | 25 million |

DenseNet [10] | 0.8–40 million |

VGG16 [11] | 138 million |

Step | 1 | 2 | 3 | 4 |
---|---|---|---|---|

Drop | Start | ${\mathit{X}}_{8}$ | ${\mathit{X}}_{7}$ | ${\mathit{X}}_{2}$ |

I-score | 160.18 | 319.65 | 638.17 | 0.65 |

Investigation by iterative dropping | ${X}_{1}$ | ${X}_{1}$ | ${X}_{1}$ | ${X}_{1}$ |

${X}_{2}$ | ${X}_{2}$ | ${X}_{2}$ | ||

${X}_{7}$ | ${X}_{7}$ | |||

${X}_{8}$ | ||||

Best result: | $\{{X}_{1},{X}_{2}\}$ |

**Table 3.**Interaction-based convolutional layer. Foran artificial data set with ${6}^{2}=36$ variables and a window size of $2\times 2$, we pass the window from the top left corner down to the bottom right corner. For each particular location, we have 4 variables to run the BDA. Before each observation had 36 features that could be sized $6\times 6$. Afterward, each observation has 25 new features that have the shape of $5\times 5$. In other words, we create ${X}_{b}^{\u2020}$, where $b=\{1,2,\dots ,25\}$. The asterisk indicates extremely influential variable module(s). For each variable module, we also present the AUC value (see Section 4.4 for a detailed discussion of AUC values) assuming a classifier is built using this variable module alone.

New Mod. | Variables | I-Score | AUC | New Mod. | Variables | I-Score | AUC |
---|---|---|---|---|---|---|---|

${X}_{1}^{\u2020}$* | X1, X2 | 638.17 | 0.75 | ${X}_{14}^{\u2020}$ | X28 | 1.3729 | 0.50 |

${X}_{2}^{\u2020}$ | X7 | 1.2162 | 0.50 | ${X}_{15}^{\u2020}$ | X28 | 1.3729 | 0.50 |

${X}_{3}^{\u2020}$ | X13, X20 | 2.3597 | 0.51 | ${X}_{16}^{\u2020}$ | X11 | 0.2347 | 0.50 |

${X}_{4}^{\u2020}$ | X19, X20, X26 | 0.7218 | 0.50 | ${X}_{17}^{\u2020}$ | X11 | 0.2347 | 0.50 |

${X}_{5}^{\u2020}$ | X26, X31 | 2.6751 | 0.50 | ${X}_{18}^{\u2020}$ | X16, X22 | 0.0777 | 0.51 |

${X}_{6}^{\u2020}$ | X8, X9 | 0.5067 | 0.49 | ${X}_{19}^{\u2020}$ | X28 | 1.3729 | 0.51 |

${X}_{7}^{\u2020}$ | X8, X9 | 0.5067 | 0.50 | ${X}_{20}^{\u2020}$ | X28 | 1.3729 | 0.51 |

${X}_{8}^{\u2020}$ | X15, X21 | 1.8013 | 0.50 | ${X}_{21}^{\u2020}$ | X6, X12 | 0.4378 | 0.49 |

${X}_{9}^{\u2020}$ | X20, X21, X26, X27 | 0.7554 | 0.50 | ${X}_{22}^{\u2020}$ | X11, X12 | 0.6184 | 0.51 |

${X}_{10}^{\u2020}$ | X27, X32 | 1.017 | 0.50 | ${X}_{23}^{\u2020}$ | X18, X24 | 1.3814 | 0.51 |

${X}_{11}^{\u2020}$ | X9, X10 | 0.6894 | 0.50 | ${X}_{24}^{\u2020}$ | X23, X24, X29 | 0.8788 | 0.51 |

${X}_{12}^{\u2020}$ | X9, X10, X15 | 0.9346 | 0.51 | ${X}_{25}^{\u2020}$ | X30, X35 | 1.2105 | 0.51 |

${X}_{13}^{\u2020}$ | X15, X16, X21, X22 | 1.0933 | 0.50 |

**Table 4.**Interaction-based convolutional layer. For an artificial data set with ${6}^{2}=36$ variables and a window size of $3\times 3$, we create 16 new features that can be shaped into $4\times 4$. The procedure of generating these 16 new features follows the same procedure as in Table 3. The only difference is the window size, i.e., here, we use $3\times 3$. In other words, we are able to create ${X}_{b}^{\u2020}$ while $b=\{1,2,\dots ,16\}$. The asterisk indicates extremely influential variable module(s). For each variable module, we also present the AUC value (see Section 4.4 for a detailed discussion of AUC values) assuming a classifier built using this variable module alone.

New Mod. | Variables | I-Score | AUC | New Mod. | Variables | I-Score | AUC |
---|---|---|---|---|---|---|---|

${X}_{1}^{\u2020}$* | X1, X2 | 638.17 | 0.75 | ${X}_{9}^{\u2020}$* | X3, X4, X5 | 350.2429 | 0.75 |

${X}_{2}^{\u2020}$ | X8, X9, X13, X15, X21 | 0.6344 | 0.50 | ${X}_{10}^{\u2020}$ | X11 | 0.2347 | 0.51 |

${X}_{3}^{\u2020}$ | X19, X21, X25 | 1.4386 | 0.50 | ${X}_{11}^{\u2020}$ | X16, X17, X21, X22 | 0.8097 | 0.51 |

${X}_{4}^{\u2020}$ | X19, X21, X25 | 1.4386 | 0.50 | ${X}_{12}^{\u2020}$ | X28 | 1.3729 | 0.51 |

${X}_{5}^{\u2020}$ | X8, X9 | 0.5067 | 0.51 | ${X}_{13}^{\u2020}$ | X11 | 0.2347 | 0.51 |

${X}_{6}^{\u2020}$ | X10, X15, X21 | 0.9883 | 0.50 | ${X}_{14}^{\u2020}$ | X18, X24 | 1.3814 | 0.51 |

${X}_{7}^{\u2020}$ | X14, X15, X21, X22, X26 | 0.9816 | 0.50 | ${X}_{15}^{\u2020}$ | X18, X24 | 1.3814 | 0.51 |

${X}_{8}^{\u2020}$ | X20, X32, X33 | 2.0205 | 0.50 | ${X}_{16}^{\u2020}$ | X22, X23, X24, X29, X34 | 1.5684 | 0.51 |

**Table 5.**The performance of a simulation. In this simulation, we defined the underlying model to be (7). The theoretical prediction rate is 75%. A conventional model such as Net-3 and LeNet-5 does not perform well in this example [28,29]. The proposed method that uses the I-score has prediction performance that is close to the theoretical prediction rate (see Section 4.4 for a detailed discussion of AUC values). NN, neural network. This is discussed in Section 4.2 and please review the architecture in Figure 5 for a detailed description.

Algorithm | Test AUC |
---|---|

Theoretical Prediction | 0.75 |

Net-3 | 0.50 |

LeNet-5 | 0.50 |

Interaction-based Conv. Layer: | |

Window size: $2\times 2$ | |

(25 modules listed in Table 3) | |

Using $\widehat{Y}$ defined in Equation (10) | |

Interaction-based Conv. Layer + NN | 0.75 |

Interaction-based Conv. Layer: | |

Window size: $3\times 3$ | |

(16 modules listed in Table 4) | |

Using $\widehat{Y}$ defined in Equation (11) | |

Interaction-based Conv. Layer + NN | 0.76 |

**Table 6.**Simulation results for Model (12). The theoretical prediction rate was calculated using Equation (13) as 75%. In other words, we expected prediction performance on th out-of-sample test set to be approximately 75% on average. For each experiment below, the out-of-sample test set had 1000 sample data points and the performance was calculated using the area under the curve (AUC) from the receiver operating characteristic (ROC). For each experiment, we changed the in-sample training size to be 50, 100, or 1000, and we fixed all data to have 49 variables. For each pair of in-sample training size and number of variables, we ran experiments using all original variables, i.e., “All Var.”, with a variety of different classifiers. Alternatively, we used the proposed method to construct new variable modules, which were used as new features on the same classifier. The table shows that the proposed method provides improved prediction performance regardless of sample size or classifier used.

Variables: $7\times 7$ | Algorithm | Test AUC | Logistic | RF | iRF | NN |
---|---|---|---|---|---|---|

Training Sample Size: | Bagging | |||||

50 | All Var. | 0.52 | 0.52 | 0.51 | 0.51 | 0.51 |

$3\times 3$ window, 25 mod. | 0.56 | 0.57 | 0.54 | 0.55 | 0.55 | |

$4\times 4$ window, 16 mod. | 0.60 | 0.60 | 0.57 | 0.59 | 0.57 | |

100 | All Var. | 0.53 | 0.52 | 0.50 | 0.51 | 0.52 |

$3\times 3$ window, 25 mod. | 0.60 | 0.58 | 0.55 | 0.55 | 0.55 | |

$4\times 4$ window, 16 mod. | 0.64 | 0.62 | 0.58 | 0.59 | 0.58 | |

1000 | All Var. | 0.60 | 0.54 | 0.53 | 0.59 | 0.53 |

$3\times 3$ window, 25 mod. | 0.76 | 0.75 | 0.69 | 0.74 | 0.80 | |

$4\times 4$ window, 16 mod. | 0.77 | 0.76 | 0.71 | 0.75 | 0.77 |

**Table 7.**The simulation results for Model (12). The theoretical prediction rate was calculated using Equations (13) as 75%. In other words, we expected the prediction performance on the out-of-sample test set to be approximately 75% on average. For each experiment below, the out-of-sample test set had 1000 sample data points and the performance was calculated using the AUC from the ROC (see Section 4.4 for detailed discussion of AUC values). Continuing from Table 6, we fixed the in-sample training size to 1000 data points and we allowed the of variables in the toy data to be 100 or 200. From 49 variables, this is a more challenging situation because it lowers the chance of finding the correct variable modules.

Training Sample Size: 1000 | Algorithms | Test AUC | Logistic | RF | iRF | NN |
---|---|---|---|---|---|---|

Variables: | Bagging | |||||

$12\times 12$ | All Var. | 0.54 | 0.52 | 0.51 | 0.53 | 0.51 |

I-score Modules | ||||||

$2\times 2$ window, 121 mod. | 0.77 | 0.68 | 0.61 | 0.72 | 0.74 | |

$3\times 3$ window, 100 mod. | 0.78 | 0.69 | 0.63 | 0.72 | 0.76 | |

$4\times 4$ window, 81 mod. | 0.78 | 0.72 | 0.64 | 0.72 | 0.76 | |

$14\times 14$ | All Var. | 0.54 | 0.52 | 0.50 | 0.52 | 0.51 |

I-score Modules | ||||||

$2\times 2$ window, 169 mod. | 0.77 | 0.64 | 0.59 | 0.70 | 0.72 | |

$3\times 3$ window, 144 mod. | 0.77 | 0.70 | 0.60 | 0.70 | 0.73 | |

$4\times 4$ window, 121 mod. | 0.77 | 0.70 | 0.62 | 0.71 | 0.73 |

**Table 8.**The dimensions of the data. We downloadeed the COVID data from Minaee et al. [31]. This totalled 576 COVID images and 2000 non-COVID images. First, we split the test set from the total images. The test consisted of 60 COVID cases and 60 non-COVID cases. Then, we were left with 516 images for the COVID class and 1940 images for the non-COVID class. This was our in-sample set, which we used for training and validating (Tr. and Val., respectively). For the in-sample set, we upsampled the images by adding noise drawn from normal distribution. This produced 5000 COVID images and 5000 non-COVID images for training and testing. The out-of-sample test set had 120 observations, which wre only used in the end to verify the learning performance.

Data | COVID | Non-COVID |
---|---|---|

Total Data Downloaded from [31] | 576 | 2000 |

Out-of-Sample: Test | 60 | 60 |

In-Sample: Tr. and Val. | 516 | 1940 |

In-Sample: Tr. and Val. | 5000 | 5000 |

(upsampled) |

**Table 9.**The experimental results on the COVID-19 data set from the literature. A number of different ultra-deep CNNs were used to classify COVID patients from non-COVID people. The performance is summarized below. * Minaee et al.disclosed that SqueezeNet has approximately 50 times fewer parameters than AlexNet. The proposed architecture achieves AUC values ranging from 98% to 99.8%. For details, please refer to Figure 8 and Table 10. The average number of parameters of the ultra-deep CNNs can exceed 25 million parameters with a top AUC value of 99.2%. The proposed method has an average number of parameters of less than 100,000 with a top AUC valu of 99.8%. This is a 99% reduction in th number of parameters without sacrificing prediction performance.

Previous Work | Number of Param. | AUC |
---|---|---|

DenseNet161 [31] | 0.8–40 M param. | 97.6% |

ResNet18 [31] | 11 M param. | 98.9% |

ResNet50 [31] | 25 M param. | 99.0% |

SqueezeNet [31] | ∼1.2 M param. * | 99.2% |

Average | >25 M | 97–99.2% |

Proposed Methods | Average 100,000 param. (a 99% reduction no. of param.) | 98.3–99.8% |

**Table 10.**A summary of the statistics of the design of the proposed network: interaction-based convolutional neural network (ICNN), for Models 1–6. Each model can take one or two interaction-based convolutional layers (i.e., 1st Conv. or 2nd Conv. Layer). The model can be designed to directly proceed from the interaction-based convolutional layer to the output layer. For example, Models 1 and 3 proceed directly from the convolutional layer to the output layer, i.e., no hidden layer.

Proposed Work | 1st Conv. | 2nd Conv. | Hidden | Output Layer | Num. of Param. | AUC |
---|---|---|---|---|---|---|

Model 1 | $\u25b5$ | None | None | 2 | 7442 | 98.5% |

Model 2 | $\u25b5$ | None | 1L (64 units) | 2 | 238,272 | 99.7% |

Model 3 | $\u25b5$ | $\square $ | None | 2 | 1800 | 97.0% |

Model 4 | $\u25b5$ | $\square $ | 1L (64 units) | 2 | 57,728 | 99.6% |

Model 5 | $\u25b5+\square $ | None | None | 2 | 9242 | 98.3% |

Model 6 | $\u25b5+\square $ | None | 1L (64 units) | 2 | 295,872 | 99.8% |

Remark | $\u25b5$: Starting Point = 6 Window Size: 2 by 2 Stride = 2 Output: 61 by 61 | $\square $: Starting Point = 1 Window Size: 2 by 2 Stride = 2 Output: 30 by 30 |

**Table 11.**Multiclass prediction data summary of the number of training set, validating set, and testing set samples in the multiclass X-ray image classification.

Classes | Train | Validate | Test |
---|---|---|---|

Healthy | 437 | 44 | 52 |

Tuberculosis | 422 | 41 | 52 |

Pneumonia | 88 | 9 | 1 |

COVID-19 | 88 | 9 | 11 |

Total | 994 | 99 | 121 |

**Table 12.**Multiclass lung cancer variants diagnosis: the experiment results for multiclass lung cancer variants classification. In total, there are 4 classes (0: healthy, 1: COVID-19, 2: pneumonia, and 3: tuberculosis). The table summarizes the benchmark performance as well as the prediction performance of the proposed method. All results are measured in area under the curve (AUC). Multiclass AUC values are the average of the AUC values from the 4 different classes, which were calculated using a statistical software package named pROC, which requires the same computation of sensitivity and specificity (these definitions are discussed in Equations (19) and (20)). The average prediction performance for 4-class diagnosis is 89% with 26 million parameters in a variety of different neural networks designs. The average prediction performance for 4-class diagnosis is 97.5% with only 13,000 parameters in the proposed network architecture.

Model | AUC (Test Set) | No. of Parameters |
---|---|---|

Proposed: | ||

ICNN (Parameters: {starting point: 6, window size: 2 by 2, stride: 2}) | 0.97 | 12,000 |

ICNN (Parameters: {starting point: 4, window size: 3 by 3, stride: 3}) | 0.98 | 13,000 |

Average | 0.975 | 12,500 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lo, S.-H.; Yin, Y.
An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images. *Algorithms* **2021**, *14*, 337.
https://doi.org/10.3390/a14110337

**AMA Style**

Lo S-H, Yin Y.
An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images. *Algorithms*. 2021; 14(11):337.
https://doi.org/10.3390/a14110337

**Chicago/Turabian Style**

Lo, Shaw-Hwa, and Yiqiao Yin.
2021. "An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images" *Algorithms* 14, no. 11: 337.
https://doi.org/10.3390/a14110337