Radiomics-Based Detection of COVID-19 from Chest X-ray Using Interpretable Soft Label-Driven TSK Fuzzy Classifier

The COVID-19 pandemic has posed a significant global public health threat with an escalating number of new cases and death toll daily. The early detection of COVID-related CXR abnormality potentially allows the early isolation of suspected cases. Chest X-Ray (CXR) is a fast and highly accessible imaging modality. Recently, a number of CXR-based AI models have been developed for the automated detection of COVID-19. However, most existing models are difficult to interpret due to the use of incomprehensible deep features in their models. Confronted with this, we developed an interpretable TSK fuzzy system in this study for COVID-19 detection using radiomics features extracted from CXR images. There are two main contributions. (1) When TSK fuzzy systems are applied to classification tasks, the commonly used binary label matrix of training samples is transformed into a soft one in order to learn a more discriminant transformation matrix and hence improve classification accuracy. (2) Based on the assumption that the samples in the same class should be kept as close as possible when they are transformed into the label space, the compactness class graph is introduced to avoid overfitting caused by label matrix relaxation. Our proposed model for a multi-categorical classification task (COVID-19 vs. No-Findings vs. Pneumonia) was evaluated using 600 CXR images from publicly available datasets and compared against five state-of-the-art AI models in aspects of classification accuracy. Experimental findings showed that our model achieved classification accuracy of over 83%, which is better than the state-of-the-art models, while maintaining high interpretability.


Introduction
Coronavirus disease 2019 (COVID-19) is caused by severe acute respiratory syndrome coronavirus (SARS-CoV-2), which has rapidly spread worldwide and posed significant public health threats. As of 25 January 2021, there have been more than 99.257 million confirmed cases of COVID-19 and 2,131,000 cumulative deaths worldwide, according to the Global COVID-19 Status Report released by Johns Hopkins University (https://coronavirus. jhu.edu/map.html (accessed on 1 August 2021)). The number of confirmed/fatal cases is still on the rise. Currently, the main screening methods for COVID-19 cases include reverse transcription polymerase chain reaction (RT-PCR) or gene sequencing for respiratory or blood specimens [1]. However, the overall RT-PCR positive rate of throat swab samples is reported to be 30 to 60%; thus, subjects with false-negative test results may continue to transmit the virus within a community [2]. CXR is a routine diagnostic tool incorporated in the initial diagnosis and subsequent monitoring for COVID-19 patients. It boasts short examination time, low radiation dose and low cost of examination. CXR images show visual indexes correlated with COVID-19 [3]. Reported CXR features include airspace opacities, which can show multilobar involvement, peripheral dominance, asymmetric In this regard, Takagi-Sugeno-Kang (TSK) fuzzy systems are also in great favor owing to their capability to maintain outstanding balance between approximation ability and interpretability [24][25][26]. Although fuzzy systems were originally designed for regression or binary tasks (it can be considered as special regression tasks), by virtue of strategic approaches, fuzzy systems have achieved satisfactory performance in clinical diagnosis, mostly in disease classification. For example, Jiang et al. proposed a multi-view TSK fuzzy system for epileptic EEG signal detection [27]. Hsieh et al. combined an adaptive neuro-fuzzy inference system (ANFIS) with greedy forward feature selection to develop an intelligent diagnostic system for differentiating cases with influenza from those with cold [28]. Khayamnia et al. established a Mamdani fuzzy system for migraine headache diagnosis [29]. In each of the above studies, the label vector was transformed into a strict binary matrix in order to apply the fuzzy system for multi-class classification tasks. However, numerous studies indicated that transforming the training samples to a strict binary label matrix is too rigid to learn a discriminative transformation matrix [30][31][32].
Confronted with all these, in this study, we developed a TSK fussy system to enhance the interpretability of our AI model for COVID-19 detection (COVID-19 vs. No-Findings vs. Pneumonia). Meanwhile, we introduced a soft strategy to adaptively relax the binary matrix during the label transformation. Inspired by manifold learning, a class compactness graph was constructed to minimize the risk of model overfitting following the introduction of label relaxation. The core idea lies in that samples sharing the same labels should be kept as close as possible when they are transformed into the label space. The contributions of this study can be summarized from the model and experiment perspectives.
(1) When classic TSK fuzzy systems are applied to classification tasks, margins between different classes are expected to be as large as possible after they are transformed into their label space. However, this assumption is too rigid to learn a discriminative transformation matrix. Additionally, excessive label fitting may cause overfitting. To address these issues, label softening and the compactness class graph are embedded into the objective function of the classic TSK fuzzy systems, which can bring two advantages: firstly, the new method can enlarge the margins between different classes as much as possible, and it has more freedom to fit the labels better. Secondly, to avoid the problem of overfitting, the new method uses the class compactness graph to guarantee that the samples from the same class can be kept close together in the transformed space. (2) Five state-of-the-art algorithms are introduced for comparison studies, and experimental results are reported from the perspectives of classification performance, sensitivity and interpretability. The proposed algorithm significantly performs better than the state-of-the-art algorithm in terms of accuracy and macro F1-score due to the intro- duced label soften and the compactness class graph. What is more, the improved generalization ability can be observed from the lower accuracy difference between training and testing. In addition, radiomics features have physical meaning; with the help of transparent fuzzy rules generated by the proposed algorithm, the interpretability can be guaranteed.
The rest of the sections are organized as follows: In Section 2, we briefly describe the working mechanism of the zero-order TSK fuzzy system for classification tasks. In Section 3, we introduce the objective function, optimization, and the corresponding algorithm of our proposed TSK fuzzy system. In Section 4, we conduct experiments on radiomics features extracted from CXR images for model evaluation. In Section 5, we discuss the experimental results and conclude our study. The abbreviations and symbols used in this study are given in the Abbreviation section.

ZERO-TSK-FS for Classification
The original TSK fuzzy systems were designed for regression tasks. However, the diagnosis of COVID-19 based on radiomics belongs to a classification task, as we know that the zero-order TSK fuzzy system (ZERO-TSK-FS) has very concise fuzzy rules and is widely used due to low model complexity. Therefore, in this section, we re-design ZERO-TSK-FS specifically for the classification task. Suppose we have a training set . , x id ] T ∈ R d , and y i ∈ R is the corresponding label. Then, the k-th fuzzy rule of ZERO-TSK-FS can be defined as follows, where ϑ k j represents a fuzzy set subscribed by feature x ij for the k-th rule, ∧ is a fuzzy conjunction operator, and K is the number of fuzzy rules. Each fuzzy rule is premised on the input feature space [x i1 , x i2 , . . . , x id ] T ∈ R d and transforms the fuzzy sets in the input feature space into a varying singleton represented by f k (x i ). After a series of operation and defuzzification steps, we can obtain the output of x i , i.e., f (x i ), where the normalized fuzzy membership function ϕ k (x i , µ k , δ k ) of the k-th fuzzy rule can be formulated as . If the Gaussian fuzzy membership function is adopted, we have where µ k = [µ k 1 , µ k 2 , . . . , µ k d ] and δ k = [δ k 1 , δ k 2 , . . . , δ k d ] denote the kernel center vector and kernel width vector, respectively. By denoting φ(x i ) = [ϕ 1 (x i , µ 1 , δ 1 ), ϕ 2 (x i , µ 2 , δ 2 ), . . ., ϕ K (x i , µ K , δ K )], and β= [β 1 , β 2 , . . . , β K ] T , then f (x i ) in (3) can be updated as It can be seen that the optimization of the consequent β can be actually considered as solving a linear regression problem in (5). There are many algorithms that can be used to solve this problem. A representative one proposed by Deng et al. has been demonstrated its promising performance [33], which can be formulated as where Θ= [φ(x 1 ), φ(x 2 ), . . . , φ(x N )] T , and y= [y 1 , y 2 , . . . , y N ] T ∈ R N×1 is the output vector. Notably, the model obtained from (6) is a regression model that cannot be directly used for our classification task. Usually, when we use a regression model for a multi-class classification task, a label transformation should be conducted [27]. That is to say, we should transform the label space from R N×1 to R N×C , where C is the number of classes. For example, suppose we have a three-class classification training set having four samples, where y = 2, 1, 3, 1] T ∈ R 4×1 ; then, after label space transformation, we have the new label space Y ∈ R 4×3 , which means that the four samples belong to the second, first, third and first class, respectively. Therefore, by substituting the new label space Y ∈ R N×C into (6) to replace the original label space y = R N×1 , we have Here, we use ∆ ∈ R K×C as the consequent matrix to replace β ∈ R K×1 in (6). In this manner of label transformation, ZERO-TSK-FS can be used for the multi-class classification task by transforming the training samples to a strict binary label matrix, such as Y ∈ R N×C shown in (7).
However, as we previously stated, transforming the training samples to a strict binary label matrix may not be adequately flexible to learn a discriminative transformation matrix. Therefore, in the following sections, we develop a novel zero TSK fuzzy system for COVID-19 cases detection which can relax the strict binary label matrix to a soft one.

Objective Function
To fit the labels in a soft way, we introduced a non-negative label soften matrix Ω ∈ R N×C and a luxury matrix Ξ ∈ R N×C to relax the strict binary label matrix. Let us also take the three-class classification training set having four samples as an example; with Ω and Ξ, the strict binary label matrix can be relaxed into the following form: where the non-negative label matrix Ω ∈ R N×C is defined as Diagnostics 2022, 12, 2613 6 of 18 the luxury matrix Ξ ∈ R N×C is defined as and is a Hadamard product operator of matrices. Therefore, with label matrix relaxation, the objective function of ZERO-TSK-FS in (8) can be updated as min Although label relaxation can help learn a more discriminant transformation matrix ∆, it may lead to overfitting problems. In manifold learning, the class compactness graph is often used to alleviate overfitting problems. The core idea is that samples sharing the same labels should be kept as close as possible when they are transformed into the label space. More specifically, a weight matrix W is defined as follows to capture the relationship that the samples sharing the same labels should be kept as close as possible, In the class compactness graph, if sample x i and sample x j have the same labels, they are connected by an undirected edge, and the corresponding weight can be computed by (13). We find that the closer sample x i and sample x j are, the bigger the corresponding weight. Therefore, when all training samples in the training space are transformed into the label space, any two samples having bigger weights in the training space should also be kept as close as possible in the label space. That is to say, it is reasonable to minimize the following objective to achieve our goal, where W ij is the i-th row and j-th column element in the weight matrix W, and f ( (14) and performing equivalent mathematical transformation, we have where L is the graph Laplacian, which is defined as L = Z − W. Z is a diagonal matrix, and the diagonal element can be computed by Finally, by embedding (15) to (12), we have the following objective function for LR-ZERO-TSK-FS,

Optimization
The objective function shown in (16) has two components that need to be optimized, i.e., the transformation matrix ∆ and the non-negative label relaxation matrix Ω. Thus, we introduced an iterative strategy to search for the optimal solution for ∆ and Ω [34].
Firstly, suppose that the non-negative label relaxation matrix Ω is fixed; then, the optimization problem becomes where U = Y + Ω Ξ. By setting the partial derivative with respect to ∆ to 0, we have Secondly, suppose that the transformation matrix ∆ is fixed; since the second term in (16) is uncorrelated with Ω, the optimization problem becomes We have common knowledge that the squared Frobenius norm of the matrix can be decoupled element by element. Therefore, the optimization problem in (19) can be equivalently decoupled into N × C sub-problems. For the i-th row and j-th column element Ω ij in Ω, we have the corresponding sub-problem, With the closed-form solution of ∆ in (18) and Ω in (21), we can use an iterative strategy to find their optimal values.

Algorithm
The training steps of LR-ZERO-TSK-FS are shown in Algorithm 1. We can see that the asymptotic time complexity of LR-ZERO-TSK-FS is mainly contributed by two components, i.e., FCM clustering for antecedent learning and the computing of ∆ for consequent learning. The time complexity of FCM clustering is O(d 2 NK), where d is the dimension of the training set. The time complexity of the computing of ∆ is O(K 3 ). Therefore, the asymptotic time complexity of LR-ZERO-TSK-FS is O(d 2 NK + K 3 ). Since d and K are relatively small compared with N, the asymptotic time complexity of LR-ZERO-TSK-FS can be considered linear with N.

Settings
After extracting radiomics features from CXR images, we followed the workflow, as shown in Figure 2, to evaluate the proposed model.  In data preprocessing, we aimed to extract radiomics features from CXR images. Firstly, a binary lung region mask was generated by a pretrained U-NET segmentation network. With the obtained lung region on each CXR image, we then used the radiomics package to extract different types of radiomics features for downstream modeling.

Settings
After extracting radiomics features from CXR images, we followed the workflow, as shown in Figure 2, to evaluate the proposed model.
To simulate an independent test, we introduce holdout cross-validation, as shown in Figure 2. To be specific, we firstly independently select a holdout set as a testing set. Then, the rest is cut into K (K = 3 in this study) folds; one is taken as the validation set and the rest are taken as the training set. In the validation phase, minimum-redundancy-maximumrelevance (mRMR) is adopted as a feature-ranking algorithm. Then, the cross-validation (3-CV) strategy was used to determine the optimal feature set and hyper-parameters with respect to the proposed model. In the training phase, with the optimal feature set and hyper-parameters, the best model can be obtained. In the testing phase, with the best model, we can obtain the corresponding testing results.
Firstly, a binary lung region mask was generated by a pretrained U-NET segmentation network. With the obtained lung region on each CXR image, we then used the radiomics package to extract different types of radiomics features for downstream modeling.

Settings
After extracting radiomics features from CXR images, we followed the workflow, as shown in Figure 2, to evaluate the proposed model.  To simulate an independent test, we introduce holdout cross-validation, as shown in Figure 2. To be specific, we firstly independently select a holdout set as a testing set. Then, the rest is cut into K (K = 3 in this study) folds; one is taken as the validation set and the rest are taken as the training set. In the validation phase, minimum-redundancy-maxi-

Model
Parameter Settings The criteria accuracy and macro F1-score were introduced to quantitatively evaluate the classification performance of all studied models.

Experimental Results
In this section, we report our evaluation results from three main aspects, i.e., classification performance analysis, sensitivity analysis and interpretability analysis.

Classification Performance Analysis
The testing and training accuracy of the proposed model LR-ZERO-TSK-FS and the studied benchmarking models are shown in Figure 3. The horizontal axis represents the top-k features selected by using the feature selection method, as shown in Figure 2. It can be observed that LR-ZERO-TSK-FS wins the best performance under the most top-k features except top-5 and top-35. Specially, LR-ZERO-TSK-FS always performs better than the baseline ZERO-TSK-FS on the testing set. As indicated in Figure 3b, LR-ZERO-TSK-FS also wins the best from the top-20 to top-50 features. Statistical analysis of the testing performance in terms of t-test is shown in Table 3, where "*" means that there exist significant differences between LR-ZERO-TSK-FS and the state-of-the-art models (p < 0.05). A similar observation can also be obtained by using another metric macro F1-score, as shown in Tables 4 and 5. In Table 6, we also report the CPU seconds each algorithm consumed. shown in Tables 4 and 5. In Table 6, we also report the CPU seconds each algorithm consumed.
(a) (b)  Table 3. Statistical analysis of testing performance in terms of t-test. * * * * * * means that there exist significant differences between LR-ZERO-TSK-FS and the state-of-the-art models (p < 0.05). Table 4. Classification performance in terms of training macro F1-score.  Table 5. Classification performance in terms of testing macro F1-score.  Table 3. Statistical analysis of testing performance in terms of t-test.  Table 5. Classification performance in terms of testing macro F1-score. To further investigate the impact of training sample size on the difference in model accuracy between training and testing sets (i.e., model generalizability), we trained the models under varying training sample sizes. Figure 4 shows the accuracy differences (the absolute error of training accuracy and test accuracy) between the training and testing datasets, where the horizontal axis represents the size of the training samples. We have common knowledge that fewer training samples are more likely to result in model overfitting, which is also demonstrated in Figure 4. It can also be noticed that the proposed model LR-ZERO-TSK-FS is more effective in inhibiting model overfitting compared with the benchmarking models. As we stated in the introduction part, label relaxation can help to learn a more discriminant transformation matrix, but it also tends to result in overfitting. This statement can be demonstrated by the results shown in Figure 5.  Table 6. CPU seconds all algorithms consumed. To further investigate the impact of training sample size on the difference in model accuracy between training and testing sets (i.e., model generalizability), we trained the models under varying training sample sizes. Figure 4 shows the accuracy differences (the absolute error of training accuracy and test accuracy) between the training and testing datasets, where the horizontal axis represents the size of the training samples. We have common knowledge that fewer training samples are more likely to result in model overfitting, which is also demonstrated in Figure 4. It can also be noticed that the proposed model LR-ZERO-TSK-FS is more effective in inhibiting model overfitting compared with the benchmarking models. As we stated in the introduction part, label relaxation can help to learn a more discriminant transformation matrix, but it also tends to result in overfitting. This statement can be demonstrated by the results shown in Figure 5. When the hyperparameter γ is set to 0, the manifold regularization term in (15) becomes useless. From Figure 5a, it can be observed that LR-ZERO-TSK-FS with =0 γ has larger accuracy differences than the benchmarking models. When compared with ZERO-

Sensitivity Analysis
The proposed model LR-ZERO-TSK-FS has two regularization hyper-par and γ that needed to be set in advance. In this study, we analyzed the robust ZERO-TSK-FS with respect to λ and γ . As illustrated in Figure 6, it can b that LR-ZERO-TSK-FS is sensitive to λ and γ , and smaller λ and γ see better testing accuracy. Therefore, according to Figure 6, λ and γ can be det using cross-validation under a small range. When the hyperparameter γ is set to 0, the manifold regularization term in (15) becomes useless. From Figure 5a, it can be observed that LR-ZERO-TSK-FS with γ= 0 has larger accuracy differences than the benchmarking models. When compared with ZERO-TSK-FS, in particular, the result demonstrated that the label relaxation indeed aggravates overfitting in the absence of manifold regularization. When the manifold regularization term was activated, see Figure 5b, the advantageous testing accuracy can basically be maintained. That is to say, the combination of label relaxation and manifold regularization achieve an outstanding balance between classification accuracy and generalization.

Sensitivity Analysis
The proposed model LR-ZERO-TSK-FS has two regularization hyper-parameters λ and γ that needed to be set in advance. In this study, we analyzed the robustness of LR-ZERO-TSK-FS with respect to λ and γ. As illustrated in Figure 6, it can be observed that LR-ZERO-TSK-FS is sensitive to λ and γ, and smaller λ and γ seem to yield better testing accuracy. Therefore, according to Figure 6, λ and γ can be determined by using cross-validation under a small range. Results from comparative analyses indicated that LR-ZERO-TSK-FS yielded the best overall model performance on the COVID-19 cases detection task (Figure 3). From Figure 5b, we can notice that when the manifold regularization is not activated (γ= 0), LR-ZERO-TSK-FS also performs satisfactorily and better than LR-ZERO-TSK-FS with manifold regularization. That is to say, from Figures 3 and 5b, the capability of our label relaxation strategy for enhancing the classification performance of the TSK fuzzy systems was successfully demonstrated. Moreover, from Figures 4 and 5a, we can see that when the manifold regularization is activated, the differences in accuracy between the training and testing sets remarkably drop. In particular, the comparative analysis with the ZERO-TSK-FS explicitly indicated the superiority of LR-ZERO-TSK-FS in minimizing the risk of model overfitting. It is noteworthy that the present evidence relies on the following particularities of LR-ZERO-TSK-FS: (1) Streamlined classification accuracy owing to the introduction of a soft strategy to adaptively relax the binary matrix during label transformation, rendering more flexibility during label transformation and capability in enlarging the margins between different classes. (2) Alleviated risk of model overfitting in virtue of the adoption of a class compactness graph during manifold learning, based on the assumption that samples sharing the same labels should be kept as close as possible when they are transformed into the label space.

Interpretability Analysis
The proposed model LR-ZERO-TSK-FS was derived from the zero order TSK fuzzy system, carrying the inherent characteristic of model interpretability. Therefore, in the following, we give an example to illustrate how the proposed model can diagnose a subject with COVID-19 with the use of radiomics features.
The optimal radiomics features selected by step 3 shown in Figure 2 are listed in Table 7, which were used for model training. The training results (five trained fuzzy rules) of LR-ZERO-TSK-FS using the selected radiomics features in terms of antecedent and consequent parameters are listed in Table 8.

LR-ZERO-TSK-FS
k-th fuzzy rule: If x i1 is ϑ k 1 ∧x i2 is ϑ k 2 ∧. . . ∧x id is ϑ k d , then f k (x i ) = β k , k = 1, 2, . . . , K. Figure 7 shows the fuzzy membership function of one radiomics feature (No. 1: wavelet-HH_glrlm_GrayLevelVariance) trained on five fuzzy rules. According to the center µ k of each fuzzy membership function and domain knowledge, we can assign a linguistic description, e.g., "low", "a little low", "medium", "a little high" and "high", to each feature. For example, from the fuzzy mapping of "wavelet-HH_glrlm_GrayLevelVariance" in the first fuzzy rule, it can be interpreted that "wavelet-HH_glrlm_GrayLevelVariance" is "low". According to Yang et al. [38], linguistic meanings are dependent on expert knowledge. This is the beauty of interpretable models because the acquired knowledge by models can be refined and integrated with expert knowledge. Accordingly, the trained five fuzzy rules are described in Table 9.  Table 8. Five trained fuzzy rules.

LR-ZERO-TSK-FS
k-th fuzzy rule: If     From Figure 8, it can be found that the classification of the case depends on th tained five fuzzy rules. As shown in Table 6, according to the parameters learned i if-part, the five fuzzy rules can be assigned different linguistic meaning based on do knowledge. Therefore, each fuzzy rule actually indicates the classification result of expert (or one kind of knowledge). For example, the first fuzzy rule classifies the ca COVID-19 based on the domain knowledge embedded in the if-part. Finally, by lin combining all fuzzy rules (combing knowledge of all experts), the case is classifie COVID-19.

Conclusions
COVID-19 has posed a significant public health threat globally. The automated tection of CXR abnormality potentially aids identifying patients with significant ri COVID infection earlier. In this study, we developed an interpretable and soft-l driven TSK fuzzy system for multi-class COVID-19 detection (COVID-19 vs. No-Find vs. Pneumonia) using radiomics features extracted from CXR images. The risk of m overfitting is alleviated in virtue of the adoption of a class compactness graph du manifold learning, which is based on the assumption that samples sharing the same l should be kept as close as possible when they are transformed into the label space successfully demonstrated that our model outperformed the comparable state-of-th models while maintaining high interpretability.
This study is not without limitations. For example, in the construction of the compactness graph, we just used Euclidean distance. In addition, in our experim multi-center based external validation is not carried out. Therefore, in our future w we will carry out more in-depth research from the above aspects. From Figure 8, it can be found that the classification of the case depends on the obtained five fuzzy rules. As shown in Table 6, according to the parameters learned in the if-part, the five fuzzy rules can be assigned different linguistic meaning based on domain knowledge. Therefore, each fuzzy rule actually indicates the classification result of one expert (or one kind of knowledge). For example, the first fuzzy rule classifies the case to COVID-19 based on the domain knowledge embedded in the if-part. Finally, by linearly combining all fuzzy rules (combing knowledge of all experts), the case is classified to COVID-19.

Conclusions
COVID-19 has posed a significant public health threat globally. The automated detection of CXR abnormality potentially aids identifying patients with significant risk of COVID infection earlier. In this study, we developed an interpretable and soft-label-driven TSK fuzzy system for multi-class COVID-19 detection (COVID-19 vs. No-Findings vs. Pneumonia) using radiomics features extracted from CXR images. The risk of model overfitting is alleviated in virtue of the adoption of a class compactness graph during manifold learning, which is based on the assumption that samples sharing the same labels should be kept as close as possible when they are transformed into the label space. We successfully demonstrated that our model outperformed the comparable state-of-the-art models while maintaining high interpretability.
This study is not without limitations. For example, in the construction of the class compactness graph, we just used Euclidean distance. In addition, in our experiments, multi-center based external validation is not carried out. Therefore, in our future work, we will carry out more in-depth research from the above aspects.