A Quantitative Comparison between Shannon and Tsallis–Havrda–Charvat Entropies Applied to Cancer Outcome Prediction

In this paper, we propose to quantitatively compare loss functions based on parameterized Tsallis–Havrda–Charvat entropy and classical Shannon entropy for the training of a deep network in the case of small datasets which are usually encountered in medical applications. Shannon cross-entropy is widely used as a loss function for most neural networks applied to the segmentation, classification and detection of images. Shannon entropy is a particular case of Tsallis–Havrda–Charvat entropy. In this work, we compare these two entropies through a medical application for predicting recurrence in patients with head–neck and lung cancers after treatment. Based on both CT images and patient information, a multitask deep neural network is proposed to perform a recurrence prediction task using cross-entropy as a loss function and an image reconstruction task. Tsallis–Havrda–Charvat cross-entropy is a parameterized cross-entropy with the parameter α . Shannon entropy is a particular case of Tsallis–Havrda–Charvat entropy for α=1 . The influence of this parameter on the final prediction results is studied. In this paper, the experiments are conducted on two datasets including in total 580 patients, of whom 434 suffered from head–neck cancers and 146 from lung cancers. The results show that Tsallis–Havrda–Charvat entropy can achieve better performance in terms of prediction accuracy with some values of α .


Introduction
This paper is devoted to studying the loss function based on Tsallis-Havrda-Charvat entropy in deep neural networks [1] for the prediction of outcomes in lung and head-neck cancers. When used for categorical prediction, the loss function is generally a crossentropy related to a given entropy function. Indeed, cross-entropy-based loss functions are appropriate for evaluating how a probability distribution is close to the Dirac distribution. In deep neural networks, the output is a probability for each class obtained from a softmax activation function, and the Dirac distribution concentrated on one class represents the ground truth. There are several ways to compare these distributions [2][3][4]. Entropy-based metrics, such as divergences and cross-entropies, are the most common because they are the most appropriate way to sum up the informative content of a distribution, as explained in [5]. In Ref. [6], different entropies are presented. In classification and prediction, the cross-entropy is derived from an entropy measure and used as a loss function measuring the difference between the predicted probability and the real Dirac probability. To sum up, in most neural networks used for prediction, Shannon-related cross-entropy is the most common and widely used for segmentation [7], classification [8], or detection [9] and many other applications [10][11][12][13]. The reason why Shannon is the most used is twofold: first, because Shannon was the first entropy in the domain of information theory, and secondly, because it is extensive in the sense that the entropy of a multivariate distribution whose margins are independent is the sum of the marginal entropies. This last property makes the calculation of Shannon entropy easy. In Ref. [14], different ways of choosing an entropy and an associated divergence are detailed. Among them, Shannon entropy can be extended by replacing the logarithm by another function. Cross-entropies can be defined by replacing the counting measure (resp. Lebesgue measure for continuous case) by a Radon-Nykodim derivative between probability measures. Shannon's entropy can be generalized on other entropies such as Renyi [15] and Tsallis-Havrda-Charvat [16,17]. In this paper, we are interested in a particular generalization of Shannon cross-entropy: Tsallis-Havrda-Charvat cross-entropy [18]. This class of entropies has the particularity of being parametrized with one parameter α and we recover Shannon entropy when the value of the parameter equal to 1. The relevance and possibilities of Tsallis-Havrda-Charvat in the medical field have been discussed before and this paper expands on the Tsallis-Havrda-Charvat formula.
Tsallis-Havrda-Charvat entropy was introduced independently by Tsallis [19] in the context of statistical physics and by Havrda and Charvat [20] in the context of information theory. Tsallis-Havrda-Charvat entropy has been used in publications in several fields, including medical imaging [21,22]. Tsallis-Havrda-Charvat entropy has rarely been used in deep learning, especially because of the difficulties with interpreting the hyperparameters α. However, there exist some scientific articles on this issue. In Ref. [23], Tsallis-Havrda-Charvat entropy is used to reduce the loss while classifying an image. In Ref. [18], the authors define Tsallis-Havrda-Charvat entropy in terms of axiomatization and propose a generalization based on this. In Ref. [24], the maximization of the entropy measure is studied for different classes of entropies such as Tsallis-Havrda-Charvat; the maximization of Tsallis-Havrda-Charvat entropy under constraints appears to be a way to generalize Gaussian distributions. In our previous work [17], Tsallis-Havrda-Charvat cross-entropy is used for the detection of noisy images in pulmonary microendoscopy. To capitalize and improve on this previous work, we propose to use deep learning in this paper. It allows for taking advantage of the previously used architecture by tuning and improving it to achieve better results. Deep learning is also very relevant for our field of study.
Deep learning has been widely developed in the medical field for classification or segmentation tasks [25][26][27]. Classification can be used to identify automatically the kind of cancer from which the patient is suffering [28,29] or the relevant outcomes after treatment, such as survival expectation [30] or relation to the treatment [31]. Recurrence in cancer after treatment is one of the main concerns for physicians [32], as it can dramatically impact the outcome for patients and their life expectancy. It would be beneficial for treatment selection if one could predict whether a recurrence will occur. Some studies have been carried out using CT scan images and clinical data. To our knowledge, there is no article using Tsallis-Havrda-Charvat for recurrence prediction [17]. The novelty of this article lies in the performance comparison of Shannon and Tsallis-Havrda-Charvat entropies in the context of cancer recurrence prediction with CT scan data combined with clinical information for patients affected by head and neck (H&N) or lung cancers (examples of the analyzed CT images are displayed in Figure 1). Moreover, we decided to study the parameter value in particular to examine its impact in order to predict these recurrences in both kinds of cancer. As medical data are generally scarce, the choice of a good entropy is important even if it can improve the performance only by 1 or 2 percent. The paper is organized as follows. In the first section, we recall how categorical Shannon cross-entropy is defined and how it can be generalized for Tsallis-Havrda-Charvat. The second section is devoted to experiments and a comparison between the two entropies.

Entropy
As our problem concerns binary prediction, we focused only on finite-state random variables whose state-space was provided with the counting measure. Obviously, these results can be generalized for the finite-dimensional vectorial space R n provided with the Lebesgue measure.

Shannon Entropy and Related Cross-Entropy
For a discrete random variable Y, taking its values in Ω = {1, . . . , k} with respective probabilities p 1 , . . . , p k , Shannon entropy is defined by: Shannon is minimal and equal to 0 if p i = 1 for one i and 0 otherwise; it is maximal if Y is uniformly distributed. The corresponding cross-entropy is given by: Generally, in a classification problem, the true distribution p is a Dirac distribution δ i 0 , where i 0 is the class of the data. In this case, the cross-entropy is H(p; q) = − log(q i 0 ) ≥ 0. As a consequence, q i 0 is closer to 0 and H(p; q) is higher. Furthermore, minimizing the cross-entropy forces q to be as close as possible to the distribution p, and as this last one is the Dirac distribution δ i 0 , the minimum of the cross-entropy is 0.

Tsallis-Havrda-Charvat Cross-Entropy
There are several ways to generalize Shannon entropy, as explained in [14]. Shannon entropy can be expressed by: where h(u) = u log(u). h is a convex function such that h(1) = 0. The idea is to choose another function satisfying the same properties. Tsallis-Havrda-Charvat entropy is defined by choosing: where α > 0 and is given by: The associated cross-entropy is given by: As for classical cross-entropy, Tsallis-Havrda-Charvat cross-entropy forces the predicted distribution q to be as close as possible to p when p is a Dirac distribution.

Neural Network Architecture for Recurrence Prediction
The proposed architecture used for recurrence prediction is a multitask neural network with a U-Net backbone, with one branch able to perform prediction tasks and another to reconstruct the input image for extracting features to help prediction. The architecture is presented in Figure 2. The U-Net backbone is composed of three convolutional layers skipped by concatenations to add the features extracted from descending convolutions to the ascending ones. Within the convolution layers, we use ReLU activation functions.
At the bottom of the network, the extracted features are sent as inputs to one branch of fully connected layers in charge of making a decision path to determine whether the patient is at risk of recurrence.
Two main tasks are jointly carried out by the network. T1 is the reconstruction task, specific to the U-Net part of the network. It allows for determining whether the features extracted by the descending part of the U are relevant for prediction and classification and representative of the whole CT scan at the same time. The loss function used in this task is the mean squared error. It is defined as follows: where y n represents the true data values for the n-th patient andˆ y n is the estimated output from the network, with . being the Euclidean norm and N the number of patients. The mean squared error computes the squared distance between the predicted output and the input image volumes. This function allows for comparing the predicted images and the true ones voxel-by-voxel and to train the network to recompose the images from the extracted features. T2 is the prediction task. It is used to determine, from the same input, whether the current patient is at risk of encountering a recurrence of their cancer. It is constructed with fully connected layers and ends up in a binary classification. The prediction task's loss function was the subject of our tests. We compared two entropies through this task.
The first, and most commonly used, was Shannon's binary cross-entropy.
where p n is the true class, p n = 1 if recurrence, p n = 0 otherwise, and q n is the estimated probability of recurrence, with N being the number of patients. The second entropy was the generalized formula, Tsallis-Havrda-Charvat binary cross-entropy.
Binary cross-entropies are loss functions that are able to compare binary predictions with ground truths, which makes them relevant for our binary labels.
The total loss function of the network is the sum of the two losses. The choice of this total loss function was motivated by different experiments in which we used different weights for the individual loss functions. It appears that the sum with equal weights provided the best results.
The prediction branch is the subject of interest for this article, with the reconstruction task being used to help the prediction in the feature extraction step.
Regarding the execution time and its change owing to the increasing or decreasing complexity of the network, we proposed to lock our network at a certain level of complexity and find an acceptable execution time of 5-7 h for a number of epochs of 100. This was considered to be a good compromise between obtaining a sufficient number of epochs to achieve meaningful results and not obtaining so many as to cause overfitting. We decided to use convolution layers in our U-Net backbone because of the limited computational power of available machines and thus the need to limit the number of parameters in the network. Allowing for a few hours of computations enabled the experiments to be run at night so that the results could be available and ready for analysis the next day.

Datasets
The datasets were composed of 580 patients, among which 434 suffered from headneck cancer and 146 from lung cancer. Both datasets were small in size. We chose to conduct experiments on the two subsets (head-neck and lung) separately. Indeed, the optimal value of α depends on the kind of data used. Moreover, we had already tested the total dataset of 580 and the results were poor. We therefore chose to show only the results for separate datasets.
CT images used in the neural network were resized with an image resolution of 128 × 128 × 64 voxels. The patient information used as input data in the neural network was of two kinds, namely quantitative and qualitative, as shown in Tables 1 and 2. Our experiments consisted of comparing the accuracy of Tsallis-Havrda-Charvat and Shannon for both datasets.

Evaluation Method
Since the study was conducted on small datasets (434 and 146 patients), a result validation strategy was required. We proposed to use the k-fold cross-validation relevant for small data validation. In our work, we used a five-fold cross-validation.
The procedure unfolds as follows: Furthermore, accuracy was proposed as a metric for evaluation. It consisted, in this case, in comparing the values of the ground truth and the prediction and summing all these occurrences over the size of the dataset.

Results
The results achieved during the tests are displayed in this section. Reconstructed images are mainly used in order to show the relevance of extracted features for the prediction. Therefore, their performance is not very important here, because the objective is the prediction of recurrence. The original images and reconstructed ones are featured in Figure 3. We can see that the reconstructed images are similar to the input images, meaning that our network is able to recover input images. The loss of information generates uncertainty in each image. A possible improvement would be to use a fuzzy image processor to improve the quality of the obtained images, as described in [33].

Comparison Results
Regarding Tsallis-Havrda-Charvat, we studied its hyperparameter α as varying from 0.1 to 2.0. When α = 1, this entropy corresponds to Shannon entropy. The displayed p-value measures whether the results acquired by Tsallis-Havrda-Charvat five-fold crossvalidation are statistically different from Shannon's. Two conditions must be satisfied in order to accept the Tsallis-Havrda-Charvat entropy as providing better results than Shannon entropy: the average of the five-fold results must be superior to Shannon and the p-value must be smaller than 0.05. The results are described in the following tables.
The results achieved for the dataset of head-neck cancers are described in Table 3.
Regarding the dataset containing lung cancers, the results achieved are described in Table 4.
After fine-tuning, we obtained a set of optimal hyperparameters. The results achieved via the Tsallis-Havrda-Charvat formula confirm that, for most values of the hyperparameter α, the final accuracy is not superior to the accuracy obtained by Shannon's loss function. It can also be observed that the loss function derived from Havrda-Charvat equation can provide better results than Shannon's in some cases. However, it is difficult to know a priori what value of α is good for an application. Its choice is still a challenge. Table 3. Accuracy obtained by loss function derived from Tsallis-Havrda-Charvat entropy in a function of α for the head-neck cancer dataset (p-values lower than 0.05 and accuracies higher than Shannon's are highlighted in blue).  Table 4. Accuracy obtained by loss function derived from Tsallis-Havrda-Charvat entropy in a function of α for the lung cancer dataset (p-values lower than 0.05 and accuracies higher than Shannon's are highlighted in blue). We highlighted the results providing both better accuracy and a significant p-value in blue. The most promising values were achieved for α equal to 1.5 with head-neck cancers and between 1.9 and 3.5 with lung cancer. In this regard, we can state that the results obtained by Tsallis-Havrda-Charvat can be significantly improved compared to Shannon.

5-Fold
When analyzing the lung cancer results, a plateau can be easily noticed where Tsallis-Havrda-Charvat achieves better results than Shannon. This involves a specific set of values of α, from 1.9 to 3.5, in which the Tsallis-Havrda-Charvat loss function is significantly more efficient than Shannon's.

Discussion
It has been determined that the Tsallis-Havrda-Charvat loss function performs equally well or better than Shannon cross-entropy, depending on the value of its hyperparameter α. It can be said that the Tsallis-Havrda-Charvat loss function, depending on the value of its hyperparameter α, can fit a wider array of input data and can potentially yield better results.
Conversely, we can state that, based on the calculated p-values and standard deviations, Tsallis-Havrda-Charvat entropy seems more unstable than Shannon entropy, as its standard deviation may reach 0.12, where Shannon's is only 0.06. Furthermore, when looking at the p-values for several values of α, the results of Havrda-Charvat are not statistically different from Shannon's. The instability of the results obtained using Tsallis-Havrda-Charvat could be explained by the fact that, despite the five-fold method, the data are still too scarce to reach a stable answer. In addition, data are tridimensional, making it more difficult than with 2D images to extract relevant features and thereby complicating the network's tasks even further. In the analysis of 3D images, multiple slices must be taken into account in order to make a decision, which drastically increases the number of variables to be learned by the neural network. Moreover, in order to be usable for analysis by the neural network, the data are all supposed to have the same size. This is why, as for reasons of computational power, it was decided to have the data resized to 128 × 128 × 64 voxels. This size, despite still being full of information for the network, implies a loss of information for wider, larger and deeper images.
Nevertheless, the value of α plays a large part in the behavior of the loss function, and it is the key element that can be set to fit the input data, but the questions remains: what α fits which data? The choice of the value of the hyperparameter α remains a challenge as it depends strongly on the kind of data. In perspective, it would be interesting to develop an algorithm for selecting automatically the value of this hyperparameter in order to fit it as accurately as possible to the data. The aim is to reach the plateau, or area, of α where the Tsallis-Havrda-Charvat loss function provides consistently better results and features a smaller SD and p-value. A further analysis needs to be conducted on the link between input data and the location of the best α area in order to determine the kind of extracted feature and the kind of neuronal path that cause one area to be more efficient than another. For instance, in our case, the question is about which key feature of the input lung cancer images leads to better values between 1.9 and 3.5 and which key feature of the input H&N cancer images lead to better results for α equal to 1.5.

Conclusions
In this article, we established that, for our data and in some cases, Tsallis-Havrda-Charvat cross-entropy performs better than a Shannon-based loss function. Tsallis-Havrda-Charvat performed best on the head-neck dataset and lung dataset, at 80% and 81%, respectively, of correct recurrence prediction, while Shannon's results for these two datasets were 67% and 52%, respectively. This makes the Tsallis-Havrda-Charvat formula the best candidate for further research on these datasets. For further research we might adapt Tsallis-Havrda-Charvat binary cross-entropy to a categorical cross-entropy. This would allow for making multi-class predictions, including estimating the time between the end of cancer treatment and recurrence. Another axis of evolution could be finding a way to automatically determine the proper value of α for a given application.