Eigenloss: Combined PCA-Based Loss Function for Polyp Segmentation

Colorectal cancer is one of the leading cancer death causes worldwide, but its early diagnosis highly improves the survival rates. The success of deep learning has also benefited this clinical field. When training a deep learning model, it is optimized based on the selected loss function. In this work, we consider two networks (U-Net and LinkNet) and two backbones (VGG-16 and Densnet121). We analyzed the influence of seven loss functions and used a principal component analysis (PCA) to determine whether the PCA-based decomposition allows for the defining of the coefficients of a non-redundant primal loss function that can outperform the individual loss functions and different linear combinations. The eigenloss is defined as a linear combination of the individual losses using the elements of the eigenvector as coefficients. Empirical results show that the proposed eigenloss improves the general performance of individual loss functions and outperforms other linear combinations when Linknet is used, showing potential for its application in polyp segmentation problems.


Introduction
Nowadays, colorectal cancer (CRC) presents high incidence rates in developed countries, as it is mainly considered a lifestyle disease [1]. More than 800,000 deaths are reported yearly and CRC incidence reaches one in ten people worldwide [2]. In 2020, over 104,000 new cases and 53,200 deaths are estimated in the United States alone [3], while more than 174,600 deaths are foreseen in Europe [4]. However, an early detection of CRC, increasing the current 13% up to 50% of patients diagnosed in stage I could save 130,000 lives and reduce the healthcare cost per year by three billion [5]. In this sense, colonoscopy is the gold standard for early diagnosis (CRC screening): the gastroenterologist performs a visual exploration of the rectum and colon using a flexible endoscope inserted through the patient's anus. During the procedure, polyps are detected and treated if deemed necessary, as these are considered precursor lesions of CRC. Colonoscopy is highly dependent on the endoscopist's skills, therefore computer-aided diagnosis and detection systems have a great potential to improve clinical results [6].
Recently, deep learning has outperformed classical methods on several computer vision tasks, such as detection, segmentation, classification, and enhancement, also in the healthcare field [7]. The automatic extraction of features performed by convolutional neural networks (CNN) is the main advantage over classical hand-crafted features, but the need of large and annotated datasets is one of its major drawbacks in specific fields, such as medical applications. For this reason, several studies focus on improving the scarce and weak annotations in medical imaging datasets [8]. Although clinical evidence of deep learning algorithms is still poor (there are few studies with low confidence), they show equivalent (or even better) performance than clinicians [9,10]. In this sense, some systematic reviews show the potential of deep learning in gastric tissue diseases [11] and wireless endoscopic capsule [12], but these studies also identify risk of bias due to gaps in the evaluation metrics and public availability of the dataset [11] that must be solved through prospective multicenter studies [12].
In our field of interest, computer-aided detection (CAD) systems have been classically based on the extraction of polyp characteristics, like shape [13], texture [14,15] or depth of valleys [16,17]. Moreover, other classical segmentation methods could be also applied to this problem, such as the alternating direction method of multipliers (ADMM) approach to decompose the image into non-additive components and classify pixels into background and foreground [18] or the approach proposed by Li et al. [19], who combine semi-supervised learning of the posterior class distribution modeled using multinomial logistic regression with segmentation that exploits spatial information. It has been recently determined that up to 80% of missed colonic lesions could be avoided using real-time artificial intelligence assistance [20]. Therefore, there is still a need for further developing polyp detection, localization, and segmentation methods. In this last case, several authors use deep learning methods, which outperform hand-crafted methods [21], and where CNNs are widely used [22]. Among them, the encoder-decoder architectures are a popular family of deep models for semantic segmentation [23]. U-Net [24] is one of the most widely used and well-known architectures for medical imaging segmentation, which has led to many variants for particular applications with great results [25][26][27][28]. While U-Net relies on concatenation in the decoder, LinkNet [29] adds the information from the encoder and has also been used for medical image segmentation [30,31] including polyp segmentation [32]. Furthermore, these two networks are usually compared in studies from different medical fields [33][34][35][36]. For polyp segmentation on colonoscopy images, different approaches can be found. Wichakam et al. [37] proposed a compressed fully convolutional network (FCN) based on VGG-16 [38] which provided state of the art results (Jaccard index of 69.36) while running seven times faster than the original FCN. Vázquez et al. [39] employ a FCN8 architecture [40] and get a Jaccard index equal to 56.07 for the polyp class. Similarly, Wickstrøm et al. [41], also with an enhanced FCN8, get 58.70 overlap.
Principal Component Analysis (PCA) is a mathematical technique for data transformation that reduces multidimensional data sets into a lower number of principal components, which are uncorrelated and retain variance as much as possible [42]. PCA has been used in medical imaging with different purposes. Ansari et al. [43] use the PCA to obtain a target image from a different source image. They transform endoscopic narrow-band images (NBI) into standard colored endoscopic images. In this case, the PCA extracts the fundamental information in NBI, which usually enhances structures and textures, so the standard images can be improved for better assessment and diagnosis. Another application of PCA is the reduction of highly dimensional data, such as fluorescence spectral images of colorectal polyps [44]. In this work, near-infrared autofluorescence spectroscopy images with multiple wavelengths and intensities were obtained from healthy, hyperplastic, and adenomatous colonic tissues and this high dimensional set of images was reduced to eight principal components for an easier and more intuitive tissue classification. It is also worth mentioning that other technical aspects, such as prevention of overfitting in neural networks, have also been addressed with PCA. Kim et al. [45] apply the PCA to reduce the dimensions of the input feature vector while retaining the main information, so the number of weights in the neural network can be reduced. Related to polyp segmentation, PCA has been used as reduction technique for contour region analysis based on hand-crafted features [46].
In statistics, a loss function is commonly used in optimization problems, where the minimum difference between estimated and true values for a given dataset is sought. Therefore, loss functions, also known as objective functions, cost functions, error functions or simply losses, conduct the training process of deep learning models computing the error between predicted values and ground truth and minimizing it to optimize the weights of the model. The relevance and influence of loss functions in deep learning models have been studied in many fields, such as handwritten digits recognition [47], fast single classification [48], cardiac MRI reconstruction [49], low-dose CT images denoising [50], or face recognition [51]. Loss functions can be categorized into distribution, region or boundary-based losses, depending on the main concept considered for minimization. Regarding distribution-based losses, the cross entropy loss function [52] is widely used. It calculates the cost at each pixel using a logarithmic function that is then averaged over all pixels to quantify the difference between the two probability distributions (ground truth vs prediction). Other distributed-based functions are the weight cross entropy [24] and the focal loss [53]. In case of region-based losses, the commonly used Dice loss [25] measures the overlap of two sets, which can be easily applied to binary segmentation. Additional region-based functions are the Jaccard or Intersection over Union (IoU) loss [54], the Tversky loss [55], the Lovász-Hinge or Lovasz-Softmax loss [56], and the focal Tversky loss [57]. Finally, boundary-based loss functions include for example the surface loss that uses a distance metric to compare the shapes of two contours [58], or the loss that uses the estimations of the Hausdorff distance to measure differences between two contours [59].
When deep learning is applied to medical imaging analysis, researchers tend to use classical cross-entropy loss functions together with a second distance or overlap measure [60], so combined loss functions are commonly used [61][62][63]. For polyp segmentation, both individual and combined loss functions can be found. Wichamkam et al. [37] apply the Dice loss function alone. On the other hand, while Zhou [26] combines the Dice loss function with the binary cross entropy, Mohammed et al. [64] combine it with the weight binary cross entropy instead.
Hence, the aim of this paper is to analyze the influence of several loss functions on different models for polyp segmentation and to determine whether a PCA-based decomposition allows for the defining of the coefficients of a non-redundant primal loss function that can outperform the individual loss functions and other aggregation methods, such as the sum and the mean of loss functions. Consequently, the same dataset and training conditions were used for all experiments, while several loss functions were applied to compare the training and testing performance of the model. Two different encoder-decoder architectures, namely U-Net [24] and LinkNet [29], with two backbones each (VGG-16 [40] and Densenet121 [65]) were used for comparison.
The rest of the paper is organized as follows. In Section 2, a detailed description of materials and methods is exposed, including dataset, network architectures, training parameters, loss functions, metrics and PCA analysis. In Section 3, the results are presented and analyzed. Finally, Section 4 discusses the results and Section 5 summarizes the most important conclusions of this work.

Dataset
CVC-EndoSceneStill is a publicly available dataset [39] that contains 912 images obtained from 44 video sequences collected from 36 patients. The dataset is divided into training, validation and test sets, each of them including 547, 183, and 182 images, respectively. For each image, binary masks are provided for the polyp, lumen, specular lights and void (black area) classes. In the binary masks, pixels corresponding to the class are labelled with 1, and 0 otherwise ( Figure 1). In this work, only the polyp and void classes are used. While the polyp binary mask is used both at training and testing, the void binary mask is just used for reporting metrics with respect to the valid endoscopic image.

Architectures and Training Parameters
For this study, two backbones (VGG-16 [40] and Densenet121 [65]) and two encoder-decoder architectures (U-Net [24] and LinkNet [29]) were selected. The backbones differ in the type of block used to obtain the feature vector that is fed into the decoder. VGG-16 ( Figure 2A) employs convolutional blocks composed of convolutional and activation layers, which are then followed by a max pooling layer to reduce the input size. On the other hand, Densenet121 ( Figure 2B) relies on dense blocks where all layers are interconnected with the rest of layers in the block. In both cases, backbones are pretrained using IMAGENET [66], rather than trained from scratch, in order to improve the performance on a small dataset such as CVC-EndosceneStill (only 547 images for training). As for the encoder-decoder architectures, the difference between U-Net and LinkNet mainly lies on the integration of the information of the encoder into the decoder. While U-Net ( Figure  2C) concatenates the features maps of the encoder at their corresponding level in the decoder, LinkNet (Figure2D) adds the input to each encoder layer to the output of the corresponding layer in the decoder. In our implementation, the decoder path is created accordingly to the selected backbone to match the dimensions and number of layers. Therefore, we have four different models (Tables S1-S4   Each model is trained for 300 epochs using the Adam optimizer, a batch size of 16 and batch normalization. The learning rate is initially set to 0.001, decreasing by half every 5 epochs and

Architectures and Training Parameters
For this study, two backbones (VGG-16 [40] and Densenet121 [65]) and two encoder-decoder architectures (U-Net [24] and LinkNet [29]) were selected. The backbones differ in the type of block used to obtain the feature vector that is fed into the decoder. VGG-16 ( Figure 2A) employs convolutional blocks composed of convolutional and activation layers, which are then followed by a max pooling layer to reduce the input size. On the other hand, Densenet121 ( Figure 2B) relies on dense blocks where all layers are interconnected with the rest of layers in the block. In both cases, backbones are pretrained using IMAGENET [66], rather than trained from scratch, in order to improve the performance on a small dataset such as CVC-EndosceneStill (only 547 images for training). As for the encoder-decoder architectures, the difference between U-Net and LinkNet mainly lies on the integration of the information of the encoder into the decoder. While U-Net ( Figure 2C) concatenates the features maps of the encoder at their corresponding level in the decoder, LinkNet ( Figure 2D) adds the input to each encoder layer to the output of the corresponding layer in the decoder. In our implementation, the decoder path is created accordingly to the selected backbone to match the dimensions and number of layers. Therefore, we have four different models (Tables S1-S4

Architectures and Training Parameters
For this study, two backbones (VGG-16 [40] and Densenet121 [65]) and two encoder-decoder architectures (U-Net [24] and LinkNet [29]) were selected. The backbones differ in the type of block used to obtain the feature vector that is fed into the decoder. VGG-16 ( Figure 2A) employs convolutional blocks composed of convolutional and activation layers, which are then followed by a max pooling layer to reduce the input size. On the other hand, Densenet121 ( Figure 2B) relies on dense blocks where all layers are interconnected with the rest of layers in the block. In both cases, backbones are pretrained using IMAGENET [66], rather than trained from scratch, in order to improve the performance on a small dataset such as CVC-EndosceneStill (only 547 images for training). As for the encoder-decoder architectures, the difference between U-Net and LinkNet mainly lies on the integration of the information of the encoder into the decoder. While U-Net ( Figure  2C) concatenates the features maps of the encoder at their corresponding level in the decoder, LinkNet (Figure2D) adds the input to each encoder layer to the output of the corresponding layer in the decoder. In our implementation, the decoder path is created accordingly to the selected backbone to match the dimensions and number of layers. Therefore, we have four different models (Tables S1-S4 provided   Each model is trained for 300 epochs using the Adam optimizer, a batch size of 16 and batch normalization. The learning rate is initially set to 0.001, decreasing by half every 5 epochs and for prediction. All experiments were implemented using segmentation models [67], Keras [68] and Tensorflow [69] as backend and were run in a NVIDIA Quadro P5000 with 16 GB memory.

Loss functions
Seven loss functions are considered in this work. In order to present their formulation, we firstly establish the mathematical framework of the binary segmentation problem. Since we are performing a binary semantic segmentation based on a pixel-wise classification, we describe the problem of a binary classifier.
As X is the image of size M × N to be segmented, each pixel x ∈ X has a corresponding value, y , in the ground truth mask, Y. In a binary classification problem, the ground truth labels can take the values y = y 1 , y 0 , where y 1 is the positive class, being polyp in this case, and y 0 the negative class, corresponding to the background class. The segmentation model gives a predictionŶ, with labelŝ y = ŷ 1 ,ŷ 0 . In our work, labels are assigned as y 1 = 1; y 0 = 0;ŷ 1 = 1;ŷ 0 = 0. Thus, the elements of the confusion matrix, n ij , indicate the number of cases belonging to class i that were classified as class j. In our binary problem, the elements of the 2 × 2 confusion matrix are: Therefore, any loss function L Y,Ŷ measures the cost or error of predictingŶ for a given image X with ground truth Y [70]. In the following subsections, we define the loss functions considered in this study.

Jaccard Loss
The Jaccard index [71], also known as intersection over union (IoU), provides the ratio of the intersection between the ground truth and the predicted binary masks over their union, in a [0, 1] range and with the convention of solving the indetermination 0/0 = 1. In our particular dataset, this situation does not occur as all images contains a segmented polyp, so the ground truth binary masks always contain positive elements. Based on this index, the Jaccard loss can be defined as: This loss function is widely used in segmentation problems as it directly optimizes the Jaccard index, one of the most common metrics.

Dice Loss
The Dice coefficient [72] also measures similarities between two regions. This coefficient is the same as the F1-score. The Dice loss can be defined as:

Binary Cross Entropy Loss
Cross entropy or softmax loss is the most widely used loss function for classification problems [73]. For a binary classification problem, binary cross entropy can be defined as: The aim of this distributed-based loss function is to maximize the accuracy of the predicted binary mask, considering both the positive class, or polyp, and the negative class, or background.

Binary Focal Loss
Yin et al. [53] defined the focal loss to counteract the effect of imbalanced classes. For a binary problem, it can be defined as: being α a weighting factor in balanced cross entropy and γ a focusing parameter. We have considered α = 0.25 and γ = 2.0, default parameters in the segmentation models library [67].

Tversky Loss
The Tversky index [74] allows us to control the penalties for FP and FN through the parameters α and β, and it is defined as: where p stands for the polyp class, so y pi = 1 if pixel i belongs to the polyp class, and 0 otherwise, and the opposite for y pi in the ground truth binary mask. The same definitions apply forŷ pi andŷ pi in the predicted binary mask. Based on this index, the Tversky loss function is defined as: For implementation, we have selected α = 0.7 and β = 1 − α = 0.3, as the authors report in their original work.

Focal Tversky Loss
Based on the Tversky index (Equation (6)) and combining the idea of the binary focal loss (Section 2.3.4), Abraham et al. [57] defined the Focal Tversky loss as: to address the issue of data imbalance in medical imaging segmentation. For implementation, we have selected γ = 4 3 , as the authors report in their original work.

Lovász-Hinge Loss
Berman et al. [56] propose the Lovász-Hinge loss function, based on the convex Lovász extension of sub-modular losses. For our binary problem, it is defined as: where ∆ J p is the Lovász extension to the Jaccard loss function and m(p) is the vector of hinge losses for the polyp class, as defined in [56].

Metrics
The results are always reported over the 182 images of the test set of CVC-EndoSceneStill dataset, which allows for a fair comparison of the models. To calculate the reporting metrics, only the valid area of the void binary mask (Figure 1c) is considered. For each image, seven metrics are calculated based on the elements of the confusion matrix:

PCA Analysis
Previously to the application of the PCA, the viability of the method was tested. Therefore, the existing correlation between the variables under study was determined based on the correlation matrix and two statistical tests: Kaiser-Meyer-Olkin (KMO) index and χ 2 test. Next, the Bartlett sphericity test was carried out to determine the correlation between all variables. This test is based on Pearson's χ 2 distribution, indicating that a significant correlation between variables other than zero when the determinant of the correlation matrix is different from one. The mean of the KMO sample fit was also calculated to measure the proportion of variance that the variables under study have in common. This parameter ranges from 0 to 1, considering a good sample suitability to use PCA for values greater than 0.5.
Under the assumption that these tests are positive, a PCA was performed using the seven metrics of the 182 images of the test set predicted by the model trained with each individual loss function, in order to identify non-correlated variables that allow for a dimensionality reduction while the included variance is maximized. We have a set of data X ∈ R N×D , where N is the number of metrics calculated for the 182 images of the test set and D is the number of loss functions. The objective is to project X into Y∈ R N×K , where K < D, while maximizing the variance included in Y. If the D variables are correlated, the information in X is redundant, so the PCA analysis allows for a dimension reduction resulting in a set of independent variables. For that, eigenvectors and eigenvalues from the correlation matrix are calculated. Eigenvalues are sorted in descending order and K eigenvectors with the largest eigenvalues are selected. Then, the original dataset X can be transformed into Y using the projection matrix W from the selected K eigenvectors as: being α kn the coefficient of the k-th principal component for the n-th input variable. The resulting components were examined and only those with eigenvalues greater than 1 which also include a great proportion of variance (over 80%) were considered [75]. We can define an eigenloss function for each principal component as a linear combination of the individual loss functions as:

PCA Analysis
Tables 1-4 summarize the results of the different models in the test set of CVC-EndoSceneStill. Metrics calculated in the [0, 1] range were transformed into [0, 100] range for better visualization. In all models, the accuracy is always closer to 100 due to the influence of the background in this metric and the unbalance between polyp and background classes. This is also found in the case of precision and specificity, where values are in general over 80.00 and 98.00, respectively. On the contrary, recall, F2-score, Jaccard and Dice provide lower values, ranging from around 40.00 to 81.00. It is important to highlight the small deviation of the specificity in comparison to the rest of the metrics. This is due to the high presence of background pixels, which makes the TN a very stable figure. When classes are highly unbalance, such as our case because the polyp represents 12.50 ± 11.49% (range: 0.75-66.15) of the image, metrics that do not consider the TN (precision, recall, F2-score, Jaccard and Dice) are more representative of the performance.   At loss function level, the Jaccard and Tversky losses appear to provide the most stable results along the different models. On the contrary, the binary cross entropy, the binary focal and the Lovász-Hinge loss functions provide different results depending on the model. On the other hand, models with the Densenet121 backbone obtain overall better results than models using VGG-16 for all individual loss functions. Using U-Net (concatenation) or LinkNet (add) lead to similar results for the individual loss functions. Figures 2 and 3 show the boxplots for the accuracy and Jaccard index for the four models. Boxplots for the rest of metrics are provided as Supplementary Materials (Figures S1-S5). While the accuracy is more similar for all loss functions and models, the Jaccard index presents a greater variability and differences among them.
The correlation matrix presents high values ranging from 0.659 to 0.898 outside the main diagonal of the matrix for the four analyzed models. Therefore, correlation between the variables under study can be considered high and significant (p-value ≤ 0.001 in all cases). Furthermore, the value of the determinant of the correlation matrix was 0 for each model, so there is a linear relationship between the variables under study. Next, for the Bartlett sphericity test value, we obtain a χ 2 greater than 10,000 which indicates that there are high values outside the diagonal of the correlation matrix with significance less than 0.001, for all models. In the KMO sample fit, the value ranged from 0.937 to 0.941 for the four models, which corroborates the results of the previous statistical parameters. Therefore, we can conclude that the PCA can be applied to the variables under study in the four models.
The PCA analysis identified only one independent component for each considered model (Table 5).  Figures 2 and 3 show the boxplots for the accuracy and Jaccard index for the four models. Boxplots for the rest of metrics are provided as supplementary material (Figures S1-S5). While the accuracy is more similar for all loss functions and models, the Jaccard index presents a greater variability and differences among them. The correlation matrix presents high values ranging from 0.659 to 0.898 outside the main diagonal of the matrix for the four analyzed models. Therefore, correlation between the variables  Based on these results, the eigenloss was calculated using the coefficients of the first principal component, α 1n , from Table 5 in (Equation (17)) for each considered model. Results for the test set of the models trained with its corresponding eigenloss are provided in the penultimate row of Tables 1-4, as well as in the penultimate boxes of Figures 3 and 4. For comparison, we also analyzed other linear combinations: • Sum: where all coefficients are equal to 1.

•
Mean: where all coefficients are equal to 1/7.
with n ∈ Jaccard, Dice, BCE, BF, Tversky, FocalTversky, Lovasz − Hinge .   Table 6 summarizes the best epoch and the corresponding validation loss at which weights are selected for prediction as indicated in Section 2.2. Models with VGG-16 converge slower, the lowest epoch is 172 and 211 for the U-Net-VGG-16 and LinkNet-VGG-16, respectively. On the other hand, LinkNet-Densenet121 converges faster than U-Net-Densenet121 for most losses. In general, all loss values in the best epochs are quite similar, ranging from 0.12 to 0.20. These findings are also observable in Figures S6-S9 included as supplementary material. They show the evolution of the losses during training for the training and validation sets. Peaks are produced when the learning rate is restored to its initial value.

Training and Testing Analysis
Although there are no great differences in the initial convergence within each model, largest differences are obtained in LinkNet-VGG-16, where Lovász-Hinge is the slowest one and the binary cross entropy the fastest one among the individual loss functions ( Figure 5). In all cases, the general trend of the eigenloss is to run within the maximum and minimum values set by the individual loss functions. The proposed eigenloss improves the general performance of individual loss functions in all models, as it provides closer values to 100 in the seven metrics at the same time, although the maximum value might be for any of the individual losses. It is interesting to remark that the F2-score of the combined loss function improves, therefore providing a better balance between precision and recall than the individual loss functions. Our proposed combination also minimizes variability in the Jaccard index.
In comparison to other linear combinations, the eigenloss obtains better results than the sum and the mean when LinkNet architecture is considered, regardless of the backbone. On the other hand, for the U-Net, either the sum or the mean performs better. Table 6 summarizes the best epoch and the corresponding validation loss at which weights are selected for prediction as indicated in Section 2.2.

Training and Testing Analysis
Models with VGG-16 converge slower, the lowest epoch is 172 and 211 for the U-Net-VGG-16 and LinkNet-VGG-16, respectively. On the other hand, LinkNet-Densenet121 converges faster than U-Net-Densenet121 for most losses. In general, all loss values in the best epochs are quite similar, ranging from 0.12 to 0.20. These findings are also observable in Figures S6-S9 included as Supplementary Materials. They show the evolution of the losses during training for the training and validation sets. Peaks are produced when the learning rate is restored to its initial value.
Although there are no great differences in the initial convergence within each model, largest differences are obtained in LinkNet-VGG-16, where Lovász-Hinge is the slowest one and the binary cross entropy the fastest one among the individual loss functions ( Figure 5). In all cases, the general trend of the eigenloss is to run within the maximum and minimum values set by the individual loss functions.  Figures S10-S13 provided in the supplementary material show the evolution of the loss, accuracy and Jaccard index during training for the training and validation sets for each model and each loss function. In all cases, accuracy values for train and validation sets remain very similar from the first epochs, achieving values very close to 100. This is due to fact that the background class is considered in the accuracy, so even when the polyp is not detected at all, accuracy is still high thanks to the negative class. Therefore, accuracy is not the most suitable metric for polyp segmentation. On the other hand, Jaccard index values in training and validation set vary along the whole training for all models and loss functions. Moreover, it shows greater fluctuations, showing a more erratic behavior when VGG-16 is used as backbone. As it might be expected, Jaccard and Dice losses provide a more stable value for the Jaccard index during training for the training and validation set, as the metric is either directly or indirectly included in the loss function for optimization. The proposed eigenloss remains acceptably stable during training, reaching values around 60 in the best epoch.

Discussion
Computer-aided detection systems have the potential to improve colonoscopy [6]. In this way, Figures S10-S13 provided in the Supplementary Materials show the evolution of the loss, accuracy and Jaccard index during training for the training and validation sets for each model and each loss function. In all cases, accuracy values for train and validation sets remain very similar from the first epochs, achieving values very close to 100. This is due to fact that the background class is considered in the accuracy, so even when the polyp is not detected at all, accuracy is still high thanks to the negative class. Therefore, accuracy is not the most suitable metric for polyp segmentation. On the other hand, Jaccard index values in training and validation set vary along the whole training for all models and loss functions. Moreover, it shows greater fluctuations, showing a more erratic behavior when VGG-16 is used as backbone. As it might be expected, Jaccard and Dice losses provide a more stable value for the Jaccard index during training for the training and validation set, as the metric is either directly or indirectly included in the loss function for optimization. The proposed eigenloss remains acceptably stable during training, reaching values around 60 in the best epoch.

Discussion
Computer-aided detection systems have the potential to improve colonoscopy [6]. In this way, polyps, considered early precursors of CRC, can be detected, diagnosed and treated. In this work, we have faced the problem of polyp detection by the application of segmentation models based on deep learning. Specifically, we have analyzed the influence of seven loss functions on four models for polyp segmentation and determined whether a PCA-based decomposition allows for the defining of a non-redundant primal loss function that can outperform them and different linear combinations.
Through the application of PCA, the eigenloss was defined as a linear combination of seven individual loss functions. It does improve overall performance over the individual loss functions for the four considered models. On the other hand, and only for the LinkNet architecture, the eigenloss performs better than the sum and the mean. Regarding the backbones, the eigenlos performs better with the Densenet121 than VGG-16, in all cases. This might be due to the fact that VGG-16 presents a more linear architecture that cannot extract enough details for segmenting complex structures such as the polyp, due to the high level of visual similarity between healthy mucosa and lesion. Furthermore, it is also worth mentioning that the eigenloss provides the most balanced results in terms of precision and recall, therefore balancing false positives and false negatives.
Based on the experimental results of the individual loss functions, region-based loss functions provide better results for polyp segmentation. While highest values remain in the individual loss functions in many cases, the eigenloss improves the overall performance providing a more balanced result among all metrics for each model. Therefore, one benefit of using the eigenloss is to include the properties of both region and distribution-based loss functions in one single loss function, which might be the reason for this balanced behavior, which arises from the PCA capability for projecting the existing losses into a statistically independent eigenloss. This way, there is no need for choosing one over the other, except in the case that one loss function behaves outstandingly well in all metrics. Due to the fact that PCA provides a compression with information loss, the use of the eigenloss in this case might hinder the performance if the individual loss function is used.
Our work can be straightforwardly compared to other works that also used the CVC-EndoSceneStill dataset (Table 7), although regretfully, the rest of the authors do not provide all of the metrics included in this work. Our proposed eigenloss leads to better results in all but one of the five comparable metrics. Nevertheless, we acknowledge that properly ranking the methods is challenging and particular guidelines would help stablish a common framework for comparison [76]. Analyzing the full training, accuracy values for train and validation sets remains very similar from the first epochs, achieving values very close to 100 for all loss functions. Therefore, accuracy is not the most suitable metric for polyp segmentation mainly because of the imbalanced problem. In this situation, Taha et al. [77] recommend the use of distance-based metrics, such as the Hausdorff distance, average Hausdorff distance or the Mahalanobis distance.
This work also has limitations to be acknowledged. Only the CVC-EndoSceneStill dataset was used. This publicly available dataset has a limited size, thus it might be possible that other datasets behave differently, depending on the type and shape of polyps included in their images. Since the imbalance level of the dataset have a great influence on metrics, a different dataset where polyps are bigger or smaller might lead to a different combination. On the other hand, only seven metrics based on the confusion matrix were used for the PCA analysis. The inclusion of the previously mentioned distance-based metrics could also yield to different PCA results and therefore a different eigenloss.
Therefore, future works should firstly consider the use of different datasets for comparison as well as images of normal tissue (frames containing no polyps at all). Another future study prospect relies on the need for studying the influence of polyp size in the performance of the segmentation models and loss functions. We also consider the inclusion of more complex loss functions to the study, such as the rectified cross-entropy [78], random drop loss [79], the integrated Hausdorff-Sine loss [80] or the enhanced softmax loss [81], which were not considered in this study mainly due to time and resources constraints. Most of them are proposed for very specific cases, so it should be tested whether they are also suitable for polyp segmentation. Regarding the selected models, performance of other U-net based approaches could be further analyzed. These approaches have been proven to be highly effective for different biomedical imaging applications [26,82,83]. In this comparison, it might be also interesting to take into account prominent segmentation networks, such as DeepLab [84] or CCNet [85], which the performs well in natural images, in order to study their suitability for polyp segmentation. Lastly, it would also be interesting to establish a detailed analysis of transferability of PCA-based losses, in order to further evaluate whether coefficients for the eigenloss can be reused, analyzing whether the level of generalization is enough, or whether a new PCA analysis is required for each particular model.

Conclusions
In this work, we propose the use of PCA as a method to extract the coefficients of a linear combination of seven individual loss functions. We have considered two different encoder-decoder architectures with two different backbones. LinkNet-Densenet121 provides the best performance out of the four models. Regardless of the backbone, if LinkNet is used, the eigenloss results in more stable performance (all metrics have an acceptable value) in comparison to other linear combinations, such as the sum or the mean. Convergence of models using VGG-16 as the backbone is slower than those using Densener121. Further analysis is required to determine the transferability of PCA-based losses.
Supplementary Materials: The following are available online at http://www.mdpi.com/2227-7390/8/8/1316/s1, Table S1: U-Net-VGG-16 architecture, Table S2: U-Net-Densenet121 architecture, Table S3: LinkNet-VGG-16 architecture, Table S4: LinkNet-Densenet121 architecture. Figure S1: Boxplot for precision for the different loss functions and models analyzed, Figure S2: Boxplot for recall for the different loss functions and models analyzed, Figure S3: Boxplot for specificity for the different loss functions and models analyzed, Figure S4: Boxplot for F2-score for the different loss functions and models analyzed, Figure S5: Boxplot for Dice index for the different loss functions and models analyzed, Figure S6: Train and validation loss rates for U-Net-VGG-16 for all loss functions, Figure S7: Train and validation loss rates for U-Net-Densenet121 for all loss functions, Figure S8: Train and validation loss rates for LinkNet-VGG-16 for all loss functions, Figure S9: Train and validation loss rates for LinkNet-Densenet121 for all loss functions, Figure S10: Loss rate, accuracy and Jaccard index during training for the train and validation sets for the U-Net-VGG-16, Figure S11: Loss rate, accuracy and Jaccard index during training for the train and validation sets for the U-Net-Densenet121, Figure S12: Loss rate, accuracy and Jaccard index during training for the train and validation sets for the LinkNet-VGG-16, Figure S13: Loss rate, accuracy and Jaccard index during training for the train and validation sets for the LinkNet-Densenet121. Funding: This work was partially supported by PICCOLO project. This project has received funding from the European Union's Horizon2020 research and innovation programme under grant agreement No 732111. The sole responsibility of this publication lies with the author. The European Union is not responsible for any use that may be made of the information contained therein.