PICCOLO White-Light and Narrow-Band Imaging Colonoscopic Dataset: A Performance Comparative of Models and Datasets

Featured Application: This dataset can be used for supervised training of models for colorectal polyp detection, localisation, segmentation and classiﬁcation. Abstract: Colorectal cancer is one of the world leading death causes. Fortunately, an early diagnosis allows for e ﬀ ective treatment, increasing the survival rate. Deep learning techniques have shown their utility for increasing the adenoma detection rate at colonoscopy, but a dataset is usually required so the model can automatically learn features that characterize the polyps. In this work, we present the PICCOLO dataset, that comprises 3433 manually annotated images (2131 white-light images 1302 narrow-band images), originated from 76 lesions from 40 patients, which are distributed into training (2203), validation (897) and test (333) sets assuring patient independence between sets. Furthermore, clinical metadata are also provided for each lesion. Four di ﬀ erent models, obtained by combining two backbones and two encoder–decoder architectures, are trained with the PICCOLO dataset and other two publicly available datasets for comparison. Results are provided for the test set of each dataset. Models trained with the PICCOLO dataset have a better generalization capacity, as they perform more uniformly along test sets of all datasets, rather than obtaining the best results for its own test set. This dataset is available at the website of the Basque Biobank, so it is expected that it will contribute to the further development of deep learning methods for polyp detection, localisation and classiﬁcation, which would eventually result in a better and earlier diagnosis of colorectal cancer, hence improving patient outcomes.


Introduction
Colorectal Cancer (CRC) represents a 10% of overall new cases and presents higher incidence rate in developed countries [1] and could be considered a "lifestyle" disease associated with a diet high in calories and animal fat, and sedentarism [2].In the United States, it has increased from over 132,000 estimated new cases and nearly 50,000 estimated deaths in 2015 [3] to 147,000 estimated new cases and 53,200 estimated deaths in 2020, being the third most commonly diagnosed cancer type [4].Despite this, CRC detection increases the 5-year survival rate from 18% to 88.5% if diagnosed at an early stage.Furthermore, screening programs allow for detection before the appearance of symptoms so up to 22% of symptomatic cases could be treated earlier [5].Colonoscopy is the gold standard procedure for detection and treatment of colorectal lesions, and its efficacy is related to the Adenoma Detection Rate (ADR) of the endoscopist, defined as the percentage of colonoscopies with at least one adenoma identified [6].It is shown that higher ADR is associated with lower interval CRC rates [7], and that flat/sessile and small lesions are frequently more missed than pedunculated/sub-pedunculated and large ones [8,9].Therefore, different approaches might be followed to improve ADR and reduce the number of missed lesions.During colonoscopy, these approaches include endoscope caps, positional manoeuvres, as well as the use of imaging modalities such as narrow-band imaging (NBI) which emphasize the capillary pattern and mucosa surface which emphasize the capillary pattern and mucosa surface [10].It is clear that further development of Computer Assisted Diagnosis (CAD) systems is justified thanks to their potential to eventually improve the patient outcome [11].
In recent years, artificial intelligence (AI) and deep learning (DL) [12] have significantly contributed to the field of medical imaging analysis [13][14][15] and the use of AI in colonoscopy has showed promising results to increase ADR, although it should be further evaluated in more randomized controlled trials [16].In the same trend, DL has also boosted the appearance of methods for polyp detection, localisation and segmentation, where end-to-end methods based on convolutional neural networks accompanied by data augmentation strategies are frequently used [17].Nevertheless, methods for polyp detection are currently more advanced than methods for polyp classification, which might be due to a lack of available datasets for this task [18].Lastly, it is important to mention that the availability of high-quality endoscopic images and the increased understanding of the technology by the endoscopists are two important factor for the further development of DL for endoscopy [19].
All deep learning approaches rely on a dataset based from which features can be learnt.If the training method is supervised, then the dataset should be labelled.Alternatively, semi-supervised training makes the most of labelled and unlabelled datasets.In the case of medical imaging datasets [20], data are hard to retrieve from healthcare systems, so it is necessary to obtain ethical approval and differently to labelling natural images, manually annotation of medical images is a cumbersome, time-consuming process that requires expert knowledge [21].Especially when segmentation is the target, two main limitations are also usually recognized: scarce annotations and weak annotations [22].Therefore, labelled datasets are usually smaller than non-labelled datasets, but in both cases, they are difficult to obtain, and, in many cases, they are proprietary datasets not publicly available for the research community.
There are currently some publicly available datasets of colonoscopy images that can be used for polyp detection, localisation and segmentation [17,18].A general trend is that the more precise the annotation is, the smaller the dataset size is.In the matter at hand, annotation of manual precise binary masks is more time demanding than just labelling frames.The size of the publicly available datasets might range from hundreds of frames when a precise manually segmented binary mask is provided, to thousands of video frames if an approximated binary mask is created.The dataset CVC-EndosceneStill [23] provides manually segmented binary masks for polyp, background, lumen and specular lights classes.In all, 912 images are available.It also splits the dataset into training, validation and test sets and indicates the recommended metrics, so providing a common dataset which facilitates the proper comparison of methods.Kvasir-SEG [24] includes 1000 polyp images for which a polygonal binary mask and a bounding box are provided.On the other hand, CVC-VideoClinicDB [21,25] provides an elliptical approximation for over 30,000 frames.Nevertheless, these datasets lack clinical metadata, such as the polyp size or histological diagnosis and only provide the Paris classification [26,27] in the best of the cases.Just recently, HyperKvasir [20] has been released.It includes labelled data for supervised training, but also provides unlabelled data.Regarding polyps, besides the images and binary masks of the Kvasir-SEG dataset, it also includes 1028 images and 73 videos labelled with the "polyp" class and 99,417 unlabelled images.All these datasets include white light (WL) images, but not NBI images.
For purposes of polyp classification, only one publicly available dataset is available [28].This dataset includes 76 colonoscopy videos, using both WL and NBI imaging, of 15 serrated adenomas, 21 hyperplastic lesions and 40 adenomas.
The objective of this paper is to present the PICCOLO dataset with its associated clinical metadata and compare the performance results of different deep learning models trained with it and other publicly available datasets, as well as analyse the influence of the polyp morphology in the results.This paper is organized as follows: Section 2 details the acquisition and annotation protocols to obtain the PICCOLO dataset, as well as the public datasets and networks used in this study.In Section 3, we present and discuss the results of the experiments.Lastly, conclusions of this work are included in Section 4.

PICCOLO Dataset
The PICCOLO dataset contains several annotated frames from colonoscopy videos together with clinical metadata.In this dataset, both WL and NBI imaging technologies are included.The following subsections describe the acquisition and annotation protocols to generate the dataset.

Acquisition Protocol
An acquisition protocol was followed to obtain relevant information at Hospital Universitario Basurto (Bilbao, Spain): 1.
Patients included in the colon cancer screening and surveillance program were informed about the study and asked for permission to collect images/videos obtained during routine colonoscopy with the associated information via informed consent and patient information sheet.If the patient gave permission, the rest of the protocol was followed.

2.
If a suspicious lesion was found during the procedure, it was resected using the most appropriate method and sent to the Department of Pathological Anatomy for diagnosis.

3.
Images/videos with the associated clinical information were anonymized.4.
Videos were analysed and edited/processed by the gastroenterologists who performed the colonoscopy.The video of the full procedure was divided into fragments, indicating if the tissue was healthy or pathological and the region on the colon where it was found.Further details on the annotation process are given in the following subsection.

5.
The gastroenterologist completed part of the associated metadata: a. Number of polyps of interest found during the procedure; b.
Current polyp ID; c.

6.
The pathologist completed part of the associated metadata: a. Final diagnosis; b.Histological stratification.

7.
Clinical metadata were exported into a csv file.
The protocol and all experiments comply with current Spanish and European Union legal regulations.The Basque Biobank was the source of samples and data.Each patient signed a specific document that was approved by the Ethics Committee of the Basque Country (CEIm-E) with identification code: PI+CES-BIOEF 2017-03.

Annotation Protocol
In order to obtain the annotated dataset, a systematic procedure was established (Figure 1): 1.
Video clips were processed to extract the individual frames.Uninformative frames were discarded from this process and they includeD (Figure S1): a.
Frames outside the patient; b.
Frames with high presence of bubbles; d.
Frames with high presence of stool; e.
Transition frames between WL and NBI.

2.
All frames were analysed and categorized into showing a polyp or not, as well as identifying the type of light source (WL or NBI).

3.
One out of 25 polyp frames (i.e., one frame per second) was selected to be manually annotated.4.
A researcher prepared three equally distributed sets of images to be processed using GTCreator [21].
Each set was manually annotated by one independent expert gastroenterologist with more than 15,000-25,000 (Á.J.C.; F.P.) and 500 (B.G.) colonoscopies.Furthermore, a void mask was also generated to indicate the valid endoscopic area of the image.5.
Segmented frames were collected and revised by a researcher to check completeness of the dataset prior to its use.

6.
Manually segmented masks were automatically corrected with the void mask to adjust segmentations to the endoscopic image area.
Appl.Sci.2020, 10, x FOR PEER REVIEW 4 of 14 document that was approved by the Ethics Committee of the Basque Country (CEIm-E) with identification code: PI+CES-BIOEF 2017-03.

Annotation Protocol
In order to obtain the annotated dataset, a systematic procedure was established (Figure 1): 1. Video clips were processed to extract the individual frames.Uninformative frames were discarded from this process and they includeD (Figure S1): a. Frames outside the patient; b.Blurry frames; c.Frames with high presence of bubbles; d.Frames with high presence of stool; e. Transition frames between WL and NBI.
2. All frames were analysed and categorized into showing a polyp or not, as well as identifying the type of light source (WL or NBI).3. One out of 25 polyp frames (i.e., one frame per second) was selected to be manually annotated.4. A researcher prepared three equally distributed sets of images to be processed using GTCreator [21].Each set was manually annotated by one independent expert gastroenterologist with more than 15,000-25,000 (Á .J.C.; F.P.) and 500 (B.G.) colonoscopies.Furthermore, a void mask was also generated to indicate the valid endoscopic area of the image.5. Segmented frames were collected and revised by a researcher to check completeness of the dataset prior to its use.6. Manually segmented masks were automatically corrected with the void mask to adjust segmentations to the endoscopic image area.

PICCOLO Dataset Details
Lesions were recorded between October 2017 and December 2019 at Hospital Universitario Basurto (Bilbao, Spain) using Olympus endoscopes (CF-H190L and-CF-HQ190L).Videos were recorded with an AVerMedia video capture and a hard drive storage.In all, the PICCOLO dataset included lesions from 40 patients.In total, 62 out of these 76 lesions included white light (WL) and narrow band imaging (NBI) frames, while the remaining 14 lesions are recorded only using WL.Original videos were of length equal to 70.05 ± 59.28 s (range: 6-345 s), which corresponds to 1965.49± 1677.57frames per clip (range: 187-10,364 frames).In all, more than 145,000 frames were revised.Out of the 80,847 frames showing a polyp, 3433 frames were selected for manual segmentation: 2131 WL images and 1302 NBI images (Table 1).Image resolution are either 854 × 480 or 1920 × 1080, depending on the video they are acquired from.For each frame, three images are provided (Figure 2  Image resolution are either 854 × 480 or 1920 × 1080, depending on the video they are acquired from.For each frame, three images are provided (Figure 2   white light images, while last two rows are NBI images.First column corresponds to the polyp frame, second column corresponds to the binary mask for the polyp area, and third column corresponds to the binary mask for the void area. In the binary masks, pixels labelled with 1 correspond to the class (lesion or void area), and 0 otherwise (Figure 2).
Acquired images have been assigned to the train, validation or test set.In order to assure patient independence between sets, lesions originated from the same video are assigned to the same set.In all, there are 2203 (64.17%), 897 (26.13%) and 333 (9.70%) images in the train, validation and test set, respectively.Their clinical information is provided in Table 2. Furthermore, Table S1 in Supplementary Material provides all clinical information and assigned set for each of the 76 lesions.The PICCOLO dataset is publicly available at the website of the Basque Biobank (https://www.biobancovasco.org/en/Sample-and-data-catalog/Databases/PD178-PICCOLO-EN.html),although a dedicated form to request downloading must be completed (the use of the dataset is restricted to research and educational purposes and use for commercial purposes is forbidden without prior written permission).

Public Datasets
In order to establish the utility of our dataset and range the learning capabilities of models trained over the present dataset, the other two publicly available datasets have been used in this work: 1.
CVC-EndoSceneStill [23].These datasets do not provide clinical information of the images they include.In both cases, for each image, a binary mask indicating the polyp area is provided.While CVC-EndoSceneStill also provides void area binary masks, Kvasir-SEG does not.The polyp binary mask is used at training and testing, but the void binary mask is used to report metrics with respect to the valid endoscopic image, so void binary masks have been manually created for the images included in the test set of Kvasir-SEG.

Architectures and Training Process
In this study, we consider four models (Figure 3) which are obtained by combining a backbone (VGG-16 [30] or Densenet121 [31]) and a encoder-decoder architecture (U-Net [32] or LinkNet [33]), replicating the architectures and training parameters defined by Sánchez-Peralta et al. [34].For implementation, segmentations models [35], Keras [36] and Tensorflow [37] have been used.Each model has been independently trained with the train and validation sets of each of the three datasets.
These datasets do not provide clinical information of the images they include.In both cases, for each image, a binary mask indicating the polyp area is provided.While CVC-EndoSceneStill also provides void area binary masks, Kvasir-SEG does not.The polyp binary mask is used at training and testing, but the void binary mask is used to report metrics with respect to the valid endoscopic image, so void binary masks have been manually created for the images included in the test set of Kvasir-SEG.

Architectures and Training Process
In this study, we consider four models (Figure 3) which are obtained by combining a backbone (VGG-16 [30] or Densenet121 [31]) and a encoder-decoder architecture (U-Net [32] or LinkNet [33]), replicating the architectures and training parameters defined by Sánchez-Peralta et al. [34].For implementation, segmentations models [35], Keras [36] and Tensorflow [37] have been used.Each model has been independently trained with the train and validation sets of each of the three datasets.

Reporting
In order to properly compare results, they are always reported over the test set of each dataset, thus over 182, 200 and 333 images for CVC-EndoSceneStill, Kvasir-SEG and PICCOLO, respectively.Furthermore, results are also reported over the whole test images (i.e., 715 images).
In order to characterize the images in each dataset, the following objective measures are calculated on each of the test sets: 1. Percentage of the image that corresponds to void area; 2. Percentage relative to the valid area that is polyp; 3. Mean value of the brightness channel in HSV [38], in the range [0, 1]; 4. Histogram flatness measure [39], in the range [0, 1]; 5. Histogram spread [39], in the range [0, 1].
For each test image, seven metrics are calculated: accuracy, precision, recall, specificity, F2-score, Jaccard index and Dice index, all of them based on the elements of the confusion matrix [34].Only the valid area of the image (indicated in the void binary mask) is considered.

Reporting
In order to properly compare results, they are always reported over the test set of each dataset, thus over 182, 200 and 333 images for CVC-EndoSceneStill, Kvasir-SEG and PICCOLO, respectively.Furthermore, results are also reported over the whole test images (i.e., 715 images).
In order to characterize the images in each dataset, the following objective measures are calculated on each of the test sets: 1.
Percentage of the image that corresponds to void area; 2.
Percentage relative to the valid area that is polyp; 3.
For each test image, seven metrics are calculated: accuracy, precision, recall, specificity, F2-score, Jaccard index and Dice index, all of them based on the elements of the confusion matrix [34].Only the valid area of the image (indicated in the void binary mask) is considered.

Characterization of the Datasets
Table 3 shows the objective measures to characterize the datasets.Largest polyps are found in PICCOLO, but also with greater variability of sizes.Brightest and highest contrast images can be found in Kvasir-SEG, while CVC-EndoSceneStill and PICCOLO show similar appearance in these aspects.
Table 3. Characterization of the datasets.Results are reported as mean ± standard deviation.Minimum and maximum values are indicated between brackets.The void area refers to the black area in the images (Figure 2), while the remaining area is considered as valid area.

CVC-EndoSceneStill Kvasir-SEG PICCOLO
Void area (%) It is also important to remark that the PICCOLO dataset includes WL and NBI images, which results in lower values in the brightness channel at the same time that the contrast is increased (higher values for the histogram flatness measure and lower values for histogram spread).

Comparison of Models Performance
Results of the Jaccard index for the different experiments are shown in Table 4. Tables with all metrics are provided in the Supplementary Material (Tables S2-S5).When reporting over the test set in CVC-EndoSceneStill, the best results are distributed between models trained with that same dataset and models trained with Kvasir-SEG.On the other hand, results in Kvasir-SEG and PICCOLO test sets are, in all cases but one, obtained when models are trained with the same dataset.Lastly, if all test sets are gathered and analysed together as whole, the best performance is obtained with models trained using the PICCOLO dataset.Models trained with Kvasir-SEG show better performance than those trained with CVC-EndoSceneStill.Therefore, it can be observed that the PICCOLO dataset allows for training more generalized models, that perform more uniformly along all datasets, rather than obtaining the best results for its own test set.If comparison is done over the different test sets, the lowest results are obtained in the PICCOLO dataset, which might be because it contains a wider range of different polyps which makes it more difficult to learn with any of the used datasets.This result might be due to the presence of bias within each dataset, which would lead to the recommendation of multi-centre or cross-dataset evaluation for DL methods in order to properly understand the performance of such models in a real-world settings [40].
Similar to the results found in previous works [34], it can be concluded that in general terms, Densenet121 works better than VGG-16 as backbone and the LinkNet encoder-decoder architecture obtains better results than U-Net.
Sets independence is assured in our dataset, as all lesions obtained from the same patient are assigned to the same dataset.This is also considered when determining train, validation and test sets in CVC-EndoSceneStill [23].On the other hand, Kvasir-SEG does not indicate the origin of each image nor provide a predefined division into sets, it is not possible to determine if the random allocation of images might have violated this independency.Should this happen, then the higher values for models trained and reported on Kvasir-SEG might be explained.
Regarding the clinical characteristics of the lesions in the test sets, it is not possible to establish a comparison as CVC-EndoSceneStill and Kvasir-SEG datasets do not provide clinical information.
Lastly, the inclusion of WL together with NBI images in the PICCOLO dataset might increase independence against changes in colour when training the models, showing better generalization when results are reported over the test sets of the other two datasets.In this regard, metrics have been calculated considering only the WL frames of the test set in the PICCOLO dataset for a fairer comparison for the other datasets.Results in the PICCOLO test sets remains similar regardless of the type of images considered (Table S4).When the three datasets are analysed together, including only the WL frames from PICCOLO (Table S5), the performances of Kvasir-SEG and PICCOLO are more similar.

Influence of the Polyp Morphology in the Results
Figure 4 shows the Jaccard index for the four analysed models, when polyps in the test set of the PICCOLO dataset are classified based on their morphology accordingly to the Paris classification.It can be clearly seen that flat polyps (0-IIb) obtain the lowest results in all models, and that pedunculated and sessile polyps (0-Ip, 0-Ips and 0-Is) outperform the flat ones (0-IIa, 0-IIa/c and 0-IIb).This result is well aligned with the clinical findings shown by clinical works [8,9].Furthermore, it is important to remark that the best results for flat polyps are obtained with models trained with PICCOLO dataset in three out of the four considered models.Regretfully, this analysis cannot be done with models trained using CVC-EndoSceneStill and Kvasir-SEG, as polyp morphology is not available for those datasets.Similarly, Lee et al. [41] employed a proprietary dataset to train the polyp detection method based on YOLOv2, and they also found that DL methods show lower sensitivity for flat polyps.

Current Limitations and Future Work
This work has also some limitations that should be acknowledged.In the first place, flat polyps, specially 0-IIb, according to the Paris classification, are still underrepresented in the dataset, in common with other datasets.This type is less frequently identified, therefore a longer, multicentre acquisition campaign would be necessary to increase their presence in the public datasets.Furthermore, these polyps are also the most difficult to be detected due to their subtle appearance, thus the Computer Assisted Diagnosis (CAD) systems would be more helpful to assist their detection [42].We consider that increasing the number of images showing these challenging polyps would increase the clinical utility of the CAD systems, although assistance for the rest of types would also remain useful for novice endoscopists with lower Adenoma Detection Rate (ADR).Secondly, images in the dataset are mainly "polyp centred".This means that the polyp is clearly visible in the image.This type of image is highly convenient for polyp segmentation as well as classification.Nevertheless, we consider that the inclusion of images where the lumen at or near the screen centre, leaving the polyps peripheral, a typical setup during the clinical exploration, would be of help for detection systems.Besides, the availability of long video sequences, even if not all frames are manually segmented, would also benefit the development of such systems.Lastly, manual segmentation is prone to inter-observer variability in medical images [43,44], therefore uncertainty should be considered and analysed, both at the dataset creation stage, but it would also be interesting to take it into account the analysis of segmentation results [45].
the WL frames from PICCOLO (table S5), the performances of Kvasir-SEG and PICCOLO are more similar.

Influence of the Polyp Morphology in the Results
Figure 4 shows the Jaccard index for the four analysed models, when polyps in the test set of the PICCOLO dataset are classified based on their morphology accordingly to the Paris classification.It can be clearly seen that flat polyps (0-IIb) obtain the lowest results in all models, and that pedunculated and sessile polyps (0-Ip, 0-Ips and 0-Is) outperform the flat ones (0-IIa, 0-IIa/c and 0-IIb).This result is well aligned with the clinical findings shown by clinical works [8,9].Furthermore, it is important to remark that the best results for flat polyps are obtained with models trained with PICCOLO dataset in three out of the four considered models.Regretfully, this analysis cannot be done with models trained using CVC-EndoSceneStill and Kvasir-SEG, as polyp morphology is not available for those datasets.Similarly, Lee et al.

Current limitations and Future Work
This work has also some limitations that should be acknowledged.In the first place, flat polyps, specially 0-IIb, according to the Paris classification, are still underrepresented in the dataset, in common with other datasets.This type is less frequently identified, therefore a longer, multicentre acquisition campaign would be necessary to increase their presence in the public datasets.Furthermore, these polyps are also the most difficult to be detected due to their subtle appearance, thus the Computer Assisted Diagnosis (CAD) systems would be more helpful to assist their detection [42].We consider that increasing the number of images showing these challenging polyps would increase the clinical utility of the CAD systems, although assistance for the rest of types would also remain useful for novice endoscopists with lower Adenoma Detection Rate (ADR).Secondly, images in the dataset are mainly "polyp centred".This means that the polyp is clearly visible in the image.This type of image is highly convenient for polyp segmentation as well as classification.Nevertheless, we consider that the inclusion of images where the lumen at or near the screen centre, leaving the polyps peripheral, a typical setup during the clinical exploration, would be of help for detection systems.Besides, the availability of long video sequences, even if not all frames are manually segmented, would also benefit the development of such systems.Lastly, manual segmentation is prone to inter-observer variability in medical images [43,44], therefore uncertainty should be considered and analysed, both at the dataset creation stage, but it would also be interesting to take it into account the analysis of segmentation results [45].
As future work, we consider the expansion of the current dataset to include some of the previously mentioned images and sequences.A joint global initiative to gather all efforts would result in a larger and more diverse dataset of polyps, increasing the possibilities of the research community.
Lastly, it is worth mentioning that, after developing any deep learning method, it is essential to carry out prospective studies and randomized trials to compare their performance with experts clinicians, which is so far not well proven [46], with the final aim of increasing the adenoma detection As future work, we consider the expansion of the current dataset to include some of the previously mentioned images and sequences.A joint global initiative to gather all efforts would result in a larger and more diverse dataset of polyps, increasing the possibilities of the research community.
Lastly, it is worth mentioning that, after developing any deep learning method, it is essential to carry out prospective studies and randomized trials to compare their performance with experts clinicians, which is so far not well proven [46], with the final aim of increasing the adenoma detection rate, as indicator of CRC detection.In this regard, a survey is currently undergoing to compare the identification of polyps done by expert gastroenterologists and residents with a deep learning model trained with the PICCOLO dataset.

Conclusions
This work presents a new dataset for polyp detection, localisation and segmentation.It provides 3433 polyp images together with a manually annotated binary mask of the polyp area.It also provides a set of clinical metadata for each of the lesions included.Besides, we have compared four different models trained with our PICCOLO dataset and two other publicly available datasets (CVC-EndoSceneStill and Kvasir-SEG).Our results show that the PICCOLO dataset is suitable for training deep learning models for polyp segmentation, which results in better generalization capabilities as well as better results for flat polyps.We also provide clinical metadata which, as far as the authors know, are not available in other publicly available datasets, and which might eventually increase the potential use of this dataset for polyp classification purposes.

Figure 1 .
Figure 1.Annotation protocol for the PICCOLO dataset generation.

Figure 1 .
Figure 1.Annotation protocol for the PICCOLO dataset generation.
): -Frame itself: png files showing the WL or NBI image.-Mask: Binary mask indicating the area corresponding to the lesion.-Void: Binary mask indicating the black area of the image.
): -Frame itself: png files showing the WL or NBI image.-Mask: Binary mask indicating the area corresponding to the lesion.-Void: Binary mask indicating the black area of the image.

Figure 2 .
Figure 2. Examples of images and masks from the PICCOLO dataset.First two rows correspond to white light images, while last two rows are NBI images.First column corresponds to the polyp frame,

Figure 2 .
Figure 2. Examples of images and masks from the PICCOLO dataset.First two rows correspond to

Figure 3 .
Figure 3. Models considered in this work are obtained by combining one backbone for the encoder (A,B) with one encoder-decoder architecture (C,D).Image reproduced from [34].

Figure 3 .
Figure 3. Models considered in this work are obtained by combining one backbone for the encoder (A,B) with one encoder-decoder architecture (C,D).Image reproduced from [34].

Figure 4 .
Figure4shows the Jaccard index for the four analysed models, when polyps in the test set of the PICCOLO dataset are classified based on their morphology accordingly to the Paris classification.It can be clearly seen that flat polyps (0-IIb) obtain the lowest results in all models, and that pedunculated and sessile polyps (0-Ip, 0-Ips and 0-Is) outperform the flat ones (0-IIa, 0-IIa/c and 0-IIb).This result is well aligned with the clinical findings shown by clinical works[8,9].Furthermore, it is important to remark that the best results for flat polyps are obtained with models trained with PICCOLO dataset in three out of the four considered models.Regretfully, this analysis cannot be done with models trained using CVC-EndoSceneStill and Kvasir-SEG, as polyp morphology is not available for those datasets.Similarly,Lee et al. [41]  employed a proprietary dataset to train the polyp detection method based on YOLOv2, and they also found that DL methods show lower sensitivity

Figure 4 .
Figure 4. Jaccard index reported on the test set of the PICCOLO dataset for the four models.Each series represents the model trained with the corresponding dataset (a) U-Net + VGG16; (b) U-Net + Densenet121; (c) LinkNet + VGG16; (d) LinkNet + Densenet121.

Table 1 .
Lesions and frames in the PICCOLO dataset according to clinical metadata.

Table 1 .
Lesions and frames in the PICCOLO dataset according to clinical metadata.

Table 2 .
Frames in each of the sets according to clinical metadata.
It contains 912 WL images which are manually segmented.A distribution into training, validation and test sets is provided by the owners.Each set contains 547, 183 and 182 images, respectively.This dataset is available at http://www.cvc.uab.es/CVC-Colon/index.php/ databases/cvc-endoscenestill/.2. Kvasir-SEG [24].It contains 1000 WL images which are manually segmented.Distribution into training, validation and test sets has been randomly done, so each set includes 600, 200 and 200 images, respectively.This dataset is available at https://datasets.simula.no/kvasir-seg/.

Table 4 .
Jaccard index for each experiment.Best value per train/validation set and test set is indicated in bold.