Enhancing Retinal Blood Vessel Segmentation through Self-Supervised Pre-Training †

: The segmentation of the retinal vasculature is fundamental in the study of many diseases. However, its manual completion is problematic, which motivates the research on automatic methods. Nowadays, these methods usually employ Fully Convolutional Networks (FCNs), whose success is highly conditioned by the network architecture and the availability of many annotated data, something infrequent in medicine. In this work, we present a novel application of self-supervised multimodal pre-training to enhance the retinal vasculature segmentation. The experiments with diverse FCN architectures demonstrate that, independently of the architecture, this pre-training allows one to overcome annotated data scarcity and leads to signiﬁcantly better results with less training on the target task.


Introduction
Retinal vasculature segmentation represents a key step in the analysis of multiple common diseases like glaucoma and diabetes. However, its manual completion is arduous and partly subjective, so automatic methods have emerged as an advantageous alternative. State-of-the-art vasculature segmentation is based on Fully Convolutional Networks (FCNs). Nonetheless, using FCNs requires addressing two major difficulties: (1) Determining the network architecture and (2) gathering a large amount of annotated training data. The first issue can be partly overcome by reviewing similar problems. Annotated data, however, are usually scarce in medical imaging, as they require experts to be involved in a tedious process. This motivates the proposal of self-supervised multimodal pre-training (SSMP) to learn the relevant patterns from unlabeled data and reduce the required amount of annotated data [1][2][3]. Specifically, the proposed SSMP consists of training an FCN to predict fluorescein angiographies (a grayscale modality that enhances the vasculature) from retinographies.
In this work, we present a novel application of SSMP to enhance vasculature segmentation in a transfer learning setting, performing a comparative analysis of several FCN architectures.

Methodology
The main objective of this work is the segmentation of the retinal vasculature using FCNs. To enhance the results, we propose a transfer learning setting that consists of using SSMP followed by a fine-tuning in the segmentation task [4]. To appraise our proposal, we evaluated the results of the same networks using the SSMP or training from scratch and with different training set sizes (1, 5, 10, and 15). In all of the cases, we used the following FCN architectures: U-Net [5], FC-DenseNet [6], and ENet [7,8].
In order to perform the SSMP, we aligned the 59 retinography-angiography pairs of the publicly available Isfahan MISP dataset [9] using the method proposed in [10]. Then, inspired by [1,2], we used SSIM function to compute the reconstruction loss between the network output and its ground truth.
To train the networks for the vasculature segmentation task, we employed the DRIVE dataset [11], which consists of 40 retinographies and their corresponding vasculature segmentation masks. As the loss, we used Binary Cross-Entropy. For testing, we included the 20 annotated images of the STARE dataset [12].
The networks were trained using the Adam optimization algorithm with learning rate decay and data augmentation through affine transformations and color and intensity variations. Table 1 shows the best AUC-ROC and AUC-PR values of the different networks trained from scratch (FS) and using SSMP for the STARE dataset. Moreover, in Figure 1 is depicted an example of the segmentation masks predicted by the U-Net trained with 15 images, with and without SSMP. As observed, the use of SSMP has significant benefits in both quantitative and qualitative terms; mainly due to the fact that the vessel continuity is better preserved and the pathological structures are better handled. This improvement, in addition, is achieved with less training in the target task. These results demonstrate that the use of SSMP emerges as a valuable option when annotated data in the target task are scarce.

Results and Conclusions
Regarding the diverse FCN architectures, both qualitative and quantitative results (see Table 1) demonstrated that the U-Net provided the best performance.